Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

Three Important Things

1. A New Taxonomy For Multimodal Machine Learning Research

The authors propose a new taxonomy for classifying multimodal machine learning research, categorized into 6 different categories given below:

Taxonomy of the 6 core challenges in multimodal machine learning

Representation: What representations to use to best learn from multi-modal inputs?
Alignment: Capturing relationships between inputs from different modalities
Reasoning: Inference on multi-modal data over multiple steps
Generation: Generating new outputs of a particular modality
Transference: Allowing different modalities to learn from each other
Quantification: An empirical and theoretical approach to multimodal models by quantifying their information content and exploring the extent and presence of relationships between multimodal data

2. Contrastive Learning

Contrastive learning, which is where similar samples are encouraged to be closer in the representation space and dissimilar samples are pushed further apart, is a popular approach for learning interactions between multi-modal inputs.

3. Attention Maps for Intermediate Concepts

When performing multi-step reasoning on multimodal data, attention maps are a popular choice as it is human-interpretable and sufficiently general.

Most Glaring Deficiency

It’s rather hard to criticize a survey, but some questions I had were whether the authors were aware of any research directions that don’t fit anywhere in the taxonomy, or perhaps span several categories.

It would have also been illuminating to include specific examples for the more abstract research questions for people like me who are less familiar with work in the space.

Conclusions for Future Work

It might be helpful to see where work on multimodal machine learning falls in the taxonomy, to understand which dimension of the problem it is tackling, and what are the other dimensions left to view the problem in. Perhaps the same technique could be applicable across several of the challenge domains.