Multimodal Machine Learning:A Survey and Taxonomy

Abstract
aim: build models that can process and relate information from multiple modalities
new taxonomy:representation, translation, alignment, fusion, and co-learning.
INTRODUCTION                                                                                                                                 
three modalities:                                                                                                                                  
(nlp)natural language which can be both written or spoken;
(cv) visual signals which are often represented with images or videos;
(sr) vocal signals which encode sounds and para-verbal information such as prosody and vocal expressions
challenges:
representation: the complementarity and redundancy
translation:translate one modality to another( open-ended or subjective)
alignment: measure similarity between different modalities and deal with possible long
range dependencies and ambiguities
fusion: join information from two or more modalities to perform a prediction
co-learning: co -training, conceptual grounding, and zero shot learning
Table 1: A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it.

Multimodal Machine Learning:A Survey and Taxonomy_第1张图片

2 A PPLICATIONS : A HISTORICAL PERSPECTIVE 略
3 M ULTIMODAL R EPRESENTATIONS
Representing multiple modalities poses many diffificulties:
how to combine the data from heterogeneous sources;
how to deal with different levels of noise;
and how to deal with missing data.
a number of properties for good representations:
smoothness(类比nlp的平滑), temporal and spatial coherence, sparsity, and natural clustering amongst others
additional desirable properties for multimodal representations:
similarity in the representation space should reflect the similarity of the corresponding concepts,
the representation should be easy to obtain even in the absence of some modalities
it should be possible  to fifill-in missing modalities given the observed ones.
3.1 Joint Representations( project unimodal representations together into a multimodal
space
Mathematically, the joint representation is expressed as:
xm = f(x1, . . . , xn)        (1)
Neural networks
To construct a multimodal representation using neural networks each modality starts with several individual neural layers fol lowed by a hidden layer that projects the modalities into a joint space.The joint multimodal representation is then be passed through multiple hidden layers itself or used directly for prediction.(usually be trained end to end)
advantages:
neural network based joint rep resentations comes from their often superior performance
and the ability to pre-train the representations in an unsu pervised manner. The performance gain is, however, depen dent on the amount of data available for training.
disadvantages:
model not being able to handle missing data                                                                              naturally deep networks are often diffificult to train
Probabilistic graphical models
advantages:
generative nature, which allows for an easy way to deal with missing data (
even if a whole modality is missing, the model has a natural way to cope)
generate samples of one modality in the presence of the other one, or both modalities from the representation s
the representation can be trained in an unsupervised manner enabling the use of unlabeled data
disadvantages:
the diffificulty of training them( high computational cost, and the need to use approximate
variational training methods )
Sequential Representation(varying length sequences) 略
3.2 Coordinated Representations ( learn separate representations for each modality but coordinate them through a constraint )
While coordinated representation is as follows:
f ( x 1 ) g ( x 2 )        (2)
Similarity models( minimize the distance between modal ities in the coordinated space
structured coordinated space( Structured coordinated spaces are commonly used in cross-modal hashing
Hashing enforces certain constraints on the resulting multimodal space:
1) it has to be an N -dimensionalHamming space — a binary representation with controllable
number of bits;
2) the same object from different modalities has to have a similar hash code;
3) the space has to be similarity-preserving.
Multimodal Machine Learning:A Survey and Taxonomy_第2张图片
Multimodal Machine Learning:A Survey and Taxonomy_第3张图片
4 T RANSLATION
Given an entity in one modality the task is to generate the same entity in a different modality.
4.1 Example-based
Retrieval-based models ( directly use the retrieved translation without modifying it )
Combination-based models ( rely on more complex rules to create translations based on a number of retrieved instances )
4.2 Generative approaches
Multimodal Machine Learning:A Survey and Taxonomy_第4张图片
Grammar-based models
Encoder-decoder models
Continuous generation models
5 A LIGNMENT
5.1 Explicit alignment
Unsupervised
Supervised
5.2 Implicit alignment
Graphical models
Neural networks         
Multimodal Machine Learning:A Survey and Taxonomy_第5张图片

 

6 F USION
6.1 Model-agnostic approaches
6.2 Model-based approaches
Multiple kernel learning (MKL)
Graphical models
Neural Networks
7 C O - LEARNING
7.1 Parallel data        
Co-training
Transfer learning
7.2 Non-parallel data
Transfer learning
Conceptual grounding
Zero shot learning
7.3 Hybrid data
Multimodal Machine Learning:A Survey and Taxonomy_第6张图片
Multimodal Machine Learning:A Survey and Taxonomy_第7张图片

 

 

 

 

 

你可能感兴趣的:(多模态paper,大数据,多模态)