论文阅读:DEEP CAPTIONING WITH MULTIMODAL RECURRENTNEURAL NETWORKS (M-RNN)

DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (M-RNN)

0.summary

reccurent layer(deep RNN) + word embedding layer = language model part

Future improvements:

1.use more better deep neural network to extract more better word embedding matrix and image features,eg:language model part:LSTM,vision part: SSD,YOLO

2.use pre-computed word vector to initialize two word emdedding layer

3.explore more effictive model architecture

4.use GPU

1.Research Objective

  • three task:

    ​ (1) generating novel sentence

    ​ (2) retrieveing images given a sentence

    ​ (3) retrieveing sentence given a images

  • build a multimodal Recurrent Neural Network (m-RNN) for generating **novel **image captions.

  • m-RNN model also have a significant performance improvement over the SOTA methods in retrieval tasks for retrieving images or sentences.

  • project page of this work( Nearest Neighbor as Reference: A Simple Way to Boost the Performance of Image Captioning ):

2.Background and Problems

  • image caption(previous):

    ​ Many previous methods treat it as a retrieval task: They learn a joint embedding to map the features of both sentences and images to the same semantic space. These methods generate image captions by retrieving them from a sentence database.

    details:

    image feature: before use deep model to extract as a global level,rencently use object level image festures based on object detection.

    sentence feature method: dependency tree Recursive Neural Network

    embedding model: optimize a ranking cost to learn a embedding model,use that model to maps both sentence feature and image feature to a common semantic feature space

    backward:

    ​ lack the ability of generating novel sentences or describing images that contain novel combinations of objects and scenes.

  • 3 categories of method for Generating novel sentence descriptions for images.

    method 1 :

    parse the sentence ->divide it into several parts -> each part associate with an object(or attribute)in the image.

    model: Conditional Random Field model or Markov Random Field model

    method 2:

    retrieves similar captioned images in train data -> generalizing and re-composing the retrieved captions -> generates new descriptions

    method 3(our model):

    learn a probability density over the space of language and image,The probability of generating sentences using the model can serve as the affinity metric for retrieval

    model: RNN, it can storing context information in a recurrent layer.

    contribution of our model:

    (1) incorporate a two-layer word embedding system in the m-RNN network structure which learns the word representation more efficiently than the single-layer word embedding.

    (2) do not use the recurrent layer to store the visual information. The image representation is inputted to the m-RNN model along with every word in the sentence description. allows us to achieve SOTA performance using a relatively small dimensional recurrent layer.

3.Method(s)

3.1Model architecture

论文阅读:DEEP CAPTIONING WITH MULTIMODAL RECURRENTNEURAL NETWORKS (M-RNN)_第1张图片

The whole m-RNN model contains a language model part, a vision part and a multimodal part. The language model part learns a dense feature embedding for each word in the dictionary and stores the semantic temporal context in recurrent layers.The vision part contains a deep Convolutional Neural Network (CNN) which generates the image representation.The multimodal part connects the language model and the deep CNN together by a one-layer representation.

  • two word embedding layers: randomly initialize this two layers and learn then from training data can also generate SOTA result(others use pre-computed word embedding vector to initial their model)

    output: word embedding vector at time t denote as w ( t ) w(t) w(t) --256dim

  • recurrent layer:

    input(time t): w ( t ) w(t) w(t) and r ( t − 1 ) r(t-1) r(t1)

    Calculation process : $ r(t) = f_2(U_r\cdot r(t-1)+w(t));$

    Parameter: U r U_r Ur map r ( t − 1 ) r(t-1) r(t1) into the same vector space as w ( t ) w(t) w(t)

    f 2 ( ⋅ ) f_2(\cdot) f2() Rectified Linear Unit( R e L U ReLU ReLU)

    + element-wize addition

    output(time t): r ( t ) r(t) r(t) --256dim

  • multimodal layer: connects the language model part and the vision part of the m-RNN model

    three input:

    w ( t ) w(t) w(t)– from word emedding layer Π \Pi Π r ( t ) r(t) r(t) --from recurrent layer 、 I I I – image representation from AlexNet or VggNet

    Calculation process : m ( t ) = g 2 ( V m ⋅ w ( t ) + V r ⋅ r ( t ) + V I ⋅ I ) ; m(t)=g_2(V_m\cdot w(t)+V_r\cdot r(t)+V_I\cdot I); m(t)=g2(Vmw(t)+Vrr(t)+VII);

    V m V_m Vm V r V_r Vr V I V_I VI : all this parameters can be seen as a map operation from their Original space to multimodal space .

    g 2 ( ⋅ ) g_2(\cdot) g2() is the element-wise scaled hyperbolic tangent function

    g 2 ( x ) = 1.7159 ⋅ tanh ⁡ ( 2 3 x ) g_2(x)=1.7159\cdot \tanh(\frac{2}{3}x) g2(x)=1.7159tanh(32x)

    output : m ( t ) m(t) m(t) --512dim

  • softmax layer: generates the probability distribution of the next word.

    output : a probability vector —dimension = vacabulary size M M M,which is different for different datasets

3.2 train

cost function: log-likehood cost function

  • Perplexity: a standard measure for evaluating language model.

    log ⁡ 2 P P L ( W 1 : L ∣ I ) = − 1 L ∑ n = 1 l l o g 2 P ( w n ∣ w 1 : n − 1 , I ) \log_2PPL(W_{1:L}|I)=-\frac{1}{L}\sum\limits_{n=1}^{l}log_2P(w_n|w_{1:n-1},I) log2PPL(W1:LI)=L1n=1llog2P(wnw1:n1,I)

    L L L: length of the word sequence

    P P L ( w 1 : L ∣ I ) PPL(w_{1:L}|I) PPL(w1:LI) : perplexity of the sentence w 1 : L w_{1:L} w1:L given the image I I I.

    P ( w n ∣ w 1 : n − 1 , I ) P(w_n|w_{1:n-1},I) P(wnw1:n1,I) : the probability of generating the word w n w_n wn given I I I and previous words w 1 : n − 1 w_{1:n-1} w1:n1,corresponds to the activation of the SoftMax layer of our model.

  • cost function: average log-likehood of the words:

    C = 1 N ∑ i − 1 N s L i ⋅ log ⁡ 2 P P L ( w 1 : L i ( i ) ∣ I ( i ) + λ θ ⋅ ∥ x ∥ 2 2 ) C=\frac{1}{N}\sum\limits_{i-1}^{N_s}L_i\cdot\log_2PPL(w_{1:L_i}^{(i)}|I^{(i)}+\lambda_{\theta}\cdot\parallel x \parallel_2^{2}) C=N1i1NsLilog2PPL(w1:Li(i)I(i)+λθx22)

    N s N_s Ns : number of sentences N N N : number of words L i L_i Li : length of i t h i^{th} ith sentences, θ \theta θ : model’s parameter

  • train objective : minimize the cost function C C C

vision part : pretrain AlexNet or VggNet on ImageNet dataset.

language model part: randomly initialize

deep learning platform: baidu PADDEL, m-rnn model average take 25ms to generate a sentence on single CPU core onFlicker8K.

datasets: IAPR TC-12 , Flickr8K , Flickr30K , MS COCO

4.Evaluation

sentence generation:

  • sentence perplexity
  • BLEU scores(B-1,B-2,B-3,B-4)

sentence & image retrieval:

  • R@K (K=1,5,10)
  • Med r

5.Conclusion

m-RNN model consist of a deep RNN a deep CNN and this two sub-netwark interact with each other in a multimodel layer which performs at the SOTA in three tasks: sentence generation, sentence retrieval given query image and image retrieval given query sentence. That model is powerful of connecting images and sentences and is flexible to incorporate more complex image representations and more sophisticated language models.

model is powerful of connecting images and sentences and is flexible to incorporate more complex image representations and more sophisticated language models.

你可能感兴趣的:(image,caption,nlp,自然语言处理,pytorch)