Show and Tell 정리

show and tell

vision CNN followed by language genearting RNN


In contrast, we would like to present in this work a single joint model that takes an image I as input, and is trained to maximize the likelihood p(S|I) of producing a target sequence of words S = {S1, S2, . . .} where each word St comes from a given dictionary, that describes the image adequately.


머신 번역기의 예_

The main inspiration of our work comes from recent advances in machine translation, where the task is to transform a sentence S written in a source language, into its translation T in the target language, by maximizing p(T|S).

An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence.


여기서 encoder RNN CNN으로 대체

CNNs can produce a rich representation of the input image by embedding it to a fixed-length vector, such that this representation can be used for a variety of vision
NIC, our model, is based end-to-end on a neural net tasks


first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.






Given a powerful sequence model, it is possible to achieve state-of-the-art results by directly maximizing the probability of the correct translation given an input sentence in an “end-to-end” fashion – both for training and inference.



θ : the parameters of our model,

I : an image

S : its correct transcription


S represents any sentence, its length is unbounded.


joint probability over S0 , . . . , SN (N = 문장의 길이)

(dropped the dependency on θ for convenience)





기존의 LSTM RNN모델에서 hidden state에 image information을 추가.




여기서 짚고 갈 것은 S는 one-hot vector 혹은 300 dimensional word-embedding vector

each word as a one-hot vector St of dimension equal to the size of the dictionary



Note that we denote by S0 a special start word and by SN a special stop word which designates the start and end of the sentence.


The image I is only input once, at t = −1, to inform the LSTM about the image contents.


(이미지 정보를 처음에만 주는 것이 매번 주는 것보다 나았다. We empirically verified that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily.)



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s