Image Caption Generator using CNN and LSTM

Technologies Used

  • Pytorch
  • Python
  • Spacy

Design

chatbot_design

Training:

  • Last layer of CNN module is removed, and fully connected layer is added that results in the feature vector of size (eg., 256). If batch_size=8, the ouput from cnn module will be of shape (8, 256)
  • For each target word it produces 256 length of embedding by passing through the embedding layer. Here the sentence of max_length=40 is used. so the output of embedding layer will be (8, 40, 256), considering batch_size=8.
  • The feature_vector from cnn_module and output of embedding_layer is concatenated to result in (8, 41, 256). This input shape is passed to the LSTM cell, which produces the 256 length of hidden_state and cell_state. After the, fc-layer is used to map the 256 length of feature vector to vocab_size=7500+(around). The ouput shape should be (8, 40, 7500+). considering, vocab_size = 7500+
  • This above process occurs for the 40th time step, cause lstm process the sequence word by word. The training happens end-to-end.

Inference:

  • The Image passes through CNN module to generate feature vector of size 256.

  • This feature vector passes to lstm cell.

  • The lstm results on the probability distribution of words in vocab_size.

  • Loops for 40(max_length), until token is found. – The embedding of ouput word is then passed as input to lstm cell.

Examples

chat4

Code

Github