Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computer Vision Project 2: Image Captioning

Computer Vision Project 2: Image Captioning

As a part of Udacity's Computer Vision Nanodegree, I defined a Recurrent Neural Network in PyTorch, using LSTM cells, and trained the last layer of a pre-trained CNN model to be able to accurately label images with a string of up to 10 words, based on images and captions the network had previously trained on.

This PPT was submitted as a part of my PhD coursework to detail my project process and reflection.


Aaron Snowberger

May 19, 2021


  1. Computer Vision Project 2: Image Captioning NAME: Aaron Snowberger (정보통신공학과,

    30211060) DATE: 2021.05.19 SUBJECT: Computer Vision (Dong-Geol Choi) CURRICULUM: Udacity Computer Vision | Udacity GitHub Project Repo | Certificate of Completion
  2. CONTENTS 01 Problem Overview 02 Load & Visualize the Data

    a. Explore MS COCO images b. Preprocess data c. Natural Language Toolkit (NLTK) d. Visualize Vocabulary e. Load & observe a batch of data 03 Network Architecture & Training a. Pre-trained CNN Encoder (ResNet-50) b. Define RNN Decoder Architecture c. Train the RNN! (Visualize Loss & Perplexity) 04 Get Results 05 Project Reflection 06 Conclusion & Future Research 2
  3. Problem Overview To automatically caption a variety of images 1

  4. Problem Overview In this project, I defined a Recurrent Neural

    Network (RNN) architecture that uses a Long Short-Term Memory (LSTM) Decoder to automatically caption new and novel images it hasn’t seen before. The Microso Common Objects in COntext (MS COCO) image dataset and captions (5 per image) was used to train the model. Steps: 1. Load dataset 2. 4
  5. Load & Visualize the Data Understanding the data & preprocessing

    it 2
  6. Exploring MS COCO Images 6 Sample image with five included

    possible captions.
  7. Data Preprocessing 1: Image Preprocessing: includes performing Transforms to standardize

    the images. 7 2: Caption Preprocessing: we use the Natural Language Toolkit (NLTK) in order to tokenize each caption. (See next slide →) vocab_threshold defines the minimum number of times a word must appear in the training captions before it is used as part of the vocabulary. Words appearing less frequently are considered “unknown.”
  8. Caption Preprocessing: The Natural Language Toolkit is used to tokenize

    the caption words. Example of the caption preprocessing pipeline: Natural Language Toolkit (NLTK) 8 This code converts any caption String into a list of integers. Then it casts it to a PyTorch tensor. These special word/integer pairings are also used: 0: <start> - indicates the start of a caption 1: <end> - indicates the end of a caption 2: <unk> - indicates a word that is not present in the vocabulary list (determined by vocab_threshold), i.e. “unknown” Final PyTorch array & what it represents
  9. Visualize Vocabulary Visualize the first 10 items in our vocab

    dataset that was built from the MS COCO captions with the help of NLTK. Check caption lengths in the dataset with: data_loader.dataset.caption_lengths Majority are length = 10. Very long & very short captions are rare. 9 vocab_threshold = 5 vocab_threshold = 3
  10. Load & Observe a batch of data With batch_size =

    10 , let’s load some image & caption Tensors take a look at them. 10 images.shape: torch.Size([10, 3, 224, 224]) captions.shape: torch.Size([10, 13]) The sampled images are random, but the results of the sampling will be relatively the same each time.
  11. Network Architecture A pre-trained CNN (ResNet-50) Encoder + RNN Decoder

  12. Pre-trained CNN Encoder (ResNet-50) 12 The pre-trained ResNet-50 architecture was

    used in this project (with the final fully connected layer removed) to extract features from the images. The output was flattened to a vector, then passed through a Linear layer to transform the feature vector to be the same size as the word embedding. ResNet-50 architecture In sum: 1. Load pre-trained network - use early layers to perform basic feature extraction 2. Remove final fully connected layer & replace with a new Linear layer to learn specific info about data 3. Train the network 4. Predict captions for novel images & assess accuracy
  13. Define RNN Decoder Architecture 13 As the diagram illustrates, the

    RNN includes the following parts: 1. Embedding Layer: maps captions -> embedded word vector of embed_size shape: (vocab_size, embed_size) 2. LSTM Layer: accepts: (1) embedded word vector from Embedding Layer, (2) embedded image feature vector from CNN shape: (embed_size, hidden_size, num_layers) 3. Hidden Layer: (not pictured) maps LSTM output to the required PyTensor: ([batch_size, captions.shape[1], vocab_size]) 4. Linear Layer: will act like a so max layer shape: (hidden_size, vocab_size) 5. (Dropout): additional layer (not pictured) to help training - between LSTM and FC layer
  14. Train the Network! 14 I trained the network twice on

    two different vocabulary lists (+20 hours). Training Perplexity vocab_threshold = 3 vocab_threshold = 5 Training Loss More vocabulary makes it more difficult to predict accurately. Ultimately, I chose this model for the final prediction step. Training Perplexity Training Loss epochs = 3 steps = 6471 epochs = 3 steps = 3236
  15. Get Results Where did and didn’t the model perform well?

  16. How well did the model perform? 16 Not very accurate...

    Good caption predictions!
  17. Project Reflection What I learned by training the CNN-RNN model

  18. Reflection Actually, one of the most important pieces of code

    that I needed to program was the sample() method which allowed me to check the accuracy of the trained model. In this method, we: 1. Pass initial inputs & states to the RNN to obtain new inputs & states 2. Pass LSTM output through the fully-connected layer to obtain token scores 3. Get probability of the most likely next word, & predicted word from the scores 4. Add predicted word to our array 5. Prepare the last predicted word as the next LSTM input, to close the RNN loop 18 Initially, I had trouble programming this method, and kept outputting an array of repeated word scores. [1, 119, 119, 119, 119, … 0] I realized my mistake in that I wasn’t passing the predicted word BACK into the RNN as the next input, so it was only predicting the first or second word continuously over and over again.
  19. Conclusion & Future Research What about other Pre-trained models? Longer

    training? 6
  20. Di erent training This project was intended to introduce the

    concepts of using a Pre-trained CNN as an Encoder for to perform feature extraction from an image dataset, and then feed the results into an RNN Decoder that utilized LSTM cells to predict image captions. My results are sufficient to prove that the model works. However, for future research, I’m interested in investigating how much better a different Pre-trained CNN model might perform compared to ResNet-50 (the model used in this project). I’m also interested in investigating what difference a greater number of training epochs might have on the results. 20 ResNet-50 architecture
  21. THANKS! The Jupyter Notebooks used in this project including code

    and output can be found at: https://github.com/jekkilekki/computer-vision/tree/ master/Image%20Captions 21