Computer Vision Project 2: Image Captioning

Computer Vision Project 2: Image Captioning NAME: Aaron Snowberger (정보통신공학과,
30211060) DATE: 2021.05.19 SUBJECT: Computer Vision (Dong-Geol Choi) CURRICULUM: Udacity Computer Vision | Udacity GitHub Project Repo | Certiﬁcate of Completion

CONTENTS 01 Problem Overview 02 Load & Visualize the Data
a. Explore MS COCO images b. Preprocess data c. Natural Language Toolkit (NLTK) d. Visualize Vocabulary e. Load & observe a batch of data 03 Network Architecture & Training a. Pre-trained CNN Encoder (ResNet-50) b. Deﬁne RNN Decoder Architecture c. Train the RNN! (Visualize Loss & Perplexity) 04 Get Results 05 Project Reﬂection 06 Conclusion & Future Research 2

Problem Overview To automatically caption a variety of images 1

Problem Overview In this project, I deﬁned a Recurrent Neural
Network (RNN) architecture that uses a Long Short-Term Memory (LSTM) Decoder to automatically caption new and novel images it hasn’t seen before. The Microso Common Objects in COntext (MS COCO) image dataset and captions (5 per image) was used to train the model. Steps: 1. Load dataset 2. 4

Load & Visualize the Data Understanding the data & preprocessing
it 2

Exploring MS COCO Images 6 Sample image with ﬁve included
possible captions.

Data Preprocessing 1: Image Preprocessing: includes performing Transforms to standardize
the images. 7 2: Caption Preprocessing: we use the Natural Language Toolkit (NLTK) in order to tokenize each caption. (See next slide →) vocab_threshold deﬁnes the minimum number of times a word must appear in the training captions before it is used as part of the vocabulary. Words appearing less frequently are considered “unknown.”

Caption Preprocessing: The Natural Language Toolkit is used to tokenize
the caption words. Example of the caption preprocessing pipeline: Natural Language Toolkit (NLTK) 8 This code converts any caption String into a list of integers. Then it casts it to a PyTorch tensor. These special word/integer pairings are also used: 0: <start> - indicates the start of a caption 1: <end> - indicates the end of a caption 2: <unk> - indicates a word that is not present in the vocabulary list (determined by vocab_threshold), i.e. “unknown” Final PyTorch array & what it represents

Visualize Vocabulary Visualize the ﬁrst 10 items in our vocab
dataset that was built from the MS COCO captions with the help of NLTK. Check caption lengths in the dataset with: data_loader.dataset.caption_lengths Majority are length = 10. Very long & very short captions are rare. 9 vocab_threshold = 5 vocab_threshold = 3

Load & Observe a batch of data With batch_size =
10 , let’s load some image & caption Tensors take a look at them. 10 images.shape: torch.Size([10, 3, 224, 224]) captions.shape: torch.Size([10, 13]) The sampled images are random, but the results of the sampling will be relatively the same each time.

Network Architecture A pre-trained CNN (ResNet-50) Encoder + RNN Decoder
3

Pre-trained CNN Encoder (ResNet-50) 12 The pre-trained ResNet-50 architecture was
used in this project (with the final fully connected layer removed) to extract features from the images. The output was flattened to a vector, then passed through a Linear layer to transform the feature vector to be the same size as the word embedding. ResNet-50 architecture In sum: 1. Load pre-trained network - use early layers to perform basic feature extraction 2. Remove final fully connected layer & replace with a new Linear layer to learn specific info about data 3. Train the network 4. Predict captions for novel images & assess accuracy

Deﬁne RNN Decoder Architecture 13 As the diagram illustrates, the
RNN includes the following parts: 1. Embedding Layer: maps captions -> embedded word vector of embed_size shape: (vocab_size, embed_size) 2. LSTM Layer: accepts: (1) embedded word vector from Embedding Layer, (2) embedded image feature vector from CNN shape: (embed_size, hidden_size, num_layers) 3. Hidden Layer: (not pictured) maps LSTM output to the required PyTensor: ([batch_size, captions.shape[1], vocab_size]) 4. Linear Layer: will act like a so max layer shape: (hidden_size, vocab_size) 5. (Dropout): additional layer (not pictured) to help training - between LSTM and FC layer

Train the Network! 14 I trained the network twice on
two different vocabulary lists (+20 hours). Training Perplexity vocab_threshold = 3 vocab_threshold = 5 Training Loss More vocabulary makes it more difficult to predict accurately. Ultimately, I chose this model for the final prediction step. Training Perplexity Training Loss epochs = 3 steps = 6471 epochs = 3 steps = 3236

Get Results Where did and didn’t the model perform well?
4

How well did the model perform? 16 Not very accurate...
Good caption predictions!

Project Reﬂection What I learned by training the CNN-RNN model
5

Reﬂection Actually, one of the most important pieces of code
that I needed to program was the sample() method which allowed me to check the accuracy of the trained model. In this method, we: 1. Pass initial inputs & states to the RNN to obtain new inputs & states 2. Pass LSTM output through the fully-connected layer to obtain token scores 3. Get probability of the most likely next word, & predicted word from the scores 4. Add predicted word to our array 5. Prepare the last predicted word as the next LSTM input, to close the RNN loop 18 Initially, I had trouble programming this method, and kept outputting an array of repeated word scores. [1, 119, 119, 119, 119, … 0] I realized my mistake in that I wasn’t passing the predicted word BACK into the RNN as the next input, so it was only predicting the ﬁrst or second word continuously over and over again.

Conclusion & Future Research What about other Pre-trained models? Longer
training? 6

Di erent training This project was intended to introduce the
concepts of using a Pre-trained CNN as an Encoder for to perform feature extraction from an image dataset, and then feed the results into an RNN Decoder that utilized LSTM cells to predict image captions. My results are sufficient to prove that the model works. However, for future research, I’m interested in investigating how much better a different Pre-trained CNN model might perform compared to ResNet-50 (the model used in this project). I’m also interested in investigating what difference a greater number of training epochs might have on the results. 20 ResNet-50 architecture

THANKS! The Jupyter Notebooks used in this project including code
and output can be found at: https://github.com/jekkilekki/computer-vision/tree/ master/Image%20Captions 21

Computer Vision Project 2: Image Captioning

Computer Vision Project 2: Image Captioning

Aaron Snowberger

More Decks by Aaron Snowberger

Other Decks in Technology

Featured

Transcript

Computer Vision Project 2: Image Captioning NAME: Aaron Snowberger (정보통신공학과,

CONTENTS 01 Problem Overview 02 Load & Visualize the Data

Problem Overview To automatically caption a variety of images 1

Problem Overview In this project, I deﬁned a Recurrent Neural

Load & Visualize the Data Understanding the data & preprocessing

Exploring MS COCO Images 6 Sample image with ﬁve included

Data Preprocessing 1: Image Preprocessing: includes performing Transforms to standardize

Caption Preprocessing: The Natural Language Toolkit is used to tokenize

Visualize Vocabulary Visualize the ﬁrst 10 items in our vocab

Load & Observe a batch of data With batch_size =

Network Architecture A pre-trained CNN (ResNet-50) Encoder + RNN Decoder

Pre-trained CNN Encoder (ResNet-50) 12 The pre-trained ResNet-50 architecture was

Deﬁne RNN Decoder Architecture 13 As the diagram illustrates, the

Train the Network! 14 I trained the network twice on

Get Results Where did and didn’t the model perform well?

How well did the model perform? 16 Not very accurate...

Project Reﬂection What I learned by training the CNN-RNN model

Reﬂection Actually, one of the most important pieces of code

Conclusion & Future Research What about other Pre-trained models? Longer

Di erent training This project was intended to introduce the

THANKS! The Jupyter Notebooks used in this project including code