As a part of Udacity's Computer Vision Nanodegree, I defined a Recurrent Neural Network in PyTorch, using LSTM cells, and trained the last layer of a pre-trained CNN model to be able to accurately label images with a string of up to 10 words, based on images and captions the network had previously trained on.
This PPT was submitted as a part of my PhD coursework to detail my project process and reflection.
Computer Vision Project 2:
NAME: Aaron Snowberger (정보통신공학과, 30211060)
SUBJECT: Computer Vision (Dong-Geol Choi)
CURRICULUM: Udacity Computer Vision | Udacity GitHub
Project Repo | Certiﬁcate of Completion
01 Problem Overview
02 Load & Visualize the Data
a. Explore MS COCO images
b. Preprocess data
c. Natural Language Toolkit (NLTK)
d. Visualize Vocabulary
e. Load & observe a batch of data
03 Network Architecture & Training
a. Pre-trained CNN Encoder (ResNet-50)
b. Deﬁne RNN Decoder Architecture
c. Train the RNN! (Visualize Loss & Perplexity)
04 Get Results
05 Project Reﬂection
06 Conclusion & Future Research
To automatically caption a variety of images
In this project, I deﬁned a Recurrent Neural Network (RNN)
architecture that uses a Long Short-Term Memory (LSTM) Decoder to
automatically caption new and novel images it hasn’t seen before.
The Microso Common Objects in COntext (MS COCO) image dataset
and captions (5 per image) was used to train the model.
1. Load dataset
Load & Visualize the Data
Understanding the data & preprocessing it
Exploring MS COCO Images
Sample image with ﬁve
included possible captions.
1: Image Preprocessing: includes performing Transforms to standardize the images.
2: Caption Preprocessing: we use the Natural Language Toolkit (NLTK) in order to
tokenize each caption. (See next slide →)
vocab_threshold deﬁnes the minimum number of times a word
must appear in the training captions before it is used as part of the
vocabulary. Words appearing less frequently are considered “unknown.”
Caption Preprocessing: The Natural Language
Toolkit is used to tokenize the caption words.
Example of the caption preprocessing pipeline:
Natural Language Toolkit (NLTK)
This code converts any caption String into a list
of integers. Then it casts it to a PyTorch tensor.
These special word/integer pairings are also used:
0: - indicates the start of a caption
1: - indicates the end of a caption
2: - indicates a word that is not
present in the vocabulary list (determined by
vocab_threshold), i.e. “unknown”
Final PyTorch array & what it represents
Visualize the ﬁrst 10 items in our vocab
dataset that was built from the MS COCO
captions with the help of NLTK.
Check caption lengths in the dataset with:
length = 10.
Very long &
vocab_threshold = 5
vocab_threshold = 3
Load & Observe a batch of data
With batch_size = 10 , let’s load some image & caption Tensors take a look at them.
images.shape: torch.Size([10, 3, 224, 224])
captions.shape: torch.Size([10, 13])
The sampled images are random, but the results of the
sampling will be relatively the same each time.
A pre-trained CNN (ResNet-50) Encoder + RNN Decoder
Pre-trained CNN Encoder (ResNet-50)
The pre-trained ResNet-50 architecture was used in this project (with the ﬁnal fully connected layer
removed) to extract features from the images. The output was ﬂattened to a vector, then passed
through a Linear layer to transform the feature vector to be the same size as the word embedding.
1. Load pre-trained network - use early layers to
perform basic feature extraction
2. Remove ﬁnal fully connected layer & replace with a
new Linear layer to learn speciﬁc info about data
3. Train the network
4. Predict captions for novel images & assess accuracy
Deﬁne RNN Decoder Architecture
As the diagram illustrates, the RNN includes the
1. Embedding Layer: maps captions ->
embedded word vector of embed_size
shape: (vocab_size, embed_size)
2. LSTM Layer: accepts: (1) embedded word
vector from Embedding Layer, (2) embedded
image feature vector from CNN
shape: (embed_size, hidden_size, num_layers)
3. Hidden Layer: (not pictured) maps LSTM
output to the required PyTensor:
([batch_size, captions.shape, vocab_size])
4. Linear Layer: will act like a so max layer
shape: (hidden_size, vocab_size)
5. (Dropout): additional layer (not pictured) to
help training - between LSTM and FC layer
Train the Network!
I trained the network twice on two
diﬀerent vocabulary lists (+20 hours).
vocab_threshold = 3 vocab_threshold = 5
makes it more diﬃcult
to predict accurately.
Ultimately, I chose this
model for the ﬁnal
Training Perplexity Training Loss
epochs = 3
steps = 6471
epochs = 3
steps = 3236
Where did and didn’t the model perform well?
How well did the model perform?
What I learned by training the CNN-RNN model
Actually, one of the most
important pieces of code that I
needed to program was the
sample() method which
allowed me to check the
accuracy of the trained model.
In this method, we:
1. Pass initial inputs & states to the
RNN to obtain new inputs & states
2. Pass LSTM output through the
fully-connected layer to obtain
3. Get probability of the most likely
next word, & predicted word from
4. Add predicted word to our array
5. Prepare the last predicted word as
the next LSTM input, to close the
Initially, I had trouble programming this method, and kept
outputting an array of repeated word scores. [1, 119, 119, 119, 119, … 0]
I realized my mistake in that I wasn’t passing the predicted word
BACK into the RNN as the next input, so it was only predicting the
ﬁrst or second word continuously over and over again.
Conclusion & Future Research
What about other Pre-trained models? Longer training?
Di erent training
This project was intended to introduce
the concepts of using a Pre-trained CNN
as an Encoder for to perform feature
extraction from an image dataset, and
then feed the results into an RNN
Decoder that utilized LSTM cells to
predict image captions. My results are
suﬃcient to prove that the model works.
However, for future research, I’m
interested in investigating how much
better a diﬀerent Pre-trained CNN model
might perform compared to ResNet-50
(the model used in this project). I’m also
interested in investigating what
diﬀerence a greater number of training
epochs might have on the results. 20
The Jupyter Notebooks used in this project
including code and output can be found at: