Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computer Vision Project 2: Image Captioning

Computer Vision Project 2: Image Captioning

As a part of Udacity's Computer Vision Nanodegree, I defined a Recurrent Neural Network in PyTorch, using LSTM cells, and trained the last layer of a pre-trained CNN model to be able to accurately label images with a string of up to 10 words, based on images and captions the network had previously trained on.

This PPT was submitted as a part of my PhD coursework to detail my project process and reflection.

Aaron Snowberger

May 19, 2021
Tweet

More Decks by Aaron Snowberger

Other Decks in Technology

Transcript

  1. Computer Vision Project 2:
    Image Captioning
    NAME: Aaron Snowberger (정보통신공학과, 30211060)
    DATE: 2021.05.19
    SUBJECT: Computer Vision (Dong-Geol Choi)
    CURRICULUM: Udacity Computer Vision | Udacity GitHub
    Project Repo | Certificate of Completion

    View Slide

  2. CONTENTS
    01 Problem Overview
    02 Load & Visualize the Data
    a. Explore MS COCO images
    b. Preprocess data
    c. Natural Language Toolkit (NLTK)
    d. Visualize Vocabulary
    e. Load & observe a batch of data
    03 Network Architecture & Training
    a. Pre-trained CNN Encoder (ResNet-50)
    b. Define RNN Decoder Architecture
    c. Train the RNN! (Visualize Loss & Perplexity)
    04 Get Results
    05 Project Reflection
    06 Conclusion & Future Research
    2

    View Slide

  3. Problem Overview
    To automatically caption a variety of images
    1

    View Slide

  4. Problem Overview
    In this project, I defined a Recurrent Neural Network (RNN)
    architecture that uses a Long Short-Term Memory (LSTM) Decoder to
    automatically caption new and novel images it hasn’t seen before.
    The Microso Common Objects in COntext (MS COCO) image dataset
    and captions (5 per image) was used to train the model.
    Steps:
    1. Load dataset
    2.
    4

    View Slide

  5. Load & Visualize the Data
    Understanding the data & preprocessing it
    2

    View Slide

  6. Exploring MS COCO Images
    6
    Sample image with five
    included possible captions.

    View Slide

  7. Data Preprocessing
    1: Image Preprocessing: includes performing Transforms to standardize the images.
    7
    2: Caption Preprocessing: we use the Natural Language Toolkit (NLTK) in order to
    tokenize each caption. (See next slide →)
    vocab_threshold defines the minimum number of times a word
    must appear in the training captions before it is used as part of the
    vocabulary. Words appearing less frequently are considered “unknown.”

    View Slide

  8. Caption Preprocessing: The Natural Language
    Toolkit is used to tokenize the caption words.
    Example of the caption preprocessing pipeline:
    Natural Language Toolkit (NLTK)
    8
    This code converts any caption String into a list
    of integers. Then it casts it to a PyTorch tensor.
    These special word/integer pairings are also used:
    0: - indicates the start of a caption
    1: - indicates the end of a caption
    2: - indicates a word that is not
    present in the vocabulary list (determined by
    vocab_threshold), i.e. “unknown”
    Final PyTorch array & what it represents

    View Slide

  9. Visualize Vocabulary
    Visualize the first 10 items in our vocab
    dataset that was built from the MS COCO
    captions with the help of NLTK.
    Check caption lengths in the dataset with:
    data_loader.dataset.caption_lengths
    Majority are
    length = 10.
    Very long &
    very short
    captions are
    rare.
    9
    vocab_threshold = 5
    vocab_threshold = 3

    View Slide

  10. Load & Observe a batch of data
    With batch_size = 10 , let’s load some image & caption Tensors take a look at them.
    10
    images.shape: torch.Size([10, 3, 224, 224])
    captions.shape: torch.Size([10, 13])
    The sampled images are random, but the results of the
    sampling will be relatively the same each time.

    View Slide

  11. Network Architecture
    A pre-trained CNN (ResNet-50) Encoder + RNN Decoder
    3

    View Slide

  12. Pre-trained CNN Encoder (ResNet-50)
    12
    The pre-trained ResNet-50 architecture was used in this project (with the final fully connected layer
    removed) to extract features from the images. The output was flattened to a vector, then passed
    through a Linear layer to transform the feature vector to be the same size as the word embedding.
    ResNet-50 architecture
    In sum:
    1. Load pre-trained network - use early layers to
    perform basic feature extraction
    2. Remove final fully connected layer & replace with a
    new Linear layer to learn specific info about data
    3. Train the network
    4. Predict captions for novel images & assess accuracy

    View Slide

  13. Define RNN Decoder Architecture
    13
    As the diagram illustrates, the RNN includes the
    following parts:
    1. Embedding Layer: maps captions ->
    embedded word vector of embed_size
    shape: (vocab_size, embed_size)
    2. LSTM Layer: accepts: (1) embedded word
    vector from Embedding Layer, (2) embedded
    image feature vector from CNN
    shape: (embed_size, hidden_size, num_layers)
    3. Hidden Layer: (not pictured) maps LSTM
    output to the required PyTensor:
    ([batch_size, captions.shape[1], vocab_size])
    4. Linear Layer: will act like a so max layer
    shape: (hidden_size, vocab_size)
    5. (Dropout): additional layer (not pictured) to
    help training - between LSTM and FC layer

    View Slide

  14. Train the Network!
    14
    I trained the network twice on two
    different vocabulary lists (+20 hours).
    Training Perplexity
    vocab_threshold = 3 vocab_threshold = 5
    Training Loss
    More vocabulary
    makes it more difficult
    to predict accurately.
    Ultimately, I chose this
    model for the final
    prediction step.
    Training Perplexity Training Loss
    epochs = 3
    steps = 6471
    epochs = 3
    steps = 3236

    View Slide

  15. Get Results
    Where did and didn’t the model perform well?
    4

    View Slide

  16. How well did the model perform?
    16
    Not very
    accurate...
    Good caption
    predictions!

    View Slide

  17. Project Reflection
    What I learned by training the CNN-RNN model
    5

    View Slide

  18. Reflection
    Actually, one of the most
    important pieces of code that I
    needed to program was the
    sample() method which
    allowed me to check the
    accuracy of the trained model.
    In this method, we:
    1. Pass initial inputs & states to the
    RNN to obtain new inputs & states
    2. Pass LSTM output through the
    fully-connected layer to obtain
    token scores
    3. Get probability of the most likely
    next word, & predicted word from
    the scores
    4. Add predicted word to our array
    5. Prepare the last predicted word as
    the next LSTM input, to close the
    RNN loop
    18
    Initially, I had trouble programming this method, and kept
    outputting an array of repeated word scores. [1, 119, 119, 119, 119, … 0]
    I realized my mistake in that I wasn’t passing the predicted word
    BACK into the RNN as the next input, so it was only predicting the
    first or second word continuously over and over again.

    View Slide

  19. Conclusion & Future Research
    What about other Pre-trained models? Longer training?
    6

    View Slide

  20. Di erent training
    This project was intended to introduce
    the concepts of using a Pre-trained CNN
    as an Encoder for to perform feature
    extraction from an image dataset, and
    then feed the results into an RNN
    Decoder that utilized LSTM cells to
    predict image captions. My results are
    sufficient to prove that the model works.
    However, for future research, I’m
    interested in investigating how much
    better a different Pre-trained CNN model
    might perform compared to ResNet-50
    (the model used in this project). I’m also
    interested in investigating what
    difference a greater number of training
    epochs might have on the results. 20
    ResNet-50 architecture

    View Slide

  21. THANKS!
    The Jupyter Notebooks used in this project
    including code and output can be found at:
    https://github.com/jekkilekki/computer-vision/tree/
    master/Image%20Captions
    21

    View Slide