Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenAI GPT and Friends

OpenAI GPT and Friends

A brief look at the new generation of models that are dominating Natural Language Processing (NLP). Focusing on GPT but also including ULMFiT, ElMO, BERT, and GPT2. Packed full of details and insight from the original papers, secondary sources, plus some original content.

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Transcript

  1. OpenAI
    Generative Pre-Training
    … and friends
    ULMFiT, ElMO, GPT, BERT, GPT2
    Robin Ranjit Singh Chauhan
    https://twitter.com/robinc

    View Slide

  2. “with only 100 labeled examples, it matches the
    performance of training from scratch on 100x more data”
    --ULMFiT paper
    “Our goal is to learn a universal representation that
    transfers with little adaptation to a wide range of tasks”
    -- GPT Paper
    ● Heavy: Unsupervised pre-training
    ○ Like days on large TPU clusters
    ● Light: Supervised fine-tuning per task
    ○ Like Hours on GPU
    Point?

    View Slide

  3. ImageNet moment for NLP?
    ● OLD Deep Learning for NLP
    ○ Download stale, static vectors
    ○ Build what pitiful models we can from
    such lowly inputs
    ● Deep Learning for images
    ○ Download fat network pre-trained on
    imagenet
    ○ Finetune for freshness
    ● NEW Deep Learning for NLP
    ○ Download a fat, happy network
    pre-trained on language model
    + Finetune for freshness
    ○ (OR: GPT-2: Zero Shot Transfer)

    View Slide

  4. The next AI revolution
    will not be supervised
    or purely reinforced.
    The future is
    self-supervised
    learning with massive
    amounts of data and
    very large networks.
    --Yann LeCun

    View Slide

  5. The Secret to Intelligence is... Peek-a-boo??

    View Slide

  6. Genealogy
    ● Dai & Le 2015 [Google]
    ○ use unlabeled data to improve sequence learning with LSTM RNN
    ● ULMFiT: Howard and Ruder 2018 [fast.ai]
    ○ LSTM-based
    ● ELMo [UWash] 2018
    ○ Embeddings from Language Models; LSTM based
    ● New generation: LSTM -> Transformer
    ○ GPT [OpenAI]
    ■ BERT [Google]
    ■ GPT-2 [OpenAI]
    ■ BigBird (BERT-based) [Microsoft]

    View Slide

  7. GPT: Unsupervised Pre-Training
    ● BooksCorpus
    ○ 7,000 unique unpublished books
    ○ Adventure, Fantasy, and Romance
    ● Pretrain data has long stretches of contiguous text
    ○ Allows generative model to learn to condition on long-range information.
    ○ Elmo: similar size data but sentences only -- no long-range structure

    View Slide

  8. View Slide

  9. Transformer (orig. for Translation)
    ● Many layers of multi-head attention
    ● All runs at once (for each output token)
    ● Position encoding
    ○ Binary encoding of token position
    ● Latest version
    ○ Universal Transformers, Dehghani et al 2019
    ● 2019 version: Universal Transformer
    Input attn:
    Values + Keys
    Output-so-far
    attn: Queries
    Key +
    Value +
    Queries
    attn
    GPT uses only the “Decoder” part of Transformer

    View Slide

  10. Transformer model vs classic RNN
    ● Inductive Power
    ○ RNN weak with distant dependencies
    ● Computational concerns
    ○ Processes whole sequence at once
    ○ Each timestep can be run in parallel
    ○ But: Computation grows as square of
    sequence length
    ■ Very expensive for longer seqs
    ■ GPT2: 1024
    ■ BERT Large: 512
    ● Attn: Each token is a complete sample for backprop
    ○ Where RNN requires Multistep backprop
    See Yannic Kilcher’s video on Attention Is All You Need
    https://www.youtube.com/watch?v=iDulhoQ2pro

    View Slide

  11. GPT: Transformer Task specific GPT variants, aim to minimize custom components

    View Slide

  12. Byte Pair Encoding
    “open-vocabulary translation by encoding rare
    and unknown words as sequences of subword
    units”
    -- Neural Machine Translation of Rare Words
    with Subword Units, Sennrich et al 2015

    View Slide

  13. BERT: Bidir. Encoder Repres. from Transformers
    ● Bidirectionality, Fine tuning with additional layers
    ● Explicitly designed to be similar to GPT
    ○ “... many of the design decisions in BERT
    were intentionally chosen to be as close
    to GPT as possible so that the two
    methods could be minimally compared….
    ○ ...The core argument of this work is that the
    two novel pre-training tasks presented in
    Section 3.3 account for the majority of the
    empirical improvements” -- BERT paper
    ● Pretrained on
    ○ BooksCorpus (800M words)
    ○ Wikipedia (2,500M words)

    View Slide

  14. BERT: Pretraining Tasks
    Masked Language Model
    ● Bi-drectional design -> can’t use
    generative pre-training
    ● Fixed word masks
    Next Sentence Prediction
    ● Given two sentences A and B: Is B
    the sentence directly after A?
    Bidirectionality gives more info for
    GLUE tasks
    But Precludes text generation. Image from http://jalammar.github.io/illustrated-bert/

    View Slide

  15. GPT-2
    ● “Zero-shots to SOTA on 7 out of 8 tested
    language modeling datasets…”
    ○ No fine tuning step!
    ● 1.5B parameter Transformer
    ● Trained on 40GB of internet text
    ○ Reddit outlinks with 3+ upvotes as proxy for
    human-curated content
    ● Larger, + very minor changes vs GPT

    View Slide

  16. GPT-2: Zero Shot SOTA
    ● Summarization
    ○ Append TL;DR:
    ● Translation
    ○ condition the language model on a context of example
    pairs of the format
    ■ english sentence = french sentence
    ○ and then after a final prompt of
    ■ english sentence =

    View Slide

  17. GPT-2: Zero shot results GPT1
    BERT Large

    GPT2

    View Slide

  18. View Slide

  19. Input
    In a shocking finding, scientist discovered a herd of
    unicorns living in a remote, previously unexplored valley,
    in the Andes Mountains. Even more surprising to the
    researchers was the fact that the unicorns spoke perfect
    English.
    Generated by GPT-2 XL
    The scientist named the population, after their distinctive
    horn, Ovid’s Unicorn. These four-horned, silver-white unicorns
    were previously unknown to science.
    Now, after almost two centuries, the mystery of what sparked
    this odd phenomenon is finally solved.
    Dr. Jorge Pérez, an evolutionary biologist from the University
    of La Paz, and several companions, were exploring the Andes
    Mountains when they found a small valley, with no other
    animals or humans. Pérez noticed that the valley had what
    appeared to be a natural fountain, surrounded by two peaks
    of rock and silver snow.
    However, Pérez also pointed out that it is likely that the only
    way of knowing for sure if unicorns are indeed the
    descendants of a lost alien race is through DNA. “But they
    seem to be able to communicate in English quite well, which I
    Input
    Data Science Lightning Presentation Rules
    - The subject should be something about data science.
    - Total Length: 360 to 400 seconds
    - Slide length: 10 to 40 seconds
    - Prepare for 10 minutes of questions after each
    presentation.
    Generated by GPT-2 small ( talktotransformer.com )
    - For each lecture, two slides should be provided. No single
    slide.
    - For each session, two presentations should be provided. Each
    presentation should be 15 minutes long.
    - At least one lecture should be on a topic specific to the topic at
    hand.
    - No more than two questions should be asked in a single
    lecture.
    - For each session, we need to create a question about any part
    of our dataset and ask it directly. The number of questions
    should be based upon the volume of data analyzed.
    - The question should be short enough that anyone with a
    background in data science knows what it is.
    - We should present each segment of the problem with some
    numbers: - The size of the dataset: - The number of points of
    each dataset as well as its number of data elements (like lines).
    - The distribution of all data points as follows: * - Random
    numbers. * - Normal

    View Slide

  20. Recycling is good for the world.
    NO! YOU COULD NOT BE MORE WRONG!!
    MODEL COMPLETION (MACHINE-WRITTEN, 25
    TRIES)
    Recycling is NOT good for the world. It is bad for the
    environment, it is bad for our health, and it is bad for
    our economy. I’m not kidding. Recycling is not good
    for the environment. It is destructive to the earth and
    it is a major contributor to global warming. Recycling
    is not good for our health. It contributes to obesity
    and diseases like heart disease and cancer.
    Recycling is bad for our economy. It increases the
    cost of a product, and in turn, the price of everything
    that is made with that product. Recycling is not good
    for our nation. We pay a tremendous price for the
    privilege of having the world’s most advanced and
    efficient recycling system. Recycling is a huge,
    colossal waste of time, energy, money, and
    resources. And THAT is why we need to get back to
    basics and get back to basics in our recycling efforts.
    One of the best ways to start is to look at the process
    of creating a paper product. When you make a paper
    product, it is basically a long chain of materials.
    Everything from the raw materials (wood, cardboard,
    paper, etc.), to the reagents (dyes, solvents, etc.) to
    …. We can also imagine the application of these models for
    malicious purposes, including the following (or other applications
    we can’t yet anticipate):
    ● Generate misleading news articles
    ● Impersonate others online
    ● Automate the production of abusive or faked content to post
    on social media
    ● Automate the production of spam/phishing content
    https://openai.com/blog/better-language-models

    View Slide

  21. Future

    View Slide