OpenAI GPT and Friends

OpenAI Generative Pre-Training … and friends ULMFiT, ElMO, GPT, BERT,
GPT2 Robin Ranjit Singh Chauhan https://twitter.com/robinc

“with only 100 labeled examples, it matches the performance of
training from scratch on 100x more data” --ULMFiT paper “Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks” -- GPT Paper • Heavy: Unsupervised pre-training ◦ Like days on large TPU clusters • Light: Supervised fine-tuning per task ◦ Like Hours on GPU Point?

ImageNet moment for NLP? • OLD Deep Learning for NLP
◦ Download stale, static vectors ◦ Build what pitiful models we can from such lowly inputs • Deep Learning for images ◦ Download fat network pre-trained on imagenet ◦ Finetune for freshness • NEW Deep Learning for NLP ◦ Download a fat, happy network pre-trained on language model + Finetune for freshness ◦ (OR: GPT-2: Zero Shot Transfer)

The next AI revolution will not be supervised or purely
reinforced. The future is self-supervised learning with massive amounts of data and very large networks. --Yann LeCun

The Secret to Intelligence is... Peek-a-boo??

Genealogy • Dai & Le 2015 [Google] ◦ use unlabeled
data to improve sequence learning with LSTM RNN • ULMFiT: Howard and Ruder 2018 [fast.ai] ◦ LSTM-based • ELMo [UWash] 2018 ◦ Embeddings from Language Models; LSTM based • New generation: LSTM -> Transformer ◦ GPT [OpenAI] ▪ BERT [Google] ▪ GPT-2 [OpenAI] ▪ BigBird (BERT-based) [Microsoft]

GPT: Unsupervised Pre-Training • BooksCorpus ◦ 7,000 unique unpublished books
◦ Adventure, Fantasy, and Romance • Pretrain data has long stretches of contiguous text ◦ Allows generative model to learn to condition on long-range information. ◦ Elmo: similar size data but sentences only -- no long-range structure

Transformer (orig. for Translation) • Many layers of multi-head attention
• All runs at once (for each output token) • Position encoding ◦ Binary encoding of token position • Latest version ◦ Universal Transformers, Dehghani et al 2019 • 2019 version: Universal Transformer Input attn: Values + Keys Output-so-far attn: Queries Key + Value + Queries attn GPT uses only the “Decoder” part of Transformer

Transformer model vs classic RNN • Inductive Power ◦ RNN
weak with distant dependencies • Computational concerns ◦ Processes whole sequence at once ◦ Each timestep can be run in parallel ◦ But: Computation grows as square of sequence length ▪ Very expensive for longer seqs ▪ GPT2: 1024 ▪ BERT Large: 512 • Attn: Each token is a complete sample for backprop ◦ Where RNN requires Multistep backprop See Yannic Kilcher’s video on Attention Is All You Need https://www.youtube.com/watch?v=iDulhoQ2pro

GPT: Transformer Task speciﬁc GPT variants, aim to minimize custom
components

Byte Pair Encoding “open-vocabulary translation by encoding rare and unknown
words as sequences of subword units” -- Neural Machine Translation of Rare Words with Subword Units, Sennrich et al 2015

BERT: Bidir. Encoder Repres. from Transformers • Bidirectionality, Fine tuning
with additional layers • Explicitly designed to be similar to GPT ◦ “... many of the design decisions in BERT were intentionally chosen to be as close to GPT as possible so that the two methods could be minimally compared…. ◦ ...The core argument of this work is that the two novel pre-training tasks presented in Section 3.3 account for the majority of the empirical improvements” -- BERT paper • Pretrained on ◦ BooksCorpus (800M words) ◦ Wikipedia (2,500M words)

BERT: Pretraining Tasks Masked Language Model • Bi-drectional design ->
can’t use generative pre-training • Fixed word masks Next Sentence Prediction • Given two sentences A and B: Is B the sentence directly after A? Bidirectionality gives more info for GLUE tasks But Precludes text generation. Image from http://jalammar.github.io/illustrated-bert/

GPT-2 • “Zero-shots to SOTA on 7 out of 8
tested language modeling datasets…” ◦ No fine tuning step! • 1.5B parameter Transformer • Trained on 40GB of internet text ◦ Reddit outlinks with 3+ upvotes as proxy for human-curated content • Larger, + very minor changes vs GPT

GPT-2: Zero Shot SOTA • Summarization ◦ Append TL;DR: •
Translation ◦ condition the language model on a context of example pairs of the format ▪ english sentence = french sentence ◦ and then after a final prompt of ▪ english sentence =

GPT-2: Zero shot results GPT1 BERT Large … GPT2

Input In a shocking finding, scientist discovered a herd of
unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. Generated by GPT-2 XL The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I Input Data Science Lightning Presentation Rules - The subject should be something about data science. - Total Length: 360 to 400 seconds - Slide length: 10 to 40 seconds - Prepare for 10 minutes of questions after each presentation. Generated by GPT-2 small ( talktotransformer.com ) - For each lecture, two slides should be provided. No single slide. - For each session, two presentations should be provided. Each presentation should be 15 minutes long. - At least one lecture should be on a topic specific to the topic at hand. - No more than two questions should be asked in a single lecture. - For each session, we need to create a question about any part of our dataset and ask it directly. The number of questions should be based upon the volume of data analyzed. - The question should be short enough that anyone with a background in data science knows what it is. - We should present each segment of the problem with some numbers: - The size of the dataset: - The number of points of each dataset as well as its number of data elements (like lines). - The distribution of all data points as follows: * - Random numbers. * - Normal

Recycling is good for the world. NO! YOU COULD NOT
BE MORE WRONG!! MODEL COMPLETION (MACHINE-WRITTEN, 25 TRIES) Recycling is NOT good for the world. It is bad for the environment, it is bad for our health, and it is bad for our economy. I’m not kidding. Recycling is not good for the environment. It is destructive to the earth and it is a major contributor to global warming. Recycling is not good for our health. It contributes to obesity and diseases like heart disease and cancer. Recycling is bad for our economy. It increases the cost of a product, and in turn, the price of everything that is made with that product. Recycling is not good for our nation. We pay a tremendous price for the privilege of having the world’s most advanced and efficient recycling system. Recycling is a huge, colossal waste of time, energy, money, and resources. And THAT is why we need to get back to basics and get back to basics in our recycling efforts. One of the best ways to start is to look at the process of creating a paper product. When you make a paper product, it is basically a long chain of materials. Everything from the raw materials (wood, cardboard, paper, etc.), to the reagents (dyes, solvents, etc.) to …. We can also imagine the application of these models for malicious purposes, including the following (or other applications we can’t yet anticipate): • Generate misleading news articles • Impersonate others online • Automate the production of abusive or faked content to post on social media • Automate the production of spam/phishing content https://openai.com/blog/better-language-models

Future …

OpenAI GPT and Friends

OpenAI GPT and Friends

Robin Ranjit Singh Chauhan

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Featured

Transcript

OpenAI Generative Pre-Training … and friends ULMFiT, ElMO, GPT, BERT,

“with only 100 labeled examples, it matches the performance of

ImageNet moment for NLP? • OLD Deep Learning for NLP

The next AI revolution will not be supervised or purely

The Secret to Intelligence is... Peek-a-boo??

Genealogy • Dai & Le 2015 [Google] ◦ use unlabeled

GPT: Unsupervised Pre-Training • BooksCorpus ◦ 7,000 unique unpublished books

Transformer (orig. for Translation) • Many layers of multi-head attention

Transformer model vs classic RNN • Inductive Power ◦ RNN

GPT: Transformer Task speciﬁc GPT variants, aim to minimize custom

Byte Pair Encoding “open-vocabulary translation by encoding rare and unknown

BERT: Bidir. Encoder Repres. from Transformers • Bidirectionality, Fine tuning

BERT: Pretraining Tasks Masked Language Model • Bi-drectional design ->

GPT-2 • “Zero-shots to SOTA on 7 out of 8

GPT-2: Zero Shot SOTA • Summarization ◦ Append TL;DR: •

GPT-2: Zero shot results GPT1 BERT Large … GPT2

Input In a shocking finding, scientist discovered a herd of

Recycling is good for the world. NO! YOU COULD NOT

Future …