Slide 1

Slide 1 text

OpenAI Generative Pre-Training … and friends ULMFiT, ElMO, GPT, BERT, GPT2 Robin Ranjit Singh Chauhan https://twitter.com/robinc

Slide 2

Slide 2 text

“with only 100 labeled examples, it matches the performance of training from scratch on 100x more data” --ULMFiT paper “Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks” -- GPT Paper ● Heavy: Unsupervised pre-training ○ Like days on large TPU clusters ● Light: Supervised fine-tuning per task ○ Like Hours on GPU Point?

Slide 3

Slide 3 text

ImageNet moment for NLP? ● OLD Deep Learning for NLP ○ Download stale, static vectors ○ Build what pitiful models we can from such lowly inputs ● Deep Learning for images ○ Download fat network pre-trained on imagenet ○ Finetune for freshness ● NEW Deep Learning for NLP ○ Download a fat, happy network pre-trained on language model + Finetune for freshness ○ (OR: GPT-2: Zero Shot Transfer)

Slide 4

Slide 4 text

The next AI revolution will not be supervised or purely reinforced. The future is self-supervised learning with massive amounts of data and very large networks. --Yann LeCun

Slide 5

Slide 5 text

The Secret to Intelligence is... Peek-a-boo??

Slide 6

Slide 6 text

Genealogy ● Dai & Le 2015 [Google] ○ use unlabeled data to improve sequence learning with LSTM RNN ● ULMFiT: Howard and Ruder 2018 [fast.ai] ○ LSTM-based ● ELMo [UWash] 2018 ○ Embeddings from Language Models; LSTM based ● New generation: LSTM -> Transformer ○ GPT [OpenAI] ■ BERT [Google] ■ GPT-2 [OpenAI] ■ BigBird (BERT-based) [Microsoft]

Slide 7

Slide 7 text

GPT: Unsupervised Pre-Training ● BooksCorpus ○ 7,000 unique unpublished books ○ Adventure, Fantasy, and Romance ● Pretrain data has long stretches of contiguous text ○ Allows generative model to learn to condition on long-range information. ○ Elmo: similar size data but sentences only -- no long-range structure

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Transformer (orig. for Translation) ● Many layers of multi-head attention ● All runs at once (for each output token) ● Position encoding ○ Binary encoding of token position ● Latest version ○ Universal Transformers, Dehghani et al 2019 ● 2019 version: Universal Transformer Input attn: Values + Keys Output-so-far attn: Queries Key + Value + Queries attn GPT uses only the “Decoder” part of Transformer

Slide 10

Slide 10 text

Transformer model vs classic RNN ● Inductive Power ○ RNN weak with distant dependencies ● Computational concerns ○ Processes whole sequence at once ○ Each timestep can be run in parallel ○ But: Computation grows as square of sequence length ■ Very expensive for longer seqs ■ GPT2: 1024 ■ BERT Large: 512 ● Attn: Each token is a complete sample for backprop ○ Where RNN requires Multistep backprop See Yannic Kilcher’s video on Attention Is All You Need https://www.youtube.com/watch?v=iDulhoQ2pro

Slide 11

Slide 11 text

GPT: Transformer Task specific GPT variants, aim to minimize custom components

Slide 12

Slide 12 text

Byte Pair Encoding “open-vocabulary translation by encoding rare and unknown words as sequences of subword units” -- Neural Machine Translation of Rare Words with Subword Units, Sennrich et al 2015

Slide 13

Slide 13 text

BERT: Bidir. Encoder Repres. from Transformers ● Bidirectionality, Fine tuning with additional layers ● Explicitly designed to be similar to GPT ○ “... many of the design decisions in BERT were intentionally chosen to be as close to GPT as possible so that the two methods could be minimally compared…. ○ ...The core argument of this work is that the two novel pre-training tasks presented in Section 3.3 account for the majority of the empirical improvements” -- BERT paper ● Pretrained on ○ BooksCorpus (800M words) ○ Wikipedia (2,500M words)

Slide 14

Slide 14 text

BERT: Pretraining Tasks Masked Language Model ● Bi-drectional design -> can’t use generative pre-training ● Fixed word masks Next Sentence Prediction ● Given two sentences A and B: Is B the sentence directly after A? Bidirectionality gives more info for GLUE tasks But Precludes text generation. Image from http://jalammar.github.io/illustrated-bert/

Slide 15

Slide 15 text

GPT-2 ● “Zero-shots to SOTA on 7 out of 8 tested language modeling datasets…” ○ No fine tuning step! ● 1.5B parameter Transformer ● Trained on 40GB of internet text ○ Reddit outlinks with 3+ upvotes as proxy for human-curated content ● Larger, + very minor changes vs GPT

Slide 16

Slide 16 text

GPT-2: Zero Shot SOTA ● Summarization ○ Append TL;DR: ● Translation ○ condition the language model on a context of example pairs of the format ■ english sentence = french sentence ○ and then after a final prompt of ■ english sentence =

Slide 17

Slide 17 text

GPT-2: Zero shot results GPT1 BERT Large … GPT2

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Input In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. Generated by GPT-2 XL The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I Input Data Science Lightning Presentation Rules - The subject should be something about data science. - Total Length: 360 to 400 seconds - Slide length: 10 to 40 seconds - Prepare for 10 minutes of questions after each presentation. Generated by GPT-2 small ( talktotransformer.com ) - For each lecture, two slides should be provided. No single slide. - For each session, two presentations should be provided. Each presentation should be 15 minutes long. - At least one lecture should be on a topic specific to the topic at hand. - No more than two questions should be asked in a single lecture. - For each session, we need to create a question about any part of our dataset and ask it directly. The number of questions should be based upon the volume of data analyzed. - The question should be short enough that anyone with a background in data science knows what it is. - We should present each segment of the problem with some numbers: - The size of the dataset: - The number of points of each dataset as well as its number of data elements (like lines). - The distribution of all data points as follows: * - Random numbers. * - Normal

Slide 20

Slide 20 text

Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!! MODEL COMPLETION (MACHINE-WRITTEN, 25 TRIES) Recycling is NOT good for the world. It is bad for the environment, it is bad for our health, and it is bad for our economy. I’m not kidding. Recycling is not good for the environment. It is destructive to the earth and it is a major contributor to global warming. Recycling is not good for our health. It contributes to obesity and diseases like heart disease and cancer. Recycling is bad for our economy. It increases the cost of a product, and in turn, the price of everything that is made with that product. Recycling is not good for our nation. We pay a tremendous price for the privilege of having the world’s most advanced and efficient recycling system. Recycling is a huge, colossal waste of time, energy, money, and resources. And THAT is why we need to get back to basics and get back to basics in our recycling efforts. One of the best ways to start is to look at the process of creating a paper product. When you make a paper product, it is basically a long chain of materials. Everything from the raw materials (wood, cardboard, paper, etc.), to the reagents (dyes, solvents, etc.) to …. We can also imagine the application of these models for malicious purposes, including the following (or other applications we can’t yet anticipate): ● Generate misleading news articles ● Impersonate others online ● Automate the production of abusive or faked content to post on social media ● Automate the production of spam/phishing content https://openai.com/blog/better-language-models

Slide 21

Slide 21 text

Future …