Deep Learningの都市伝説と現実

DAY 1 “技” Developer Day Deep Learningの都市伝説と現実 Tatsuya Shirakawa (ABEJA,
Inc.)

Self-Introduction — ⽩川達也 ABEJA, Inc. (Researcher) - Deep Learning (CV,
Graph, NLP, ) - Machine Learning - Mathematical Optimization - https://github.com/TatsuyaShirakawa Tech blog http://tech-blog.abeja.asia/ Poincaré Embeddings Graph Convolution Annotation ML in Hyperbolic Space

Researcher at ABEJA 1. 2. 3. 先に⾒つけるシンプルに解く先に失敗する •
最新テクノロジーのキャッチアップ • 技術視点を交えた新しいビジネス構想 • 独⾃技術の開発・検証 • ⾼難易度タスクのコアロジックの構築 • 技術ソリューションの提案 • プロダクトの根本的な精度改善 • アイデアの検証 • 既存のやり⽅/考え⽅の再検討 

at ABEJA 最強のモデルをつくりたい最強のモデルはあらゆるタスクに晒すことで得られる？ Poincaré Embeddings Graph Convolution Annotation あらゆるものを 
組み合わせるには？知識を効率的に  抽出するには？どんな空間でモデルを表現すべき？ ML in Hyperbolic Space

Deep Learningが世を賑わせはじめてはや数年、まことしやかに語られている情報＝都市伝説    やAIを運⽤する中で湧いてくる疑問を技術的に検討・検証してゆく  ことで、AI/Deep Learningを実践・活⽤していくための指針を得る本講演の⽬標

ABEJA ͸ΩϥΩϥͨ͠Deep LearningΛߦ͍ͬͯΔ? キラキラ

鳞ַ׏׋麓⿠˘ ˖ 䏄莧طحزٙ٦ؙ橆㞮ח䮶׶㔐ׁ׸׋麓⿠ ˖ 䏄莧㤍秈ך*41ꥺ㹱٥鋉ⵖח䋆ֹ鴥ת׸׷˘ ˖ 䏄莧طحزٙ٦ؙ堣㐻הך䎁幧 ˖ 䏄莧طحزٙ٦ؙ圓䧭ך醱꧟ׁխխ٥䞔ءأח״׷鸐⥋ـٗحؙ ˖
؟٦ؽأ䭁㣐儗ך倜ًؕٓ٥،ٕ؞ٔؤيך㼎䘔؝أز ˖ 㼪Ⰵ噟珏ָ㟓⸇׃ג鏣縧橆㞮ָ㢳圫⻉⇢رغ؎أ׮㢳圫⻉ׅ׷ ˖ 灇瑔Ꟛ涪ָו׿ו׿鹌׬ה倜׃ְ،ٕ؞ٔؤي٥ٌرٕغ٦آָّٝ钰欰 ˖ ؒٝآص،ךءأذي㼎䘔䊨侧ָ㟓ִ׷կ㴍׃ְ䝰뒊˘ 鳞ַ׏׋麓⿠˘ ˖ *P5 植㜥דرغ؎أָ佦ꥺׅ׷⡦佦ַ嗚鏾橆㞮דכ⹛ֻ ˖ 橆㞮⣛㶷ך纇涪佦ꥺָ涪欰׃׋㜥さծ⾱㔓♶僇ז✲ָ㢳ְկ ˖ 鏣縧橆㞮㢳圫⻉ח⠵ְծ䟝㹀橆㞮ַ׵⛣ꨄ׃׋橆㞮ָ⳿גֻ׷ ˖ رغ؎أךչㅷ颵պ׾鋅׷➬穈׫ָ䔲儗זַ׏׋ 䖓㔐׃ח׃ג׋˘ ˖ "* ؙٕؔ٦آّٝ٥鷞⯔ח״׶僥⫷鍑匿ךㅷ颵כ衅׍׷ ˖ ٖ؎،ؐز㢌刿ח״׶ًؕٓך鋔ꅿָ黪׵׸ծ礵䏝ָ⡚♴ ˖ *P5Y"* 㟓ִ竲ֽ׷رغ؎أה"*ך盖椚㣐㢌˘ ˖ 㟓ִ竲ֽ׷رغ؎أה"*ח״׷鍑匿կ⥂䭯ׅץֹ鏣㹀ך㢳ָׁ㣐㢌˘ 劤䔲ח֮׏׋✲⟝ ⡭锑 ˖J1BEך⯍ꨵך׋׭חءأذيךꨵ彁ָ䫙ַ׸׷ ˖⡦佦ַ暴㹀ًؕٓךꨵ彁ָ״ֻ衅׍׷ ˖⥋ׄג׋؝ٝإٝزָ弔屖ַ׃㐻ך禸窟׌׏׋ ˖荈爡Ꟛ涪ًؕٓ倵䊨儗ח7ꨵ彁׾N䒀ꞿ⇢7〴תדꨵ㖇꣬♴ ˖䔲搫ծ⡦䏝倜ㅷח❛䳔׃ג׮⹛⡲♶葺ָ涪欰 ˖أظ٦ن٦س䱸鍗ח״׷暟椚涸灶䴦 ˖ָًؕٓ㣓׾⟒ְדְ׋ ˖/51ـٗحؙח״׏ג׆׸׷ة؎يأةٝف ˖䎃ך⚅歲ַ׵⹛歗ָ勻׋ 鍑匿ㅷ颵ח䝤䕦갟׾♷ִ׷銲稆 ؙٕؔ٦آّٝ涪欰ך⢽ 鷞⯔ך⢽ ˟♫㢧זוך؎كٝزָ㖑㄂ח銲岣䠐կ 瘁ך衝ؙٕؔ٦آָّٝ涪欰ׅ׷ ˟儗꟦ח״׏ג傈⯔ָ䊴׃鴥׬䏄莧ח 銲岣䠐 IoTxAIを活⽤した⼩売業向け店舗解析サービスの仕組みとノウハウ  (https://www.slideshare.net/xecus/soracom-ug-explorer-2018-iotxai-115181245) RetailͭΒ͍

σʔλ ͷऔಘ σʔλ ͷ஝ੵ σʔλ ͷ֬ೝ ڭࢣ σʔλ ͷ࡞੒ Ϟσϧ
ͷઃܭ ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश ⼤量データの取得に必要なAPI や負荷分散の仕組みや準備、セキュリティ担保データウェアハウスの準備と管理データのバリデーション（正確性）の確認教師データの作成に必要なツールと⼈材の準備 0からのモデル設計開発環境から本番環境への引き渡し冗⻑性やGPUリソースの担保、エッジ側との連携プロセス構築 GPU環境の準備と⾼度な分散化データ、モデル、結果のバージョン管理デプロイ後のモデルの挙動を監視し、  必要に応じてモデルをアップデート AIӡ༻ͭΒ͍

世の中は割り切れない（複雑で不確実）キズ⼤キズ⼩どっち？ AnnotationͭΒ͍

現実はチャレンジの連続

ݱ࣮ʹ޲͖߹ͬͯɺܧଓతʹվળ キラキラコツコツ

AI = ൚༻ਓ޻஌ೳʁ キラキラコツコツ汎⽤⼈⼯知能

機械学習（ML） = データから帰納的に有⽤なルールや知識表現などを得る技術 Deep Learningは多層のNeural Networksを⽤いて機械学習を⾏う⼿法 AI > ML >
DL AI ML DL

• 教師あり学習 • 教師なし学習 • 強化学習 • 半教師あり学習 • 転移学習 
機械学習 (ML)

教師あり学習 — 現状最もPracticalな機械学習 “cat” データ駆動で獲得されたソフトな知覚センサー

AIはセンシングの概念を⼤きく拡張している

強化学習数理最適化  ルールベース教師あり学習教師なし学習演繹的帰納的⾼度な知能意思決定センシング
＞＞ ൚༻ਓ޻஌ೳʁ 現在のAIができることはごく⼀部

Baby vs AI Chelsea Finn, “An agent that can do
many things”, NeurIPS2018  https://people.eecs.berkeley.edu/~cbﬁnn/_ﬁles/neurips18_model_the_world_25min.pdf

AIは賢くない。適材適所が重要。⼈間は複雑なタスクをこなせるが、AI（ソフトウェア）は圧倒的にスケールする。 Human AI 同じ作業を続ける ✔ スケールさせる ✔ 未知な状況への適応 ✔
複雑な作業 ✔

• コストが削減されるか？  • 利益を増加させるか？ • スケール/コピー可能なことに価値はあるか？ • 継続運⽤可能か？ AI導⼊において考えること賢い 
代替困難スケーラブル  量産可能⼈間のコスパは悪くないかもしれない

ݡ͘AIΛ࢖͏͜ͱ͕ॏཁ キラキラコツコツ汎⽤⼈⼯知能賢くAI活⽤

(https://www.microsoft.com/en-us/research/video/nips-oral-session-4-ilya-sutskever/) NeurIPS2014でのSequence to Sequence Learningの発表での⼀コマ

উརͷํఔࣜ = Big Data + DL? キラキラコツコツ汎⽤⼈⼯知能 Big
Data   + DL 賢くAI活⽤

典型的なモデル作りのプロセス収集意味付け学習評価運⽤アノテーション = “cat

典型的なモデル作りのプロセス “cat” 収集アノテーション学習評価運⽤ Process

典型的なモデル作りのプロセス “cat” Info. Loss 上流で重要な情報を⽋落させない  ことが重要収集アノテーション学習
評価運⽤ Process

アノテーション品質の影響は？ “cat” Info. Loss 収集アノテーション学習評価運⽤ Process
上流で重要な情報を⽋落させない  ことが重要

実験: Learning on Noisy Dataset Classification Dataset   (MNIST, CIFAR10,
CIFAR100) 100% Label Noising  ֶशσʔλͷ0~100%ͷϥϕϧΛ෇ସ 0~100% ࣝผϞσϧͷֶश ςετσʔλͰTest 100% アノテーション精度

クリーンなデータで学習した⽅が精度が⾼くなりやすく、過学習しにくい実験結果 Accuracy Epochs ※線はそれぞれ上からノイズ0, 10, …, 100%の結果。

顔認識 = 顔画像が誰の顔かを当てる技術  ⼤規模で多様でクリーンなデータセットを学習に使うことがクリティカル顔認識でもデータセットのアノテーション精度がクリティカル⼤規模な（9000⼈の330万枚の顔画像）  綿密にクリーニングされたデータセット (https://github.com/davidsandberg/facenet)

“The Devil of Face Recognition is in the Noise”, arXiv1807.11649
論⽂中で提案されたクリーンなデータセットクリーンなデータセットを使うと⾼精度のモデルを効率的に得られる

Big Clean Data + DL キラキラコツコツ汎⽤⼈⼯知能 Big Data
  + DL Big Clean Data  + DL 賢くAI活⽤

ΫϦʔϯͳΞϊςʔγϣϯ   = ஫ҙਂ͍Ξϊςʔγϣϯʁ キラキラコツコツ汎⽤⼈⼯知能 Careful   Annotation
賢くAI活⽤ Big Data   + DL Big Clean Data  + DL

アノテーションも簡単ではないワイルドなデータ何歳? 本質的に難しいタスクどっちが好き？

カメラに映った顔画像で年齢・性別などを特定しやすいもの（奇跡の顔）を  ABEJAのResearcherたちがアノテーション失敗例 — 奇跡の顔アノテーション結果  学習できる代物ではないアノテーション • アノテーターごとに基準が異なる •
同⼀アノテーターでもブレブレ

フレーミング・アンカリングが重要 (http://cocodataset.org/#explore?id=482161) どこまで塗る？キズ⼤キズ⼩どっち？基準があいまい

λεΫͱਓؒΛཧղ キラキラコツコツ汎⽤⼈⼯知能 Careful   Annotation Understand  Task &
Human 賢くAI活⽤ Big Data   + DL Big Clean Data  + DL

ͳΔ΂͘Ξϊςʔγϣϯ͠ͳ͍Ͱࡁ·ͤΒΕΔ͔ʁ キラキラコツコツ汎⽤⼈⼯知能 Careful   Annotation No More  
Annotation Understand  Task & Human 賢くAI活⽤ Big Data   + DL Big Clean Data  + DL

少量の教師ありデータと（⽐較的）⼤量の教師なしデータから学習する半教師あり学習（Semi-Supervised Learning, SSL）教師なしデータを利⽤して教師あり学習の精度を向上できるか？ “cat” “dog” “dog”
“cat” “cat”

• Temporal Ensemble (ICLR2017) • Mean Teacher (NeurIPS2018) • TE++
(WAICに基づく筆者らの未発表⼿法）有名なSSL⼿法+αの効果検証 x, y ⼊出⼒ペア  w dropoutやDAなどの  効果を表す確率変数  θ … モデルパラメータ（省略可）いずれも、モデルの出⼒が揺らがないように正則化を加える⼿法

CIFAR10での検証結果 TE Mean Teacher TE++ ※モデルはいずれも同じ効果はあるが不安定・限定的で、教師データを増やしたほうが明らかに効果的

“Realistic Evaluation of Semi-Supervised Learning Algorithms”, NeurIPS2018 既存のSSL論⽂は過度なパラメータチューニングなどがなされており、  ⼿法⾃体の正当な評価が出来ていない

Annotation First  アノテーションをへたにサボるより  アノテーションしたほうがはやい

Few-shot Learning 少数のデータからの学習効率を最⼤化する研究 Transfer Learning 別タスクでの学習結果を転⽤することで、ターゲットタスクでの学習成果を最⼤化する研究学習に使えるデータ⾃体が少量の場合

Few-shot Learning 少数のデータからの学習効率を最⼤化する研究 Transfer Learning 別タスクでの学習結果を転⽤することで、ターゲットタスクでの学習成果を最⼤化する研究学習に使えるデータ⾃体が少量の場合実⽤化に向けて研究中

Few-shot Learning 少数のデータからの学習効率を最⼤化する研究 Transfer Learning 別タスクでの学習結果を転⽤することで、ターゲットタスクでの学習成果を最⼤化する研究データ⾃体が少量な場合 Pretraining/Fine Tuning
パラダイム

⾃然⾔語処理の事前学習には最近イノベーションが起きている⽂章理解  BERT: Pre-training of Deep Bidirectional Transformers for  
Language Understanding (11 Oct. 2018) ⽂章⽣成  Language Models are Unsupervised Multitask Learners (14 Feb. 2019)

BERT — ⾔語理解タスクでの新しい強⼒な教師なし事前学習⼿法 2. ⼆⽂が連続⽂かどうかの判定 The cat [MASK] on the
mat sat 1. 単語の⽳埋め強⼒なモデル（BERT）を教師なしで構成できる下記の2タスクで事前学習することで  さまざまな⾔語理解系タスクでSOTAを⼤幅更新 GLUE test results (論⽂より） 1. The man went to [MASK] store 2. He bought a gallon [MASK] milk → IsNext / NotNext?

GPT-2 — クリーンで多様なデータで巨⼤な⾔語モデルを教師なし学習信頼性の⾼いWebページをクローリングして得たクリーンで多様なコーパス（WebText）上で強⼒な⾔語モデル（GPT-2）を教師なし学習（尤度最⼤化）。  ⽂書⽣成系のさまざまなタスクのZero-shot学習でSOTAを更新

画像処理の場合の事前学習

ImageNet (ILSVRC2012) WordNet（英語の概念辞書）にしたがって  収集された、20,000以上のカテゴリーに  わたる1,400万枚の画像データセット ILSVRC2012で使われた、1,000クラス  からなるサブセットがよく使われる WordNetの埋め込み  https://github.com/facebookresearch/poincare-embeddings

Rethinking Imagenet Pretraining (2018.11)    ImageNet-Trained CNNs are Biased Towards
Texture (2018.11)    Using Pre-Training Can Improve Model Robustness and Uncertainty (2019.1)   ImageNetでの事前学習は万能か？ — 3 papers

• Object DetectionにおいてImageNetの効果を検証した論⽂ • ImageNetで事前学習すると  少ないイテレーションで精度が  出やすくなるが、学習データが  ⼗分多ければ、最終的な精度  に有意差はでない   
    ※ データが少なければ精度寄与あり “Rethinking ImageNet Pre-training”, arXiv1811.08883

• ImageNetでふつうに識別学習した場合、形状ではなくテクスチャー情報を元にした識別モデルが学習されがちでロバストになりにくい • 画像のスタイル変換をData Augmentationとして施したImageNetを使うと、  形状ベースでの学習がされやすくなり、得られるモデルはロバストになる “ImageNet-Trained CNNs are
Biased Towards Texture” ICLR2019

• ImageNetでの事前学習の効果は、Pretrainingは精度などには現れなくても、  モデルの安定性や不確実性の推定精度の向上には⼤きく寄与 “Using Pre-Training Can Improve Model Robustness and
Uncertainty”, arXiv1901.09960

Annotation First / Pretraining キラキラコツコツ汎⽤⼈⼯知能 Careful   Annotation
No More   Annotation Annotation First,  Pretraining 賢くAI活⽤ Understand  Task & Human Big Data   + DL Big Clean Data  + DL

Deep Learningはこれから幻滅期に⼊る？

͜Ε͔ΒͷDeep Learning͸Ͳ͏ͳΔͷ͔ʁ キラキラコツコツ汎⽤⼈⼯知能 Careful   Annotation No More
  Annotation 幻滅期？ Annotation First  Pretraining 賢くAI活⽤ Understand  Task & Human Big Data   + DL Big Clean Data  + DL

ݬ໓ظ = ਰୀظʁ NO!

2030年までにAIはGDPを13兆ドル、16%増加させる https://www.mckinsey.com/featured-insights/artiﬁcial-intelligence/notes-from-the-ai-frontier-modeling-the-impact-of-ai-on-the-world-economy

5-7年以内にAIを導⼊した企業が勝ち組 https://www.mckinsey.com/featured-insights/artiﬁcial-intelligence/notes-from-the-ai-frontier-modeling-the-impact-of-ai-on-the-world-economy

AIによるパラダイムチェンジデータからImplicitに学べるようになったことで、  定義できないものさえ、プログラムできるようになった 00110110110  11010101010 11010111011  10110110100 01010111011 Deﬁne Learn
Program

AIは運⽤することで加速する (Andrew Ng, “AI Transformation Playbook”)

AIを使わない理由はない  正しい技術を正しく使って差別化するチャンス

AIΛਖ਼͘͠׆༻͢ΔϓϨΠϠʔத৺ͷ࣌୅ キラキラ汎⽤⼈⼯知能コツコツ幻滅期？活⽤期 Careful   Annotation No
More   Annotation Annotation First  Pretraining 賢くAI活⽤ Understand  Task & Human Big Data   + DL Big Clean Data  + DL

キラキラ汎⽤⼈⼯知能コツコツ Big Data  + DL Big Clean Data 
+ DL 幻滅期？活⽤期 Careful   Annotation No More   Annotation Understand  Task & Human Annotation First  Pretraining ࣍ʹͲΜͳAIٕज़͕͘Δ͔ʁ Near Future 賢くAI活⽤

at ABEJA（再掲） ࠷ڧͷϞσϧΛͭ͘Γ͍ͨ ࠷ڧͷϞσϧ͸͋ΒΏΔλεΫʹࡽ͢͜ͱͰಘΒΕΔʁ Poincaré Embeddings Graph Convolution Annotation あらゆるものを 
組み合わせるには？知識を効率的に  抽出するには？どんな空間でモデルを表現すべき？ ML in Hyperbolic Space

Poincaré Embeddings Graph Convolution Annotation あらゆるものを  組み合わせるには？知識を効率的に  抽出するには？どんな空間でモデルを表現すべき？
ML in Hyperbolic Space BERT Language Models are Unsupervised Multitask Learners Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1 Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current systems are better characterized as narrow experts rather than *, **Equal contribution 1OpenAI, San Francisco, Califor- nia, United States. Correspondence to: Alec Radford <[email protected]>. competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, multitask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks GPT-2 Big Clean Data + Big DL Taskonomy

タスクの分類体系 — Taskonomy (CVPR 2018 Best Paper) Taskonomy = Task
+ Taxonomy  室内画像にたいするビジョン系の26種のタスク間の  転移しやすさの関係性を解析し、タスク間転移学習  の有効性・効率性を検証 Autoencoding Object Class. Scene Class. Curvature Denoising Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Reshading Distance Z-Depth Normals Layout 2.5D Segm. 2D Segm. Semantic Segm. Vanishing Pts. Novel Task 1 Novel Task 2 Novel Task 3 Autoencoding Object Class. Scene Class. Curvature Denoising Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Reshading Distance Z-Depth Normals Layout 2.5D Segm. 2D Segm. Semantic Segm. Vanishing Pts. Novel Task 1 Novel Task 2 Novel Task 3

BERT Language Models are Unsupervised Multitask Learners Alec Radford *
1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1 Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current systems are better characterized as narrow experts rather than *, **Equal contribution 1OpenAI, San Francisco, Califor- nia, United States. Correspondence to: Alec Radford <[email protected]>. competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, multitask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks GPT-2 Poincaré Embeddings Graph Convolution Annotation あらゆるものを  組み合わせるには？知識を効率的に  抽出するには？どんな空間でモデルを表現すべき？ ML in Hyperbolic Space Taskonomy Large Scale Multi Task Transfer ಓ۩΋ͦΖ͍ཧղ΋ਐΜͰ͖ͨ… Big Clean Data + Big DL

キラキラ汎⽤⼈⼯知能コツコツ幻滅期？活⽤期 Careful   Annotation No More
  Annotation 賢くAI活⽤ Annotation First  Pretraining 3B: Big Task + Big Clean Data + Big DL Near Future Big Task  + Big Clean Data  + Big DL + Multi Modal ? Understand  Task & Human Big Data  + DL Big Clean Data  + DL

Enjoy the Future

Deep Learningの都市伝説と現実

Deep Learningの都市伝説と現実

More Decks by ABEJA

Other Decks in Research

Featured

Transcript