Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Learningの都市伝説と現実

ABEJA
March 04, 2019

Deep Learningの都市伝説と現実

SIX 2019 dev-d-5
Tatsuya Shirakawa @ABEJA, Inc.

「Deep Learningの都市伝説と現実」

Deep Learningが世の脚光を浴び始めて数年、いまやDeep Learningは欠くことの出来ないコア技術として、様々に応用されるようになってきました。その一方で、Deep Learningに夢と希望をないまぜにした過度な期待が集まり、Deep Learning = AI = 万能マシンのようなイメージが先行していることも事実です。
本セッションでは、複雑でノイジーな現実世界に対して技術としてのDeep Learningをどのように適用していけばよいのか、最新の研究成果を含めたいくつかのトピックを中心に振り返ってみたいと思います。

ABEJA

March 04, 2019
Tweet

More Decks by ABEJA

Other Decks in Research

Transcript

  1. Self-Introduction — ⽩川達也 ABEJA, Inc. (Researcher) - Deep Learning (CV,

    Graph, NLP, ) - Machine Learning - Mathematical Optimization - https://github.com/TatsuyaShirakawa Tech blog http://tech-blog.abeja.asia/ Poincaré Embeddings Graph Convolution Annotation ML in Hyperbolic Space
  2. Researcher at ABEJA 1. 2. 3. 先に⾒つける シンプルに解く 先に失敗する •

    最新テクノロジーのキャッチアップ • 技術視点を交えた新しいビジネス構想 • 独⾃技術の開発・検証 • ⾼難易度タスクのコアロジックの構築 • 技術ソリューションの提案 • プロダクトの根本的な精度改善 • アイデアの検証 • 既存のやり⽅/考え⽅の再検討

  3. at ABEJA 最強のモデルをつくりたい 最強のモデルはあらゆるタスクに晒すことで得られる? Poincaré Embeddings Graph Convolution Annotation あらゆるものを


    組み合わせるには? 知識を効率的に
 抽出するには? どんな空間でモデルを表現すべき? ML in Hyperbolic Space
  4. 鳞ַ׏׋麓⿠˘ ˖ 䏄莧طحزٙ٦ؙ橆㞮ח䮶׶㔐ׁ׸׋麓⿠ ˖ 䏄莧㤍秈ך*41ꥺ㹱٥鋉ⵖח䋆ֹ鴥ת׸׷˘ ˖ 䏄莧طحزٙ٦ؙ堣㐻הך䎁幧 ˖ 䏄莧طحزٙ٦ؙ圓䧭ך醱꧟ׁխխ٥䞔ءأח״׷鸐⥋ـٗحؙ ˖

    ؟٦ؽأ䭁㣐儗ך倜ًؕٓ٥،ٕ؞ٔؤيך㼎䘔؝أز ˖ 㼪Ⰵ噟珏ָ㟓⸇׃ג鏣縧橆㞮ָ㢳圫⻉⇢رغ؎أ׮㢳圫⻉ׅ׷ ˖ 灇瑔Ꟛ涪ָו׿ו׿鹌׬ה倜׃ְ،ٕ؞ٔؤي٥ٌرٕغ٦آָّٝ钰欰 ˖ ؒٝآص،ךءأذي㼎䘔䊨侧ָ㟓ִ׷կ㴍׃ְ䝰뒊˘ 鳞ַ׏׋麓⿠˘ ˖ *P5 植㜥דرغ؎أָ佦ꥺׅ׷⡦佦ַ嗚鏾橆㞮דכ⹛ֻ ˖ 橆㞮⣛㶷ך纇涪佦ꥺָ涪欰׃׋㜥さծ⾱㔓♶僇ז✲ָ㢳ְկ ˖ 鏣縧橆㞮㢳圫⻉ח⠵ְծ䟝㹀橆㞮ַ׵⛣ꨄ׃׋橆㞮ָ⳿גֻ׷ ˖ رغ؎أךչㅷ颵պ׾鋅׷➬穈׫ָ䔲儗זַ׏׋ 䖓㔐׃ח׃ג׋˘  ˖ "* ؙٕؔ٦آّٝ٥鷞⯔ח״׶僥⫷鍑匿ךㅷ颵כ衅׍׷ ˖ ٖ؎،ؐز㢌刿ח״׶ًؕٓך鋔ꅿָ黪׵׸ծ礵䏝ָ⡚♴ ˖ *P5Y"* 㟓ִ竲ֽ׷رغ؎أה"*ך盖椚㣐㢌˘ ˖ 㟓ִ竲ֽ׷رغ؎أה"*ח״׷鍑匿կ⥂䭯ׅץֹ鏣㹀ך㢳ָׁ㣐㢌˘ 劤䔲ח֮׏׋✲⟝ ⡭锑 ˖J1BEך⯍ꨵך׋׭חءأذيךꨵ彁ָ䫙ַ׸׷ ˖⡦佦ַ暴㹀ًؕٓךꨵ彁ָ״ֻ衅׍׷ ˖⥋ׄג׋؝ٝإٝزָ弔屖ַ׃㐻ך禸窟׌׏׋ ˖荈爡Ꟛ涪ًؕٓ倵䊨儗ח7ꨵ彁׾N䒀ꞿ⇢7〴תדꨵ㖇꣬♴ ˖䔲搫ծ⡦䏝倜ㅷח❛䳔׃ג׮⹛⡲♶葺ָ涪欰 ˖أظ٦ن٦س䱸鍗ח״׷暟椚涸灶䴦 ˖ָًؕٓ㣓׾⟒ְדְ׋ ˖/51ـٗحؙח״׏ג׆׸׷ة؎يأةٝف ˖䎃ך⚅歲ַ׵⹛歗ָ勻׋ 鍑匿ㅷ颵ח䝤䕦갟׾♷ִ׷銲稆 ؙٕؔ٦آّٝ涪欰ך⢽ 鷞⯔ך⢽ ˟♫㢧זוך؎كٝزָ㖑㄂ח銲岣䠐կ 瘁ך衝ؙٕؔ٦آָّٝ涪欰ׅ׷ ˟儗꟦ח״׏ג傈⯔ָ䊴׃鴥׬䏄莧ח 銲岣䠐 IoTxAIを活⽤した⼩売業向け店舗解析サービスの仕組みとノウハウ
 (https://www.slideshare.net/xecus/soracom-ug-explorer-2018-iotxai-115181245) RetailͭΒ͍
  5. σʔλ ͷऔಘ σʔλ ͷ஝ੵ σʔλ ͷ֬ೝ ڭࢣ σʔλ ͷ࡞੒ Ϟσϧ

    ͷઃܭ ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश ⼤量データの取得に必要なAPI や負荷分散の仕組みや準備、 セキュリティ担保 データウェアハウスの 準備と管理 データのバリデーション (正確性)の確認 教師データの作成に必要な ツールと⼈材の準備 0からの モデル 設計 開発環境から本番環境への引き渡し 冗⻑性やGPUリソース の担保、エッジ側との 連携プロセス構築 GPU環境 の準備と ⾼度な分 散化 データ、モデル、結果のバージョン管理 デプロイ後のモデルの挙動を監視し、
 必要に応じてモデルをアップデート AIӡ༻ͭΒ͍
  6. Baby vs AI Chelsea Finn, “An agent that can do

    many things”, NeurIPS2018
 https://people.eecs.berkeley.edu/~cbfinn/_files/neurips18_model_the_world_25min.pdf
  7. 実験: Learning on Noisy Dataset Classification Dataset 
 (MNIST, CIFAR10,

    CIFAR100) 100% Label Noising
 ֶशσʔλͷ0~100%ͷϥϕϧΛ෇ସ 0~100% ࣝผϞσϧͷֶश ςετσʔλͰTest 100% アノテーション精度
  8. “The Devil of Face Recognition is in the Noise”, arXiv1807.11649

    論⽂中で提案されたクリーンなデータセット クリーンなデータセットを使うと⾼精度のモデルを効率的に得られる
  9. Big Clean Data + DL キラキラ コツコツ 汎⽤⼈⼯知能 Big Data

    
 + DL Big Clean Data
 + DL 賢くAI活⽤
  10. ͳΔ΂͘Ξϊςʔγϣϯ͠ͳ͍Ͱࡁ·ͤΒΕΔ͔ʁ キラキラ コツコツ 汎⽤⼈⼯知能 Careful 
 Annotation No More 


    Annotation Understand
 Task & Human 賢くAI活⽤ Big Data 
 + DL Big Clean Data
 + DL
  11. • Temporal Ensemble (ICLR2017) • Mean Teacher (NeurIPS2018) • TE++

    (WAICに基づく筆者らの未発表⼿法) 有名なSSL⼿法+αの効果検証 x, y ⼊出⼒ペア
 w dropoutやDAなどの
 効果を表す確率変数
 θ … モデルパラメータ(省略可) いずれも、モデルの出⼒が揺らが ないように正則化を加える⼿法
  12. ⾃然⾔語処理の事前学習には最近イノベーションが起きている ⽂章理解
 BERT: Pre-training of Deep Bidirectional Transformers for 


    Language Understanding (11 Oct. 2018) ⽂章⽣成
 Language Models are Unsupervised Multitask Learners (14 Feb. 2019)
  13. BERT — ⾔語理解タスクでの新しい強⼒な教師なし事前学習⼿法 2. ⼆⽂が連続⽂かどうかの判定 The cat [MASK] on the

    mat sat 1. 単語の⽳埋め 強⼒なモデル(BERT)を教師なしで構成できる下記の2タスクで事前学習することで
 さまざまな⾔語理解系タスクでSOTAを⼤幅更新 GLUE test results (論⽂より) 1. The man went to [MASK] store 2. He bought a gallon [MASK] milk → IsNext / NotNext?
  14. Rethinking Imagenet Pretraining (2018.11)
 
 ImageNet-Trained CNNs are Biased Towards

    Texture (2018.11)
   
 Using Pre-Training Can Improve Model Robustness and Uncertainty (2019.1) 
   ImageNetでの事前学習は万能か? — 3 papers
  15. Annotation First / Pretraining キラキラ コツコツ 汎⽤⼈⼯知能 Careful 
 Annotation

    No More 
 Annotation Annotation First,
 Pretraining 賢くAI活⽤ Understand
 Task & Human Big Data 
 + DL Big Clean Data
 + DL
  16. ͜Ε͔ΒͷDeep Learning͸Ͳ͏ͳΔͷ͔ʁ キラキラ コツコツ 汎⽤⼈⼯知能 Careful 
 Annotation No More

    
 Annotation 幻滅期? Annotation First
 Pretraining 賢くAI活⽤ Understand
 Task & Human Big Data 
 + DL Big Clean Data
 + DL
  17. AIΛਖ਼͘͠׆༻͢ΔϓϨΠϠʔத৺ͷ࣌୅ キラキラ 汎⽤⼈⼯知能 コツコツ 幻滅期? 活⽤期 Careful 
 Annotation No

    More 
 Annotation Annotation First
 Pretraining 賢くAI活⽤ Understand
 Task & Human Big Data 
 + DL Big Clean Data
 + DL
  18. キラキラ 汎⽤⼈⼯知能 コツコツ Big Data
 + DL Big Clean Data


    + DL 幻滅期? 活⽤期 Careful 
 Annotation No More 
 Annotation Understand
 Task & Human Annotation First
 Pretraining ࣍ʹͲΜͳAIٕज़͕͘Δ͔ʁ Near Future 賢くAI活⽤
  19. at ABEJA(再掲) ࠷ڧͷϞσϧΛͭ͘Γ͍ͨ ࠷ڧͷϞσϧ͸͋ΒΏΔλεΫʹࡽ͢͜ͱͰಘΒΕΔʁ Poincaré Embeddings Graph Convolution Annotation あらゆるものを


    組み合わせるには? 知識を効率的に
 抽出するには? どんな空間でモデルを表現すべき? ML in Hyperbolic Space
  20. Poincaré Embeddings Graph Convolution Annotation あらゆるものを
 組み合わせるには? 知識を効率的に
 抽出するには? どんな空間でモデルを表現すべき?

    ML in Hyperbolic Space BERT Language Models are Unsupervised Multitask Learners Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1 Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading com- prehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested lan- guage modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language pro- cessing systems which learn to perform tasks from their naturally occurring demonstrations. 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current sys- tems are better characterized as narrow experts rather than *, **Equal contribution 1OpenAI, San Francisco, Califor- nia, United States. Correspondence to: Alec Radford <[email protected]>. competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, mul- titask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks GPT-2 Big Clean Data + Big DL Taskonomy
  21. タスクの分類体系 — Taskonomy (CVPR 2018 Best Paper) Taskonomy = Task

    + Taxonomy
 室内画像にたいするビジョン系の26種のタスク間の
 転移しやすさの関係性を解析し、タスク間転移学習
 の有効性・効率性を検証 Autoencoding Object Class. Scene Class. Curvature Denoising Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Reshading Distance Z-Depth Normals Layout 2.5D Segm. 2D Segm. Semantic Segm. Vanishing Pts. Novel Task 1 Novel Task 2 Novel Task 3 Autoencoding Object Class. Scene Class. Curvature Denoising Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Reshading Distance Z-Depth Normals Layout 2.5D Segm. 2D Segm. Semantic Segm. Vanishing Pts. Novel Task 1 Novel Task 2 Novel Task 3
  22. BERT Language Models are Unsupervised Multitask Learners Alec Radford *

    1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1 Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading com- prehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested lan- guage modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language pro- cessing systems which learn to perform tasks from their naturally occurring demonstrations. 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current sys- tems are better characterized as narrow experts rather than *, **Equal contribution 1OpenAI, San Francisco, Califor- nia, United States. Correspondence to: Alec Radford <[email protected]>. competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, mul- titask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks GPT-2 Poincaré Embeddings Graph Convolution Annotation あらゆるものを
 組み合わせるには? 知識を効率的に
 抽出するには? どんな空間でモデルを表現すべき? ML in Hyperbolic Space Taskonomy Large Scale Multi Task Transfer ಓ۩΋ͦΖ͍ཧղ΋ਐΜͰ͖ͨ… Big Clean Data + Big DL
  23. キラキラ 汎⽤⼈⼯知能 コツコツ 幻滅期? 活⽤期 Careful 
 Annotation No More

    
 Annotation 賢くAI活⽤ Annotation First
 Pretraining 3B: Big Task + Big Clean Data + Big DL Near Future Big Task
 + Big Clean Data
 + Big DL + Multi Modal ? Understand
 Task & Human Big Data
 + DL Big Clean Data
 + DL