Deep Learningの都市伝説と現実

Slide 1

Slide 1 text

DAY 1 “技” Developer Day Deep Learningの都市伝説と現実 Tatsuya Shirakawa (ABEJA, Inc.)

Slide 7

Slide 7 text

鳞ַ׏׋麓⿠˘ ˖ 䏄莧طحزٙ٦ؙ橆㞮ח䮶׶㔐ׁ׸׋麓⿠ ˖ 䏄莧㤍秈ך*41ꥺ㹱٥鋉ⵖח䋆ֹ鴥ת׸׷˘ ˖ 䏄莧طحزٙ٦ؙ堣㐻הך䎁幧 ˖ 䏄莧طحزٙ٦ؙ圓䧭ך醱꧟ׁխխ٥䞔ءأח״׷鸐⥋ـٗحؙ ˖ ؟٦ؽأ䭁㣐儗ך倜ًؕٓ٥،ٕ؞ٔؤيך㼎䘔؝أز ˖ 㼪Ⰵ噟珏ָ㟓⸇׃ג鏣縧橆㞮ָ㢳圫⻉⇢رغ؎أ׮㢳圫⻉ׅ׷ ˖ 灇瑔Ꟛ涪ָו׿ו׿鹌׬ה倜׃ְ،ٕ؞ٔؤي٥ٌرٕغ٦آָّٝ钰欰 ˖ ؒٝآص،ךءأذي㼎䘔䊨侧ָ㟓ִ׷կ㴍׃ְ䝰뒊˘ 鳞ַ׏׋麓⿠˘ ˖ *P5 植㜥דرغ؎أָ佦ꥺׅ׷⡦佦ַ嗚鏾橆㞮דכ⹛ֻ ˖ 橆㞮⣛㶷ך纇涪佦ꥺָ涪欰׃׋㜥さծ⾱㔓♶僇ז✲ָ㢳ְկ ˖ 鏣縧橆㞮㢳圫⻉ח⠵ְծ䟝㹀橆㞮ַ׵⛣ꨄ׃׋橆㞮ָ⳿גֻ׷ ˖ رغ؎أךչㅷ颵պ׾鋅׷➬穈׫ָ䔲儗זַ׏׋ 䖓㔐׃ח׃ג׋˘ ˖ "* ؙٕؔ٦آّٝ٥鷞⯔ח״׶僥⫷鍑匿ךㅷ颵כ衅׍׷ ˖ ٖ؎،ؐز㢌刿ח״׶ًؕٓך鋔ꅿָ黪׵׸ծ礵䏝ָ⡚♴ ˖ *P5Y"* 㟓ִ竲ֽ׷رغ؎أה"*ך盖椚㣐㢌˘ ˖ 㟓ִ竲ֽ׷رغ؎أה"*ח״׷鍑匿կ⥂䭯ׅץֹ鏣㹀ך㢳ָׁ㣐㢌˘ 劤䔲ח֮׏׋✲⟝ ⡭锑 ˖J1BEך⯍ꨵך׋׭חءأذيךꨵ彁ָ䫙ַ׸׷ ˖⡦佦ַ暴㹀ًؕٓךꨵ彁ָ״ֻ衅׍׷ ˖⥋ׄג׋؝ٝإٝزָ弔屖ַ׃㐻ך禸窟׌׏׋ ˖荈爡Ꟛ涪ًؕٓ倵䊨儗ח7ꨵ彁׾N䒀ꞿ⇢7〴תדꨵ㖇꣬♴ ˖䔲搫ծ⡦䏝倜ㅷח❛䳔׃ג׮⹛⡲♶葺ָ涪欰 ˖أظ٦ن٦س䱸鍗ח״׷暟椚涸灶䴦 ˖ָًؕٓ㣓׾⟒ְדְ׋ ˖/51ـٗحؙח״׏ג׆׸׷ة؎يأةٝف ˖䎃ך⚅歲ַ׵⹛歗ָ勻׋ 鍑匿ㅷ颵ח䝤䕦갟׾♷ִ׷銲稆 ؙٕؔ٦آّٝ涪欰ך⢽ 鷞⯔ך⢽ ˟♫㢧זוך؎كٝزָ㖑㄂ח銲岣䠐կ 瘁ך衝ؙٕؔ٦آָّٝ涪欰ׅ׷ ˟儗꟦ח״׏ג傈⯔ָ䊴׃鴥׬䏄莧ח 銲岣䠐 IoTxAIを活⽤した⼩売業向け店舗解析サービスの仕組みとノウハウ  (https://www.slideshare.net/xecus/soracom-ug-explorer-2018-iotxai-115181245) RetailͭΒ͍

Slide 68

Slide 68 text

Poincaré Embeddings Graph Convolution Annotation あらゆるものを  組み合わせるには？知識を効率的に  抽出するには？どんな空間でモデルを表現すべき？ ML in Hyperbolic Space BERT Language Models are Unsupervised Multitask Learners Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1 Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current systems are better characterized as narrow experts rather than *, **Equal contribution 1OpenAI, San Francisco, Califor- nia, United States. Correspondence to: Alec Radford . competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, multitask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks GPT-2 Big Clean Data + Big DL Taskonomy

Slide 70

Slide 70 text

BERT Language Models are Unsupervised Multitask Learners Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1 Abstract Natural language processing tasks, such as ques- tion answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task- specific datasets. We demonstrate that language models begin to learn these tasks without any ex- plicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the an- swers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and in- creasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain co- herent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. 1. Introduction Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current systems are better characterized as narrow experts rather than *, **Equal contribution 1OpenAI, San Francisco, Califor- nia, United States. Correspondence to: Alec Radford . competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one. The dominant approach to creating ML systems is to col- lect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the short- comings of this approach. Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this. Multitask learning (Caruana, 1997) is a promising frame- work for improving general performance. However, multitask training in NLP is still nascent. Recent work re- ports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be re- quired to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning. The current best performing systems on language tasks GPT-2 Poincaré Embeddings Graph Convolution Annotation あらゆるものを  組み合わせるには？知識を効率的に  抽出するには？どんな空間でモデルを表現すべき？ ML in Hyperbolic Space Taskonomy Large Scale Multi Task Transfer ಓ۩΋ͦΖ͍ཧղ΋ਐΜͰ͖ͨ… Big Clean Data + Big DL

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text