論文紹介/Towards understanding Chain-of-Thought prompting: An empirical study of what matters

Slide 1

Slide 1 text

঺հऀɿՃ౻ ↅଠʢژ౎େֶʣ ͱ͘ʹ஫ऍ͕ͳ͍ݶΓɼਤද΍ࣄྫ͸঺հ࿦จ͔ΒͷҾ༻Ͱ͢ ˞͸঺հऀͷίϝϯτͰ͢ Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, Huan Sun ACL 2023 https://aclanthology.org/2023.acl-long.153/ !ୈճ࠷ઌ୺/-1ษڧձ

Slide 2

Slide 2 text

·ͱΊ • ໨త • Chain-of-thought (CoT) ϓϩϯϓτʹد༩͢Δ఺Λղ໌͢Δɽ • ख๏ • CoT ϓϩϯϓτΛͭͷཁૉʹ෼ׂͯ͠ ablation study Λߦͬͨɽ • ؔ࿈ੑͱҰ؏ੑΛධՁͨ͠ɽ • ಘΒΕͨ஌ݟ • ਪ࿦ͷଥ౰ੑ͸ੑೳʹ΄ͱΜͲӨڹ͠ͳ͍ɽ • CoT Ͱ͸ɼೖྗΫΤϦͱͷؔ࿈ੑɾਪ࿦աఔͷҰ؏ੑ͕ॏཁɽ • LLM ͸ࣄલֶशʹΑͬͯਪ࿦ํ๏Λֶश͍ͯ͠Δɽ 1

Slide 3

Slide 3 text

A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9. Chain-of-Thought Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: The answer is 27. Standard Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model Input Model Output Model Output Model Input େن໛ݴޠϞσϧͷϓϩϯϓτ େن໛ݴޠϞσϧʢ--.ʣͷೖྗʢϓϩϯϓτʣΛ޻෉͢Δͱ ৽͍͠λεΫͰ΋ߴ͍ੑೳΛୡ੒Ͱ͖Δɽ • ຊจதֶशʢIn-Context Learning; ICLʣ[Brown+,20] • ࢥߟ࿈࠯ܕʢChain-of-Thought; CoTʣϓϩϯϓτ [Wei+,22] [Wei+,22] 2

Slide 4

Slide 4 text

A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9. Chain-of-Thought Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: The answer is 27. Standard Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model Input Model Output Model Output Model Input Chain-of-Thought (CoT) ϓϩϯϓτ େن໛ݴޠϞσϧʢ--.ʣͷೖྗʢϓϩϯϓτʣΛ޻෉͢Δͱ ৽͍͠λεΫͰ΋ߴ͍ੑೳΛୡ੒Ͱ͖Δɽ • ຊจதֶशʢIn-Context Learning; ICLʣ[Brown+,20] • ࢥߟ࿈࠯ܕʢChain-of-Thought; CoTʣϓϩϯϓτ [Wei+,22] CoT ϓϩϯϓτΛ༻͍Δͱɺ ੑೳ͕޲্͢Δ ྫɿࢉज़ਪ࿦λεΫ [Cobbe+,21] Accuracy: 15.4 → 48.5 (InstructGPT-175B text-davinci-002 [Ouyang+,22;Brown+,20]) CoT ϓϩϯϓτ͕ߴ͍ੑೳΛ ୡ੒Ͱ͖Δͷ͸ͳ͔ͥʁ [Wei+,22] 3

Slide 5

Slide 5 text

CoT ϓϩϯϓτͷߏ੒ཁૉ Bridging objects ਖ਼͍͠༧ଌΛ͢ΔͨΊʹඞཁͳΦϒδΣΫτ ࢉज़ਪ࿦ɿਪ࿦ʹؚ·ΕΔ਺஋෦෼ʢ਺ɾ਺ࣜʣ ࣄ࣮ܕ࣭໰Ԡ౴ɿओମͱ٬ମͷΤϯςΟςΟ Language templates bridging objects Λิ׬͢Δ෦෼ 4

Slide 6

Slide 6 text

CoT ϓϩϯϓτͷߏ੒ཁૉ Bridging objects ਖ਼͍͠༧ଌΛ͢ΔͨΊʹඞཁͳΦϒδΣΫτ ࢉज़ਪ࿦ɿਪ࿦ʹؚ·ΕΔ਺஋෦෼ʢ਺ɾ਺ࣜʣ ࣄ࣮ܕ࣭໰Ԡ౴ɿओମͱ٬ମͷΤϯςΟςΟ Language templates bridging objects Λิ׬͢Δ෦෼ Research question ਖ਼֬ͳ bridging objects ͱ language templates ͸ඞཁ͔ʁ ͷճ౴͕ No ͳΒɼ LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ 5

Slide 7

Slide 7 text

࣮ݧ̍ ਖ਼֬ͳ bridging objects/language templates ͸ඞཁ͔ʁ ௚ײɿCoT ͱͯ͠ଥ౰Ͱͳ͍ਪ࿦աఔ͕༩͑ΒΕͨΒɼ LLM͸ਖ਼͍͠ਪ࿦͕Ͱ͖ͳ͍͸ͣ…

Slide 8

Slide 8 text

࣮ݧઃఆ • ࢖༻͢ΔݴޠϞσϧ • InstructGPT-175B [Ouyang+,22;Brown+,20] • text-davinci-002ʢϝΠϯʣ, text-davinci-003 • PaLM [Chowdhery+,22] • Flan-PaLM [Chung+,22]: PaLM + instruction tuning • λεΫɿCoT Ͱੑೳ͕޲্ͨ͠ଟஈਪ࿦Λཁ͢ΔλεΫ • ࢉज़ਪ࿦ɿGSM8K [Cobbe+,21] • ࣄ࣮ܕϚϧνϗοϓ2"ɿBamboogle [Press+,22] • ϕʔεϥΠϯʢCoT ϓϩϯϓτʣ GSM8K ͱ Bamboogle Ͱ༻͍ΒΕͨϓϩϯϓτΛमਖ਼͢Δɽ ʢGSM8K: 8-shotɼBamboogle: 4-shotʣ 7

Slide 9

Slide 9 text

࢖༻͢ΔCoTϓϩϯϓτʢࢉज़ਪ࿦ʣ 8

Slide 10

Slide 10 text

࣮ݧ̍ʛख๏ • Invalid reasoning ͷϓϩϯϓτΛख࡞ۀͰ࡞੒͢Δɽ CoT ͷ bridging objects ͱ language templates Λมߋͨ͠ϓϩϯϓτ • Invalid reasoning ͱ CoT ͱͷੑೳͷࠩΛଌΔɽ ճ౴ʹ໾ཱͭ෦෼ͷΈ Λมߋ͢Δ 9

Slide 11

Slide 11 text

ධՁํ๏ • ࠷ऴग़ྗͷධՁʢ֎ࡏతධՁʣ • GSM8K: Accuracy • BamboogleɿF1 • ਪ࿦աఔͷධՁʢ಺ࡏతධՁʣ Bridging objects ͷ Recall / F1 Λଌఆ͢Δɽ • GSM8Kɿਪ࿦աఔதͷ਺ࣈͷRecall / F1ʢInter. Recall / F1ʣ GSM8K தͷਪ࿦աఔͷϥϕϧ෇͖σʔλΛ༻͍Δɽ • Bamboogleɿओମɼ٬ମΤϯςΟςΟͷ Recall ʢInter. Recallʣ ϥϕϧ෇͖σʔλΛखಈͰ࡞੒ͯ͠༻͍Δɽ ਪ࿦աఔΛධՁͰ͖ͳ͍ɽ 10

Slide 12

Slide 12 text

࣮ݧ̍ʛ݁Ռ Flan-PaLM PaLM text-davinci-002 text-davinci-003 11

Slide 13

Slide 13 text

࣮ݧ̍ʛ݁Ռ Flan-PaLM PaLM text-davinci-002 text-davinci-003 • 全モデルで invalid reasoning の性能は CoT の約90%． • おそらく事前学習で多段推論能力を獲得している． ਪ࿦աఔͷଥ౰ੑͱճ౴ͷ࣭ʹڧ͍ؔ࿈ੑ͸ແ͍ 12

Slide 14

Slide 14 text

࣮ݧ̍ʛ·ͱΊ 2ਖ਼֬ͳ bridging objects/language templates ͸ඞཁ͔ʁ "ඞཁͰ͸ͳ͍ɽ ௚ײɿCoT ͱͯ͠ଥ౰Ͱͳ͍ਪ࿦աఔ͕༩͑ΒΕͨΒɼ LLM ͸ਖ਼͍͠ਪ࿦͕Ͱ͖ͳ͍͸ͣ… Research question ਖ਼֬ͳ bridging objects ͱ language templates ͸ඞཁ͔ʁ ͷճ౴͕ No ͳΒɼ LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ 13

Slide 15

Slide 15 text

࣮ݧ̎ LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ

Slide 16

Slide 16 text

࣮ݧ̍ͷख๏Λݟฦ͢ͱʜ Invalid reasoning ʹ͸ɼਪ࿦ʹ໾ཱͭ৘ใ͕࢒͍ͬͯΔɽ 15

Slide 17

Slide 17 text

࣮ݧ̍ͷख๏Λݟฦ͢ͱʜ Invalid reasoning ʹ͸ɼਪ࿦ʹ໾ཱͭ৘ใ͕࢒͍ͬͯΔɽ ΫΤϦʹؔ͢Δ৘ใΛؚΉɽ Bridging objects ಉ͡਺ࣈ͕ΫΤϦʹؚ·ΕΔɽ Language templates ࿩୊͕ΫΤϦͱಉ͡ɽ จ͕ܨ͕͍ͬͯͯے͕௨͍ͬͯΔɽ ྫ͑͹ɼ௚લͷจʹग़͖ͯͨ਺ࣈ ͕࣍ͷจͰ࢖ΘΕ͍ͯΔɽ Relevanceʢؔ࿈ੑʣ CoherenceʢҰ؏ੑʣ 16

Slide 18

Slide 18 text

࣮ݧ̎ cख๏ CoT ϓϩϯϓτΛมߋͯ͠ɼ ύλʔϯͷϓϩϯϓτΛ৽ͨʹ࡞੒͢Δɽ • No coherence for bridging objects • No relevance for bridging objects • No coherence for language templates • No relevance for language templates • No coherence • No relevance 17

Slide 19

Slide 19 text

࣮ݧ̎ cϓϩϯϓτͷྫ ߏ੒ཁૉͷॱংΛϥϯμϜʹ ೖΕସ͑Δɽ ਺ࣈΛΫΤϦ͔ΒϥϯμϜʹ αϯϓϦϯάͯ͠ஔ׵͢Δɽ ˞No relevance for bridging objects Ͱ͸ ࠷ޙͷ౴͕͑ෆਖ਼ղʹͳ͍ͬͯΔɽ 18

Slide 20

Slide 20 text

࣮ݧ̎ʛ݁Ռ • ؔ࿈ੑͱҰ؏ੑ͸ CoT ʹඞཁɽ • ؔ࿈ੑ͸ಛʹॏཁɽ • αϯϓϧΛਓखͰධՁɽͰग़ྗͱΫΤϦͷؔ࿈ੑͳ͠ɽ • ແؔ܎ͷग़ྗ͸ࣅͨ࿩୊ʢcats and dogs” ΍ "passengers and buses”ʣɽ ͓ͦΒ͘ࣄલֶशίʔύεதͷ਺ֶؔ࿈෦෼Ͱසग़ͷ࿩୊ɽ text-davinci-002 ˞No relevance for bridging objects ͱ No relevance ͷ౴͕͑ෆਖ਼ղͰ͋Δ͜ͱͷӨڹʜʁ 19

Slide 21

Slide 21 text

࣮ݧ̎ʛ݁Ռ • Bridging objects Ͱ͸ؔ࿈ੑ͕ΑΓॏཁɽ • ؔ࿈ੑͳ͠ ᶅ ᶉ ͷग़ྗͱΫΤϦͱͷ bridging objects ͷҰக཰͸ ҎԼͰɼଞͷ৔߹ʢ໿ʣΑΓ΋௿͔ͬͨɽ • Language templates Ͱ͸Ұ؏ੑ͕ΑΓॏཁɽ • αϯϓϧͷग़ྗ͕Ұ؏ੑͳ͠ɽ text-davinci-002 ˞No relevance for bridging objects ͱ No relevance ͷ౴͕͑ෆਖ਼ղͰ͋Δ͜ͱͷӨڹʜʁ 20

Slide 22

Slide 22 text

LLM͸CoT͔Βਪ࿦ํ๏ΛֶΜͰ͍Δʁ • LLM ͕ CoT ͔ΒֶͿਪ࿦ํ๏͸ݶఆతɽ üLLM ͸ɼࣄલֶशʹΑͬͯෳࡶͳਪ࿦ೳྗΛ֫ಘ͍ͯ͠Δɽ üCoT ͷ໾ׂ͸ؔ࿈ੑͱҰ؏ੑΛ࣋ͭΑ͏ʹग़ྗΛ੍ޚ͢Δ͜ͱɽ • λεΫͷ஌͕ࣝଟ͍ͱ ablations ʹΑΔੑೳ௿Լ͸খ͍͞ɽ üλεΫΛղ͘ࡍʹࣄલ஌ࣝΛ׆༻Ͱ͖Δɽ ✘ଥ౰Ͱͳ͍ਪ࿦աఔΛੜ੒͢ΔλεΫͷ࣮ߦ͸೉͍͠ɽ Flan-PaLM 21

Slide 23

Slide 23 text

Slide 24

Slide 24 text

[Wei+,22] ՝୊ • ଞͷਪ࿦λεΫʹద༻Մೳͳ࣮ݧͷઃܭ • ຊݚڀͰ࢖ͬͨख๏͸൚༻తͰ͸ͳ͘ɼ CoT ϓϩϯϓτͷߏ੒ཁૉ͕Ұ༷ͩͱద༻Ͱ͖ͳ͍ɽ ྫɿLast letter concatenation task • Invalid reasoning ͷϓϩϯϓτ࡞੒ํ๏ͷࣗಈԽ • ಺ࡏతධՁͷվળ • ධՁʹ༻͍ͨ bridging objects ͷਖ਼ղ͸͍ͭͰ΋ར༻ՄೳͰ͸ͳ͍ • แׅత͔ͭࢀরෆཁͳධՁํ๏ͷ։ൃ͕՝୊ɽؔ࿈[Golovneva+,23] 23

Slide 25

Slide 25 text

Slide 26

Slide 26 text

ࢀߟจݙ [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. [Wei+,22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc. [Cobbe+,21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. [Ouyang+,22] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. 25

Slide 27

Slide 27 text

ࢀߟจݙ [Chowdhery+,22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311. [Chung+,22] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416. [Press+,22] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. [Golovneva+,23] Olga Golovneva, Moya Peng Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations. 26

Slide 28

Slide 28 text

ิ଍ࢿྉ

Slide 29

Slide 29 text

࣮ݧ̍ʛ݁Ռ text-davinci-002 ͷྔతղੳ) • invalid reasoning ͷੑೳ͸ CoT ͷ໿ɽ • ೉қ౓ͷҟͳΔαϯϓϧؒͰ ύϑΥʔϚϯεͷ௿Լ཰͸Ұ༷ɽ • CoTͷΈෆਖ਼ղยํͷΈෆਖ਼ղ GSM8K: 62/196, Bamboogle: 6/20 ਪ࿦աఔͷଥ౰ੑͱճ౴ͷ࣭ʹڧ͍ؔ࿈ੑ͸ແ͍ text-davinci-002 28

Slide 30

Slide 30 text

࣮ݧ̍ʛ݁Ռʢ࣭తղੳʣ • CoT ͱ invalid-reasoning ͷࠜڌͱͷؒʹ໌֬ͳҧ͍͸ແ͔ͬͨɽ • ճ౴͕ਖ਼ղͰ͋ͬͨέʔεͷ͏ͪຆͲʹ͓͍ͯɼਪ࿦͸ଥ౰Ͱ͋ͬͨɽ • ճ౴͕ؒҧ͍Ͱ͋ͬͨέʔεͰͷؒҧ͑ํ͸ CoT ͷ৔߹ͱಉ༷ͩͬͨɽ 29

Slide 31

Slide 31 text

ߟ࡯ • few-shot ਪ࿦ͷϕϯνϚʔΫʹ͍ͭͯ • ຊݚڀ͸ LLM ͷଟஈਪ࿦ʹؔ͢Δࣄલ஌ࣝͷఆྔԽํ๏ͱΈͳͤΔɽ • few-shot ͔Βਪ࿦ํ๏Λֶश͢Δ LLM ͷೳྗΛධՁ͢ΔͨΊʹ͸ɼ LLM ʹؚ·Ε͍ͯΔ஌͕ࣝগͳ͍ϕϯνϚʔΫ͕ඞཁɽ 30