Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介/Towards understanding Chain-of-Thought prompting: An empirical study of what matters

Shota Kato
August 21, 2023

論文紹介/Towards understanding Chain-of-Thought prompting: An empirical study of what matters

第15回最先端NLP勉強会のスライドです
https://sites.google.com/view/snlp-jp/home/2023?authuser=0

Shota Kato

August 21, 2023
Tweet

More Decks by Shota Kato

Other Decks in Research

Transcript

  1. ঺հऀɿՃ౻ ↅଠʢژ౎େֶʣ ͱ͘ʹ஫ऍ͕ͳ͍ݶΓɼਤද΍ࣄྫ͸঺հ࿦จ͔ΒͷҾ༻Ͱ͢ ˞͸঺հऀͷίϝϯτͰ͢ Towards Understanding Chain-of-Thought Prompting: An Empirical

    Study of What Matters Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, Huan Sun ACL 2023 https://aclanthology.org/2023.acl-long.153/ !ୈճ࠷ઌ୺/-1ษڧձ
  2. ·ͱΊ • ໨త • Chain-of-thought (CoT) ϓϩϯϓτʹد༩͢Δ఺Λղ໌͢Δɽ • ख๏ •

    CoT ϓϩϯϓτΛͭͷཁૉʹ෼ׂͯ͠ ablation study Λߦͬͨɽ • ؔ࿈ੑͱҰ؏ੑΛධՁͨ͠ɽ • ಘΒΕͨ஌ݟ • ਪ࿦ͷଥ౰ੑ͸ੑೳʹ΄ͱΜͲӨڹ͠ͳ͍ɽ • CoT Ͱ͸ɼೖྗΫΤϦͱͷؔ࿈ੑɾਪ࿦աఔͷҰ؏ੑ͕ॏཁɽ • LLM ͸ࣄલֶशʹΑͬͯਪ࿦ํ๏Λֶश͍ͯ͠Δɽ 1
  3. A: The cafeteria had 23 apples originally. They used 20

    to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9. Chain-of-Thought Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: The answer is 27. Standard Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model Input Model Output Model Output Model Input େن໛ݴޠϞσϧͷϓϩϯϓτ େن໛ݴޠϞσϧʢ--.ʣͷೖྗʢϓϩϯϓτʣΛ޻෉͢Δͱ ৽͍͠λεΫͰ΋ߴ͍ੑೳΛୡ੒Ͱ͖Δɽ • ຊจதֶशʢIn-Context Learning; ICLʣ[Brown+,20] • ࢥߟ࿈࠯ܕʢChain-of-Thought; CoTʣϓϩϯϓτ [Wei+,22] [Wei+,22] 2
  4. A: The cafeteria had 23 apples originally. They used 20

    to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9. Chain-of-Thought Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? A: The answer is 27. Standard Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model Input Model Output Model Output Model Input Chain-of-Thought (CoT) ϓϩϯϓτ େن໛ݴޠϞσϧʢ--.ʣͷೖྗʢϓϩϯϓτʣΛ޻෉͢Δͱ ৽͍͠λεΫͰ΋ߴ͍ੑೳΛୡ੒Ͱ͖Δɽ • ຊจதֶशʢIn-Context Learning; ICLʣ[Brown+,20] • ࢥߟ࿈࠯ܕʢChain-of-Thought; CoTʣϓϩϯϓτ [Wei+,22] CoT ϓϩϯϓτΛ༻͍Δͱɺ ੑೳ͕޲্͢Δ ྫɿࢉज़ਪ࿦λεΫ [Cobbe+,21] Accuracy: 15.4 → 48.5 (InstructGPT-175B text-davinci-002 [Ouyang+,22;Brown+,20]) CoT ϓϩϯϓτ͕ߴ͍ੑೳΛ ୡ੒Ͱ͖Δͷ͸ͳ͔ͥʁ [Wei+,22] 3
  5. CoT ϓϩϯϓτͷߏ੒ཁૉ Bridging objects ਖ਼͍͠༧ଌΛ͢ΔͨΊʹඞཁͳΦϒδΣΫτ ࢉज़ਪ࿦ɿਪ࿦ʹؚ·ΕΔ਺஋෦෼ʢ਺ɾ਺ࣜʣ ࣄ࣮ܕ࣭໰Ԡ౴ɿओମͱ٬ମͷΤϯςΟςΟ Language templates bridging

    objects Λิ׬͢Δ෦෼ Research question  ਖ਼֬ͳ bridging objects ͱ language templates ͸ඞཁ͔ʁ   ͷճ౴͕ No ͳΒɼ LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ 5
  6. ࣮ݧઃఆ • ࢖༻͢ΔݴޠϞσϧ • InstructGPT-175B [Ouyang+,22;Brown+,20] • text-davinci-002ʢϝΠϯʣ, text-davinci-003 •

    PaLM [Chowdhery+,22] • Flan-PaLM [Chung+,22]: PaLM + instruction tuning • λεΫɿCoT Ͱੑೳ͕޲্ͨ͠ଟஈਪ࿦Λཁ͢ΔλεΫ • ࢉज़ਪ࿦ɿGSM8K [Cobbe+,21] • ࣄ࣮ܕϚϧνϗοϓ2"ɿBamboogle [Press+,22] • ϕʔεϥΠϯʢCoT ϓϩϯϓτʣ GSM8K ͱ Bamboogle Ͱ༻͍ΒΕͨϓϩϯϓτΛमਖ਼͢Δɽ ʢGSM8K: 8-shotɼBamboogle: 4-shotʣ 7
  7. ࣮ݧ̍ʛख๏ • Invalid reasoning ͷϓϩϯϓτΛख࡞ۀͰ࡞੒͢Δɽ CoT ͷ bridging objects ͱ

    language templates Λมߋͨ͠ϓϩϯϓτ • Invalid reasoning ͱ CoT ͱͷੑೳͷࠩΛଌΔɽ ճ౴ʹ໾ཱͭ෦෼ͷΈ Λมߋ͢Δ 9
  8. ධՁํ๏ • ࠷ऴग़ྗͷධՁʢ֎ࡏతධՁʣ • GSM8K: Accuracy • BamboogleɿF1 • ਪ࿦աఔͷධՁʢ಺ࡏతධՁʣ

    Bridging objects ͷ Recall / F1 Λଌఆ͢Δɽ • GSM8Kɿਪ࿦աఔதͷ਺ࣈͷRecall / F1ʢInter. Recall / F1ʣ GSM8K தͷਪ࿦աఔͷϥϕϧ෇͖σʔλΛ༻͍Δɽ • Bamboogleɿओମɼ٬ମΤϯςΟςΟͷ Recall ʢInter. Recallʣ ϥϕϧ෇͖σʔλΛखಈͰ࡞੒ͯ͠༻͍Δɽ ਪ࿦աఔΛධՁͰ͖ͳ͍ɽ 10
  9. ࣮ݧ̍ʛ݁Ռ Flan-PaLM PaLM text-davinci-002 text-davinci-003 • 全モデルで invalid reasoning の性能は

    CoT の約90%. • おそらく事前学習で多段推論能力を獲得している. ਪ࿦աఔͷଥ౰ੑͱճ౴ͷ࣭ʹڧ͍ؔ࿈ੑ͸ແ͍ 12
  10. ࣮ݧ̍ʛ·ͱΊ 2ਖ਼֬ͳ bridging objects/language templates ͸ඞཁ͔ʁ "ඞཁͰ͸ͳ͍ɽ ௚ײɿCoT ͱͯ͠ଥ౰Ͱͳ͍ਪ࿦աఔ͕༩͑ΒΕͨΒɼ LLM

    ͸ਖ਼͍͠ਪ࿦͕Ͱ͖ͳ͍͸ͣ… Research question  ਖ਼֬ͳ bridging objects ͱ language templates ͸ඞཁ͔ʁ   ͷճ౴͕ No ͳΒɼ LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ 13
  11. ࣮ݧ̍ͷख๏Λݟฦ͢ͱʜ Invalid reasoning ʹ͸ɼਪ࿦ʹ໾ཱͭ৘ใ͕࢒͍ͬͯΔɽ ΫΤϦʹؔ͢Δ৘ใΛؚΉɽ Bridging objects ಉ͡਺ࣈ͕ΫΤϦʹؚ·ΕΔɽ Language templates

    ࿩୊͕ΫΤϦͱಉ͡ɽ จ͕ܨ͕͍ͬͯͯے͕௨͍ͬͯΔɽ ྫ͑͹ɼ௚લͷจʹग़͖ͯͨ਺ࣈ ͕࣍ͷจͰ࢖ΘΕ͍ͯΔɽ Relevanceʢؔ࿈ੑʣ CoherenceʢҰ؏ੑʣ 16
  12. ࣮ݧ̎ cख๏ CoT ϓϩϯϓτΛมߋͯ͠ɼ ύλʔϯͷϓϩϯϓτΛ৽ͨʹ࡞੒͢Δɽ • No coherence for bridging

    objects • No relevance for bridging objects • No coherence for language templates • No relevance for language templates • No coherence • No relevance 17
  13. ࣮ݧ̎ʛ݁Ռ  • ؔ࿈ੑͱҰ؏ੑ͸ CoT ʹඞཁɽ • ؔ࿈ੑ͸ಛʹॏཁɽ • αϯϓϧΛਓखͰධՁɽͰग़ྗͱΫΤϦͷؔ࿈ੑͳ͠ɽ

    • ແؔ܎ͷग़ྗ͸ࣅͨ࿩୊ʢcats and dogs” ΍ "passengers and buses”ʣɽ ͓ͦΒ͘ࣄલֶशίʔύεதͷ਺ֶؔ࿈෦෼Ͱසग़ͷ࿩୊ɽ text-davinci-002 ˞No relevance for bridging objects ͱ No relevance ͷ౴͕͑ෆਖ਼ղͰ͋Δ͜ͱͷӨڹʜʁ 19
  14. ࣮ݧ̎ʛ݁Ռ  • Bridging objects Ͱ͸ؔ࿈ੑ͕ΑΓॏཁɽ • ؔ࿈ੑͳ͠ ᶅ ᶉ

    ͷग़ྗͱΫΤϦͱͷ bridging objects ͷҰக཰͸ ҎԼͰɼଞͷ৔߹ʢ໿ʣΑΓ΋௿͔ͬͨɽ • Language templates Ͱ͸Ұ؏ੑ͕ΑΓॏཁɽ • αϯϓϧͷग़ྗ͕Ұ؏ੑͳ͠ɽ text-davinci-002 ˞No relevance for bridging objects ͱ No relevance ͷ౴͕͑ෆਖ਼ղͰ͋Δ͜ͱͷӨڹʜʁ 20
  15. LLM͸CoT͔Βਪ࿦ํ๏ΛֶΜͰ͍Δʁ • LLM ͕ CoT ͔ΒֶͿਪ࿦ํ๏͸ݶఆతɽ üLLM ͸ɼࣄલֶशʹΑͬͯෳࡶͳਪ࿦ೳྗΛ֫ಘ͍ͯ͠Δɽ üCoT ͷ໾ׂ͸ؔ࿈ੑͱҰ؏ੑΛ࣋ͭΑ͏ʹग़ྗΛ੍ޚ͢Δ͜ͱɽ

    • λεΫͷ஌͕ࣝଟ͍ͱ ablations ʹΑΔੑೳ௿Լ͸খ͍͞ɽ üλεΫΛղ͘ࡍʹࣄલ஌ࣝΛ׆༻Ͱ͖Δɽ ✘ଥ౰Ͱͳ͍ਪ࿦աఔΛੜ੒͢ΔλεΫͷ࣮ߦ͸೉͍͠ɽ Flan-PaLM 21
  16. LLM͸CoT͔Βਪ࿦ํ๏ΛֶΜͰ͍Δʁ • LLM ͕ CoT ͔ΒֶͿਪ࿦ํ๏͸ݶఆతɽ üLLM ͸ɼࣄલֶशʹΑͬͯෳࡶͳਪ࿦ೳྗΛ֫ಘ͍ͯ͠Δɽ üCoT ͷ໾ׂ͸ؔ࿈ੑͱҰ؏ੑΛ࣋ͭΑ͏ʹग़ྗΛ੍ޚ͢Δ͜ͱɽ

    • λεΫͷ஌͕ࣝଟ͍ͱ ablations ʹΑΔੑೳ௿Լ͸খ͍͞ɽ üλεΫΛղ͘ࡍʹࣄલ஌ࣝΛ׆༻Ͱ͖Δɽ ✘ଥ౰Ͱͳ͍ਪ࿦աఔΛੜ੒͢ΔλεΫͷ࣮ߦ͸೉͍͠ɽ LLM ͸ CoT ͔Βਪ࿦ํ๏Λֶ΂Δ͔ʁ • ݁࿦Λग़͢ʹ͸ݱঢ়ͷ݁Ռ͸ෆे෼ɽ • CoT ͷओͳ໾ׂ͸ࣄલֶशͰಘͨਪ࿦εΩϧΛҾ͖ग़͢͜ͱɽ 22
  17. [Wei+,22] ՝୊ • ଞͷਪ࿦λεΫʹద༻Մೳͳ࣮ݧͷઃܭ • ຊݚڀͰ࢖ͬͨख๏͸൚༻తͰ͸ͳ͘ɼ CoT ϓϩϯϓτͷߏ੒ཁૉ͕Ұ༷ͩͱద༻Ͱ͖ͳ͍ɽ ྫɿLast letter

    concatenation task • Invalid reasoning ͷϓϩϯϓτ࡞੒ํ๏ͷࣗಈԽ • ಺ࡏతධՁͷվળ • ධՁʹ༻͍ͨ bridging objects ͷਖ਼ղ͸͍ͭͰ΋ར༻ՄೳͰ͸ͳ͍ • แׅత͔ͭࢀরෆཁͳධՁํ๏ͷ։ൃ͕՝୊ɽؔ࿈[Golovneva+,23] 23
  18. ·ͱΊ • ໨త • Chain-of-thought (CoT) ϓϩϯϓτʹد༩͢Δ఺Λղ໌͢Δɽ • ख๏ •

    CoT ϓϩϯϓτΛͭͷཁૉʹ෼ׂͯ͠ ablation study Λߦͬͨɽ • ؔ࿈ੑͱҰ؏ੑΛධՁͨ͠ɽ • ಘΒΕͨ஌ݟ • ਪ࿦ͷଥ౰ੑ͸ੑೳʹ΄ͱΜͲӨڹ͠ͳ͍ɽ • CoT Ͱ͸ɼೖྗΫΤϦͱͷؔ࿈ੑɾਪ࿦աఔͷҰ؏ੑ͕ॏཁɽ • LLM ͸ࣄલֶशʹΑͬͯਪ࿦ํ๏Λֶश͍ͯ͠Δɽ ஶऀ࣮૷ɿhttps://github.com/sunlab-osu/Understanding-CoT 24
  19. ࢀߟจݙ [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,

    Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. [Wei+,22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc. [Cobbe+,21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. [Ouyang+,22] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. 25
  20. ࢀߟจݙ [Chowdhery+,22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma,

    Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311. [Chung+,22] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416. [Press+,22] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. [Golovneva+,23] Olga Golovneva, Moya Peng Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations. 26
  21. ࣮ݧ̍ʛ݁Ռ text-davinci-002 ͷྔతղੳ) • invalid reasoning ͷੑೳ͸ CoT ͷ໿ɽ •

    ೉қ౓ͷҟͳΔαϯϓϧؒͰ ύϑΥʔϚϯεͷ௿Լ཰͸Ұ༷ɽ • CoTͷΈෆਖ਼ղยํͷΈෆਖ਼ղ GSM8K: 62/196, Bamboogle: 6/20 ਪ࿦աఔͷଥ౰ੑͱճ౴ͷ࣭ʹڧ͍ؔ࿈ੑ͸ແ͍ text-davinci-002 28
  22. ߟ࡯ • few-shot ਪ࿦ͷϕϯνϚʔΫʹ͍ͭͯ • ຊݚڀ͸ LLM ͷଟஈਪ࿦ʹؔ͢Δࣄલ஌ࣝͷఆྔԽํ๏ͱΈͳͤΔɽ • few-shot

    ͔Βਪ࿦ํ๏Λֶश͢Δ LLM ͷೳྗΛධՁ͢ΔͨΊʹ͸ɼ LLM ʹؚ·Ε͍ͯΔ஌͕ࣝগͳ͍ϕϯνϚʔΫ͕ඞཁɽ 30