Shota Kato
August 21, 2023
300

論文紹介/Towards understanding Chain-of-Thought prompting: An empirical study of what matters

August 21, 2023

Transcript

1. ঺հऀɿՃ౻ ↅଠʢژ౎େֶʣ
ͱ͘ʹ஫ऍ͕ͳ͍ݶΓɼਤද΍ࣄྫ͸঺հ࿦จ͔ΒͷҾ༻Ͱ͢
˞͸঺հऀͷίϝϯτͰ͢
Towards Understanding Chain-of-Thought Prompting:
An Empirical Study of What Matters
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu,
Luke Zettlemoyer, Huan Sun
ACL 2023
https://aclanthology.org/2023.acl-long.153/
!ୈճ࠷ઌ୺/-1ษڧձ

2. ·ͱΊ
• ໨త
• Chain-of-thought (CoT) ϓϩϯϓτʹد༩͢Δ఺Λղ໌͢Δɽ
• ख๏
• CoT ϓϩϯϓτΛͭͷཁૉʹ෼ׂͯ͠ ablation study Λߦͬͨɽ
• ؔ࿈ੑͱҰ؏ੑΛධՁͨ͠ɽ
• ಘΒΕͨ஌ݟ
• ਪ࿦ͷଥ౰ੑ͸ੑೳʹ΄ͱΜͲӨڹ͠ͳ͍ɽ
• CoT Ͱ͸ɼೖྗΫΤϦͱͷؔ࿈ੑɾਪ࿦աఔͷҰ؏ੑ͕ॏཁɽ
• LLM ͸ࣄલֶशʹΑͬͯਪ࿦ํ๏Λֶश͍ͯ͠Δɽ
1

3. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
Chain-of-Thought Prompting
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Standard Prompting
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Model Input
Model Output Model Output
Model Input
େن໛ݴޠϞσϧͷϓϩϯϓτ
େن໛ݴޠϞσϧʢ--.ʣͷೖྗʢϓϩϯϓτʣΛ޻෉͢Δͱ
৽͍͠λεΫͰ΋ߴ͍ੑೳΛୡ੒Ͱ͖Δɽ
• ຊจதֶशʢIn-Context Learning; ICLʣ[Brown+,20]
• ࢥߟ࿈࠯ܕʢChain-of-Thought; CoTʣϓϩϯϓτ [Wei+,22]
[Wei+,22]
2

4. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
Chain-of-Thought Prompting
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Standard Prompting
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Model Input
Model Output Model Output
Model Input
Chain-of-Thought (CoT) ϓϩϯϓτ
େن໛ݴޠϞσϧʢ--.ʣͷೖྗʢϓϩϯϓτʣΛ޻෉͢Δͱ
৽͍͠λεΫͰ΋ߴ͍ੑೳΛୡ੒Ͱ͖Δɽ
• ຊจதֶशʢIn-Context Learning; ICLʣ[Brown+,20]
• ࢥߟ࿈࠯ܕʢChain-of-Thought; CoTʣϓϩϯϓτ [Wei+,22]
CoT ϓϩϯϓτΛ༻͍Δͱɺ
ੑೳ͕޲্͢Δ
ྫɿࢉज़ਪ࿦λεΫ [Cobbe+,21]
Accuracy: 15.4 → 48.5
(InstructGPT-175B text-davinci-002
[Ouyang+,22;Brown+,20])
CoT ϓϩϯϓτ͕ߴ͍ੑೳΛ
ୡ੒Ͱ͖Δͷ͸ͳ͔ͥʁ
[Wei+,22]
3

5. CoT ϓϩϯϓτͷߏ੒ཁૉ
Bridging objects
ਖ਼͍͠༧ଌΛ͢ΔͨΊʹඞཁͳΦϒδΣΫτ
ࢉज़ਪ࿦ɿਪ࿦ʹؚ·ΕΔ਺஋෦෼ʢ਺ɾ਺ࣜʣ
ࣄ࣮ܕ࣭໰Ԡ౴ɿओମͱ٬ମͷΤϯςΟςΟ
Language templates
bridging objects Λิ׬͢Δ෦෼
4

6. CoT ϓϩϯϓτͷߏ੒ཁૉ
Bridging objects
ਖ਼͍͠༧ଌΛ͢ΔͨΊʹඞཁͳΦϒδΣΫτ
ࢉज़ਪ࿦ɿਪ࿦ʹؚ·ΕΔ਺஋෦෼ʢ਺ɾ਺ࣜʣ
ࣄ࣮ܕ࣭໰Ԡ౴ɿओମͱ٬ମͷΤϯςΟςΟ
Language templates
bridging objects Λิ׬͢Δ෦෼
Research question

ਖ਼֬ͳ bridging objects ͱ language templates ͸ඞཁ͔ʁ

ͷճ౴͕ No ͳΒɼ
LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ
5

7. ࣮ݧ̍
ਖ਼֬ͳ bridging objects/language templates ͸ඞཁ͔ʁ
௚ײɿCoT ͱͯ͠ଥ౰Ͱͳ͍ਪ࿦աఔ͕༩͑ΒΕͨΒɼ
LLM͸ਖ਼͍͠ਪ࿦͕Ͱ͖ͳ͍͸ͣ…

8. ࣮ݧઃఆ
• ࢖༻͢ΔݴޠϞσϧ
• InstructGPT-175B [Ouyang+,22;Brown+,20]
• text-davinci-002ʢϝΠϯʣ, text-davinci-003
• PaLM [Chowdhery+,22]
• Flan-PaLM [Chung+,22]: PaLM + instruction tuning
• λεΫɿCoT Ͱੑೳ͕޲্ͨ͠ଟஈਪ࿦Λཁ͢ΔλεΫ
• ࢉज़ਪ࿦ɿGSM8K [Cobbe+,21]
• ࣄ࣮ܕϚϧνϗοϓ2"ɿBamboogle [Press+,22]
• ϕʔεϥΠϯʢCoT ϓϩϯϓτʣ
GSM8K ͱ Bamboogle Ͱ༻͍ΒΕͨϓϩϯϓτΛमਖ਼͢Δɽ
ʢGSM8K: 8-shotɼBamboogle: 4-shotʣ
7

9. ࢖༻͢ΔCoTϓϩϯϓτʢࢉज़ਪ࿦ʣ
8

10. ࣮ݧ̍ʛख๏
• Invalid reasoning ͷϓϩϯϓτΛख࡞ۀͰ࡞੒͢Δɽ
CoT ͷ bridging objects ͱ language templates Λมߋͨ͠ϓϩϯϓτ
• Invalid reasoning ͱ CoT ͱͷੑೳͷࠩΛଌΔɽ
ճ౴ʹ໾ཱͭ෦෼ͷΈ
Λมߋ͢Δ
9

11. ධՁํ๏
• ࠷ऴग़ྗͷධՁʢ֎ࡏతධՁʣ
• GSM8K: Accuracy
• BamboogleɿF1
• ਪ࿦աఔͷධՁʢ಺ࡏతධՁʣ
Bridging objects ͷ Recall / F1 Λଌఆ͢Δɽ
• GSM8Kɿਪ࿦աఔதͷ਺ࣈͷRecall / F1ʢInter. Recall / F1ʣ
GSM8K தͷਪ࿦աఔͷϥϕϧ෇͖σʔλΛ༻͍Δɽ
• Bamboogleɿओମɼ٬ମΤϯςΟςΟͷ Recall ʢInter. Recallʣ
ϥϕϧ෇͖σʔλΛखಈͰ࡞੒ͯ͠༻͍Δɽ
ਪ࿦աఔΛධՁͰ͖ͳ͍ɽ
10

12. ࣮ݧ̍ʛ݁Ռ
Flan-PaLM
PaLM
text-davinci-002
text-davinci-003
11

13. ࣮ݧ̍ʛ݁Ռ
Flan-PaLM
PaLM
text-davinci-002
text-davinci-003
• 全モデルで invalid reasoning の性能は CoT の約90%．
• おそらく事前学習で多段推論能力を獲得している．
ਪ࿦աఔͷଥ౰ੑͱճ౴ͷ࣭ʹڧ͍ؔ࿈ੑ͸ແ͍
12

14. ࣮ݧ̍ʛ·ͱΊ
2ਖ਼֬ͳ bridging objects/language templates ͸ඞཁ͔ʁ
"ඞཁͰ͸ͳ͍ɽ
௚ײɿCoT ͱͯ͠ଥ౰Ͱͳ͍ਪ࿦աఔ͕༩͑ΒΕͨΒɼ
LLM ͸ਖ਼͍͠ਪ࿦͕Ͱ͖ͳ͍͸ͣ…
Research question

ਖ਼֬ͳ bridging objects ͱ language templates ͸ඞཁ͔ʁ

ͷճ౴͕ No ͳΒɼ
LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ
13

15. ࣮ݧ̎
LLM ͕ద੾ʹਪ࿦͢ΔͨΊʹॏཁͳཁૉ͸Կ͔ʁ

16. ࣮ݧ̍ͷख๏Λݟฦ͢ͱʜ
Invalid reasoning ʹ͸ɼਪ࿦ʹ໾ཱͭ৘ใ͕࢒͍ͬͯΔɽ
15

17. ࣮ݧ̍ͷख๏Λݟฦ͢ͱʜ
Invalid reasoning ʹ͸ɼਪ࿦ʹ໾ཱͭ৘ใ͕࢒͍ͬͯΔɽ
ΫΤϦʹؔ͢Δ৘ใΛؚΉɽ
Bridging objects
ಉ͡਺ࣈ͕ΫΤϦʹؚ·ΕΔɽ
Language templates
࿩୊͕ΫΤϦͱಉ͡ɽ
จ͕ܨ͕͍ͬͯͯے͕௨͍ͬͯΔɽ
ྫ͑͹ɼ௚લͷจʹग़͖ͯͨ਺ࣈ
͕࣍ͷจͰ࢖ΘΕ͍ͯΔɽ
Relevanceʢؔ࿈ੑʣ
CoherenceʢҰ؏ੑʣ
16

18. ࣮ݧ̎ cख๏
CoT ϓϩϯϓτΛมߋͯ͠ɼ
ύλʔϯͷϓϩϯϓτΛ৽ͨʹ࡞੒͢Δɽ
• No coherence for bridging objects
• No relevance for bridging objects
• No coherence for language templates
• No relevance for language templates
• No coherence
• No relevance
17

19. ࣮ݧ̎ cϓϩϯϓτͷྫ
ߏ੒ཁૉͷॱংΛϥϯμϜʹ
ೖΕସ͑Δɽ
਺ࣈΛΫΤϦ͔ΒϥϯμϜʹ
αϯϓϦϯάͯ͠ஔ׵͢Δɽ ˞No relevance for bridging objects Ͱ͸
࠷ޙͷ౴͕͑ෆਖ਼ղʹͳ͍ͬͯΔɽ
18

20. ࣮ݧ̎ʛ݁Ռ
• ؔ࿈ੑͱҰ؏ੑ͸ CoT ʹඞཁɽ
• ؔ࿈ੑ͸ಛʹॏཁɽ
• αϯϓϧΛਓखͰධՁɽͰग़ྗͱΫΤϦͷؔ࿈ੑͳ͠ɽ
• ແؔ܎ͷग़ྗ͸ࣅͨ࿩୊ʢcats and dogs” ΍ "passengers and buses”ʣɽ
͓ͦΒ͘ࣄલֶशίʔύεதͷ਺ֶؔ࿈෦෼Ͱසग़ͷ࿩୊ɽ
text-davinci-002
˞No relevance for bridging objects ͱ No relevance ͷ౴͕͑ෆਖ਼ղͰ͋Δ͜ͱͷӨڹʜʁ
19

21. ࣮ݧ̎ʛ݁Ռ
• Bridging objects Ͱ͸ؔ࿈ੑ͕ΑΓॏཁɽ
• ؔ࿈ੑͳ͠ ᶅ ᶉ
ͷग़ྗͱΫΤϦͱͷ bridging objects ͷҰக཰͸
ҎԼͰɼଞͷ৔߹ʢ໿ʣΑΓ΋௿͔ͬͨɽ
• Language templates Ͱ͸Ұ؏ੑ͕ΑΓॏཁɽ
• αϯϓϧͷग़ྗ͕Ұ؏ੑͳ͠ɽ
text-davinci-002
˞No relevance for bridging objects ͱ No relevance ͷ౴͕͑ෆਖ਼ղͰ͋Δ͜ͱͷӨڹʜʁ
20

22. LLM͸CoT͔Βਪ࿦ํ๏ΛֶΜͰ͍Δʁ
• LLM ͕ CoT ͔ΒֶͿਪ࿦ํ๏͸ݶఆతɽ
üLLM ͸ɼࣄલֶशʹΑͬͯෳࡶͳਪ࿦ೳྗΛ֫ಘ͍ͯ͠Δɽ
üCoT ͷ໾ׂ͸ؔ࿈ੑͱҰ؏ੑΛ࣋ͭΑ͏ʹग़ྗΛ੍ޚ͢Δ͜ͱɽ
• λεΫͷ஌͕ࣝଟ͍ͱ ablations ʹΑΔੑೳ௿Լ͸খ͍͞ɽ
üλεΫΛղ͘ࡍʹࣄલ஌ࣝΛ׆༻Ͱ͖Δɽ
✘ଥ౰Ͱͳ͍ਪ࿦աఔΛੜ੒͢ΔλεΫͷ࣮ߦ͸೉͍͠ɽ
Flan-PaLM
21

23. LLM͸CoT͔Βਪ࿦ํ๏ΛֶΜͰ͍Δʁ
• LLM ͕ CoT ͔ΒֶͿਪ࿦ํ๏͸ݶఆతɽ
üLLM ͸ɼࣄલֶशʹΑͬͯෳࡶͳਪ࿦ೳྗΛ֫ಘ͍ͯ͠Δɽ
üCoT ͷ໾ׂ͸ؔ࿈ੑͱҰ؏ੑΛ࣋ͭΑ͏ʹग़ྗΛ੍ޚ͢Δ͜ͱɽ
• λεΫͷ஌͕ࣝଟ͍ͱ ablations ʹΑΔੑೳ௿Լ͸খ͍͞ɽ
üλεΫΛղ͘ࡍʹࣄલ஌ࣝΛ׆༻Ͱ͖Δɽ
✘ଥ౰Ͱͳ͍ਪ࿦աఔΛੜ੒͢ΔλεΫͷ࣮ߦ͸೉͍͠ɽ
LLM ͸ CoT ͔Βਪ࿦ํ๏Λֶ΂Δ͔ʁ
• ݁࿦Λग़͢ʹ͸ݱঢ়ͷ݁Ռ͸ෆे෼ɽ
• CoT ͷओͳ໾ׂ͸ࣄલֶशͰಘͨਪ࿦εΩϧΛҾ͖ग़͢͜ͱɽ
22

24. [Wei+,22]
՝୊
• ଞͷਪ࿦λεΫʹద༻Մೳͳ࣮ݧͷઃܭ
• ຊݚڀͰ࢖ͬͨख๏͸൚༻తͰ͸ͳ͘ɼ
CoT ϓϩϯϓτͷߏ੒ཁૉ͕Ұ༷ͩͱద༻Ͱ͖ͳ͍ɽ
• Invalid reasoning ͷϓϩϯϓτ࡞੒ํ๏ͷࣗಈԽ
• ಺ࡏతධՁͷվળ
• ධՁʹ༻͍ͨ bridging objects ͷਖ਼ղ͸͍ͭͰ΋ར༻ՄೳͰ͸ͳ͍
• แׅత͔ͭࢀরෆཁͳධՁํ๏ͷ։ൃ͕՝୊ɽؔ࿈[Golovneva+,23]
23

25. ·ͱΊ
• ໨త
• Chain-of-thought (CoT) ϓϩϯϓτʹد༩͢Δ఺Λղ໌͢Δɽ
• ख๏
• CoT ϓϩϯϓτΛͭͷཁૉʹ෼ׂͯ͠ ablation study Λߦͬͨɽ
• ؔ࿈ੑͱҰ؏ੑΛධՁͨ͠ɽ
• ಘΒΕͨ஌ݟ
• ਪ࿦ͷଥ౰ੑ͸ੑೳʹ΄ͱΜͲӨڹ͠ͳ͍ɽ
• CoT Ͱ͸ɼೖྗΫΤϦͱͷؔ࿈ੑɾਪ࿦աఔͷҰ؏ੑ͕ॏཁɽ
• LLM ͸ࣄલֶशʹΑͬͯਪ࿦ํ๏Λֶश͍ͯ͠Δɽ
ஶऀ࣮૷ɿhttps://github.com/sunlab-osu/Understanding-CoT
24

26. ࢀߟจݙ
[Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
2020. Language models are few-shot learners. Advances in neural information processing
systems, 33:1877–1901.
[Wei+,22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia,
Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in
large language models. In Advances in Neural Information Processing Systems, volume 35,
pages 24824–24837. Curran Associates, Inc.
[Cobbe+,21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro
Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word
problems. arXiv preprint arXiv:2110.14168.
[Ouyang+,22] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training
language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
25

27. ࢀߟจݙ
[Chowdhery+,22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma,
Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian
Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint
arXiv:2204.02311.
[Chung+,22] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus,
Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-
finetuned language models. arXiv preprint arXiv:2210.11416.
[Press+,22] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike
Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv
preprint arXiv:2210.03350.
[Golovneva+,23] Olga Golovneva, Moya Peng Chen, Spencer Poff, Martin Corredor, Luke
Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023. ROSCOE: A suite of metrics
for scoring step-by-step reasoning. In The Eleventh International Conference on Learning
Representations.
26

28. ิ଍ࢿྉ

29. ࣮ݧ̍ʛ݁Ռ text-davinci-002 ͷྔతղੳ)
• invalid reasoning ͷੑೳ͸ CoT ͷ໿ɽ
• ೉қ౓ͷҟͳΔαϯϓϧؒͰ
ύϑΥʔϚϯεͷ௿Լ཰͸Ұ༷ɽ
• CoTͷΈෆਖ਼ղยํͷΈෆਖ਼ղ
GSM8K: 62/196, Bamboogle: 6/20
ਪ࿦աఔͷଥ౰ੑͱճ౴ͷ࣭ʹڧ͍ؔ࿈ੑ͸ແ͍
text-davinci-002
28

30. ࣮ݧ̍ʛ݁Ռʢ࣭తղੳʣ
• CoT ͱ invalid-reasoning ͷࠜڌͱͷؒʹ໌֬ͳҧ͍͸ແ͔ͬͨɽ
• ճ౴͕ਖ਼ղͰ͋ͬͨέʔεͷ͏ͪຆͲʹ͓͍ͯɼਪ࿦͸ଥ౰Ͱ͋ͬͨɽ
• ճ౴͕ؒҧ͍Ͱ͋ͬͨέʔεͰͷؒҧ͑ํ͸ CoT ͷ৔߹ͱಉ༷ͩͬͨɽ
29

31. ߟ࡯
• few-shot ਪ࿦ͷϕϯνϚʔΫʹ͍ͭͯ
• ຊݚڀ͸ LLM ͷଟஈਪ࿦ʹؔ͢Δࣄલ஌ࣝͷఆྔԽํ๏ͱΈͳͤΔɽ
• few-shot ͔Βਪ࿦ํ๏Λֶश͢Δ LLM ͷೳྗΛධՁ͢ΔͨΊʹ͸ɼ
LLM ʹؚ·Ε͍ͯΔ஌͕ࣝগͳ͍ϕϯνϚʔΫ͕ඞཁɽ
30

32. ࢖༻͢ΔCoTϓϩϯϓτʢࣄ࣮ܕ2"ʣ
31