Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ask Me Anything

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Ask Me Anything

Avatar for Seunghyun Hwang

Seunghyun Hwang

June 01, 2023
Tweet

More Decks by Seunghyun Hwang

Other Decks in Research

Transcript

  1. Ask Me Anything: A Simple Strategy for Propmting Language Models

    Presented by Seunghyun Hwang 2023. 6. 1. 1 Simran Arora, Avanika Narayan, Mayee Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Christopher Ré ICLR, 2023 Reading club 4 Research outcome 1
  2. Prompt 3 • For super-large models like 175B GPT-3[1], fine

    tuning them is hard and also costly • Fix their parameters and apply them to different tasks by different prompts • Hard prompt, prompt tuning[2], transfer learning[3] … [1] Brown, Tom, et al. "Language models are few-shot learners.“, Neurips 2020 [2] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt tuning “, (2021). [3] Asai, Akari, et al. "Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts.“ EMNLP 2022. Background Information
  3. Prompt engineering 4 • Prompt-engineering is the process of designing

    natural language specifications of a task • Manually rewrite task-inputs to the prescribed formats on a example-by-example basis[1] • Simplify complex tasks to achieve better performance in the prompting paradigm[2] [1] Mishra, Swaroop, et al. "Reframing Instructional Prompts to GPTk's Language.“ 2021. [2] Creswell, Antonia, Murray Shanahan, and Irina Higgins. "Selection-inference: Exploiting large language models for interpretable logical reasoning." 2022. Background Information
  4. Motivation 5 • LLM prompting performance on a broad set

    of tasks and finds the process to be brittle • Small changes to the prompt result in large performance variations[1],[2] • Significant effort is dedicated towards designing a perfect prompt for a task [1] Zhao, Zihao, et al. "Calibrate before use: Improving few-shot performance of language models." PMLR, 2021. [2] Holtzman, Ari, et al. "Surface form competition: Why the highest probability answer isn't always right." 2021. Motivation
  5. Motivation 6 • Instead, aggregating the predictions of multiple effective,

    yet imperfect prompts to improve prompting performance • Vote for the input’s true label to produce a final prediction Motivation -> Ask Me Anything : A Simple strategy for Prompting Language Models
  6. Ask Me Anything Prompting 8 Supervised tas𝑘𝑘, 𝑋𝑋, 𝑌𝑌 ,

    𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑥𝑥 ∈ 𝑋𝑋 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦 ∈ 𝑌𝑌 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝐷𝐷 = {𝑥𝑥𝑖𝑖 }𝑖𝑖=1 𝑛𝑛 𝑓𝑓𝑓𝑓𝑓𝑓 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑤𝑤𝑤𝑤 𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑦𝑦𝑖𝑖 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝, 𝑤𝑤𝑤𝑤 𝑢𝑢𝑢𝑢𝑢𝑢 𝑝𝑝 ∶ 𝑋𝑋 → 𝑌𝑌 𝑡𝑡𝑡𝑡 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐿𝐿𝐿𝐿𝐿𝐿 𝑖𝑖. 𝑒𝑒. � 𝑦𝑦 = 𝑝𝑝 𝑥𝑥 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑃𝑃 = 𝑝𝑝1 , 𝑝𝑝2 , … , 𝑝𝑝𝑚𝑚 , 1 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑃𝑃 𝑡𝑡𝑡𝑡 𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑥𝑥 ∈ 𝐷𝐷, 𝑃𝑃 𝑥𝑥 = 𝑝𝑝1 𝑥𝑥 , … , 𝑝𝑝𝑚𝑚 𝑥𝑥 2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 ∅: 𝑌𝑌𝑚𝑚 → 𝑌𝑌 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 � 𝑦𝑦 𝑜𝑜𝑜𝑜 𝑥𝑥 3 key problems • Effective prompts (1) • Scalable collection (1) • Prompt aggregation (2) Model Structure
  7. Effective Prompts 9 • High quality prompts are a precursor

    to improvements from aggregation • Previous approach[1] focus on a single task -> focus on prompt engineering • Origin standard prompt format(hard)[2] is right? • (“John invited Mark to come watch Jurassic Park. Output True or False?”) - restrict • (“John invited Mark to come watch Jurassic _” fill-the-blank, “Park”) - cloze question • (“Where did John invite Mark?”) - open ended question [1] Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." 2022. [2] Brown, Tom, et al. "Language models are few-shot learners.“, Neurips 2020 Model Structure
  8. Effective Prompts 10 • Open-ended question improves the performance significantly(on

    GPT-J 6B) • For example, in WSC, • Restrictive form : “The pronoun ‘his’ refers to “Mark” in the context. True or False?” • Open-ended form : “Mark went to the park with his dog.”. Reformatting to “What does ‘his’ refer to?” • 38 % lift (50% -> 69.2 %) Model Structure
  9. Effective Prompts 11 • Why is the QA prompt format

    effective? • Intuitively, the task of answering open-ended questions is aligned with the next-token prediction language modeling objective • By analysis EleutherAI[1](Pile corpus[2]), • Open-ended QA structures is 1000* more frequently than the restrictive format • Large imbalances in corpus between the frequencies [1] https://www.eleuther.ai/ [2] Black, Sid, et al. "Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow” 2021. Model Structure
  10. Scalable Collection 15 • Prior works manually rewrite each task

    input to new format [1],[2] • Given input x, applying prompt-chains 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑥𝑥 ) • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 : 𝑥𝑥 → 𝑞𝑞 − 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑎𝑎 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑥𝑥 − (1) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : 𝑞𝑞 → 𝑎𝑎 − 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑡𝑡𝑡 𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 1 𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎 • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 and 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 also contains demonstration of prompts • Chains are (1) reused across inputs and (2) different pairs of functional prompts can be combined to create variety Model Structure [1] Mishra, Swaroop, et al. "Reframing Instructional Prompts to GPTk's Language." 2021 [2] Wu, Tongshuang, Michael Terry, and Carrie Jun Cai. "Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts.“ CHI `22. 2022.
  11. Scalable Collection 17 • Prior works manually rewrite each task

    input to new format [1],[2] • Given input x, applying prompt-chains 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑥𝑥 ) • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 : 𝑥𝑥 → 𝑞𝑞 − 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑎𝑎 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑥𝑥 − (1) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : 𝑞𝑞 → 𝑎𝑎 − 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑡𝑡𝑡 𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 1 𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎 • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 and 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 also contains demonstration of prompts • Chains are (1) reused across inputs and (2) different pairs of functional prompts can be combined to create variety Model Structure [1] Mishra, Swaroop, et al. "Reframing Instructional Prompts to GPTk's Language." 2021 [2] Wu, Tongshuang, Michael Terry, and Carrie Jun Cai. "Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts.“ CHI `22. 2022.
  12. Scalable Collection 18 • Variation for context demonstration and the

    style of prompt question • Each unique prompt()-chain is a different view of the task • Each unique prompt()-chain emphasize different aspects of x Model Structure (with our running example: “Who went to the park?”, “Did John go the park?”, “Where did John go?”)
  13. Prompt aggregation 20 • Majority vote(MV) is the primary aggregation

    strategy in prior prompting work[1][2] • not enough for prompt dependency and varied accuracy • WS is a powerful framework that learns the accuracies and correlations for training data[3] • Smith[4] applied WS aggregating the outputs of hand-curated prompts into a labeled dataset • Prompt()-chains get varied accuracies and dependencies (Appendix A) -> Weak supervision Model Structure [1] Jiang, Zhengbao, et al. "How can we know what language models know?." TACL 2020. [2] Schick, Timo, and Hinrich Schütze. "It's not just size that matters: Small language models are also few-shot learners.“ 2020. [3] Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." 2017. [4] Smith, Ryan, et al. "Language models in the loop: Incorporating prompting into weak supervision” 2022.
  14. Prompt aggregation 21 Model Structure 𝐺𝐺 = 𝑉𝑉, 𝐸𝐸 ,

    𝑉𝑉 = 𝑦𝑦, 𝑃𝑃 𝑥𝑥 , 𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑖𝑖 𝑥𝑥 , 𝑝𝑝𝑗𝑗 𝑥𝑥 𝜖𝜖𝐸𝐸 𝑖𝑖𝑖𝑖𝑖𝑖 𝑝𝑝𝑖𝑖 𝑥𝑥 𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑗𝑗 𝑥𝑥 𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑦𝑦 𝜙𝜙𝑤𝑤𝑤𝑤 𝑥𝑥 = arg max𝑦𝑦𝜖𝜖𝑌𝑌 𝑃𝑃𝑃𝑃𝐺𝐺 𝑦𝑦 𝑃𝑃 𝑥𝑥
  15. Information Flow Metric 23 • 𝐻𝐻(𝑦𝑦|� 𝑦𝑦), conditional entropy, which

    measures the amount of uncertainty remaining in the true label 𝑦𝑦 given a prediction � 𝑦𝑦 • In our setting, � 𝑦𝑦 = ∅(𝑃𝑃 𝑥𝑥 ) is dependent on the two components, P and ∅ • The first term shows 𝐻𝐻 𝑦𝑦 � 𝑦𝑦 depends on the quality and quantity of the individual prompts in P(x) • The second term shows 𝐻𝐻 𝑦𝑦 � 𝑦𝑦 depends on how the aggregation step compresses the information Information Flow
  16. Experiment 25 • 20 popular language benchmarks used in GPT-3[1]

    • 14 unique LMs including 4 model families(Neo[2], OPT[3], BLOOM, T0[4]) • 125M-175B parameters • Benchmark dataset • SuperGLUE[5] • NLI[6] • Classification[7] • QA[8] Results [1] Brown, Tom, et al. "Language models are few-shot learners.“, Neurips 2020 [2] Black, Sid, et al. "Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow” 2021. [3] Zhang, Susan, et al. "Opt: Open pre-trained transformer language models." 2022. [4] Sanh, Victor, et al. "Multitask prompted training enables zero-shot task generalization.” 2021. [5] Wang, Alex, et al. "Superglue: A stickier benchmark for general-purpose language understanding systems.“ Neurips 2019. [6] Mostafazadeh, Nasrin, et al. "Lsdsem 2017 shared task: The story cloze test." 2017. [7] Zhang, Xiang, Junbo Zhao, and Yann LeCun. "Character-level convolutional networks for text classification." Neurips 2015. [8] Kasai, Jungo, et al. "RealTime QA: What's the Answer Right Now?." 2022.
  17. Appendix - Prompt aggregation 31 Model Structure Goal : 𝑃𝑃

    𝑌𝑌 𝜆𝜆1 , 𝜆𝜆2 , … 𝜆𝜆𝑚𝑚 Accuracies : 𝐸𝐸 𝜆𝜆1 𝑌𝑌 , 𝐸𝐸 𝜆𝜆2 𝑌𝑌 Correlations : 𝐸𝐸[𝜆𝜆1 𝜆𝜆2 ],.. Goal is induce from accuracies and correlations But we don’t know graph format � 𝐺𝐺
  18. Appendix - Prompt aggregation 35 Model Structure Goal : 𝑃𝑃

    𝑌𝑌 𝜆𝜆1 , 𝜆𝜆2 , … 𝜆𝜆𝑚𝑚 Accuracies : 𝐸𝐸 𝜆𝜆1 𝑌𝑌 , 𝐸𝐸 𝜆𝜆2 𝑌𝑌 Correlations : 𝐸𝐸[𝜆𝜆1 𝜆𝜆2 ],.. Goal is induce from accuracies and correlations Now we know graph format � 𝐺𝐺