Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ask Me Anything

Ask Me Anything

Seunghyun Hwang

June 01, 2023
Tweet

More Decks by Seunghyun Hwang

Other Decks in Research

Transcript

  1. Ask Me Anything: A Simple Strategy for Propmting Language Models

    Presented by Seunghyun Hwang 2023. 6. 1. 1 Simran Arora, Avanika Narayan, Mayee Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Christopher Ré ICLR, 2023 Reading club 4 Research outcome 1
  2. Prompt 3 • For super-large models like 175B GPT-3[1], fine

    tuning them is hard and also costly • Fix their parameters and apply them to different tasks by different prompts • Hard prompt, prompt tuning[2], transfer learning[3] … [1] Brown, Tom, et al. "Language models are few-shot learners.“, Neurips 2020 [2] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt tuning “, (2021). [3] Asai, Akari, et al. "Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts.“ EMNLP 2022. Background Information
  3. Prompt engineering 4 • Prompt-engineering is the process of designing

    natural language specifications of a task • Manually rewrite task-inputs to the prescribed formats on a example-by-example basis[1] • Simplify complex tasks to achieve better performance in the prompting paradigm[2] [1] Mishra, Swaroop, et al. "Reframing Instructional Prompts to GPTk's Language.“ 2021. [2] Creswell, Antonia, Murray Shanahan, and Irina Higgins. "Selection-inference: Exploiting large language models for interpretable logical reasoning." 2022. Background Information
  4. Motivation 5 • LLM prompting performance on a broad set

    of tasks and finds the process to be brittle • Small changes to the prompt result in large performance variations[1],[2] • Significant effort is dedicated towards designing a perfect prompt for a task [1] Zhao, Zihao, et al. "Calibrate before use: Improving few-shot performance of language models." PMLR, 2021. [2] Holtzman, Ari, et al. "Surface form competition: Why the highest probability answer isn't always right." 2021. Motivation
  5. Motivation 6 • Instead, aggregating the predictions of multiple effective,

    yet imperfect prompts to improve prompting performance • Vote for the input’s true label to produce a final prediction Motivation -> Ask Me Anything : A Simple strategy for Prompting Language Models
  6. Ask Me Anything Prompting 8 Supervised tas𝑘𝑘, 𝑋𝑋, 𝑌𝑌 ,

    𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑥𝑥 ∈ 𝑋𝑋 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦 ∈ 𝑌𝑌 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝐷𝐷 = {𝑥𝑥𝑖𝑖 }𝑖𝑖=1 𝑛𝑛 𝑓𝑓𝑓𝑓𝑓𝑓 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑤𝑤𝑤𝑤 𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑦𝑦𝑖𝑖 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝, 𝑤𝑤𝑤𝑤 𝑢𝑢𝑢𝑢𝑢𝑢 𝑝𝑝 ∶ 𝑋𝑋 → 𝑌𝑌 𝑡𝑡𝑡𝑡 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐿𝐿𝐿𝐿𝐿𝐿 𝑖𝑖. 𝑒𝑒. � 𝑦𝑦 = 𝑝𝑝 𝑥𝑥 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑃𝑃 = 𝑝𝑝1 , 𝑝𝑝2 , … , 𝑝𝑝𝑚𝑚 , 1 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑃𝑃 𝑡𝑡𝑡𝑡 𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑥𝑥 ∈ 𝐷𝐷, 𝑃𝑃 𝑥𝑥 = 𝑝𝑝1 𝑥𝑥 , … , 𝑝𝑝𝑚𝑚 𝑥𝑥 2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 ∅: 𝑌𝑌𝑚𝑚 → 𝑌𝑌 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 � 𝑦𝑦 𝑜𝑜𝑜𝑜 𝑥𝑥 3 key problems • Effective prompts (1) • Scalable collection (1) • Prompt aggregation (2) Model Structure
  7. Effective Prompts 9 • High quality prompts are a precursor

    to improvements from aggregation • Previous approach[1] focus on a single task -> focus on prompt engineering • Origin standard prompt format(hard)[2] is right? • (“John invited Mark to come watch Jurassic Park. Output True or False?”) - restrict • (“John invited Mark to come watch Jurassic _” fill-the-blank, “Park”) - cloze question • (“Where did John invite Mark?”) - open ended question [1] Wei, Jason, et al. "Chain of thought prompting elicits reasoning in large language models." 2022. [2] Brown, Tom, et al. "Language models are few-shot learners.“, Neurips 2020 Model Structure
  8. Effective Prompts 10 • Open-ended question improves the performance significantly(on

    GPT-J 6B) • For example, in WSC, • Restrictive form : “The pronoun ‘his’ refers to “Mark” in the context. True or False?” • Open-ended form : “Mark went to the park with his dog.”. Reformatting to “What does ‘his’ refer to?” • 38 % lift (50% -> 69.2 %) Model Structure
  9. Effective Prompts 11 • Why is the QA prompt format

    effective? • Intuitively, the task of answering open-ended questions is aligned with the next-token prediction language modeling objective • By analysis EleutherAI[1](Pile corpus[2]), • Open-ended QA structures is 1000* more frequently than the restrictive format • Large imbalances in corpus between the frequencies [1] https://www.eleuther.ai/ [2] Black, Sid, et al. "Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow” 2021. Model Structure
  10. Scalable Collection 15 • Prior works manually rewrite each task

    input to new format [1],[2] • Given input x, applying prompt-chains 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑥𝑥 ) • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 : 𝑥𝑥 → 𝑞𝑞 − 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑎𝑎 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑥𝑥 − (1) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : 𝑞𝑞 → 𝑎𝑎 − 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑡𝑡𝑡 𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 1 𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎 • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 and 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 also contains demonstration of prompts • Chains are (1) reused across inputs and (2) different pairs of functional prompts can be combined to create variety Model Structure [1] Mishra, Swaroop, et al. "Reframing Instructional Prompts to GPTk's Language." 2021 [2] Wu, Tongshuang, Michael Terry, and Carrie Jun Cai. "Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts.“ CHI `22. 2022.
  11. Scalable Collection 17 • Prior works manually rewrite each task

    input to new format [1],[2] • Given input x, applying prompt-chains 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑥𝑥 ) • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 : 𝑥𝑥 → 𝑞𝑞 − 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑎𝑎 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑥𝑥 − (1) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : 𝑞𝑞 → 𝑎𝑎 − 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑡𝑡𝑡 𝑞𝑞 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 1 𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎 • 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 and 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 also contains demonstration of prompts • Chains are (1) reused across inputs and (2) different pairs of functional prompts can be combined to create variety Model Structure [1] Mishra, Swaroop, et al. "Reframing Instructional Prompts to GPTk's Language." 2021 [2] Wu, Tongshuang, Michael Terry, and Carrie Jun Cai. "Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts.“ CHI `22. 2022.
  12. Scalable Collection 18 • Variation for context demonstration and the

    style of prompt question • Each unique prompt()-chain is a different view of the task • Each unique prompt()-chain emphasize different aspects of x Model Structure (with our running example: “Who went to the park?”, “Did John go the park?”, “Where did John go?”)
  13. Prompt aggregation 20 • Majority vote(MV) is the primary aggregation

    strategy in prior prompting work[1][2] • not enough for prompt dependency and varied accuracy • WS is a powerful framework that learns the accuracies and correlations for training data[3] • Smith[4] applied WS aggregating the outputs of hand-curated prompts into a labeled dataset • Prompt()-chains get varied accuracies and dependencies (Appendix A) -> Weak supervision Model Structure [1] Jiang, Zhengbao, et al. "How can we know what language models know?." TACL 2020. [2] Schick, Timo, and Hinrich Schütze. "It's not just size that matters: Small language models are also few-shot learners.“ 2020. [3] Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." 2017. [4] Smith, Ryan, et al. "Language models in the loop: Incorporating prompting into weak supervision” 2022.
  14. Prompt aggregation 21 Model Structure 𝐺𝐺 = 𝑉𝑉, 𝐸𝐸 ,

    𝑉𝑉 = 𝑦𝑦, 𝑃𝑃 𝑥𝑥 , 𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑖𝑖 𝑥𝑥 , 𝑝𝑝𝑗𝑗 𝑥𝑥 𝜖𝜖𝐸𝐸 𝑖𝑖𝑖𝑖𝑖𝑖 𝑝𝑝𝑖𝑖 𝑥𝑥 𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑗𝑗 𝑥𝑥 𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑦𝑦 𝜙𝜙𝑤𝑤𝑤𝑤 𝑥𝑥 = arg max𝑦𝑦𝜖𝜖𝑌𝑌 𝑃𝑃𝑃𝑃𝐺𝐺 𝑦𝑦 𝑃𝑃 𝑥𝑥
  15. Information Flow Metric 23 • 𝐻𝐻(𝑦𝑦|� 𝑦𝑦), conditional entropy, which

    measures the amount of uncertainty remaining in the true label 𝑦𝑦 given a prediction � 𝑦𝑦 • In our setting, � 𝑦𝑦 = ∅(𝑃𝑃 𝑥𝑥 ) is dependent on the two components, P and ∅ • The first term shows 𝐻𝐻 𝑦𝑦 � 𝑦𝑦 depends on the quality and quantity of the individual prompts in P(x) • The second term shows 𝐻𝐻 𝑦𝑦 � 𝑦𝑦 depends on how the aggregation step compresses the information Information Flow
  16. Experiment 25 • 20 popular language benchmarks used in GPT-3[1]

    • 14 unique LMs including 4 model families(Neo[2], OPT[3], BLOOM, T0[4]) • 125M-175B parameters • Benchmark dataset • SuperGLUE[5] • NLI[6] • Classification[7] • QA[8] Results [1] Brown, Tom, et al. "Language models are few-shot learners.“, Neurips 2020 [2] Black, Sid, et al. "Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow” 2021. [3] Zhang, Susan, et al. "Opt: Open pre-trained transformer language models." 2022. [4] Sanh, Victor, et al. "Multitask prompted training enables zero-shot task generalization.” 2021. [5] Wang, Alex, et al. "Superglue: A stickier benchmark for general-purpose language understanding systems.“ Neurips 2019. [6] Mostafazadeh, Nasrin, et al. "Lsdsem 2017 shared task: The story cloze test." 2017. [7] Zhang, Xiang, Junbo Zhao, and Yann LeCun. "Character-level convolutional networks for text classification." Neurips 2015. [8] Kasai, Jungo, et al. "RealTime QA: What's the Answer Right Now?." 2022.
  17. Appendix - Prompt aggregation 31 Model Structure Goal : 𝑃𝑃

    𝑌𝑌 𝜆𝜆1 , 𝜆𝜆2 , … 𝜆𝜆𝑚𝑚 Accuracies : 𝐸𝐸 𝜆𝜆1 𝑌𝑌 , 𝐸𝐸 𝜆𝜆2 𝑌𝑌 Correlations : 𝐸𝐸[𝜆𝜆1 𝜆𝜆2 ],.. Goal is induce from accuracies and correlations But we don’t know graph format � 𝐺𝐺
  18. Appendix - Prompt aggregation 35 Model Structure Goal : 𝑃𝑃

    𝑌𝑌 𝜆𝜆1 , 𝜆𝜆2 , … 𝜆𝜆𝑚𝑚 Accuracies : 𝐸𝐸 𝜆𝜆1 𝑌𝑌 , 𝐸𝐸 𝜆𝜆2 𝑌𝑌 Correlations : 𝐸𝐸[𝜆𝜆1 𝜆𝜆2 ],.. Goal is induce from accuracies and correlations Now we know graph format � 𝐺𝐺