Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Training Generative Question-Answering on Synth...

Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

This paper presents a simple and cost-effective method for synthesizing data to train question-answering systems. For training, fine-tuning GPT models is a common practice in resource-rich languages like English, however, it becomes challenging for non-English languages due to the scarcity of sufficient question-answer (QA) pairs. Existing approaches use question and answer generators trained on human-authored QA pairs, which involves substantial human expenses. In contrast, we use an instruct-tuned model to generate QA pairs in a zero-shot or few-shot manner. We conduct experiments to compare various strategies for obtaining QA pairs from the instruct-tuned model. The results demonstrate that a model trained on our proposed synthetic data achieves comparable performance to a model trained on manually curated datasets, without incurring human costs.

Kosuke Takahashi

December 07, 2023
Tweet

Other Decks in Research

Transcript

  1. Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned

    Model [email protected] 1 ,PTVLF5BLBIBTIJ̍ 5BLBIJSP0NJ ,PTVLF"SJNB 5BUTVZB*TIJHBLJ 4UPDLNBSL  National Institute of Advanced Industrial Science and Technology
  2. Training Steps of LLMs (decoder-only models) 2 LLM Pre-training Large

    amount of unlabeled texts LLM Fine-tuning with Instruction Labeled texts by instruction
  3. Fine-tuning LLMs improves performance 3 reference : https://arxiv.org/abs/2303.10420 davinci :

    plain GPT-3 text-XX-YY: fine-tuned GPT-3 on natural langage texts code-XX-YY: fine-tuned GPT-3 on natural language texts and codes gpt-3.5-turbo: fine-tuned GPT-3 for chat
  4. Instruct-tuning improves the performance of Question-Answering 4 reference : https://arxiv.org/abs/2303.10420

    davinci : plain GPT-3 text-XX-YY: fine-tuned GPT-3 on natural langage texts code-XX-YY: fine-tuned GPT-3 on natural language texts and codes gpt-3.5-turbo: fine-tuned GPT-3 for chat
  5. Question-Answering (QA) System Using a LLM 5 Question (query from

    users): What can be an energy source of the next generation? Answer: Hydrogen, especially green hydrogen, is attracting increasing attention as a next-generation energy source. • Generate an answer from the pair of question and context • Answers are not always extraction of the context Context: As a next-generation energy source that does not emit CO2, hydrogen is attracting increasing attention. In particular, "green hydrogen" produced using electricity derived from renewable energy is… (omitted) LLM
  6. Resource of Context-Aware QA Data 6 } Japanese QA data

    is limited compared to English } JSQuAD [Kurihara+. 2022] } Japanese version of SQuAD (English QA data) } Train data : 62k } Half amount of English SQuAD (130k) } Domain : wikipedia } JAQKET [Suzuki+. 2020] } Candidate-selection style QA (No Context) } Train data 13k } Domain : miscellaneous quiz
  7. Related Work: Automatic Generation of QA pairs 7 } Supervised

    approach [Sachan+, Tang+, Lee+] } Learn to generate a pair or either of QA from context } Requires data that has pairs of question, answer, and context } Unsupervised approach [Wang+] } Use GPT-3 to generate all of question, answer, and context } The quality of generated data is low } Only 54% of generated data are valid
  8. Proposed: Generation of QA Pairs from Contexts 8 Question: What

    statement was made at the summit? Answer: According to the statement issued at the summit, the accelerated phase- out of fossil fuels was specified. • Use an instruct-tuned model to generate QA pairs • Used gpt-3.5-turbo-0613 • No training data required for QA pair generation Source context Instruct-tuned LLM The G7 Summit Leaders' Statement released on April 20 clearly stated for the first time the acceleration of the phase-out of fossil fuels, but did not include a timetable for the phase-out of coal- fired power generation, which had been the focus of the summit.
  9. Variation of Source Context from which QA pairs are generated

    9 } News } Frequently viewed news articles at Anews (Stockmark’s news service) } Wiki } Randomly sampled Japanese Wikipedia articles } JSQuAD } Training data of JSQuAD } Annotated on Wikipedia articles by crowd-workers Unify all data size of contexts when training QA models
  10. Prompts to generate QA pairs: Zero / one example in

    the prompt 10 • Instructions and explanation of the output structure are described in order • In one-shot, ## example is given as an actual example. Zero-shot prompt to generate QA pairs from context “## example” of One-shot prompt to generate QA pairs from context
  11. Prompts to generate QA pairs: Number of generated QA pairs

    11 zero-shot prompt N = 1 Based on the given texts, please make a pair of answerable question and answer. Please make the answer in Japanese polite langauge. Please resond in the JSON format. ## example texts:”texts to extract the pair of question and answer” output:{“Question”:”the question that can be answered from the texts”, “Answer”:the answer to the question”} ## input texts:{QA context} output: zero-shot prompt N = 3 Based on the given texts, please make three pairs of answerable questions and answers. Please make the answers in Japanese polite langauge. Please resond in the JSON format. ## example texts:”texts to extract the pair of question and answer” output:[{“Question”:”the first question that can be answered from the texts”, “Answer”:the answer to the first question”}, {“Question”:”the second question that can be answered from the texts”, “Answer”:the answer to the second question”}, {“Question”:”the third question that can be answered from the texts”, “Answer”:the answer to the thrid question”}] ## input texts:{QA context} output: • When N=3, specify the number of QA pairs to be generated
  12. Evaluation of The Quality of Generated QA Pairs 12 }

    Evaluate the performance of generative context-aware QA models trained on each context (News/Wikipedia/JSQuAD) } Test data : evaluation corpus of JSQuAD } For the QA task, fine-tuned Japanese GPT-2 model (Cyberagent’s open-calm-7B) } Automatic Evaluation of QA } BERTscore, BLEU } Human Evaluation } Conducted by 4 NLP engineers or researchers } Scored by the accuracy of correct answers
  13. Variations of The Experimental Settings 13 • Contexts from which

    QA pairs are generated • news, wiki, JSQuAD • GPT : Japanese GPT-2 without any fine-tuning • Human : training labels (human authored QA pairs) • Number of QA pairs per one context • N = 1, 3 • Whether to use in-context learning • zero-shot : no actual example of QA pair • one-shot : one actual example of QA pair
  14. Results of Automatic Evaluation 14 Context N Prompt BERTscore BLEU

    GPT - - 0.601 0.00 Human - - 0.899 5.64 news 1 zero 0.697 0.02 wiki 1 zero 0.713 0.03 JSQuAD 1 zero 0.724 1.55 news 1 zero 0.738 0.11 wiki 1 zero 0.775 0.09 JSQuAD 1 zero 0.863 4.83 news 3 one 0.713 0.38 wiki 3 one 0.706 0.23 JSQuAD 3 one 0.740 1.85 news 3 one 0.747 1.25 wiki 3 one 0.838 1.66 JSQuAD 3 one 0.889 6.77 } Domain similarity of the source context is important } N=3 > N=1 } 1-shot > 0-shot } Automatically generated QA pairs achieve comparable scores to human generated ones
  15. Results of Human Evaluation 15 } Our best performing model

    JSQuAD (N=3, one-shot) outperforms Human (the model fine-tuned on human-authored QA pairs) } [Gilardi+] also reports that ChatGPT’s annotation outperforms crowd-workers } The accuracy of QA model answers is lower than the human generated correct answers QA pairs Accuracy (%) Human 38.4 wiki (N=3, one-shot) 16.6 JSQuAD (N=3, one-shot) 45.4 Gold 90.4
  16. Conclusion 16 } We proposed to use an instruction-tuned model

    for synthesizing QA pairs } Our experimental results demonstrate that the models trained on automatically generated QA pairs achieve comparable or even superior performance compared to the fine-tuned model trained on human-authored QA pairs } Proposed method of data expansion, which is an inverse problem for the QA task as treated in this study, can be applied to various tasks where the inverse problem can be solved more easily. } Relation extraction; automatic generation of documents from which a specified triple can be extracted. } e.x. triple : (Poland, area in total, 120733) } Generated: Poland, officially the Republic of Poland, is a country in Central Europe. It is divided into 16 administrative regions called provinces and covers an area of 120,733 square miles.