Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned
Model [email protected] 1 ,PTVLF5BLBIBTIJ̍ 5BLBIJSP0NJ ,PTVLF"SJNB 5BUTVZB*TIJHBLJ 4UPDLNBSL National Institute of Advanced Industrial Science and Technology

Training Steps of LLMs (decoder-only models) 2 LLM Pre-training Large
amount of unlabeled texts LLM Fine-tuning with Instruction Labeled texts by instruction

Fine-tuning LLMs improves performance 3 reference : https://arxiv.org/abs/2303.10420 davinci :
plain GPT-3 text-XX-YY: fine-tuned GPT-3 on natural langage texts code-XX-YY: fine-tuned GPT-3 on natural language texts and codes gpt-3.5-turbo: fine-tuned GPT-3 for chat

Instruct-tuning improves the performance of Question-Answering 4 reference : https://arxiv.org/abs/2303.10420
davinci : plain GPT-3 text-XX-YY: fine-tuned GPT-3 on natural langage texts code-XX-YY: fine-tuned GPT-3 on natural language texts and codes gpt-3.5-turbo: fine-tuned GPT-3 for chat

Question-Answering (QA) System Using a LLM 5 Question (query from
users): What can be an energy source of the next generation? Answer: Hydrogen, especially green hydrogen, is attracting increasing attention as a next-generation energy source. • Generate an answer from the pair of question and context • Answers are not always extraction of the context Context: As a next-generation energy source that does not emit CO2, hydrogen is attracting increasing attention. In particular, "green hydrogen" produced using electricity derived from renewable energy is… (omitted) LLM

Resource of Context-Aware QA Data 6 } Japanese QA data
is limited compared to English } JSQuAD [Kurihara+. 2022] } Japanese version of SQuAD (English QA data) } Train data : 62k } Half amount of English SQuAD (130k) } Domain : wikipedia } JAQKET [Suzuki+. 2020] } Candidate-selection style QA (No Context) } Train data 13k } Domain : miscellaneous quiz

Related Work: Automatic Generation of QA pairs 7 } Supervised
approach [Sachan+, Tang+, Lee+] } Learn to generate a pair or either of QA from context } Requires data that has pairs of question, answer, and context } Unsupervised approach [Wang+] } Use GPT-3 to generate all of question, answer, and context } The quality of generated data is low } Only 54% of generated data are valid

Proposed: Generation of QA Pairs from Contexts 8 Question: What
statement was made at the summit? Answer: According to the statement issued at the summit, the accelerated phase- out of fossil fuels was specified. • Use an instruct-tuned model to generate QA pairs • Used gpt-3.5-turbo-0613 • No training data required for QA pair generation Source context Instruct-tuned LLM The G7 Summit Leaders' Statement released on April 20 clearly stated for the first time the acceleration of the phase-out of fossil fuels, but did not include a timetable for the phase-out of coal- fired power generation, which had been the focus of the summit.

Variation of Source Context from which QA pairs are generated
9 } News } Frequently viewed news articles at Anews (Stockmark’s news service) } Wiki } Randomly sampled Japanese Wikipedia articles } JSQuAD } Training data of JSQuAD } Annotated on Wikipedia articles by crowd-workers Unify all data size of contexts when training QA models

Prompts to generate QA pairs: Zero / one example in
the prompt 10 • Instructions and explanation of the output structure are described in order • In one-shot, ## example is given as an actual example. Zero-shot prompt to generate QA pairs from context “## example” of One-shot prompt to generate QA pairs from context

Prompts to generate QA pairs: Number of generated QA pairs
11 zero-shot prompt N = 1 Based on the given texts, please make a pair of answerable question and answer. Please make the answer in Japanese polite langauge. Please resond in the JSON format. ## example texts:”texts to extract the pair of question and answer” output:{“Question”:”the question that can be answered from the texts”, “Answer”:the answer to the question”} ## input texts:{QA context} output: zero-shot prompt N = 3 Based on the given texts, please make three pairs of answerable questions and answers. Please make the answers in Japanese polite langauge. Please resond in the JSON format. ## example texts:”texts to extract the pair of question and answer” output:[{“Question”:”the first question that can be answered from the texts”, “Answer”:the answer to the first question”}, {“Question”:”the second question that can be answered from the texts”, “Answer”:the answer to the second question”}, {“Question”:”the third question that can be answered from the texts”, “Answer”:the answer to the thrid question”}] ## input texts:{QA context} output: • When N=3, specify the number of QA pairs to be generated

Evaluation of The Quality of Generated QA Pairs 12 }
Evaluate the performance of generative context-aware QA models trained on each context (News/Wikipedia/JSQuAD) } Test data : evaluation corpus of JSQuAD } For the QA task, fine-tuned Japanese GPT-2 model (Cyberagent’s open-calm-7B) } Automatic Evaluation of QA } BERTscore, BLEU } Human Evaluation } Conducted by 4 NLP engineers or researchers } Scored by the accuracy of correct answers

Variations of The Experimental Settings 13 • Contexts from which
QA pairs are generated • news, wiki, JSQuAD • GPT : Japanese GPT-2 without any fine-tuning • Human : training labels (human authored QA pairs) • Number of QA pairs per one context • N = 1, 3 • Whether to use in-context learning • zero-shot : no actual example of QA pair • one-shot : one actual example of QA pair

Results of Automatic Evaluation 14 Context N Prompt BERTscore BLEU
GPT - - 0.601 0.00 Human - - 0.899 5.64 news 1 zero 0.697 0.02 wiki 1 zero 0.713 0.03 JSQuAD 1 zero 0.724 1.55 news 1 zero 0.738 0.11 wiki 1 zero 0.775 0.09 JSQuAD 1 zero 0.863 4.83 news 3 one 0.713 0.38 wiki 3 one 0.706 0.23 JSQuAD 3 one 0.740 1.85 news 3 one 0.747 1.25 wiki 3 one 0.838 1.66 JSQuAD 3 one 0.889 6.77 } Domain similarity of the source context is important } N=3 > N=1 } 1-shot > 0-shot } Automatically generated QA pairs achieve comparable scores to human generated ones

Results of Human Evaluation 15 } Our best performing model
JSQuAD (N=3, one-shot) outperforms Human (the model fine-tuned on human-authored QA pairs) } [Gilardi+] also reports that ChatGPT’s annotation outperforms crowd-workers } The accuracy of QA model answers is lower than the human generated correct answers QA pairs Accuracy (%) Human 38.4 wiki (N=3, one-shot) 16.6 JSQuAD (N=3, one-shot) 45.4 Gold 90.4

Conclusion 16 } We proposed to use an instruction-tuned model
for synthesizing QA pairs } Our experimental results demonstrate that the models trained on automatically generated QA pairs achieve comparable or even superior performance compared to the fine-tuned model trained on human-authored QA pairs } Proposed method of data expansion, which is an inverse problem for the QA task as treated in this study, can be applied to various tasks where the inverse problem can be solved more easily. } Relation extraction; automatic generation of documents from which a specified triple can be extracted. } e.x. triple : (Poland, area in total, 120733) } Generated: Poland, officially the Republic of Poland, is a country in Central Europe. It is divided into 16 administrative regions called provinces and covers an area of 120,733 square miles.

Training Generative Question-Answering on Synth...

Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Kosuke Takahashi

Other Decks in Research

Featured

Transcript

Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned

Training Steps of LLMs (decoder-only models) 2 LLM Pre-training Large

Fine-tuning LLMs improves performance 3 reference : https://arxiv.org/abs/2303.10420 davinci :

Instruct-tuning improves the performance of Question-Answering 4 reference : https://arxiv.org/abs/2303.10420

Question-Answering (QA) System Using a LLM 5 Question (query from

Resource of Context-Aware QA Data 6 } Japanese QA data

Related Work: Automatic Generation of QA pairs 7 } Supervised

Proposed: Generation of QA Pairs from Contexts 8 Question: What

Variation of Source Context from which QA pairs are generated

Prompts to generate QA pairs: Zero / one example in

Prompts to generate QA pairs: Number of generated QA pairs

Evaluation of The Quality of Generated QA Pairs 12 }

Variations of The Experimental Settings 13 • Contexts from which

Results of Automatic Evaluation 14 Context N Prompt BERTscore BLEU

Results of Human Evaluation 15 } Our best performing model

Conclusion 16 } We proposed to use an instruction-tuned model