33/xx 補足:gpt-3.5を使ったデータセット生成 • 70$かけて70k行の文書探索精度を上げ るためのデータセットを作成 – Kaggle - LLM Science Exam | Kaggle – 私は使いこなせなかったためkaggleで公開 • 3位の人が使いこなした。#嬉しい。 – プロンプトは→ • 文書のQAと、QAを作るためにつかった sentenceを同時に生成 system_message = f""" You will be provided with TEXT from wikipedia. ¥ The TEXT will be delimited with {delimiter} characters. Output a python list of 3 dict objects, where each object is ¥ a multiple choice question whose answers should be in ¥ the given TEXT and that has 5 choices each. Each object should have the following format: 'question': <question on the TEXT> 'option_1': <question answer option> 'option_2': <question answer option> 'option_3': <question answer option> 'option_4': <question answer option> 'option_5': <question answer option> 'answer': <answer option key label> 'reference_sentence': <original sentence from the TEXT that supports the answer> You should tell me which one of your proposed options is right ¥ by assigning the corresponding option's key label in the 'answer' field. Also, provide the original sentence ¥ from the TEXT that supports the answer in the 'reference_sentence' field. The question, the answer, and question answer options should be broad, ¥ challenging, long, detailed, and based on the TEXT provided. Additionally, ensure the token distribution of question follows these statistics: - Mean: 14.22 tokens - Std Deviation: 7.223939 tokens - Min: 4 token - 25th Percentile: 9 tokens - Median: 13 tokens - 75th Percentile: 17.25 tokens - Max: 49 tokens Additionally, ensure the token distribution of each answer follows these statistics: - Mean: 30.840 tokens - Std Deviation: 19.883692 tokens - Min: 1 token - 25th Percentile: 16 tokens - Median: 27.5 tokens - 75th Percentile: 43.25 tokens - Max: 100 tokens Only output the list of objects, with nothing else.