Slide 33
Slide 33 text
Platform Technology Division Copyright 2020 Sony Semiconductor Solutions Corporation
DATE
33/xx
補足:gpt-3.5を使ったデータセット生成
• 70$かけて70k行の文書探索精度を上げ
るためのデータセットを作成
– Kaggle - LLM Science Exam | Kaggle
– 私は使いこなせなかったためkaggleで公開
• 3位の人が使いこなした。#嬉しい。
– プロンプトは→
• 文書のQAと、QAを作るためにつかった
sentenceを同時に生成
system_message = f"""
You will be provided with TEXT from wikipedia. ¥
The TEXT will be delimited with {delimiter} characters.
Output a python list of 3 dict objects, where each object is ¥
a multiple choice question whose answers should be in ¥
the given TEXT and that has 5 choices each. Each object should have the following format:
'question':
'option_1':
'option_2':
'option_3':
'option_4':
'option_5':
'answer':
'reference_sentence':
You should tell me which one of your proposed options is right ¥
by assigning the corresponding option's key label in the 'answer' field. Also, provide the
original sentence ¥
from the TEXT that supports the answer in the 'reference_sentence' field.
The question, the answer, and question answer options should be broad, ¥
challenging, long, detailed, and based on the TEXT provided.
Additionally, ensure the token distribution of question follows these statistics:
- Mean: 14.22 tokens
- Std Deviation: 7.223939 tokens
- Min: 4 token
- 25th Percentile: 9 tokens
- Median: 13 tokens
- 75th Percentile: 17.25 tokens
- Max: 49 tokens
Additionally, ensure the token distribution of each answer follows these statistics:
- Mean: 30.840 tokens
- Std Deviation: 19.883692 tokens
- Min: 1 token
- 25th Percentile: 16 tokens
- Median: 27.5 tokens
- 75th Percentile: 43.25 tokens
- Max: 100 tokens
Only output the list of objects, with nothing else.