Slide 37
Slide 37 text
© Recruit Co., Ltd. All Rights Reserved
大量・多様なデータの例
MNIST: 手書き数字とラベルからなる6万サンプル
従来のタスク特化のデータセット
LAION-5B:Webから収集した50億の
画像・キャプションペア
IMDB Moview Reviews:映画のレビュー文と
極性ラベルからなる5万サンプル
The Pile:
Webから収集した825GBのテキスト
Why is the Pile a good training set? Recent work has shown that especially for
large models, diversity in data sources improves general cross-domain
knowledge of the model, as well as downstream generalization capability. In our
evaluations, not only do models trained on the Pile show moderate
improvements in traditional language modeling benchmarks, they also show
significant improvements on Pile BPB. Why is the Pile a good benchmark?
To score well on Pile BPB (bits per byte), a model must be able to understand
many disparate domains including books, github repositories, webpages, chat
logs, and medical, physics, math, computer science, and philosophy papers.
Pile BPB is a measure of world knowledge and reasoning ability in these
domains, making it a robust benchmark of general, cross-domain text …
大量かつ多様なデータセット