ಛڐɾχϡʔεɾwikipediaυϝΠϯ͕ଟؚ͘·ΕΔ • ͜͜10ͷσʔλ͕΄ͱΜͲɺ10Ҏ্લͷσʔλؚ·ΕΔ • ػց͕ੜͨ͠ςΩετɺϕϯνϚʔΫσʔλͷίϯλϛ • (ࣾձతόΠΞεͷଘࡏ) • ϚΠϊϦςΟͷ͕ෆదʹআ͔Ε͍ͯΔՄೳੑ • σʔληοτެ։ʹؔ͢Δఏݴ • Reporting website metadataʢΣϒαΠτͷϝλσʔλΛใࠂ͢Δ͜ͱʣ • Examining benchmark contaminationʢϕϯνϚʔΫσʔλͷίϯλϛΛௐࠪ͢Δ͜ͱʣ • Social biases and representational harmsʢࣾձతόΠΞεͱฐΛೝࣝ͢Δ͜ͱʣ • Excluded voices and identitiesʢআ֎͞ΕͨͱΞΠσϯςΟςΟΛೝࣝ͢Δ͜ͱʣ Meta Meta Included Included Excluded
Neural Toxic Degeneration in Language Models [Gehman+EMNLP2020] • OpenWebText [Gokaslan+2019] ʹؚ·ΕΔ༗ੑͷ͋ΔίϯςϯπϑΣΠΫχϡʔεΛௐࠪ • What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus [Luccioni+ACL2021] • Common Crawl ʹؚ·ΕΔ·͘͠ͳ͍ίϯςϯπΛௐࠪɺϔΠτεϐʔνਓ͚ίϯςϯπͷଘࡏΛࢦఠ • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [Kreutzer+TACL2022] • ΣϒΫϩʔϧͰ࡞͞Εͨ5ͭͷଟݴޠίʔύε [El-Kishky+2020; Xue+2021; Ortiz Suárez+ 2020; Bañón+2020; Schwenk+2019] ʹ͍ͭͯɺݴޠϥϕϧͷద͞ɾςΩετͷ࣭ΛਓखͰஅ ࣄલֶशίʔύεͷମܥతੳҰൠʹࠔɺΏ͑ʹৄࡉͳυΩϡϝϯτܽམ [Bender+2021; Paullada+2020] ࣄલֶशίʔύε͕γεςϜʹ༩͑ΔӨڹΛΔ͜ͱ͕Ͱ͖ͳ͍ ԼྲྀλεΫʹόΠΞε͕ೖ͞ΕΔՄೳੑ [Li+2020; Gehman+2020; Groenwold+2020] ཱͭ