Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SNLP2022: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Reina Akama
September 20, 2022
300

SNLP2022: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Reina Akama

September 20, 2022
Tweet

Transcript

  1. Documenting Large Webtext Corpora: A Case Study on the Colossal

    Clean Crawled Corpus Jesse Dodge et al., EMNLP2021 ಡΈख: ੺ؒ ྯಸ (౦๺େ) ୈ14ճ࠷ઌ୺NLPษڧձ @͓஡େ
  2. େن໛ݴޠϞσϧͱࣄલֶशίʔύε ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 2 • BERT [Devlin+2019] → RoBERTa

    [Liu+2019] • BookCorpus [Zhu+2015], English-language Wikipedia • → CommonCrawl News dataset [Nagel 2016], OpenWebtext [Gokaslan+2019], Stories [Trinh+2018] • GPT-3 [Brown+2020] • CommonCrawl (60%), WebText2 (22%) [Kaplan+2020], Books1 (8%) and Books2 (8%) [Brown+2020], English-language Wikipedia (3%) • T5 [Raffel+2019] • Colossal Clean Crawled Corpus (C4; filtered CommonCrawl) [Raffel+2019] • Switch Transformer [Fedus+2021] • Colossal Clean Crawled Corpus (C4; filtered CommonCrawl) [Raffel+2019]
  3. େن໛ݴޠϞσϧͱࣄલֶशίʔύε ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 3 • BERT [Devlin+2019] → RoBERTa

    [Liu+2019] • BookCorpus [Zhu+2015], English-language Wikipedia • → CommonCrawl News dataset [Nagel 2016], OpenWebtext [Gokaslan+2019], Stories [Trinh+2018] • GPT-3 [Brown+2020] • CommonCrawl (60%), WebText2 (22%) [Kaplan+2020], Books1 (8%) and Books2 (8%) [Brown+2020], English-language Wikipedia (3%) • T5 [Raffel+2019] • Colossal Clean Crawled Corpus (C4; filtered CommonCrawl) [Raffel+2019] • Switch Transformer [Fedus+2021] • Colossal Clean Crawled Corpus (C4; filtered CommonCrawl) [Raffel+2019]
  4. English Colossal Clean Crawled Corpus (C4) ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 4

    • Common Crawl (2019-04) ΛϑΟϧλϦϯάɺ Reasonably clean and natural ͳେن໛ӳޠςΩετσʔλ • 5จҎ্͔ΒͳΔจॻɺ3୯ޠҎ্͔ΒͳΔจͷΈอ࣋ • ຤ඌʹ۟ಡ఺ʢϐϦΦυɺײ୰ූɺٙ໰ූɺҾ༻ූ౳ʣ͕͋ΔߦͷΈอ࣋ • langdetect (https://pypi.org/project/langdetect/) Ͱ 0.99 Ҏ্ͷ֬཰Ͱӳޠʹ෼ྨ͞ΕͨจॻͷΈอ࣋ • “List of Dirty, Naughty, Obscene, or Otherwise Bad Words” (https://git.io/vSyEu) ಺ͷ୯ޠΛؚΉจॻΛআڈ • `Javascript` ͱ͍͏୯ޠؚ͕·ΕΔߦɺதׅހ `{` ͕ొ৔͢ΔϖʔδΛ࡟আ • 3จઅҎ্ͷॏෳ͸1ͭΛ࢒ͯ͠ଞΛ࡟আ [Raffel+ JMLR2020]
  5. ࿦จͷओுɾಘΒΕΔ஌ݟ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 5 • ࣄલֶशίʔύεʹ͍ͭͯ஌Βͳ͍ͷ͸·͍ͣ • ࣄલֶशίʔύε͕γεςϜʹ༩͑ΔӨڹΛ஌Δ͜ͱ͕Ͱ͖ͳ͍ •

    ԼྲྀλεΫʹόΠΞε͕஫ೖ͞ΕΔՄೳੑ΋ • ͳͥ஌Βͳ͍ʁɿେن໛͗ͯ͢ମܥతͳ෼ੳ͕ࠔ೉ • Ұ౓పఈతʹ෼ੳͯ͠ɺυΩϡϝϯτԽ͢Ε͹͍͍ʢ͜ͷ࿦จʣ • Post-hoc ͳ෼ੳΛ͠΍͍͢Α͏ͳ؀ڥͷ੔උ΋ॏཁ • ຊ࿦จͷߩݙ • ࣄલֶशίʔύε C4 ͷఆྔత෼ੳɾৄࡉͳυΩϡϝϯτͷఏڙ • ఆྔԽͷํ๏࿦ɺ෼ੳͷํ๏࿦Λఏࣔ • σʔληοτެ։ʹؔ͢Δఏݴ
  6. Metadata ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 7 σʔλͷग़ॲɺछྨɺ࣌ظɺ஍Ҭ… ᶃ ͲͷΑ͏ͳσʔλ͔ʁ ᶄ Կؚ͕·Ε͍ͯΔ͔ʁ

    ᶅ Կؚ͕·Ε͍ͯͳ͍͔ʁ ػց͕ੜ੒ͨ͠σʔλɺϕϯνϚʔΫσʔλ… τϐοΫɺΞΠσϯςΟςΟ…
  7. Metadata − Internet domains ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 8 • ௐࠪํ๏

    • Internet Archive Λࢀর͢Δ͜ͱʹΑΓ จॻͱϝλσʔλ (೔෇ɺҐஔ৘ใ౳) Λඥ෇͚ • Χ΢ϯτͯ͠܏޲Λัଊ • ஌ݟ • .com, .org, .net ͕ଟ͍ • (ӳޠͷςΩετΛऩू͍ͯ͠Δͷʹ) ӳޠҎ֎Λ ओݴޠͱ͢Δࠃɾ஍ҬͷυϝΠϯ΋্Ґ25Ґʹଘࡏ • ্Ґ25Ґʹ͸ೖ͍ͬͯͳ͍͕ɺถࠃ੓෎ͷ܉ࣄ༻ υϝΠϯ .mil (33M tokens) ΍ɺӳࠃͷ܉ࣄ༻ɾࠃ๷ল υϝΠϯ .mod.uk (1M tokens) ΋ଟ͍
  8. Metadata − Websites ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 9 • ௐࠪํ๏ •

    Internet Archive Λࢀর͢Δ͜ͱʹΑΓ จॻͱϝλσʔλ (೔෇, Ґஔ৘ใ౳) Λඥ෇͚ • Χ΢ϯτͯ͠܏޲Λัଊ • ஌ݟ • ಛڐจॻ͕େྔʹؚ·Ε͍ͯΔ • patents.google.com (1Ґ), patents.com (10Ґ) • Wikipedia, χϡʔε౳ɺݴޠϞσϧͷࣄલֶशʹ Α͘༻͍ΒΕΔσʔλ΋ଟؚ͘·Ε͍ͯΔ • NYTimes, LATimes, AlJazeera • ॻ੶ྨ΋ଟ͍ • Plos, FrontiersIn, Springer
  9. Metadata − Utterance Date ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 10 • ௐࠪํ๏

    • Internet Archive Λࢀর͢Δ͜ͱʹΑΓ จॻͱϝλσʔλ (೔෇, Ґஔ৘ใ౳) Λඥ෇͚ • Χ΢ϯτͯ͠܏޲Λัଊ • ஌ݟ • C4ͷ92%͕աڈ10೥ؒ (2011೥ʙ2019೥) ʹ ॻ͔Εͨ΋ͷͱਪఆ • σʔλ͸ϩϯάςʔϧɺσʔλऩूͷ 10ʙ20೥લʹॻ͔Εͨσʔλ΋গͳ͔Βͣଘࡏ C4͔Βແ࡞ҝʹநग़ͨ͠1,000,000ͷURLʹ͍ͭͯ ࠷ॳʹΠϯσοΫε͞Εͨ೔෇ͷ෼෍
  10. Included data ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 11 σʔλͷग़ॲɺछྨɺ࣌ظɺ஍Ҭ… ᶃ ͲͷΑ͏ͳσʔλ͔ʁ ᶄ

    Կؚ͕·Ε͍ͯΔ͔ʁ ᶅ Կؚ͕·Ε͍ͯͳ͍͔ʁ ػց͕ੜ੒ͨ͠σʔλɺϕϯνϚʔΫσʔλ… τϐοΫɺΞΠσϯςΟςΟ…
  11. Included data − Machine-generated text ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 12 •

    ௐࠪํ๏ • ඥ෇͚ͨϝλσʔλΛར༻ • ಛڐจॻ = ػց͕ੜ੒ͨ͠จॻ (ӳޠ΁ͷػց຋༁) ͷ্ݶͱΈͳ͢ - ࠃ͝ͱʹಛڐΛॻͨ͘Ίͷݴޠ͕ఆΊΒΕ͍ͯΔ - patents.google.com ͸ɺ֤ࠃಛڐிͷಛڐΛ ػց຋༁Ͱӳޠʹ຋༁͍ͯ͠Δ • ஌ݟ • ಛڐจॻͷେ൒͸ถࠃಛڐிͷ΋ͷͰ͋Δ͕ɺ 10%Ҏ্͸ӳޠҎ֎ͷݴޠͰॻ͘͜ͱΛఆΊ͍ͯΔ ಛڐிͷ΋ͷɻ͜ΕΒͷจॻ͸ɺػցʹΑͬͯ ੜ੒͞ΕͨՄೳੑ͕ߴ͍ɻ • ΢Σϒ্ͷػցੜ੒จॻ͸ɺ࣌ؒͱͱ΋ʹ૿Ճ͢Δ ͜ͱ͕༧૝͞ΕΔ ಛڐ
  12. Included data − Benchmark data contamination ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 13

    • ௐࠪํ๏ • ࣄલֶशσʔλͱϕϯνϚʔΫྫͷؒͷ ׬શҰகΛଌఆɻେจࣈͱ۟ಡ఺͸ਖ਼نԽɻ • n-gramॏෳ (8<=n<=13) [Brown+2020] • ஌ݟ • ໿2%ʙ25% ͷҰகΛ؍࡯ • Ұக཰͸ɺ୯จͷλʔήοτΛؚΉ σʔληοτ (Xsum, TIFU-short, AMR-to-text) ͕ɺ ෳ਺จͷλʔήοτΛؚΉσʔληοτ (TIFU-long, WikiBio) ΑΓߴ͍ • LAMA Λ༻͍ͨ zero-shot ͷੑೳධՁʹ͸ ஫ҙ͕ඞཁʢطʹֶशࡁͷՄೳੑʣ ࣗવݴޠจੜ੒λεΫͷλʔήοττʔΫϯؚ͕·Ε͍ͯΔ͔ʁ
  13. Included data − Benchmark data contamination ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 14

    • ௐࠪํ๏ • ࣄલֶशσʔλͱϕϯνϚʔΫྫͷؒͷ ׬શҰகΛଌఆɻେจࣈͱ۟ಡ఺͸ਖ਼نԽɻ • n-gramॏෳ (8<=n<=13) [Brown+2020] • ஌ݟ • ໿2%ʙ50% ͷҰகΛ؍࡯ • ΋ͬͱ΋Ұக཰ͷߴ͍ QNLI ͸ wikipedia ͔Β ߏங͞Εͨσʔλ • GLEU Λ༻͍ͨ few-shot ͷੑೳධՁʹ͸ ஫ҙ͕ඞཁʢطʹֶशࡁͷՄೳੑʣ GLEU [Wang+2019] ͱҰக͢ΔτʔΫϯؚ͕·Ε͍ͯΔ͔ʁ
  14. ͳͥҰகʁ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 15 • Ծઆ1: ΢Σϒ্ͷςΩετΛݩʹߏங͔ͨ͠Β (ݩ͕ಉ͔ͩ͡Β) Ұகͨ͠

    • QNLI ͸ wikipedia σʔλΛݩʹߏங͞Εͨ΋ͷ • Ծઆ2: σʔλ࡞੒ޙʹ΢ΣϒʹΞοϓϩʔυͨ͠΋ͷ͕Ϋϩʔϧ͞Εͨ • ෳ਺ͷݚڀऀ͕ GLEU Λ github ϨϙδτϦͰެ։͍ͯ͠Δ
  15. ͳͥҰகʁ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 16 • Ծઆ1: ΢Σϒ্ͷςΩετΛݩʹߏங͔ͨ͠Β (ݩ͕ಉ͔ͩ͡Β) Ұகͨ͠

    • QNLI ͸ wikipedia σʔλΛݩʹߏங͞Εͨ΋ͷ • Ծઆ2: σʔλ࡞੒ޙʹ΢ΣϒʹΞοϓϩʔυͨ͠΋ͷ͕Ϋϩʔϧ͞Εͨ • ෳ਺ͷݚڀऀ͕ GLEU Λ github ϨϙδτϦͰެ։͍ͯ͠Δ GLEU Λެ։͍ͯ͠ΔϨϙδτϦʹؚ·ΕΔϑϨʔζ͕ CommonCrawl ʹؚ·ΕΔ͔Λ֬ೝ͠ɺ ͦͷϖʔδ͕σʔλͱؚͯ͠·Ε͍ͯΔ͔ΛਓखͰνΣοΫ ˠ ֘౰ͳ͠ ✘
  16. Excluded data ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 17 σʔλͷग़ॲɺछྨɺ࣌ظɺ஍Ҭ… ᶃ ͲͷΑ͏ͳσʔλ͔ʁ ᶄ

    Կؚ͕·Ε͍ͯΔ͔ʁ ᶅ Կؚ͕·Ε͍ͯͳ͍͔ʁ ػց͕ੜ੒ͨ͠σʔλɺϕϯνϚʔΫσʔλ… τϐοΫɺΞΠσϯςΟςΟ…
  17. ϒϩοΫϦετʹΑΔϑΟϧλॲཧ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 18 • Common Crawl (2019-04) ΛϑΟϧλϦϯάɺ

    Reasonably clean and natural ͳେن໛ӳޠςΩετσʔλ • 5จҎ্͔ΒͳΔจॻɺ3୯ޠҎ্͔ΒͳΔจͷΈอ࣋ • ຤ඌʹ۟ಡ఺ʢϐϦΦυɺײ୰ූɺٙ໰ූɺҾ༻ූ౳ʣ͕͋ΔߦͷΈอ࣋ • langdetect (https://pypi.org/project/langdetect/) Ͱ 0.99 Ҏ্ͷ֬཰Ͱӳޠʹ෼ྨ͞ΕͨจॻͷΈอ࣋ • “List of Dirty, Naughty, Obscene, or Otherwise Bad Words” (https://git.io/vSyEu) ಺ͷ୯ޠΛؚΉจॻΛআڈ • `Javascript` ͱ͍͏୯ޠؚ͕·ΕΔߦɺதׅހ `{` ͕ొ৔͢ΔϖʔδΛ࡟আ • 3จઅҎ্ͷॏෳ͸1ͭΛ࢒ͯ͠ଞΛ࡟আ
  18. Excluded data − Topic ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 19 • ௐࠪํ๏

    • ϒϩοΫϦετʹΑͬͯ আڈ͞Εͨจॻ͔Βແ࡞ҝʹ 100,000จॻΛநग़ • TF-IDFͰ֫ಘͨ͠ຒΊࠐΈදݱΛ k-means Ͱ 50Ϋϥελʹ෼ྨ • PCA ͰՄࢹԽ • ஌ݟ • ϙϧϊ(31%)΍ϔΠτεϐʔνܥ ͷΫϥελ͕ଟ͍ • Պֶɾҩྍɾ݈߁ɾήʔϜɾ ༮ࣇڭҭ౳ͷແ֐ͳΫϥελ΋ ؍࡯ ҩྍ
  19. Excluded data − Topic ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 20 • ௐࠪํ๏

    • ϒϩοΫϦετʹΑͬͯ আڈ͞Εͨจॻ͔Βແ࡞ҝʹ 100,000จॻΛநग़ • TF-IDFͰ֫ಘͨ͠ຒΊࠐΈදݱΛ k-means Ͱ 50Ϋϥελʹ෼ྨ • PCA ͰՄࢹԽ • ஌ݟ • ϙϧϊ(31%)΍ϔΠτεϐʔνܥ ͷΫϥελ͕ଟ͍ • Պֶɾҩྍɾ݈߁ɾήʔϜɾ ༮ࣇڭҭ౳ͷແ֐ͳΫϥελ΋ ؍࡯ ༮ࣇڭҭ ֶߍɾϏδωε
  20. Excluded data − Identity ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 21 • ௐࠪํ๏

    • Identity ʹݴٴ͍ͯ͠ΔจॻΛ ਖ਼نදݱͰநग़ • ͜ΕΒͷจॻʹ͍ͭͯɺ ϒϩοΫϦετʹΑΓআڈ͞Εͨ จॻͱͷPMIΛଌఆ • ஌ݟ • LGBTQ+ ͷ identity Λٞ࿦͢Δจॻ͕ ಛʹภͬͯআڈ͞ΕΔՄೳੑ • ਓछɾຽ଒ΑΓ΋ॏେͳϦεΫ
  21. Excluded data − Identity ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 22 • ௐࠪํ๏

    • Identity ʹݴٴ͍ͯ͠ΔจॻΛ ਖ਼نදݱͰநग़ • ͜ΕΒͷจॻʹ͍ͭͯɺ ϒϩοΫϦετʹΑΓআڈ͞Εͨ จॻͱͷPMIΛଌఆ • ஌ݟ • LGBTQ+ ͷ identity Λٞ࿦͢Δจॻ͕ ಛʹภͬͯআڈ͞ΕΔՄೳੑ • ਓछɾຽ଒ΑΓ΋ॏେͳϦεΫ • ແ࡞ҝʹબ୒ͨ͠50จॻΛਓखධՁ • ͏ͪɺnon-offensive: 22%, non-sexual: 36%
  22. ·ͱΊ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 23 • ࣄલֶशίʔύε C4 ͷఆྔత෼ੳɾ஌ݟͷఏڙ •

    ಛڐɾχϡʔεɾwikipediaυϝΠϯ͕ଟؚ͘·ΕΔ • ͜͜10೥ͷσʔλ͕΄ͱΜͲɺ10೥Ҏ্લͷσʔλ΋ؚ·ΕΔ • ػց͕ੜ੒ͨ͠ςΩετɺϕϯνϚʔΫσʔλͷίϯλϛ • (ࣾձతόΠΞεͷଘࡏ) • ϚΠϊϦςΟͷ੠͕ෆద౰ʹআ͔Ε͍ͯΔՄೳੑ • σʔληοτެ։ʹؔ͢Δఏݴ • Reporting website metadataʢ΢ΣϒαΠτͷϝλσʔλΛใࠂ͢Δ͜ͱʣ • Examining benchmark contaminationʢϕϯνϚʔΫσʔλͷίϯλϛΛௐࠪ͢Δ͜ͱʣ • Social biases and representational harmsʢࣾձతόΠΞεͱฐ֐Λೝࣝ͢Δ͜ͱʣ • Excluded voices and identitiesʢআ֎͞Εͨ੠ͱΞΠσϯςΟςΟΛೝࣝ͢Δ͜ͱʣ Meta Meta Included Included Excluded
  23. ॻࢽ৘ใ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 24 • ஶऀʹΑΔิ଍ʁ → Ͱ͸ͳ͍ C4

    Λެ։ (JMLR 2020; 2020/06) C4 Λௐࠪ (EMNLP 2021; 2021/11)
  24. طଘίʔύεʹର͢Δ post-hoc study ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 25 • RealToxicityPrompts: Evaluating

    Neural Toxic Degeneration in Language Models [Gehman+EMNLP2020] • OpenWebText [Gokaslan+2019] ʹؚ·ΕΔ༗֐ੑͷ͋Δίϯςϯπ΍ϑΣΠΫχϡʔεΛௐࠪ • What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus [Luccioni+ACL2021] • Common Crawl ʹؚ·ΕΔ๬·͘͠ͳ͍ίϯςϯπΛௐࠪɺϔΠτεϐʔν΍੒ਓ޲͚ίϯςϯπͷଘࡏΛࢦఠ • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [Kreutzer+TACL2022] • ΢ΣϒΫϩʔϧͰ࡞੒͞Εͨ5ͭͷଟݴޠίʔύε [El-Kishky+2020; Xue+2021; Ortiz Suárez+ 2020; Bañón+2020; Schwenk+2019] ʹ͍ͭͯɺݴޠϥϕϧͷద੾͞ɾςΩετͷ඼࣭ΛਓखͰ൑அ ࣄલֶशίʔύεͷମܥత෼ੳ͸Ұൠʹࠔ೉ɺΏ͑ʹৄࡉͳυΩϡϝϯτ΋ܽམ [Bender+2021; Paullada+2020] ࣄલֶशίʔύε͕γεςϜʹ༩͑ΔӨڹΛ஌Δ͜ͱ͕Ͱ͖ͳ͍ ԼྲྀλεΫʹόΠΞε͕஫ೖ͞ΕΔՄೳੑ΋ [Li+2020; Gehman+2020; Groenwold+2020] ໾ཱͭ
  25. ॴײ ୈ14ճ࠷ઌ୺NLPษڧձ | 2022-09-27 26 • ϒϥοΫϘοΫε໰୊΁ͷऔΓ૊Έʹ΋ߩݙ͢Δॏཁͳ؍఺ • Ϟσϧͷ෼ੳʢe.g., interpretation,

    probingʣ • σʔλͷ෼ੳ • ෼ੳ؍఺ͷબ୒ཧ༝ʁ • ໢ཏతͳௐࠪ͸ݱ࣮తͰ͸ͳ͍ɺ·ͣԡ͑͞Δ΂͖؍఺ͱͦͷࠜڌ • ஶऀʮ͜ͷαΠζͷσʔληοτͰى͜Γ͏Δ໰୊ͷҰ෦͔͠ௐ͍ࠪͯ͠ͳ͍͜ͱΛೝ͍ࣝͯ͠Δʯ • ஶऀʮݸਓΛಛఆͰ͖Δ৘ใ΍ஶ࡞ݖͷ͋ΔςΩετ΋ଘࡏ͢ΔͩΖ͏ɺ͜ΕΒͷఆྔԽ΍࡟আ͸ࠓޙͷ࡞ۀʹҕͶΔʯ