Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building an Effective Pre-training Corpus for J...

Building an Effective Pre-training Corpus for Japanese LLM (TAI AAI #3)

“Swallow” is a Large Language Model (LLM) developed by research teams at Tokyo Institute of Technology and the National Institute of Advanced Industrial Science and Technology (AIST). We are working on building better Japanese LLMs by continual pre-training from English LLMs to Japanese. So far, we have successfully enhanced Japanese performance by continual pre-training from models such as Llama 2, Mistral, Mixtral, and Llama 3. The use of a high-quality pre-training corpus has been a crucial factor in this achievement. In this presentation, I will mainly introduce the construction procedure of the “Swallow Corpus”, our proprietary large Japanese web corpus, and its combination with other corpora. Additionally, if possible, I will discuss the future tasks considering current challenges of Swallow and recent research trends.

This slide was prepared for presentation at Tokyo AI Advanced AI (TAI AAI) on 8/7.

Our team has also published a paper, which was accepted to COLM 2024.
This slide is mostly based on this paper.
https://arxiv.org/abs/2404.17733

Kakeru Hattori

August 06, 2024
Tweet

More Decks by Kakeru Hattori

Other Decks in Research

Transcript

  1. Tokyo AI Advanced AI (TAI AAI) #3 Kakeru Hattori Okazaki

    Laboratory M2 Tokyo Institute of Technology August 7 (Wed), 2024 [email protected] Building an Effective Pre-training Corpus for Japanese LLM
  2. Self Introduction 服部 翔 Kakeru Hattori Tokyo Institute of Technology

    School of Computing Computer Science Okazaki Laboratory M2 March, 2023 言語処理学会第29回年次大会(NLP2023) 『クエリ指向要約におけるクエリと要約の統合的な生成』(日立製作所賞) July, 2023〜 Joined “Swallow” project (corpus building) Others SWE internship at Nikkei (SRE team / 1.5 years) 2 @ayase_lab Kakeru Hattori
  3. Agenda I am working on an LLM building project called

    “Swallow”, and focusing on improving the pre-training corpus Today’s Topics 1. Overview of “Swallow” project 2. Building of a large Japanese Web corpus (Swallow Corpus) 3. Llama 3 Swallow (improved corpus recipes, remaining issues) 3
  4. Swallow Project 5 LLM research and development project at Okazaki

    Lab and Yokota Lab https://swallow-llm.github.io/index.en.html
  5. Project Members 6 • Roles: training, corpus, evaluation and instruction

    tuning • Most are students at Tokyo Tech (mostly Bachelor’s and Master’s)
  6. Project Features 7 • Continual pre-training from English open source

    models ◦ Llama 2 (7B, 70B), Mistral (7B), Mixtral (8x7B), Llama 3 (8B, 70B) • Making a strong effort in evaluate the LLMs ◦ Dashboard to visualize evaluation results of various Japanese LLMs • Building a large Japanese Web corpus (Swallow corpus)
  7. Overview of Project History 8 2023.12.19 Swallow (on Llama 2)

    ➢ Using a large Japanese Web corpus (Swallow Corpus) -> Today’s main topic 2024.03.11 Swallow-MS 7B / Swallow-MX 8x7B (from Mistral, Mixtral) 2024.04.26 Swallow-*-instruct-v0.1 (improved instruct model) 2024.07.01 Llama 3 Swallow ➢ Swallow Corpus + Use of existing corpus -> Today’s sub topic
  8. Building Japanese LLMs with Continual Pre-training 10 • Teach the

    language, knowledge, and culture of Japan to English LLMs • High quality Japanese text is needed -> Swallow Corpus Base LLM e.g. Llama 3 Japanese corpus (training data) Enhanced LLM
  9. Swallow Corpus 11 • Extract and refine Japanese texts from

    Common Crawl • Japanese text (≈5%) is further filtered to build a high-quality corpus Swallow Corpus Japanese is only 5% ONLY Japanese ONLY high-quality Only 0.27% of the original remains
  10. Procedure for Building Swallow Corpus 12 ❶ Swallow-RAW Text extraction

    Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC files Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts
  11. Procedure for Building Swallow Corpus 13 ❶ Swallow-RAW Text extraction

    Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC files Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts
  12. WARC Format Requires Text Extraction, but can Achieve High-quality 14

    There are two file formats available in Common Crawl…… WET format WARC format Data format Extracted text HTML Need for extraction No Yes Text quality Low High Example of corpus CC-100, mC4, OSCAR Swallow Corpus ➔ High quality extraction than existing Japanese corpus via WET
  13. Which Comes First, Text Extraction or Japanese Detection? 15 •

    Text extraction from HTML is time consuming • Japanese is ONLY 5%, so we want to reduce the candidates first • However, Japanese language classifier requires text extraction ◦ HTML tags include English, not “pure” Japanese ➔ Is it possible to “roughly” Japanese detection in HTML format? 🤔🤔🤔
  14. Step 2 Rapid Japanese Detection before Text Extraction 16 If

    either rule is met, it is considered to be Japanese 1. HTML “lang” attribute is ”ja” 2. HTML “title” element text is in Japanese (using a classifier) Precision: 0.89 / Recall: 0.97 / F1: 0.926 ➔ The benefit of 1/14 time reduction outweighs the loss of 3% Rapid Japanese detection Text extraction Step 3 Precise Japanese detection Step 4
  15. Example of Text Extraction with Trafiratura 17 • Use Trafiratura

    library for text extraction from HTML • e.g. Okazaki Lab’s website → Score: 336.33 (positive) Extract
  16. Step 4 More Precise Japanese Detection by Classifier 18 •

    Japanese can be successfully detected by characters only • We trained SVM on Wikipedia using character n-grams as features ◦ Precision: 0.998 / Recall: 0.993 / F1: 0.996 • We later found FastText outperforms our classifier in total ◦ Well-adjusted FastText may be the most cost-effective choice Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection
  17. Procedure for Building Swallow Corpus 19 ❶ Swallow-RAW Text extraction

    Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC files Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts
  18. Step 5 Quality Filtering with Heuristic Rules 20 • Documents

    with duplicate expressions are detected based on n-grams ◦ Prevents LLM from generating the same phrase repeatedly ◦ e.g. Number of most frequent 2-gram / Number of all 2-grams > 0.2 • Apply custom rules for Japanese language quality ◦ e.g. Hiragana fraction < 0.2, Katakana fraction > 0.5 * A list of all filtering rules is available in Appendix.1 Quality filtering Deduplication Step 6 Filtering by hostnames Step 7
  19. Step 6 Deduplication Prevents “Rote Memorization” of the Corpus 21

    • Common Crawl tours the same website at different times ➔ Remove duplicates to prevent LLMs from “memorizing” the documents • Deduplication is executed by duplicate feature detection with MinHash ◦ All documents are converted to character 5-gram feature set * The details of the duplicate detection algorithm by MinHash is available in Appendix.2 Quality filtering Step 5 Deduplication Filtering by hostnames Step 7
  20. Step 6 After Deduplication, only about 20% of the Older

    Crawlings Remain 22 Older documents are reduced considerably since the newer ones are kept Before 2022, only about 20% are left Step 5 Step 6 Step 7 Quality filtering Step 5 Deduplication Filtering by hostnames Step 7
  21. Step 7 Filtering Based on NG Words and NG URLs

    23 • In addition to Step 5, remove harmful contents by NG words and URLs ◦ Included in the UT1 blocklist ◦ Percentage of pages containing the name of a dating site (0.001<) ◦ Percentage of pages containing NG expressions (0.005 <) ◦ *wikipedia.org (Add the dump separately) ◦ *.5ch.net (Probably a low quality forum) Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames
  22. Procedure for Building Swallow Corpus 24 ❶ Swallow-RAW Text extraction

    Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC files Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts
  23. Japanese-specific Normalization & Footer Removal 25 • NFKC Normalization ◦

    Normalize full- and half-width alphabets, kana, katakana, and symbols • Consider the Japanese specific use of punctuations ◦ If「,」occurs more than「、」, unify to「、」 ◦ If「.」occurs more than「。」, unify to「。」 • Remove typical representations in footers ◦ e.g.「無断転載を禁ず」, 「この記事へのトラックバック一覧」
  24. Step 2 Step 1 Differences from Previous Works (RefinedWeb) 26

    Hostname-based filtering Text extraction Step 2 English detection Step 3 Quality filtering Step 4 Correction in line Step 5 Deduplication Step 6 Precise deduplication Step 7 RefinedWeb https://arxiv.org/abs/2306.01116 UT1 blocklist is available Japanese-specific processing Economical processing Downloading WARC files Step 1 Rapid Japanese detection Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9
  25. Swallow Corpus has Boosted LLM Performance 27 • Continual pre-training

    for Llama 2 (13B) with several Japanese corpus ◦ Of the about 104.9 BT of training data, 90% was Japanese ◦ Of the Japanese corpus, about 1.6 BT was Wikipedia, and the rest was each Japanese corpus
  26. Llama 3 Swallow 29 • Llama 3 Swallow was released

    July 2024! ◦ Overall Japanese performance is comparable to Qwen 2 ◦ Excellent in Japanese knowledge (e.g. QA, common sense, translation) Question answering Translation
  27. Improved Performance by Incorporating Existing Corpus 30 • Cosmopedia: synthesized

    “textbook-like” text (English) ◦ Contributes to arithmetic reasoning and code generation • Laboro ParaCorpus: English-Japanese parallel corpus ◦ Contributes to translation (especially Ja-En)
  28. Some Issues Remain Unresolved 31 • Issues: code generation, arithmetic

    reasoning, general education ◦ Qwen 2 is significantly better than Llama 3 Swallow in these domains ◦ Especially, code generation is lower than the base model (Llama 3) Code generation Arithmetic reasoning General education
  29. What are Next Actions in Pre-training Corpus? 32 Pre-training corpus

    must be a core factor in enhancing Japanese LLM 1. Further additional data collection ◦ Web-oriented texts may be near exhaustion ◦ Potentially a lot of non-Web texts, but often with copyright issues ◦ e.g. PDF, closed documents, OCR ◦ Synthetic data ◦ e.g. Cosmopedia 2. Make effective use of data already collected ◦ Non-heuristics filtering
  30. The Full List of Filtering Rules (1) 34 Delete documents

    that contain many duplicate expressions The rules of related work (MassiveWeb) are adapted as is • Number of lines duplicated in other lines / Number of all lines (0.30) • Number of paragraphs overlapping other paragraphs / Total number of paragraphs (0.30) • Number of characters in lines overlapping other lines / Number of all characters (0.20) • Characters in paragraphs that overlap with other paragraphs / Total number of characters (0.20) • Number of occurrences of most frequent 2-gram / Number of occurrences of all 2-grams (0.20) • Number of occurrences of most frequent 3-gram / Number of occurrences of all 3-grams (0.18) • Number of occurrences of most frequent 4-gram/ Number of occurrences of all 4-grams (0.16) • Total number of 5-grams occurring 2 or more times / Total number of 5-grams (0.15) • Total number of 6-grams occurring 2 or more times / Total number of 6-grams (0.14) • Total number of 7-grams occurring 2 or more times / Total number of 7-grams (0.13) • Total number of 8-grams occurring 2 or more times / Total number of 8-grams (0.12) • Total number of 9-grams occurring 2 or more times / Total number of 9-grams (0.11) • Total number of 10-grams occurring 2 or more times / Total number of 10-grams (0.10)
  31. The Full List of Filtering Rules (2) 35 Deletion of

    documents containing low-quality Japanese • Number of letters (< 400) • Fraction of hiragana letters (< 0.2) • Fraction of katakana letters (0.5 <) • Fraction of Japanese letters (hiragana, katakana, kanji, punctuation) (< 0.5) • Average number of letters in a sentence in a document (< 20, 90 <) • Number of letters in the longest sentence (200 <) • Fraction of sentences ending with an ellipsis (0.2 <)
  32. Jaccard Coefficients can be Approximated with MinHash 37 Number of

    common features Number of total features Jaccard coefficient Similarity between two documents (0~1) Reduces huge computation complexity by approximation Minimum hash value of all elements in a given hash function Jaccard coefficient is obtained by MinHash
  33. Probability of the Smallest Hash Value Being a Common Feature

    38 Document A Features Hash value f_3 0.1 (1st) f_1 0.3 (3rd) f_5 0.4 (4th) f_2 0.7 (7th) f_4 0.8 (8th) Document B Features Hash value f_3 0.1 (1st) f_7 0.2 (2nd) f_5 0.4 (4th) f_6 0.5 (5th) f_8 0.6 (6th) = MinHash match Is the smallest hash value a common feature?
  34. Compare Multiple Hash Values Together 39 Algorithm to detect duplicate

    documents 1. Create r of b MinHash value concatenations (=buckets) 2. If there is an exact match in r comparisons, it is considered a duplicate Probability of being considered a duplicate with Jaccard coefficient s In this study, b = 20 and r = 20 * Too large K = br will increase memory requirements and computational complexity