Building an Effective Pre-training Corpus for Japanese LLM (TAI AAI #3)

Tokyo AI Advanced AI (TAI AAI) #3 Kakeru Hattori Okazaki
Laboratory M2 Tokyo Institute of Technology August 7 (Wed), 2024 [email protected] Building an Eﬀective Pre-training Corpus for Japanese LLM

Self Introduction 服部翔 Kakeru Hattori Tokyo Institute of Technology
School of Computing Computer Science Okazaki Laboratory M2 March, 2023 言語処理学会第29回年次大会（NLP2023）『クエリ指向要約におけるクエリと要約の統合的な生成』（日立製作所賞） July, 2023〜 Joined “Swallow” project (corpus building) Others SWE internship at Nikkei (SRE team / 1.5 years) 2 @ayase_lab Kakeru Hattori

Agenda I am working on an LLM building project called
“Swallow”, and focusing on improving the pre-training corpus Today’s Topics 1. Overview of “Swallow” project 2. Building of a large Japanese Web corpus (Swallow Corpus) 3. Llama 3 Swallow (improved corpus recipes, remaining issues) 3

Swallow

Swallow Project 5 LLM research and development project at Okazaki
Lab and Yokota Lab https://swallow-llm.github.io/index.en.html

Project Members 6 • Roles: training, corpus, evaluation and instruction
tuning • Most are students at Tokyo Tech (mostly Bachelor’s and Master’s)

Project Features 7 • Continual pre-training from English open source
models ◦ Llama 2 (7B, 70B), Mistral (7B), Mixtral (8x7B), Llama 3 (8B, 70B) • Making a strong eﬀort in evaluate the LLMs ◦ Dashboard to visualize evaluation results of various Japanese LLMs • Building a large Japanese Web corpus (Swallow corpus)

Overview of Project History 8 2023.12.19 Swallow (on Llama 2)
➢ Using a large Japanese Web corpus (Swallow Corpus) -> Today’s main topic 2024.03.11 Swallow-MS 7B / Swallow-MX 8x7B (from Mistral, Mixtral) 2024.04.26 Swallow-*-instruct-v0.1 (improved instruct model) 2024.07.01 Llama 3 Swallow ➢ Swallow Corpus + Use of existing corpus -> Today’s sub topic

Swallow Corpus

Building Japanese LLMs with Continual Pre-training 10 • Teach the
language, knowledge, and culture of Japan to English LLMs • High quality Japanese text is needed -> Swallow Corpus Base LLM e.g. Llama 3 Japanese corpus (training data) Enhanced LLM

Swallow Corpus 11 • Extract and reﬁne Japanese texts from
Common Crawl • Japanese text (≈5%) is further ﬁltered to build a high-quality corpus Swallow Corpus Japanese is only 5% ONLY Japanese ONLY high-quality Only 0.27% of the original remains

Procedure for Building Swallow Corpus 12 ❶ Swallow-RAW Text extraction
Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC ﬁles Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts

WARC Format Requires Text Extraction, but can Achieve High-quality 14
There are two ﬁle formats available in Common Crawl…… WET format WARC format Data format Extracted text HTML Need for extraction No Yes Text quality Low High Example of corpus CC-100, mC4, OSCAR Swallow Corpus ➔ High quality extraction than existing Japanese corpus via WET

Which Comes First, Text Extraction or Japanese Detection? 15 •
Text extraction from HTML is time consuming • Japanese is ONLY 5%, so we want to reduce the candidates ﬁrst • However, Japanese language classiﬁer requires text extraction ◦ HTML tags include English, not “pure” Japanese ➔ Is it possible to “roughly” Japanese detection in HTML format? 🤔🤔🤔

Step 2 Rapid Japanese Detection before Text Extraction 16 If
either rule is met, it is considered to be Japanese 1. HTML “lang” attribute is ”ja” 2. HTML “title” element text is in Japanese (using a classiﬁer) Precision: 0.89 / Recall: 0.97 / F1: 0.926 ➔ The beneﬁt of 1/14 time reduction outweighs the loss of 3% Rapid Japanese detection Text extraction Step 3 Precise Japanese detection Step 4

Example of Text Extraction with Traﬁratura 17 • Use Traﬁratura
library for text extraction from HTML • e.g. Okazaki Lab’s website → Score: 336.33 (positive) Extract

Step 4 More Precise Japanese Detection by Classifier 18 •
Japanese can be successfully detected by characters only • We trained SVM on Wikipedia using character n-grams as features ◦ Precision: 0.998 / Recall: 0.993 / F1: 0.996 • We later found FastText outperforms our classifier in total ◦ Well-adjusted FastText may be the most cost-effective choice Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection

Step 5 Quality Filtering with Heuristic Rules 20 • Documents
with duplicate expressions are detected based on n-grams ◦ Prevents LLM from generating the same phrase repeatedly ◦ e.g. Number of most frequent 2-gram / Number of all 2-grams > 0.2 • Apply custom rules for Japanese language quality ◦ e.g. Hiragana fraction < 0.2, Katakana fraction > 0.5 * A list of all ﬁltering rules is available in Appendix.1 Quality ﬁltering Deduplication Step 6 Filtering by hostnames Step 7

Step 6 Deduplication Prevents “Rote Memorization” of the Corpus 21
• Common Crawl tours the same website at diﬀerent times ➔ Remove duplicates to prevent LLMs from “memorizing” the documents • Deduplication is executed by duplicate feature detection with MinHash ◦ All documents are converted to character 5-gram feature set * The details of the duplicate detection algorithm by MinHash is available in Appendix.2 Quality ﬁltering Step 5 Deduplication Filtering by hostnames Step 7

Step 6 After Deduplication, only about 20% of the Older
Crawlings Remain 22 Older documents are reduced considerably since the newer ones are kept Before 2022, only about 20% are left Step 5 Step 6 Step 7 Quality ﬁltering Step 5 Deduplication Filtering by hostnames Step 7

Step 7 Filtering Based on NG Words and NG URLs
23 • In addition to Step 5, remove harmful contents by NG words and URLs ◦ Included in the UT1 blocklist ◦ Percentage of pages containing the name of a dating site (0.001<) ◦ Percentage of pages containing NG expressions (0.005 <) ◦ *wikipedia.org (Add the dump separately) ◦ *.5ch.net (Probably a low quality forum) Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames

Japanese-speciﬁc Normalization & Footer Removal 25 • NFKC Normalization ◦
Normalize full- and half-width alphabets, kana, katakana, and symbols • Consider the Japanese speciﬁc use of punctuations ◦ If「，」occurs more than「、」, unify to「、」 ◦ If「．」occurs more than「。」, unify to「。」 • Remove typical representations in footers ◦ e.g.「無断転載を禁ず」, 「この記事へのトラックバック一覧」

Step 2 Step 1 Differences from Previous Works (RefinedWeb) 26
Hostname-based filtering Text extraction Step 2 English detection Step 3 Quality filtering Step 4 Correction in line Step 5 Deduplication Step 6 Precise deduplication Step 7 RefinedWeb https://arxiv.org/abs/2306.01116 UT1 blocklist is available Japanese-specific processing Economical processing Downloading WARC files Step 1 Rapid Japanese detection Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9

Swallow Corpus has Boosted LLM Performance 27 • Continual pre-training
for Llama 2 (13B) with several Japanese corpus ◦ Of the about 104.9 BT of training data, 90% was Japanese ◦ Of the Japanese corpus, about 1.6 BT was Wikipedia, and the rest was each Japanese corpus

Recent Progress (Llama 3 Swallow)

Llama 3 Swallow 29 • Llama 3 Swallow was released
July 2024! ◦ Overall Japanese performance is comparable to Qwen 2 ◦ Excellent in Japanese knowledge (e.g. QA, common sense, translation) Question answering Translation

Improved Performance by Incorporating Existing Corpus 30 • Cosmopedia: synthesized
“textbook-like” text (English) ◦ Contributes to arithmetic reasoning and code generation • Laboro ParaCorpus: English-Japanese parallel corpus ◦ Contributes to translation (especially Ja-En)

Some Issues Remain Unresolved 31 • Issues: code generation, arithmetic
reasoning, general education ◦ Qwen 2 is signiﬁcantly better than Llama 3 Swallow in these domains ◦ Especially, code generation is lower than the base model (Llama 3) Code generation Arithmetic reasoning General education

What are Next Actions in Pre-training Corpus? 32 Pre-training corpus
must be a core factor in enhancing Japanese LLM 1. Further additional data collection ◦ Web-oriented texts may be near exhaustion ◦ Potentially a lot of non-Web texts, but often with copyright issues ◦ e.g. PDF, closed documents, OCR ◦ Synthetic data ◦ e.g. Cosmopedia 2. Make eﬀective use of data already collected ◦ Non-heuristics ﬁltering

Appendix.1: The Full List of Filtering Rules

The Full List of Filtering Rules (1) 34 Delete documents
that contain many duplicate expressions The rules of related work (MassiveWeb) are adapted as is • Number of lines duplicated in other lines / Number of all lines (0.30) • Number of paragraphs overlapping other paragraphs / Total number of paragraphs (0.30) • Number of characters in lines overlapping other lines / Number of all characters (0.20) • Characters in paragraphs that overlap with other paragraphs / Total number of characters (0.20) • Number of occurrences of most frequent 2-gram / Number of occurrences of all 2-grams (0.20) • Number of occurrences of most frequent 3-gram / Number of occurrences of all 3-grams (0.18) • Number of occurrences of most frequent 4-gram/ Number of occurrences of all 4-grams (0.16) • Total number of 5-grams occurring 2 or more times / Total number of 5-grams (0.15) • Total number of 6-grams occurring 2 or more times / Total number of 6-grams (0.14) • Total number of 7-grams occurring 2 or more times / Total number of 7-grams (0.13) • Total number of 8-grams occurring 2 or more times / Total number of 8-grams (0.12) • Total number of 9-grams occurring 2 or more times / Total number of 9-grams (0.11) • Total number of 10-grams occurring 2 or more times / Total number of 10-grams (0.10)

The Full List of Filtering Rules (2) 35 Deletion of
documents containing low-quality Japanese • Number of letters (< 400) • Fraction of hiragana letters (< 0.2) • Fraction of katakana letters (0.5 <) • Fraction of Japanese letters (hiragana, katakana, kanji, punctuation) (< 0.5) • Average number of letters in a sentence in a document (< 20, 90 <) • Number of letters in the longest sentence (200 <) • Fraction of sentences ending with an ellipsis (0.2 <)

Appendix.2: Details of Deduplication Algorithm

Jaccard Coefficients can be Approximated with MinHash 37 Number of
common features Number of total features Jaccard coefficient Similarity between two documents (0~1) Reduces huge computation complexity by approximation Minimum hash value of all elements in a given hash function Jaccard coefficient is obtained by MinHash

Probability of the Smallest Hash Value Being a Common Feature
38 Document A Features Hash value f_3 0.1 (1st) f_1 0.3 (3rd) f_5 0.4 (4th) f_2 0.7 (7th) f_4 0.8 (8th) Document B Features Hash value f_3 0.1 (1st) f_7 0.2 (2nd) f_5 0.4 (4th) f_6 0.5 (5th) f_8 0.6 (6th) ＝ MinHash match Is the smallest hash value a common feature?

Compare Multiple Hash Values Together 39 Algorithm to detect duplicate
documents 1. Create r of b MinHash value concatenations (=buckets) 2. If there is an exact match in r comparisons, it is considered a duplicate Probability of being considered a duplicate with Jaccard coeﬃcient s In this study, b = 20 and r = 20 * Too large K = br will increase memory requirements and computational complexity

Building an Effective Pre-training Corpus for J...

Building an Effective Pre-training Corpus for Japanese LLM (TAI AAI #3)

More Decks by Kakeru Hattori

Other Decks in Research

Featured

Transcript