Building an Effective Pre-training Corpus for Japanese LLM (TAI AAI #3)

Slide 1

Slide 1 text

Tokyo AI Advanced AI (TAI AAI) #3 Kakeru Hattori Okazaki Laboratory M2 Tokyo Institute of Technology August 7 (Wed), 2024 [email protected] Building an Eﬀective Pre-training Corpus for Japanese LLM

Slide 2

Slide 2 text

Self Introduction 服部翔 Kakeru Hattori Tokyo Institute of Technology School of Computing Computer Science Okazaki Laboratory M2 March, 2023 言語処理学会第29回年次大会（NLP2023）『クエリ指向要約におけるクエリと要約の統合的な生成』（日立製作所賞） July, 2023〜 Joined “Swallow” project (corpus building) Others SWE internship at Nikkei (SRE team / 1.5 years) 2 @ayase_lab Kakeru Hattori

Slide 3

Slide 3 text

Agenda I am working on an LLM building project called “Swallow”, and focusing on improving the pre-training corpus Today’s Topics 1. Overview of “Swallow” project 2. Building of a large Japanese Web corpus (Swallow Corpus) 3. Llama 3 Swallow (improved corpus recipes, remaining issues) 3

Slide 4

Slide 4 text

Swallow

Slide 5

Slide 5 text

Swallow Project 5 LLM research and development project at Okazaki Lab and Yokota Lab https://swallow-llm.github.io/index.en.html

Slide 6

Slide 6 text

Project Members 6 ● Roles: training, corpus, evaluation and instruction tuning ● Most are students at Tokyo Tech (mostly Bachelor’s and Master’s)

Slide 7

Slide 7 text

Project Features 7 ● Continual pre-training from English open source models ○ Llama 2 (7B, 70B), Mistral (7B), Mixtral (8x7B), Llama 3 (8B, 70B) ● Making a strong eﬀort in evaluate the LLMs ○ Dashboard to visualize evaluation results of various Japanese LLMs ● Building a large Japanese Web corpus (Swallow corpus)

Slide 8

Slide 8 text

Overview of Project History 8 2023.12.19 Swallow (on Llama 2) ➢ Using a large Japanese Web corpus (Swallow Corpus) -> Today’s main topic 2024.03.11 Swallow-MS 7B / Swallow-MX 8x7B (from Mistral, Mixtral) 2024.04.26 Swallow-*-instruct-v0.1 (improved instruct model) 2024.07.01 Llama 3 Swallow ➢ Swallow Corpus + Use of existing corpus -> Today’s sub topic

Slide 9

Slide 9 text

Swallow Corpus

Slide 10

Slide 10 text

Building Japanese LLMs with Continual Pre-training 10 ● Teach the language, knowledge, and culture of Japan to English LLMs ● High quality Japanese text is needed -> Swallow Corpus Base LLM e.g. Llama 3 Japanese corpus (training data) Enhanced LLM

Slide 11

Slide 11 text

Swallow Corpus 11 ● Extract and reﬁne Japanese texts from Common Crawl ● Japanese text (≈5%) is further ﬁltered to build a high-quality corpus Swallow Corpus Japanese is only 5% ONLY Japanese ONLY high-quality Only 0.27% of the original remains

Slide 12

Slide 12 text

Procedure for Building Swallow Corpus 12 ❶ Swallow-RAW Text extraction Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC ﬁles Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts

Slide 13

Slide 13 text

Procedure for Building Swallow Corpus 13 ❶ Swallow-RAW Text extraction Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC ﬁles Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts

Slide 14

Slide 14 text

WARC Format Requires Text Extraction, but can Achieve High-quality 14 There are two ﬁle formats available in Common Crawl…… WET format WARC format Data format Extracted text HTML Need for extraction No Yes Text quality Low High Example of corpus CC-100, mC4, OSCAR Swallow Corpus ➔ High quality extraction than existing Japanese corpus via WET

Slide 15

Slide 15 text

Which Comes First, Text Extraction or Japanese Detection? 15 ● Text extraction from HTML is time consuming ● Japanese is ONLY 5%, so we want to reduce the candidates ﬁrst ● However, Japanese language classiﬁer requires text extraction ○ HTML tags include English, not “pure” Japanese ➔ Is it possible to “roughly” Japanese detection in HTML format? 🤔🤔🤔

Slide 16

Slide 16 text

Step 2 Rapid Japanese Detection before Text Extraction 16 If either rule is met, it is considered to be Japanese 1. HTML “lang” attribute is ”ja” 2. HTML “title” element text is in Japanese (using a classiﬁer) Precision: 0.89 / Recall: 0.97 / F1: 0.926 ➔ The beneﬁt of 1/14 time reduction outweighs the loss of 3% Rapid Japanese detection Text extraction Step 3 Precise Japanese detection Step 4

Slide 17

Slide 17 text

Example of Text Extraction with Traﬁratura 17 ● Use Traﬁratura library for text extraction from HTML ● e.g. Okazaki Lab’s website → Score: 336.33 (positive) Extract

Slide 18

Slide 18 text

Step 4 More Precise Japanese Detection by Classifier 18 ● Japanese can be successfully detected by characters only ● We trained SVM on Wikipedia using character n-grams as features ○ Precision: 0.998 / Recall: 0.993 / F1: 0.996 ● We later found FastText outperforms our classifier in total ○ Well-adjusted FastText may be the most cost-effective choice Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection

Slide 19

Slide 19 text

Procedure for Building Swallow Corpus 19 ❶ Swallow-RAW Text extraction Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC ﬁles Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts

Slide 20

Slide 20 text

Step 5 Quality Filtering with Heuristic Rules 20 ● Documents with duplicate expressions are detected based on n-grams ○ Prevents LLM from generating the same phrase repeatedly ○ e.g. Number of most frequent 2-gram / Number of all 2-grams > 0.2 ● Apply custom rules for Japanese language quality ○ e.g. Hiragana fraction < 0.2, Katakana fraction > 0.5 * A list of all ﬁltering rules is available in Appendix.1 Quality ﬁltering Deduplication Step 6 Filtering by hostnames Step 7

Slide 21

Slide 21 text

Step 6 Deduplication Prevents “Rote Memorization” of the Corpus 21 ● Common Crawl tours the same website at diﬀerent times ➔ Remove duplicates to prevent LLMs from “memorizing” the documents ● Deduplication is executed by duplicate feature detection with MinHash ○ All documents are converted to character 5-gram feature set * The details of the duplicate detection algorithm by MinHash is available in Appendix.2 Quality ﬁltering Step 5 Deduplication Filtering by hostnames Step 7

Slide 22

Slide 22 text

Step 6 After Deduplication, only about 20% of the Older Crawlings Remain 22 Older documents are reduced considerably since the newer ones are kept Before 2022, only about 20% are left Step 5 Step 6 Step 7 Quality ﬁltering Step 5 Deduplication Filtering by hostnames Step 7

Slide 23

Slide 23 text

Step 7 Filtering Based on NG Words and NG URLs 23 ● In addition to Step 5, remove harmful contents by NG words and URLs ○ Included in the UT1 blocklist ○ Percentage of pages containing the name of a dating site (0.001<) ○ Percentage of pages containing NG expressions (0.005 <) ○ *wikipedia.org (Add the dump separately) ○ *.5ch.net (Probably a low quality forum) Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames

Slide 24

Slide 24 text

Procedure for Building Swallow Corpus 24 ❶ Swallow-RAW Text extraction Japanese detection ❷ Swallow-CLEAN Quality Filtering Deduplication ❸ Swallow-NORM Normalization Downloading WARC ﬁles Step 1 Rapid Japanese detection Step 2 Text extraction Step 3 Precise Japanese detection Step 4 Quality ﬁltering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9 Remove non-Japanese texts Remove low-quality Japanese texts

Slide 25

Slide 25 text

Japanese-speciﬁc Normalization & Footer Removal 25 ● NFKC Normalization ○ Normalize full- and half-width alphabets, kana, katakana, and symbols ● Consider the Japanese speciﬁc use of punctuations ○ If「，」occurs more than「、」, unify to「、」 ○ If「．」occurs more than「。」, unify to「。」 ● Remove typical representations in footers ○ e.g.「無断転載を禁ず」, 「この記事へのトラックバック一覧」

Slide 26

Slide 26 text

Step 2 Step 1 Differences from Previous Works (RefinedWeb) 26 Hostname-based filtering Text extraction Step 2 English detection Step 3 Quality filtering Step 4 Correction in line Step 5 Deduplication Step 6 Precise deduplication Step 7 RefinedWeb https://arxiv.org/abs/2306.01116 UT1 blocklist is available Japanese-specific processing Economical processing Downloading WARC files Step 1 Rapid Japanese detection Text extraction Step 3 Precise Japanese detection Step 4 Quality filtering Step 5 Deduplication Step 6 Filtering by hostnames Step 7 Normalizing punctuations Step 8 Removing footers Step 9

Slide 27

Slide 27 text

Swallow Corpus has Boosted LLM Performance 27 ● Continual pre-training for Llama 2 (13B) with several Japanese corpus ○ Of the about 104.9 BT of training data, 90% was Japanese ○ Of the Japanese corpus, about 1.6 BT was Wikipedia, and the rest was each Japanese corpus

Slide 28

Slide 28 text

Recent Progress (Llama 3 Swallow)

Slide 29

Slide 29 text

Llama 3 Swallow 29 ● Llama 3 Swallow was released July 2024! ○ Overall Japanese performance is comparable to Qwen 2 ○ Excellent in Japanese knowledge (e.g. QA, common sense, translation) Question answering Translation

Slide 30

Slide 30 text

Improved Performance by Incorporating Existing Corpus 30 ● Cosmopedia: synthesized “textbook-like” text (English) ○ Contributes to arithmetic reasoning and code generation ● Laboro ParaCorpus: English-Japanese parallel corpus ○ Contributes to translation (especially Ja-En)

Slide 31

Slide 31 text

Some Issues Remain Unresolved 31 ● Issues: code generation, arithmetic reasoning, general education ○ Qwen 2 is signiﬁcantly better than Llama 3 Swallow in these domains ○ Especially, code generation is lower than the base model (Llama 3) Code generation Arithmetic reasoning General education

Slide 32

Slide 32 text

What are Next Actions in Pre-training Corpus? 32 Pre-training corpus must be a core factor in enhancing Japanese LLM 1. Further additional data collection ○ Web-oriented texts may be near exhaustion ○ Potentially a lot of non-Web texts, but often with copyright issues ○ e.g. PDF, closed documents, OCR ○ Synthetic data ○ e.g. Cosmopedia 2. Make eﬀective use of data already collected ○ Non-heuristics ﬁltering

Slide 33

Slide 33 text

Appendix.1: The Full List of Filtering Rules

Slide 34

Slide 34 text

The Full List of Filtering Rules (1) 34 Delete documents that contain many duplicate expressions The rules of related work (MassiveWeb) are adapted as is ● Number of lines duplicated in other lines / Number of all lines (0.30) ● Number of paragraphs overlapping other paragraphs / Total number of paragraphs (0.30) ● Number of characters in lines overlapping other lines / Number of all characters (0.20) ● Characters in paragraphs that overlap with other paragraphs / Total number of characters (0.20) ● Number of occurrences of most frequent 2-gram / Number of occurrences of all 2-grams (0.20) ● Number of occurrences of most frequent 3-gram / Number of occurrences of all 3-grams (0.18) ● Number of occurrences of most frequent 4-gram/ Number of occurrences of all 4-grams (0.16) ● Total number of 5-grams occurring 2 or more times / Total number of 5-grams (0.15) ● Total number of 6-grams occurring 2 or more times / Total number of 6-grams (0.14) ● Total number of 7-grams occurring 2 or more times / Total number of 7-grams (0.13) ● Total number of 8-grams occurring 2 or more times / Total number of 8-grams (0.12) ● Total number of 9-grams occurring 2 or more times / Total number of 9-grams (0.11) ● Total number of 10-grams occurring 2 or more times / Total number of 10-grams (0.10)

Slide 35

Slide 35 text

The Full List of Filtering Rules (2) 35 Deletion of documents containing low-quality Japanese ● Number of letters (< 400) ● Fraction of hiragana letters (< 0.2) ● Fraction of katakana letters (0.5 <) ● Fraction of Japanese letters (hiragana, katakana, kanji, punctuation) (< 0.5) ● Average number of letters in a sentence in a document (< 20, 90 <) ● Number of letters in the longest sentence (200 <) ● Fraction of sentences ending with an ellipsis (0.2 <)

Slide 36

Slide 36 text

Appendix.2: Details of Deduplication Algorithm

Slide 37

Slide 37 text

Jaccard Coefficients can be Approximated with MinHash 37 Number of common features Number of total features Jaccard coefficient Similarity between two documents (0~1) Reduces huge computation complexity by approximation Minimum hash value of all elements in a given hash function Jaccard coefficient is obtained by MinHash

Slide 38

Slide 38 text

Probability of the Smallest Hash Value Being a Common Feature 38 Document A Features Hash value f_3 0.1 (1st) f_1 0.3 (3rd) f_5 0.4 (4th) f_2 0.7 (7th) f_4 0.8 (8th) Document B Features Hash value f_3 0.1 (1st) f_7 0.2 (2nd) f_5 0.4 (4th) f_6 0.5 (5th) f_8 0.6 (6th) ＝ MinHash match Is the smallest hash value a common feature?

Slide 39

Slide 39 text

Compare Multiple Hash Values Together 39 Algorithm to detect duplicate documents 1. Create r of b MinHash value concatenations (=buckets) 2. If there is an exact match in r comparisons, it is considered a duplicate Probability of being considered a duplicate with Jaccard coeﬃcient s In this study, b = 20 and r = 20 * Too large K = br will increase memory requirements and computational complexity