Slide 25
Slide 25 text
Japanese-specific Normalization & Footer Removal 25
● NFKC Normalization
○ Normalize full- and half-width alphabets, kana, katakana, and symbols
● Consider the Japanese specific use of punctuations
○ If「,」occurs more than「、」, unify to「、」
○ If「.」occurs more than「。」, unify to「。」
● Remove typical representations in footers
○ e.g.「無断転載を禁ず」, 「この記事へのトラックバック一覧」