Youichi Sekiguchi and Kazuhide Yamamoto. Improving Quality of the Web Corpus. Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04), pp.201-206 (2004.3)
TEXTS • has live language information • has more words than Newspaper corpus • has more caseframes than Newspaper corpus Web corpus has advantage in quantity & quality 1
(Kanji, Hiragana, Katakana, Alphabet, numbers) delete line had unnatural proportion of character Over-spoken style Web texts has unworthy expressions for construction Sentence identification To decide position of the end of a sentence (because web texts has some HTML tags) 3
This phenomenon makes wrong information for statistics Web-oriented expression deletion Sometimes was included smilies and word-smilies in Japanese Web text ex. :-) (*^o^*) (-_-) [smilies] ( 笑 ) -- smiling, ( 泣 ) – (crying) [word-simlies] 4
Comparsion with Newspaper corpus Wordage Word bias number of case frame Wordage number of case frame Make a comparsion between Web corpus and Web text 5
: Web corpus sized 21MB (processed completly) Web Ao : Web text sized 21MB (only HTML tags removed) Web A Unknown word Known word Web A was confirmed to more Known word than Web Ao. less Unknown word than Web Ao. 6
wordage • More caseframe • Not biased Web peculiar word ➔Our procedures contribute to improve quality (It is not effect to delete HTML tags only) • Unknown words were decreased • Known terms were increased 8
21 MB Web corpus with same size of Newspaper corpus Word numbers: 24% increased case frame numbers: 3% increased • Make contributes to reduced Unknown Word get more case frames
general. = there exists no ideal text corpus satisfied for any tasks. Statistics word N-gram For example: Statistical word tagging Word sense disambiguation Variouse word usages We aim to construct Web corpus has ➔As large size as possible ➔As wide domains as possible ➔both slang-like and highly-formal written style are missing ➔many statistics obtained within a sentence
pages 14878 5.7 Character proportions 4937 1.9 Word smilies 3003 1.1 Web oriented expressions 2582 0.9 Over spoken style 2214 0.8 Smileies 1736 0.7 proportion of deleted line on the Web texts
A sentence can include: Number less than 40% alphabets less than 40% generic symbols( 。.、,!? ) less than 30% another symbols less than 20% deleted example ★★★★★腰痛こんにゃくゼリー。 720x486 , 720x480/59.94i をサポートしています。 Authropology resource son the Internet から.
tags. If we do delete tags simply, it makes divide a sentence somewhere. We determine to position of end of position using characteristic of HTML tags. • A priod( 。 ) + <BR> • A priod( 。 ) + </**> (all of the end tags) • <li> • “the end of character” + <br> rules: the end of characters: ) > ? ! )>?!♪