Youichi Sekiguchi and Kazuhide Yamamoto. Improving Quality of the Web Corpus. Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04), pp.201-206 (2004.3)
Improving Quality of Web Corpus Youichi SEKIGUCHI Kazuhide YAMAMOTO Department of Electrical Engineering Nagaoka University of Technology Nagaoka, Niigata
What's Web corpus What's Web corpus? ● made from Web TEXTS ● has live language information ● has more words than Newspaper corpus ● has more caseframes than Newspaper corpus Web corpus has advantage in quantity & quality 1
Outline (process) ● To improve corpus quality ● Take account of surface expressions sentence identification character type proportion overlapped line ● Take account of contents Web oriented expressions 2
Process (1/2) Character type proportion Japanese has 5 types character (Kanji, Hiragana, Katakana, Alphabet, numbers) delete line had unnatural proportion of character Over-spoken style Web texts has unworthy expressions for construction Sentence identification To decide position of the end of a sentence (because web texts has some HTML tags) 3
Process (2/2) Overlapped-line deletion Web texts has sometimes duplicated line. This phenomenon makes wrong information for statistics Web-oriented expression deletion Sometimes was included smilies and word-smilies in Japanese Web text ex. :-) (*^o^*) (-_-) [smilies] ( 笑 ) -- smiling, ( 泣 ) – (crying) [word-simlies] 4
Outline (evaluate) ● Evaluate our procedure ● Evaluate Web corpus Comparsion with Newspaper corpus Wordage Word bias number of case frame Wordage number of case frame Make a comparsion between Web corpus and Web text 5
Result:comparsion wordage Web Ao 0 50000 100000 Web A : Web corpus sized 21MB (processed completly) Web Ao : Web text sized 21MB (only HTML tags removed) Web A Unknown word Known word Web A was confirmed to more Known word than Web Ao. less Unknown word than Web Ao. 6
Result:word distribution on thesaurus Comparsion between Web corpus and Newspaper corpus Web corpus has same word distribution with Newspaper Confirmed for Increased words are not peculiar to Web 7
Conclusion ➔Web corpus is better resources than Newspaper ● More wordage ● More caseframe ● Not biased Web peculiar word ➔Our procedures contribute to improve quality (It is not effect to delete HTML tags only) ● Unknown words were decreased ● Known terms were increased 8
Knowledge ● Which is more useful? (Web vs. Newspaper) Compare 21 MB Web corpus with same size of Newspaper corpus Word numbers: 24% increased case frame numbers: 3% increased ● Make contributes to reduced Unknown Word get more case frames
What's Quality of corpus? Quality depends on a task in general. = there exists no ideal text corpus satisfied for any tasks. Statistics word N-gram For example: Statistical word tagging Word sense disambiguation Variouse word usages We aim to construct Web corpus has ➔As large size as possible ➔As wide domains as possible ➔both slang-like and highly-formal written style are missing ➔many statistics obtained within a sentence
Resource Web corpus get URLs from portal site [ http://www.webring.ne.jp ] Web texts 3505 [MB] Web corpus 223 [MB] Web corpus was constructed by extracted at random to equate with Newspaper corpus.
Deleted lines Element Numbers Ratio[%] Overlapped lines 51691 19.7 Overlapped pages 14878 5.7 Character proportions 4937 1.9 Word smilies 3003 1.1 Web oriented expressions 2582 0.9 Over spoken style 2214 0.8 Smileies 1736 0.7 proportion of deleted line on the Web texts
Example:character proportion ● Set up our standard proportion in Japanese A sentence can include: Number less than 40% alphabets less than 40% generic symbols( 。.、,!? ) less than 30% another symbols less than 20% deleted example ★★★★★腰痛こんにゃくゼリー。 720x486 , 720x480/59.94i をサポートしています。 Authropology resource son the Internet から.
Example: over-spoken style ● Web texts has over-spoken style sentence like: I am vvvveeeerrrry happppyyyyy!!!!! very Cooooooool !!! ん゛あーーーーーーーーーーーーーー。 「もーーーーーやだーーーーーーーー!!」 [ English ] [ Japanese ]
Example: sentence identification ● Web page was designed by HTML tags. If we do delete tags simply, it makes divide a sentence somewhere. We determine to position of end of position using characteristic of HTML tags. ● A priod( 。 ) +