Improving Quality of the Web Corpus

Improving Quality of the Web Corpus

Youichi Sekiguchi and Kazuhide Yamamoto. Improving Quality of the Web Corpus. Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04), pp.201-206 (2004.3)


  1. Improving Quality of Web Corpus Youichi SEKIGUCHI Kazuhide YAMAMOTO Department

    of Electrical Engineering Nagaoka University of Technology Nagaoka, Niigata
  2. What's Web corpus What's Web corpus? • made from Web

    TEXTS • has live language information • has more words than Newspaper corpus • has more caseframes than Newspaper corpus Web corpus has advantage in quantity & quality 1
  3. Outline (process) • To improve corpus quality • Take account

    of surface expressions sentence identification character type proportion overlapped line • Take account of contents Web oriented expressions 2
  4. Process (1/2) Character type proportion Japanese has 5 types character

    (Kanji, Hiragana, Katakana, Alphabet, numbers) delete line had unnatural proportion of character Over-spoken style Web texts has unworthy expressions for construction Sentence identification To decide position of the end of a sentence (because web texts has some HTML tags) 3
  5. Process (2/2) Overlapped-line deletion Web texts has sometimes duplicated line.

    This phenomenon makes wrong information for statistics Web-oriented expression deletion Sometimes was included smilies and word-smilies in Japanese Web text ex. :-) (*^o^*) (-_-) [smilies] ( 笑 ) -- smiling, ( 泣 ) – (crying) [word-simlies] 4
  6. Outline (evaluate) • Evaluate our procedure • Evaluate Web corpus

    Comparsion with Newspaper corpus Wordage Word bias number of case frame Wordage number of case frame Make a comparsion between Web corpus and Web text 5
  7. Result:comparsion wordage Web Ao   0 50000 100000 Web A

    : Web corpus sized 21MB (processed completly) Web Ao : Web text sized 21MB (only HTML tags removed) Web A Unknown word Known word Web A was confirmed to more Known word than Web Ao. less Unknown word than Web Ao. 6
  8. Result:word distribution on thesaurus Comparsion between Web corpus and Newspaper

    corpus Web corpus has same word distribution with Newspaper Confirmed for Increased words are not peculiar to Web 7
  9. Conclusion ➔Web corpus is better resources than Newspaper • More

    wordage • More caseframe • Not biased Web peculiar word ➔Our procedures contribute to improve quality (It is not effect to delete HTML tags only) • Unknown words were decreased • Known terms were increased 8
  10. Knowledge • Which is more useful? (Web vs. Newspaper) Compare

    21 MB Web corpus with same size of Newspaper corpus Word numbers: 24% increased case frame numbers: 3% increased • Make contributes to reduced Unknown Word get more case frames
  11. What's Quality of corpus? Quality depends on a task in

    general. = there exists no ideal text corpus satisfied for any tasks. Statistics word N-gram For example: Statistical word tagging Word sense disambiguation Variouse word usages We aim to construct Web corpus has ➔As large size as possible ➔As wide domains as possible ➔both slang-like and highly-formal written style are missing ➔many statistics obtained within a sentence
  12. Resource Web corpus get URLs from portal site [

    ] Web texts 3505 [MB] Web corpus 223 [MB] Web corpus was constructed by extracted at random to equate with Newspaper corpus.
  13. Deleted lines Element Numbers Ratio[%] Overlapped lines 51691 19.7 Overlapped

    pages 14878 5.7 Character proportions 4937 1.9 Word smilies 3003 1.1 Web oriented expressions 2582 0.9 Over spoken style 2214 0.8 Smileies 1736 0.7 proportion of deleted line on the Web texts
  14. Example of Web corpus • アフターファイブの活動として継続していくためには,なぜこうし た活動が必要になってくるかを,部会員の一人ひとりが納得でき る活動にしていく必要がある. • 以前,先行き不透明感が強い.

    • 失った家庭を全国規模で見つめてきた人物は我々だけだし,外 国にもいない. • 参加型コンテンツ多数あり. • Maxtor が開発した Ultra320 インタフェースは, MazAdept 機能 を搭載することにより,閉ループ方式で信号品質を改善していま す. • 子ども達もチラシを見て,10名余り駆けつけてくれた.
  15. Example:character proportion • Set up our standard proportion in Japanese

    A sentence can include: Number less than 40% alphabets less than 40% generic symbols( 。.、,!? ) less than 30% another symbols less than 20% deleted example  ★★★★★腰痛こんにゃくゼリー。   720x486 , 720x480/59.94i をサポートしています。   Authropology resource son the Internet から.
  16. Example: over-spoken style • Web texts has over-spoken style sentence

    like: I am vvvveeeerrrry happppyyyyy!!!!! very Cooooooool !!! ん゛あーーーーーーーーーーーーーー。 「もーーーーーやだーーーーーーーー!!」 [ English ] [ Japanese ]
  17. Example: sentence identification • Web page was designed by HTML

    tags. If we do delete tags simply, it makes divide a sentence somewhere. We determine to position of end of position using characteristic of HTML tags. • A priod( 。 ) + <BR> • A priod( 。 ) + </**> (all of the end tags) • <li> • “the end of character” + <br> rules: the end of characters: ) > ? ! )>?!♪
  18. Result: number of case frame • Compare Web A with

    Web Ao Web Ao Web A 170 thousand case frames increased