Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Quality of the Web Corpus

Improving Quality of the Web Corpus

Youichi Sekiguchi and Kazuhide Yamamoto. Improving Quality of the Web Corpus. Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04), pp.201-206 (2004.3)

More Decks by 自然言語処理研究室

Other Decks in Research

Transcript

  1. Improving Quality of
    Web Corpus
    Youichi SEKIGUCHI Kazuhide YAMAMOTO
    Department of Electrical Engineering
    Nagaoka University of Technology
    Nagaoka, Niigata

    View Slide

  2. What's Web corpus
    What's Web corpus?

    made from Web TEXTS

    has live language information

    has more words than Newspaper corpus

    has more caseframes than Newspaper corpus
    Web corpus has advantage in quantity & quality
    1

    View Slide

  3. Outline (process)

    To improve corpus quality

    Take account of surface expressions
    sentence identification
    character type proportion
    overlapped line

    Take account of contents
    Web oriented expressions
    2

    View Slide

  4. Process (1/2)
    Character type proportion
    Japanese has 5 types character
    (Kanji, Hiragana, Katakana, Alphabet, numbers)
    delete line had unnatural proportion of character
    Over-spoken style
    Web texts has unworthy expressions for construction
    Sentence identification
    To decide position of the end of a sentence
    (because web texts has some HTML tags)
    3

    View Slide

  5. Process (2/2)
    Overlapped-line deletion
    Web texts has sometimes duplicated line.
    This phenomenon makes wrong information for statistics
    Web-oriented expression deletion
    Sometimes was included smilies and word-smilies
    in Japanese Web text
    ex. :-) (*^o^*) (-_-) [smilies]
    ( 笑 ) -- smiling, ( 泣 ) – (crying) [word-simlies]
    4

    View Slide

  6. Outline (evaluate)

    Evaluate our procedure

    Evaluate Web corpus
    Comparsion with Newspaper corpus
    Wordage
    Word bias
    number of case frame
    Wordage
    number of case frame
    Make a comparsion between
    Web corpus and Web text
    5

    View Slide

  7. Result:comparsion wordage
    Web Ao  
    0 50000 100000
    Web A : Web corpus sized 21MB (processed completly)
    Web Ao : Web text sized 21MB (only HTML tags removed)
    Web A
    Unknown word
    Known word
    Web A was confirmed to more Known word than Web Ao.
    less Unknown word than Web Ao.
    6

    View Slide

  8. Result:word distribution on thesaurus
    Comparsion between Web corpus and Newspaper corpus
    Web corpus has same word distribution with Newspaper
    Confirmed for Increased words are not peculiar to Web
    7

    View Slide

  9. Conclusion
    ➔Web corpus is better resources than Newspaper
    ● More wordage
    ● More caseframe
    ● Not biased Web peculiar word
    ➔Our procedures contribute to improve quality
    (It is not effect to delete HTML tags only)
    ● Unknown words were decreased
    ● Known terms were increased
    8

    View Slide

  10. Knowledge

    Which is more useful? (Web vs. Newspaper)
    Compare 21 MB Web corpus with
    same size of Newspaper corpus
    Word numbers: 24% increased
    case frame numbers: 3% increased

    Make contributes to
    reduced Unknown Word
    get more case frames

    View Slide

  11. What's Quality of corpus?
    Quality depends on a task in general.
    = there exists no ideal text corpus satisfied for any tasks.
    Statistics word N-gram
    For example:
    Statistical word tagging
    Word sense disambiguation Variouse word usages
    We aim to construct Web corpus has
    ➔As large size as possible
    ➔As wide domains as possible
    ➔both slang-like and highly-formal written style are missing
    ➔many statistics obtained within a sentence

    View Slide

  12. Resource
    Web corpus
    get URLs from portal site
    [ http://www.webring.ne.jp ]
    Web texts
    3505 [MB]
    Web corpus
    223 [MB]
    Web corpus was constructed by extracted
    at random to equate with Newspaper corpus.

    View Slide

  13. Deleted lines
    Element Numbers Ratio[%]
    Overlapped lines 51691 19.7
    Overlapped pages 14878 5.7
    Character proportions 4937 1.9
    Word smilies 3003 1.1
    Web oriented expressions 2582 0.9
    Over spoken style 2214 0.8
    Smileies 1736 0.7
    proportion of deleted line on the Web texts

    View Slide

  14. Example of Web corpus
    ● アフターファイブの活動として継続していくためには,なぜこうし
    た活動が必要になってくるかを,部会員の一人ひとりが納得でき
    る活動にしていく必要がある.
    ● 以前,先行き不透明感が強い.
    ● 失った家庭を全国規模で見つめてきた人物は我々だけだし,外
    国にもいない.
    ● 参加型コンテンツ多数あり.

    Maxtor が開発した Ultra320 インタフェースは, MazAdept 機能
    を搭載することにより,閉ループ方式で信号品質を改善していま
    す.
    ● 子ども達もチラシを見て,10名余り駆けつけてくれた.

    View Slide

  15. Example:character proportion

    Set up our standard proportion in Japanese
    A sentence can include:
    Number less than 40%
    alphabets less than 40%
    generic symbols( 。.、,!? ) less than 30%
    another symbols less than 20%
    deleted example
     ★★★★★腰痛こんにゃくゼリー。
      720x486 , 720x480/59.94i をサポートしています。
      Authropology resource son the Internet から.

    View Slide

  16. Example: over-spoken style

    Web texts has over-spoken style sentence
    like:
    I am vvvveeeerrrry happppyyyyy!!!!!
    very Cooooooool !!!
    ん゛あーーーーーーーーーーーーーー。
    「もーーーーーやだーーーーーーーー!!」
    [ English ]
    [ Japanese ]

    View Slide

  17. Example: sentence identification

    Web page was designed by HTML tags.
    If we do delete tags simply,
    it makes divide a sentence somewhere.
    We determine to position of end of position
    using characteristic of HTML tags.

    A priod( 。 ) +


    A priod( 。 ) + **> (all of the end tags)



    “the end of character” +

    rules:
    the end of characters: ) > ? ! )>?!♪

    View Slide

  18. Result: number of case frame

    Compare Web A with Web Ao
    Web Ao
    Web A
    170 thousand case frames increased

    View Slide