Extracting Troubles from Daily Reports based on Syntactic Pieces

Extracting Troubles from Daily Reports based on Syntactic Pieces

Yoshifumi Kakimoto and Kazuhide Yamamoto. Extracting Troubles from Daily Reports based on Syntactic Pieces. Proceedings of the Annual meetings of the Pacific Asia Conference on Language, Information and Computation (PACLIC 22), pp.411-417 (2008.11)



November 30, 2008


  1. Extracting Troubles from Daily Reports based on Syntactic Pieces Yoshifumi

    Kakimoto Kazuhide Yamamoto Nagaoka University of Technology, 1603-1, Kamitomioka, Nagaoka, Niigata 940-2188 Japan {kakimoto, ykaz} @nlp.nagaokaut.ac.jp
  2. 2 Introduction In companies: request daily reports of text data

    browsed by human cope with troubles It is expensive We extracted troubles automatically
  3. 3 Definition of the Trouble Content regarding some problems in

    daily reports Troubles must take into account the context of the problem We dealt with syntactic pieces [Aoki et al. 07] examples of troubles αʔόʔ͕ ˰ յΕΔ (the server breaks) ஗Ԇ͕ ˰ ൃੜ͢Δ (the delay occurs)
  4. 4 System Overview Livedoor Blog kakaku.com Review boards Trouble reports

    No-trouble reports Training data Web corpus Trouble dictionary Extract syntactic pieces and calculate score A B Expand dictionary Matching of pieces Trouble information In new input report New input report Pieces of input report
  5. 5 Construction of Trouble Dictionary A Assign a score to

    pieces as troubles „ The score observes deviation between trouble and no-trouble reports. „ range of value: -1 ʙ +1 Consider reliability of scores with frequency „ Apply the confidence interval estimation method [Fujimura et al 04] [Agresti et al. 98] Pieces having positive scores are added to the trouble dictionary
  6. 6 If we use the trouble dictionary … „ Cannot

    extract all troubles Tackling troubles not included in the training data Expansion of Trouble Dictionary ಈ࡞͕஗͍ (motion is slow) ݕࡧ͕ (search) ද͕ࣔ (display) ରԠ͕ (response) ɾɾɾ ݕࡧ͕஗͍ (search is slow) ද͕ࣔ஗͍ (display is slow) ରԠ͕஗͍ (response is slow) ɾɾɾ Searching of similar verbal nouns Add to the dictionary B A
  7. 7 Evaluation:Two-Values Classifier Evaluation data: Made by humans Trouble reports

    : 133 No-troube reports : 133 Threshold of the highest F-value: 0.780 the highest F-value Precision 0.724 The highest F-value: 0.772 Recall 0.827
  8. 8 EvaluationɿExtracted Troubles bases Trouble information ʢ̍ʣ ը໘͕ ˰ දࣔ͞Εͳ͍

    (don’t appear on the screen) ஗Ԇ͕ ˰ ൃੜ͢Δ (the delay occurs) ʢ̎ʣ αϙʔτʹ ˰ ి࿩͢Δ (call for support) ൢചళʹ ˰ ฦ඼͢Δ (return goods to selling office) ʢ̏ʣ ίϯηϯτΛ ˰ ൈ͘ (pull out a plug) ిݯΛ ˰ ೖΕΔ (turn on power) Correct: base (1) Correct: base (1) and base (2) 0.30 (precision) 0.40 (precision) Input : 266 reports Threshold of the dictionary: 0.780 Number of extracted troubles: 407
  9. 9 Conclusion We developed a system that extracts troubles from

    reports Our dictionary is constructed using training data involving syntactic pieces The two-values classifier had an F- value of 0.772 the extracted troubles had a precision of 0.400
  10. 10 Top modificands list ى͜Δ (happen) ूத͢Δ (frequent) ଟ͍ (is

    large) ɾɾɾ ஗Ԇ͕ൃੜ͢Δ (the delay occurs) Pieces list of web corpus Τϥʔ͕ൃੜ͢Δ (the error occurs) ۤ৘͕ൃੜ͢Δ (the complaint occurs) ໰୊͕ൃੜ͢Δ (the problem occurs) ɾɾɾ Top modifiers list Τϥʔ͕ (the error) ۤ৘͕ (the complaint) ໰୊͕ (the problem) ɾɾɾ Add to the dictionary
  11. 11 the server breaks: 100 , 10 , 0.819 the

    delay occurs : 10000 , 1000 , 0.819 Piecs Frequency in trouble reports Frequency in no- trouble reports scores We want to consider ‘the delay occurs’ is more expensive than another one. the server breaks: ʶ0.150 the delay occurs : ʶ0.014 confidence interval the server breaks: 0.669 the delay occurs : 0.805 Final scores
  12. 12 Troubles getting expansion bases troubles expanded troubles ʢ̍ʣ ѱ͍˰αʔϏε

    ѱ͍˰Πϝʔδ (bad searvice) (bad image) ݕࡧ͕˰ग़དྷͳ͍ ද͕ࣔ˰ग़དྷͳ͍ (can’t search) (can’t display) ʢ̎ʣ αϙʔτʹ˰࿈བྷ͢Δ αϙʔτʹ˰૬ஊ͢Δ (inform for support) (consult for support) Τϥʔ͕˰ग़Δ ۤ৘͕˰ग़Δ (the error occurs) (the complaint occurs) ʢ̏ʣ ࿈བྷΛ˰͘ΕΔ ฦࣄΛ˰͘ΕΔ (receive the contact) (receive the reply)
  13. 13 Expressions of scores wi : a syntactic piece Pʢwi

    ʣ : the frequency of trouble reports containing wi Nʢwi ʣ : the frequency of no-trouble reports containing wi Pdoc : the total number of trouble reports Ndoc : the total number of no-trouble reports
  14. 14 confidence interval estimation method

  15. 15 Training Data Livedoor Blog „ Trouble reports Š Tags

    or titles have the word ‘trouble’. „ No-trouble reports Š Tags and titles don’t have the word ‘trouble’. Kakaku.com review boards „ Trouble reports Š Tags have ‘bad’. „ No-trouble reports Š Tags don’t have ‘bad’ and ‘question’