Extracting Troubles from Daily Reports based on Syntactic Pieces

Extracting Troubles from Daily Reports based on Syntactic Pieces Yoshifumi
Kakimoto Kazuhide Yamamoto Nagaoka University of Technology, 1603-1, Kamitomioka, Nagaoka, Niigata 940-2188 Japan {kakimoto, ykaz} @nlp.nagaokaut.ac.jp

2 Introduction In companies: request daily reports of text data
browsed by human cope with troubles It is expensive We extracted troubles automatically

3 Definition of the Trouble Content regarding some problems in
daily reports Troubles must take into account the context of the problem We dealt with syntactic pieces [Aoki et al. 07] examples of troubles αʔόʔ͕ ˰ յΕΔ (the server breaks) ஗Ԇ͕ ˰ ൃੜ͢Δ (the delay occurs)

4 System Overview Livedoor Blog kakaku.com Review boards Trouble reports
No-trouble reports Training data Web corpus Trouble dictionary Extract syntactic pieces and calculate score A B Expand dictionary Matching of pieces Trouble information In new input report New input report Pieces of input report

5 Construction of Trouble Dictionary A Assign a score to
pieces as troubles The score observes deviation between trouble and no-trouble reports. range of value: -1 ʙ +1 Consider reliability of scores with frequency Apply the confidence interval estimation method [Fujimura et al 04] [Agresti et al. 98] Pieces having positive scores are added to the trouble dictionary

6 If we use the trouble dictionary … Cannot
extract all troubles Tackling troubles not included in the training data Expansion of Trouble Dictionary ಈ࡞͕஗͍ (motion is slow) ݕࡧ͕ (search) ද͕ࣔ (display) ରԠ͕ (response) ɾɾɾ ݕࡧ͕஗͍ (search is slow) ද͕ࣔ஗͍ (display is slow) ରԠ͕஗͍ (response is slow) ɾɾɾ Searching of similar verbal nouns Add to the dictionary B A

7 Evaluation:Two-Values Classifier Evaluation data: Made by humans Trouble reports
: 133 No-troube reports : 133 Threshold of the highest F-value: 0.780 the highest F-value Precision 0.724 The highest F-value: 0.772 Recall 0.827

8 EvaluationɿExtracted Troubles bases Trouble information ʢ̍ʣ ը໘͕ ˰ දࣔ͞Εͳ͍
(don’t appear on the screen) ஗Ԇ͕ ˰ ൃੜ͢Δ (the delay occurs) ʢ̎ʣ αϙʔτʹ ˰ ి࿩͢Δ (call for support) ൢചళʹ ˰ ฦ඼͢Δ (return goods to selling office) ʢ̏ʣ ίϯηϯτΛ ˰ ൈ͘ (pull out a plug) ిݯΛ ˰ ೖΕΔ (turn on power) Correct: base (1) Correct: base (1) and base (2) 0.30 (precision) 0.40 (precision) Input : 266 reports Threshold of the dictionary: 0.780 Number of extracted troubles: 407

9 Conclusion We developed a system that extracts troubles from
reports Our dictionary is constructed using training data involving syntactic pieces The two-values classifier had an F- value of 0.772 the extracted troubles had a precision of 0.400

10 Top modificands list ى͜Δ (happen) ूத͢Δ (frequent) ଟ͍ (is
large) ɾɾɾ ஗Ԇ͕ൃੜ͢Δ (the delay occurs) Pieces list of web corpus Τϥʔ͕ൃੜ͢Δ (the error occurs) ۤ৘͕ൃੜ͢Δ (the complaint occurs) ໰୊͕ൃੜ͢Δ (the problem occurs) ɾɾɾ Top modifiers list Τϥʔ͕ (the error) ۤ৘͕ (the complaint) ໰୊͕ (the problem) ɾɾɾ Add to the dictionary

11 the server breaks: 100 , 10 , 0.819 the
delay occurs : 10000 , 1000 , 0.819 Piecs Frequency in trouble reports Frequency in no- trouble reports scores We want to consider ‘the delay occurs’ is more expensive than another one. the server breaks: ʶ0.150 the delay occurs : ʶ0.014 confidence interval the server breaks: 0.669 the delay occurs : 0.805 Final scores

12 Troubles getting expansion bases troubles expanded troubles ʢ̍ʣ ѱ͍˰αʔϏε
ѱ͍˰Πϝʔδ (bad searvice) (bad image) ݕࡧ͕˰ग़དྷͳ͍ ද͕ࣔ˰ग़དྷͳ͍ (can’t search) (can’t display) ʢ̎ʣ αϙʔτʹ˰࿈བྷ͢Δ αϙʔτʹ˰૬ஊ͢Δ (inform for support) (consult for support) Τϥʔ͕˰ग़Δ ۤ৘͕˰ग़Δ (the error occurs) (the complaint occurs) ʢ̏ʣ ࿈བྷΛ˰͘ΕΔ ฦࣄΛ˰͘ΕΔ (receive the contact) (receive the reply)

13 Expressions of scores wi : a syntactic piece Pʢwi
ʣ : the frequency of trouble reports containing wi Nʢwi ʣ : the frequency of no-trouble reports containing wi Pdoc : the total number of trouble reports Ndoc : the total number of no-trouble reports

14 confidence interval estimation method

15 Training Data Livedoor Blog Trouble reports Tags
or titles have the word ‘trouble’. No-trouble reports Tags and titles don’t have the word ‘trouble’. Kakaku.com review boards Trouble reports Tags have ‘bad’. No-trouble reports Tags don’t have ‘bad’ and ‘question’

Extracting Troubles from Daily Reports based on...

Extracting Troubles from Daily Reports based on Syntactic Pieces

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Research

Featured

Transcript

Extracting Troubles from Daily Reports based on Syntactic Pieces Yoshifumi

2 Introduction In companies: request daily reports of text data

3 Definition of the Trouble Content regarding some problems in

4 System Overview Livedoor Blog kakaku.com Review boards Trouble reports

5 Construction of Trouble Dictionary A Assign a score to

6 If we use the trouble dictionary … Cannot

7 Evaluation:Two-Values Classifier Evaluation data: Made by humans Trouble reports

8 EvaluationɿExtracted Troubles bases Trouble information ʢ̍ʣ ը໘͕ ˰ දࣔ͞Εͳ͍

9 Conclusion We developed a system that extracts troubles from

10 Top modificands list ى͜Δ (happen) ूத͢Δ (frequent) ଟ͍ (is

11 the server breaks: 100 , 10 , 0.819 the

12 Troubles getting expansion bases troubles expanded troubles ʢ̍ʣ ѱ͍˰αʔϏε

13 Expressions of scores wi : a syntactic piece Pʢwi

14 confidence interval estimation method

15 Training Data Livedoor Blog Trouble reports Tags