Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing (13) Text mining and conclusion

Natural Language Processing (13) Text mining and conclusion

自然言語処理研究室

December 12, 2013
Tweet

More Decks by 自然言語処理研究室

Other Decks in Education

Transcript

  1. 1 / 27 Natural Language Processing (13) Text mining and

    Overall Conclusion Kazuhide Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology
  2. 2 / 27 交通の便が悪い 電気がない 高い! 料理に不満 露天が自慢! 女性に人気 周囲が静か

    貸し切り風呂あり 四季の変化を感じる ペットOK! A温泉 昨年もA温泉で会社の総会があって、生まれて初めて行きました。 塩分の強い温泉で、上司は湿疹があって飛び跳ねて出て行ってしまいました 自分は2度目ですが、やはり日本海といった感じが最高でした。 B地方のC寺・・・・あまりの広さにガイドが迷子になりそう~~っていうか実際なって たし. D高原(付近に牧場あり) 小規模の宿が多いですね.でも女性に人気らしい.新鮮な牛乳が飲めるらしい.肉も? C温泉 病院の横の元湯館(幽霊に逢える)、ホテルE 極楽浄土、極楽浄土。。。 毎日を忙しく過ごす貴女に ゆったりとした露天風呂を おすすめします。 評判抽出 まとめ A温泉の
  3. 3 / 27 What is text mining? • Text mining

    is technology to mine, i.e., extract somewhat useful information from large amount of text. • The term is not clearly defined; detail of processes depends on the needs one wish to obtain from text. • In broad terms, some NLP tasks, such as information retrieval, document classification, automatic summarization are all regarded as (one of) text mining process.
  4. 4 / 27 Example of analysis: Spring A Suppose that

    you are interested in hot spring spot A, so you are looking for information for it. You see that: (1) number of search string for "Spring A" has been increased. • time-series analysis of query (2) increasing number of Twitter users have mentioned Spring A recently. • time-series analysis of tweets (3) Spring A has paid more attentions than Spring B. • comparison to other target(s)
  5. 5 / 27 Spring A: analysis results • Spring A

    gets positive sentiment. • Women have positive sentiment for Spring A. • Spring A has an impression as a spring spot that is silent and secluded.
  6. 6 / 27 Text mining: background • As we have

    collected huge amount of text for recent decades, both in a company and in a society, we are no more unable to deal with them manually. • Moreover, we hopefully want to analyze them for finding some useful information.
  7. 8 / 27 Text mining and data mining • Data

    mining uses formatted data, that is usually described as table(s) consisting of figures. • Text mining deals with unformatted data, that is usually described as (collection of) natural language sentences. • In other words, the mission of text mining is to construct table from language. Once it is formatted as table, we use all techniques of data mining. In this sense text mining is regarded as pre-process of data mining.
  8. 9 / 27 Information retrieval and text mining • Information

    Retrieval (IR) is used when we have specific and clear target for search. – e.g., sightseeing in Kyoto. • Text mining is used when we need to find search targets. While text mining is conducted, IR technique is used throughout. – e.g., good sightseeing spot to visit with my family in this summer vacation
  9. 10 / 27 Text classification and mining • Text classification

    classifies texts into several categories that are clearly defined in advance. • Clustering process merges texts each of which are somewhat similar to others. Categories are not pre-defined here, but number of categories may be given. Text mining analyzes text by utilizing classification/clustering technologies, in order to get some useful knowledge.
  10. 11 / 27 Details of mining process 1. sentence analysis

    by natural language processing • morphological analysis • parsing 2. statistical analysis and/or machine learning • time-series processing • correlation analysis • trend analysis • topic detection 3. visualization of analysis results
  11. 12 / 27 Primitive approach • conduct morphological analysis •

    count frequencies for all words • visualize a trend of a word by plotting changes over time. • Some analyses are doing only this, which may be enough to see a trend. • However, it is obviously problematic in the viewpoint of language processing.
  12. 13 / 27 Problems of the primitive approach • unknown

    words – new coined words often have key information for judgment. • stop words – no information is obtained when we see that particle appears very often. • selection of word – not enough to know that it's a hot word. – not easy to pick "drastic" increases. • don't know how it is evaluated. – positive or negative?
  13. 14 / 27 Method for positive/negative classification Texts are classified

    into positive one and negative one basically according to the following approach: • collect both positive and negative texts, • conduct supervised training by machine learning, and • classify given texts by the automatic classifier This process tells us whether an input text is regarded as positive or negative.
  14. 15 / 27 Advanced text mining • Identification of attributes:

    what attribute is good/bad? – coffee: hardness, milkiness, package, CM, ... • Identification of subjective expressions: how good/bad? – movie: boring, poor performance, so loud, tear-jerker not sad
  15. 16 / 27 p/n judgment: problems • context – "Expensive?

    You're kidding?" – "We want to see it if it's the case." – "We don't say that it's not good." • multiple word senses – 値段が「高い」 vs. 品質が「高い」 – 「やばい」 • new coined words – まいうー、マジパない、とにかく神
  16. 17 / 27 Who (should) mine text? • (a) one

    who has a large amount of text, or one who can produce/collect a large amount of texts. • (b) one who wish to acquire any useful knowledge or information from texts. There are a lot of people/companies who satisfy (a), while there are also a lot to satisfy (b). However, it is interesting to note, that there are not so many who satisfy both of them up to now.
  17. 18 / 27 Contact center (customer desk) • In contact

    center, all of contacts both by e-mail and by phone are recorded everyday. – Canon Corp. received around 6 million contacts over a year worldwide. – IBM customer center gets over 10 thousands phone calls per month, where all of speech are described and recorded automatically by a speech recognizer. • These companies are aware that this is not only for avoiding troubles later, but also useful information: – in order to find defects of the product/service; – in order to develop a new product/service.
  18. 19 / 27 Company's daily report • Companies engaged in

    business activities ask staffs to write an daily report as everyday's record. For example, Lion Corp. in Japan is collecting around 8,000 reports per month by employees. • All of these reports were read by a manager. This is becoming more difficult due to lack of time and increasing number of reports. Here, there is a demand to automatically pick some of reports that should be somewhat important.
  19. 20 / 27 Marketing research • Product companies need to

    know "voice of consumers" in order to develop new products and services. • So far the survey was done by paid questionnaire against consumers collected by the company. • Companies have paid attention to text mining as alternative of questionnaire, since – automatic, thus little cost – more chance of collecting consumers "real voice"
  20. 21 / 27 Web mining There are Web-based text mining

    services: • sentiment analysis • finding hot topic (such as in Twitter) • and more
  21. 22 / 27 Other examples Text mining is used not

    only CRM (customer relationship management) but other domains: • medicine: correlation of symptoms and substance • trend analysis for patent documents • trend analysis of a topic for newspapers Currently some tools for text mining are available, both for free (having limited functions) and for sale (powerful but expensive). You can also analyze texts only with Excel and open source tools.
  22. 24 / 27 Conclusion 1 • Natural language processing (NLP)

    has a history of only half century, that is far shorter than other areas like mathematics and electrical engineering. • As a few products or services are in practical use during this period, there are a lot of problems to be solved around us.
  23. 25 / 27 Conclusion 2 • There is a (potential)

    demand and an expectation in society to use NLP technology. • The number of e-texts is increasing day by day, particularly on the Web, that make us more and more difficult to deal with without help of computer. • At the same time, NLP technology for human-computer interaction also remains as important, in order to realize a "talking robot".
  24. 26 / 27 Conclusion 3 • Language is a most

    important "tool" (or a means) in order to convey one's intent to others correctly and precisely. • Therefore, NLP may help all of human activities including our daily life, education, and all fields of corporate activities. • Ultimately, advanced NLP technology may have a potential power to change our society where it is more and more convenient.
  25. 27 / 27 ありがとうございました。 Thank you ! Merci à tous!

    Cảm ơn tất cả các bạn! 谢谢! Dankon ĉiuj! ขอบคุณคุณ ك ل ارك ش.