Slide 1

Slide 1 text

1 / 27 Natural Language Processing (13) Text mining and Overall Conclusion Kazuhide Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology

Slide 2

Slide 2 text

2 / 27 交通の便が悪い 電気がない 高い! 料理に不満 露天が自慢! 女性に人気 周囲が静か 貸し切り風呂あり 四季の変化を感じる ペットOK! A温泉 昨年もA温泉で会社の総会があって、生まれて初めて行きました。 塩分の強い温泉で、上司は湿疹があって飛び跳ねて出て行ってしまいました 自分は2度目ですが、やはり日本海といった感じが最高でした。 B地方のC寺・・・・あまりの広さにガイドが迷子になりそう~~っていうか実際なって たし. D高原(付近に牧場あり) 小規模の宿が多いですね.でも女性に人気らしい.新鮮な牛乳が飲めるらしい.肉も? C温泉 病院の横の元湯館(幽霊に逢える)、ホテルE 極楽浄土、極楽浄土。。。 毎日を忙しく過ごす貴女に ゆったりとした露天風呂を おすすめします。 評判抽出 まとめ A温泉の

Slide 3

Slide 3 text

3 / 27 What is text mining? ● Text mining is technology to mine, i.e., extract somewhat useful information from large amount of text. ● The term is not clearly defined; detail of processes depends on the needs one wish to obtain from text. ● In broad terms, some NLP tasks, such as information retrieval, document classification, automatic summarization are all regarded as (one of) text mining process.

Slide 4

Slide 4 text

4 / 27 Example of analysis: Spring A Suppose that you are interested in hot spring spot A, so you are looking for information for it. You see that: (1) number of search string for "Spring A" has been increased. ● time-series analysis of query (2) increasing number of Twitter users have mentioned Spring A recently. ● time-series analysis of tweets (3) Spring A has paid more attentions than Spring B. ● comparison to other target(s)

Slide 5

Slide 5 text

5 / 27 Spring A: analysis results ● Spring A gets positive sentiment. ● Women have positive sentiment for Spring A. ● Spring A has an impression as a spring spot that is silent and secluded.

Slide 6

Slide 6 text

6 / 27 Text mining: background ● As we have collected huge amount of text for recent decades, both in a company and in a society, we are no more unable to deal with them manually. ● Moreover, we hopefully want to analyze them for finding some useful information.

Slide 7

Slide 7 text

7 / 27

Slide 8

Slide 8 text

8 / 27 Text mining and data mining ● Data mining uses formatted data, that is usually described as table(s) consisting of figures. ● Text mining deals with unformatted data, that is usually described as (collection of) natural language sentences. ● In other words, the mission of text mining is to construct table from language. Once it is formatted as table, we use all techniques of data mining. In this sense text mining is regarded as pre-process of data mining.

Slide 9

Slide 9 text

9 / 27 Information retrieval and text mining ● Information Retrieval (IR) is used when we have specific and clear target for search. – e.g., sightseeing in Kyoto. ● Text mining is used when we need to find search targets. While text mining is conducted, IR technique is used throughout. – e.g., good sightseeing spot to visit with my family in this summer vacation

Slide 10

Slide 10 text

10 / 27 Text classification and mining ● Text classification classifies texts into several categories that are clearly defined in advance. ● Clustering process merges texts each of which are somewhat similar to others. Categories are not pre-defined here, but number of categories may be given. Text mining analyzes text by utilizing classification/clustering technologies, in order to get some useful knowledge.

Slide 11

Slide 11 text

11 / 27 Details of mining process 1. sentence analysis by natural language processing ● morphological analysis ● parsing 2. statistical analysis and/or machine learning ● time-series processing ● correlation analysis ● trend analysis ● topic detection 3. visualization of analysis results

Slide 12

Slide 12 text

12 / 27 Primitive approach ● conduct morphological analysis ● count frequencies for all words ● visualize a trend of a word by plotting changes over time. ● Some analyses are doing only this, which may be enough to see a trend. ● However, it is obviously problematic in the viewpoint of language processing.

Slide 13

Slide 13 text

13 / 27 Problems of the primitive approach ● unknown words – new coined words often have key information for judgment. ● stop words – no information is obtained when we see that particle appears very often. ● selection of word – not enough to know that it's a hot word. – not easy to pick "drastic" increases. ● don't know how it is evaluated. – positive or negative?

Slide 14

Slide 14 text

14 / 27 Method for positive/negative classification Texts are classified into positive one and negative one basically according to the following approach: ● collect both positive and negative texts, ● conduct supervised training by machine learning, and ● classify given texts by the automatic classifier This process tells us whether an input text is regarded as positive or negative.

Slide 15

Slide 15 text

15 / 27 Advanced text mining ● Identification of attributes: what attribute is good/bad? – coffee: hardness, milkiness, package, CM, ... ● Identification of subjective expressions: how good/bad? – movie: boring, poor performance, so loud, tear-jerker not sad

Slide 16

Slide 16 text

16 / 27 p/n judgment: problems ● context – "Expensive? You're kidding?" – "We want to see it if it's the case." – "We don't say that it's not good." ● multiple word senses – 値段が「高い」 vs. 品質が「高い」 – 「やばい」 ● new coined words – まいうー、マジパない、とにかく神

Slide 17

Slide 17 text

17 / 27 Who (should) mine text? ● (a) one who has a large amount of text, or one who can produce/collect a large amount of texts. ● (b) one who wish to acquire any useful knowledge or information from texts. There are a lot of people/companies who satisfy (a), while there are also a lot to satisfy (b). However, it is interesting to note, that there are not so many who satisfy both of them up to now.

Slide 18

Slide 18 text

18 / 27 Contact center (customer desk) ● In contact center, all of contacts both by e-mail and by phone are recorded everyday. – Canon Corp. received around 6 million contacts over a year worldwide. – IBM customer center gets over 10 thousands phone calls per month, where all of speech are described and recorded automatically by a speech recognizer. ● These companies are aware that this is not only for avoiding troubles later, but also useful information: – in order to find defects of the product/service; – in order to develop a new product/service.

Slide 19

Slide 19 text

19 / 27 Company's daily report ● Companies engaged in business activities ask staffs to write an daily report as everyday's record. For example, Lion Corp. in Japan is collecting around 8,000 reports per month by employees. ● All of these reports were read by a manager. This is becoming more difficult due to lack of time and increasing number of reports. Here, there is a demand to automatically pick some of reports that should be somewhat important.

Slide 20

Slide 20 text

20 / 27 Marketing research ● Product companies need to know "voice of consumers" in order to develop new products and services. ● So far the survey was done by paid questionnaire against consumers collected by the company. ● Companies have paid attention to text mining as alternative of questionnaire, since – automatic, thus little cost – more chance of collecting consumers "real voice"

Slide 21

Slide 21 text

21 / 27 Web mining There are Web-based text mining services: ● sentiment analysis ● finding hot topic (such as in Twitter) ● and more

Slide 22

Slide 22 text

22 / 27 Other examples Text mining is used not only CRM (customer relationship management) but other domains: ● medicine: correlation of symptoms and substance ● trend analysis for patent documents ● trend analysis of a topic for newspapers Currently some tools for text mining are available, both for free (having limited functions) and for sale (powerful but expensive). You can also analyze texts only with Excel and open source tools.

Slide 23

Slide 23 text

23 / 27 Overall Conclusions

Slide 24

Slide 24 text

24 / 27 Conclusion 1 ● Natural language processing (NLP) has a history of only half century, that is far shorter than other areas like mathematics and electrical engineering. ● As a few products or services are in practical use during this period, there are a lot of problems to be solved around us.

Slide 25

Slide 25 text

25 / 27 Conclusion 2 ● There is a (potential) demand and an expectation in society to use NLP technology. ● The number of e-texts is increasing day by day, particularly on the Web, that make us more and more difficult to deal with without help of computer. ● At the same time, NLP technology for human-computer interaction also remains as important, in order to realize a "talking robot".

Slide 26

Slide 26 text

26 / 27 Conclusion 3 ● Language is a most important "tool" (or a means) in order to convey one's intent to others correctly and precisely. ● Therefore, NLP may help all of human activities including our daily life, education, and all fields of corporate activities. ● Ultimately, advanced NLP technology may have a potential power to change our society where it is more and more convenient.

Slide 27

Slide 27 text

27 / 27 ありがとうございました。 Thank you ! Merci à tous! Cảm ơn tất cả các bạn! 谢谢! Dankon ĉiuj! ขอบคุณคุณ ك ل ارك ش.