is technology to mine, i.e., extract somewhat useful information from large amount of text. • The term is not clearly defined; detail of processes depends on the needs one wish to obtain from text. • In broad terms, some NLP tasks, such as information retrieval, document classification, automatic summarization are all regarded as (one of) text mining process.
you are interested in hot spring spot A, so you are looking for information for it. You see that: (1) number of search string for "Spring A" has been increased. • time-series analysis of query (2) increasing number of Twitter users have mentioned Spring A recently. • time-series analysis of tweets (3) Spring A has paid more attentions than Spring B. • comparison to other target(s)
collected huge amount of text for recent decades, both in a company and in a society, we are no more unable to deal with them manually. • Moreover, we hopefully want to analyze them for finding some useful information.
mining uses formatted data, that is usually described as table(s) consisting of figures. • Text mining deals with unformatted data, that is usually described as (collection of) natural language sentences. • In other words, the mission of text mining is to construct table from language. Once it is formatted as table, we use all techniques of data mining. In this sense text mining is regarded as pre-process of data mining.
Retrieval (IR) is used when we have specific and clear target for search. – e.g., sightseeing in Kyoto. • Text mining is used when we need to find search targets. While text mining is conducted, IR technique is used throughout. – e.g., good sightseeing spot to visit with my family in this summer vacation
classifies texts into several categories that are clearly defined in advance. • Clustering process merges texts each of which are somewhat similar to others. Categories are not pre-defined here, but number of categories may be given. Text mining analyzes text by utilizing classification/clustering technologies, in order to get some useful knowledge.
count frequencies for all words • visualize a trend of a word by plotting changes over time. • Some analyses are doing only this, which may be enough to see a trend. • However, it is obviously problematic in the viewpoint of language processing.
words – new coined words often have key information for judgment. • stop words – no information is obtained when we see that particle appears very often. • selection of word – not enough to know that it's a hot word. – not easy to pick "drastic" increases. • don't know how it is evaluated. – positive or negative?
into positive one and negative one basically according to the following approach: • collect both positive and negative texts, • conduct supervised training by machine learning, and • classify given texts by the automatic classifier This process tells us whether an input text is regarded as positive or negative.
what attribute is good/bad? – coffee: hardness, milkiness, package, CM, ... • Identification of subjective expressions: how good/bad? – movie: boring, poor performance, so loud, tear-jerker not sad
You're kidding?" – "We want to see it if it's the case." – "We don't say that it's not good." • multiple word senses – 値段が「高い」 vs. 品質が「高い」 – 「やばい」 • new coined words – まいうー、マジパない、とにかく神
who has a large amount of text, or one who can produce/collect a large amount of texts. • (b) one who wish to acquire any useful knowledge or information from texts. There are a lot of people/companies who satisfy (a), while there are also a lot to satisfy (b). However, it is interesting to note, that there are not so many who satisfy both of them up to now.
center, all of contacts both by e-mail and by phone are recorded everyday. – Canon Corp. received around 6 million contacts over a year worldwide. – IBM customer center gets over 10 thousands phone calls per month, where all of speech are described and recorded automatically by a speech recognizer. • These companies are aware that this is not only for avoiding troubles later, but also useful information: – in order to find defects of the product/service; – in order to develop a new product/service.
business activities ask staffs to write an daily report as everyday's record. For example, Lion Corp. in Japan is collecting around 8,000 reports per month by employees. • All of these reports were read by a manager. This is becoming more difficult due to lack of time and increasing number of reports. Here, there is a demand to automatically pick some of reports that should be somewhat important.
know "voice of consumers" in order to develop new products and services. • So far the survey was done by paid questionnaire against consumers collected by the company. • Companies have paid attention to text mining as alternative of questionnaire, since – automatic, thus little cost – more chance of collecting consumers "real voice"
only CRM (customer relationship management) but other domains: • medicine: correlation of symptoms and substance • trend analysis for patent documents • trend analysis of a topic for newspapers Currently some tools for text mining are available, both for free (having limited functions) and for sale (powerful but expensive). You can also analyze texts only with Excel and open source tools.
has a history of only half century, that is far shorter than other areas like mathematics and electrical engineering. • As a few products or services are in practical use during this period, there are a lot of problems to be solved around us.
demand and an expectation in society to use NLP technology. • The number of e-texts is increasing day by day, particularly on the Web, that make us more and more difficult to deal with without help of computer. • At the same time, NLP technology for human-computer interaction also remains as important, in order to realize a "talking robot".
important "tool" (or a means) in order to convey one's intent to others correctly and precisely. • Therefore, NLP may help all of human activities including our daily life, education, and all fields of corporate activities. • Ultimately, advanced NLP technology may have a potential power to change our society where it is more and more convenient.