Basics of Text Analysis - Viktor Kifer

Basic of text analysis With Python

Why Python? 1. Simple language (easy to learn and understand)
2. Fast development 3. Lots of libraries for machine learning and neural networks: a. NumPy and SciPy (complex math) b. Scikit-Learn (machine learning) c. NLTK and TextBlob (natural languages processing) d. Theano (neural networks)

Business problem • Is the document actually written by some
person? • What the document is about? • What is the opinion of the author to described problem? • Is the document is a spam? • Automatically correct mistakes in documents • Detect document language

To the process...

Where to get documents? Using public APIs (xml, json) Twitter
API New York Times API Web Scraping (html, MS Word documents) Comments to given product on shopping site Results of some competitions

Natural Language Processing (NLP) Python provides you with Natural Language
ToolKit (NLTK) library, which helps you to make language processing way much easier

Convert text into vector of features Split text into sentences
This problem usually isn’t that hard You can easily split text by dots, question or exclamation marks etc Split sentences into words While the problem isn’t hard at the first look, it becomes more complicated as you dive deeper Examples: It’s, 20-year-old, U.S., New York

So tokenization... nltk.tokenize • TweetTokenizer • MWETokenizer • RegexpTokenizer •
WhitespaceTokenizer • WordPunctTokenizer • StanfordTokenizer

Infinitive form, suffixes, prefixes and so on There are times,
when you don’t really care about the endings, prefixes and suffixes of the words in the text, and you only need the root of the word. Generously -> generous Miles -> mile Traditional -> tradition

And stemming nltk.stem • Porter stemmer • Snowball stemmer print(SnowballStemmer("english").stem("generously"))
generous print(SnowballStemmer("porter").stem("generously")) gener

High-dimensional data Usually you’d like to avoid high dimensional data,
as it makes both the teaching and the testing process slower. So Common words should be removed

Stopwords And fortunately, there are dictionaries for different languages that
list the words, that usually have low impact on the meaning of the sentence. from nltk.corpus import stopwords stop = stopwords.words('english')

Complex terms And we should also consider how to words
with multi-word terms: New York United States Washington DC Chicago Bulls

Collocations nltk.collocations • BigramCollocationFinder • TrigramCollocationFinder • BigramAssocMeasures • TrigramAssocMeasures

Frequency distributions One of the approaches to analyse authorship is
to compare the frequency of words from the document and the golder frequency distribution of given author nltk.FreqDist

Classification NaiveBayesClassifier • Quite fast learning process • Good of
`bag-of-words` analysis: ◦ Sentimental analysis ◦ Authorship analysis DecisionTreeClassifier • Longer learning process, but can give better results • Good for `bag-of-collocations` analysis

Example common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)] tagged_words
= brown.tagged_words(categories='news') featuresets = [(pos_features(n), g) for (n,g) in tagged_words] size = int(len(featuresets) * 0.1) train_set, test_set = featuresets[size:], featuresets[:size] classifier = nltk.DecisionTreeClassifier.train(train_set) nltk.classify.accuracy(classifier, test_set) 0.62705121829935351 classifier.classify(pos_features('cats')) 'NNS'

Example print(classifier.pseudocode(depth=4)) if endswith(,) == True: return ',' if endswith(,)
== False: if endswith(the) == True: return 'AT' if endswith(the) == False: if endswith(s) == True: if endswith(is) == True: return 'BEZ' if endswith(is) == False: return 'VBZ' if endswith(s) == False: if endswith(.) == True: return '.' if endswith(.) == False: return 'NN' Source: http://www.nltk.org/book/ch06.html

And if you like SciKit-Learn from nltk.classify import SklearnClassifier from
sklearn.svm import SVC classif = SklearnClassifier(SVC()).train(train_data)

Text Analysis With Human Face TextBlob

TextBlob example from textblob import TextBlob wiki = TextBlob("Python is
a high-level, general-purpose programming language.") wiki.tags [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')] testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!") testimonial.sentiment Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

Sentimental Analysis of NY Times Mean: 1.87 Variation: 0.21

Basics of Text Analysis - Viktor Kifer

Basics of Text Analysis - Viktor Kifer

GDG Ternopil

More Decks by GDG Ternopil

Other Decks in Technology

Featured

Transcript

Basic of text analysis With Python

Why Python? 1. Simple language (easy to learn and understand)

Business problem • Is the document actually written by some

To the process...

Where to get documents? Using public APIs (xml, json) Twitter

Natural Language Processing (NLP) Python provides you with Natural Language

Convert text into vector of features Split text into sentences

So tokenization... nltk.tokenize • TweetTokenizer • MWETokenizer • RegexpTokenizer •

Infinitive form, suffixes, prefixes and so on There are times,

And stemming nltk.stem • Porter stemmer • Snowball stemmer print(SnowballStemmer("english").stem("generously"))

High-dimensional data Usually you’d like to avoid high dimensional data,

Stopwords And fortunately, there are dictionaries for different languages that

Complex terms And we should also consider how to words

Collocations nltk.collocations • BigramCollocationFinder • TrigramCollocationFinder • BigramAssocMeasures • TrigramAssocMeasures

Frequency distributions One of the approaches to analyse authorship is

Classification NaiveBayesClassifier • Quite fast learning process • Good of

Example common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)] tagged_words

Example print(classifier.pseudocode(depth=4)) if endswith(,) == True: return ',' if endswith(,)

And if you like SciKit-Learn from nltk.classify import SklearnClassifier from

Text Analysis With Human Face TextBlob

TextBlob example from textblob import TextBlob wiki = TextBlob("Python is

Sentimental Analysis of NY Times Mean: 1.87 Variation: 0.21