Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Basics of Text Analysis - Viktor Kifer

Basics of Text Analysis - Viktor Kifer

Basics of Text Analysis - Viktor Kifer

GDG Ternopil

March 16, 2016
Tweet

More Decks by GDG Ternopil

Other Decks in Technology

Transcript

  1. Why Python? 1. Simple language (easy to learn and understand)

    2. Fast development 3. Lots of libraries for machine learning and neural networks: a. NumPy and SciPy (complex math) b. Scikit-Learn (machine learning) c. NLTK and TextBlob (natural languages processing) d. Theano (neural networks)
  2. Business problem • Is the document actually written by some

    person? • What the document is about? • What is the opinion of the author to described problem? • Is the document is a spam? • Automatically correct mistakes in documents • Detect document language
  3. Where to get documents? Using public APIs (xml, json) Twitter

    API New York Times API Web Scraping (html, MS Word documents) Comments to given product on shopping site Results of some competitions
  4. Natural Language Processing (NLP) Python provides you with Natural Language

    ToolKit (NLTK) library, which helps you to make language processing way much easier
  5. Convert text into vector of features Split text into sentences

    This problem usually isn’t that hard You can easily split text by dots, question or exclamation marks etc Split sentences into words While the problem isn’t hard at the first look, it becomes more complicated as you dive deeper Examples: It’s, 20-year-old, U.S., New York
  6. So tokenization... nltk.tokenize • TweetTokenizer • MWETokenizer • RegexpTokenizer •

    WhitespaceTokenizer • WordPunctTokenizer • StanfordTokenizer
  7. Infinitive form, suffixes, prefixes and so on There are times,

    when you don’t really care about the endings, prefixes and suffixes of the words in the text, and you only need the root of the word. Generously -> generous Miles -> mile Traditional -> tradition
  8. High-dimensional data Usually you’d like to avoid high dimensional data,

    as it makes both the teaching and the testing process slower. So Common words should be removed
  9. Stopwords And fortunately, there are dictionaries for different languages that

    list the words, that usually have low impact on the meaning of the sentence. from nltk.corpus import stopwords stop = stopwords.words('english')
  10. Complex terms And we should also consider how to words

    with multi-word terms: New York United States Washington DC Chicago Bulls
  11. Frequency distributions One of the approaches to analyse authorship is

    to compare the frequency of words from the document and the golder frequency distribution of given author nltk.FreqDist
  12. Classification NaiveBayesClassifier • Quite fast learning process • Good of

    `bag-of-words` analysis: ◦ Sentimental analysis ◦ Authorship analysis DecisionTreeClassifier • Longer learning process, but can give better results • Good for `bag-of-collocations` analysis
  13. Example common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)] tagged_words

    = brown.tagged_words(categories='news') featuresets = [(pos_features(n), g) for (n,g) in tagged_words] size = int(len(featuresets) * 0.1) train_set, test_set = featuresets[size:], featuresets[:size] classifier = nltk.DecisionTreeClassifier.train(train_set) nltk.classify.accuracy(classifier, test_set) 0.62705121829935351 classifier.classify(pos_features('cats')) 'NNS'
  14. Example print(classifier.pseudocode(depth=4)) if endswith(,) == True: return ',' if endswith(,)

    == False: if endswith(the) == True: return 'AT' if endswith(the) == False: if endswith(s) == True: if endswith(is) == True: return 'BEZ' if endswith(is) == False: return 'VBZ' if endswith(s) == False: if endswith(.) == True: return '.' if endswith(.) == False: return 'NN' Source: http://www.nltk.org/book/ch06.html
  15. And if you like SciKit-Learn from nltk.classify import SklearnClassifier from

    sklearn.svm import SVC classif = SklearnClassifier(SVC()).train(train_data)
  16. TextBlob example from textblob import TextBlob wiki = TextBlob("Python is

    a high-level, general-purpose programming language.") wiki.tags [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')] testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!") testimonial.sentiment Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)