Text analysis and Visualization

Text Analysis and Visualization Irene Ros @ireneros http:/ /bocoup.com/datavis

Bocoup Datavis Team http:/ /bocoup.com/datavis Data Science & Visualization Design
& Application Development

https://openvisconf.com

WHY TEXT?

http://www.nbcnews.com/politics/2016-election/trump-shocks-awes-ﬁnal-new-hampshire-rally-primary-n514266

TEXT IS DATA TOO Document Collections

SINGLE DOCUMENT Measurements Clean up Structure Word Relationships

Alice was beginning to get very tired of sitting by
her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. https://www.gutenberg.org/files/11/11-h/11-h.htm

MEASUREMENTS BASIC COUNTS basic units of text analysis...

13 the 11 it 9 to 8 of 8 her
8 a 7 she 7 and 5 was 5 rabbit 4 very 4 or 4 in 4 alice 3 with 3 when 3 that 3 so 3 out 3 had 3 but 3 at 3 ' 2 watch 2 waistcoat 2 time 2 thought 2 sister 2 ran 2 pocket 2 pictures 2 on 2 nothing 2 mind 2 for 2 dear 2 conversations 2 by 2 book 2 be 2 as 2 across 1 would 1 worth 1 wondered 1 white 1 whether 1 what 1 well 1 way 1 use 1 up 1 under 1 twice 1 trouble 1 took 1 tired 1 this 1 think 1 there 1 then 1 take 1 suddenly 1 stupid 1 started 1 sleepy 1 sitting 1 shall 1 seen 1 seemed 1 see 1 say 1 remarkable 1 reading 1 quite 1 pop 1 pleasure 1 pink 1 picking 1 peeped 1 own 1 over 1 ought 1 once 1 oh 1 occurred 1 nor 1 no 1 never 1 natural 1 much 1 making 1 made 1 looked 1 late 1 large 1 just 1 itself 1 its 1 is 1 into 1 i 1 hurried 1 hot 1 hole 1 hedge 1 hear 1 having 1 have 1 getting 1 get 1 fortunately 1 ﬂashed 1 ﬁeld 1 feet 1 feel 1 eyes 1 either 1 down 1 do 1 did 1 day 1 daisy 1 daisies 1 curiosity 1 could 1 considering 1 close 1 chain 1 burning 1 beginning 1 before 1 bank 1 all 1 afterwards 1 after 1 after 1 actually 1 'without 1 'oh 1 'and

http://wordle.net

http://graphics.wsj.com/elections/2016/democratic-debate-charts/

CODE TIME! pyton + nltk

import nltk from collections import Counter tokens = nltk.word_tokenize(text) counts
= Counter(tokens) sorted_counts = sorted(counts.items(), key=lambda count: count[1], reverse=True) sorted_counts [(',', 2418), ('the', 1516), ("'", 1129), ('.', 975), ('and', 757), ('to', 717), ('a', 612), ('it', 513), ('she', 507), ('of', 496), ('said', 456), ('!', 450), ('Alice', 394), ('I', 374),...]

Remove punctuation Remove stop words Normalize the case Remove fragments
Stemming CLEAN-UP slight diversion...

REMOVE PUNCTUATION # starting point for punctuation from python string
# punctuation is '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~-' from string import punctuation def remove_tokens(tokens, remove_tokens): return [token for token in tokens if token not in remove_tokens] no_punc_tokens = remove_tokens(tokens, punctuation) ['CHAPTER', 'I', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to', 'get',...] ['CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to',...] before

NORMALIZE CASE # downcase every token def lowercase(tokens): return [token.lower()
for token in tokens] lowercase(no_punc_tokens) ['chapter', 'i', 'down', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting',...] ['CHAPTER', 'I', 'Down', 'Rabbit-Hole', 'Alice', 'beginning', 'get', 'tired', 'sitting',...] before

REMOVE STOP WORDS # import stopwords from nltk from nltk.corpus
import stopwords stops = stopwords.words('english') # stop words look like: # [u'i', u'my', u'myself', u'we', u'our', # u'ours', u'you'...] filtered_tokens = remove_tokens(no_punc_tokens, stops) before ['chapter', 'i', 'down', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting',...] ['chapter', 'down', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister',...]

REMOVE FRAGMENTS # Removes fragmented words like: n't, 's def
remove_word_fragments(tokens): return [token for token in tokens if "'" not in token] no_frag_tokens = remove_word_fragments(filtered_tokens) before ['chapter', 'down', 'rabbit-hole', 'beginning', 'tired', 'sitting', 'sister',...] ['chapter', 'down', 'rabbit-hole', 'n't', 'beginning', ''s', 'tired', 'sitting', 'sister',...]

STEMMING Converts words to their 'base' form, for example: regular
= ['house', 'housing', 'housed'] stemmed = ['hous', 'hous', 'hous'] from nltk.stem import PorterStemmer stemmer = PorterStemmer() stemmed_tokens = [stemmer.stem(token) for token in no_frag_tokens] ['chapter', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth',...] ['chapter', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing',...] before

IMPROVED COUNTS: [('said', 462), ('alic', 396), ('littl', 128), ('look', 102),
('one', 100), ('like', 97), ('know', 91), ('would', 90), ('could', 86), ('went', 83), ('thought', 80), ('thing', 79), ('queen', 76), ('go', 75), ('time', 74), ('say', 70), ('see', 68), ('get', 66), ('king', 64),...] [(',', 2418), ('the', 1516), ("'", 1129), ('.', 975), ('and', 757), ('to', 717), ('a', 612), ('it', 513), ('she', 507), ('of', 496), ('said', 456), ('!', 450), ('Alice', 394), ('I', 374), ('was', 362), ('in', 351), ('you', 337), ('that', 267), ('--', 264),...] before

STRUCTURE PART-OF-SPEECH TAGGING back to our regular format...

PART OF SPEECH TAGGING http://cogcomp.cs.illinois.edu/page/demo_view/pos

http://cogcomp.cs.illinois.edu/page/demo_view/pos

http://tvtropes.org/pmwiki/pmwiki.php/Main/DamselInDistress

http://stereotropes.bocoup.com/tropes/DamselInDistress

http://stereotropes.bocoup.com/gender

POS-TAGGING POS tag your raw tokens - punctuation and capitalization
matter tagged = nltk.pos_tag(tokens) [('CHAPTER', 'NN'), ('I', 'PRP'), ('.', '.'), ('Down', 'RP'), ('the', 'DT'), ('Rabbit-Hole', 'JJ'), ('Alice', 'NNP'), ('was', 'VBD'), ('beginning', 'VBG'), ('to', 'TO'), ('get', 'VB'), ('very', 'RB'), ('tired', 'JJ'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PRP$'), ('sister', 'NN'),...]

WORD RELATIONSHIPS CONCORDANCE, N-GRAMS, CO-OCCURRENCE

CONCORDANCE my_text = nltk.Text(tokens) my_text.concordance('Alice') Alice was beginning to get
very tired of s hat is the use of a book , ' thought Alice 'without pictures or conversations ? so VERY remarkable in that ; nor did Alice think it so VERY much out of the way looked at it , and then hurried on , Alice started to her feet , for it flashed hedge . In another moment down went Alice after it , never once considering ho ped suddenly down , so suddenly that Alice had not a moment to think about stop she fell past it . 'Well ! ' thought Alice to herself , 'after such a fall as t own , I think -- ' ( for , you see , Alice had learnt several things of this so tude or Longitude I 've got to ? ' ( Alice had no idea what Latitude was , or L . There was nothing else to do , so Alice soon began talking again . 'Dinah 'l ats eat bats , I wonder ? ' And here Alice began to get rather sleepy , and wen dry leaves , and the fall was over . Alice was not a bit hurt , and she jumped not a moment to be lost : away went Alice like the wind , and was just in time but they were all locked ; and when Alice had been all the way down one side a KEYWORD IN CONTEXT

CONCORDANCE

https://www.washingtonpost.com/graphics/politics/2016-election/debates/oct-13-speakers/ CONCORDANCE

CONCORDANCE: WORD TREE https://www.jasondavies.com/wordtree/?source=alice-in-wonderland.txt&preﬁx=She

http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum Visualizing Google's Bi-Gram Data N-GRAMS (COLLOCATIONS)

http://www.chrisharrison.net/index.php/Visualizations/WordAssociations N-GRAMS (COLLOCATIONS)

N-GRAMS (COLLOCATIONS) A set of words that occur together more
often then chance. from nltk.collocations import BigramCollocationFinder finder = BigramCollocationFinder.from_words(filtered_tokens) # built in bigram metrics are in here bigram_measures = nltk.collocations.BigramAssocMeasures() # we call score_ngrams on the finder to produce a sorted list # of bigrams. Each comes with its score from the metric, which # is how they are sorted. finder.score_ngrams(bigram_measures.raw_freq) [(('said', 'the'), 0.007674512539933169), (('of', 'the'), 0.004700179928762898), (('said', 'alice'), 0.004259538060441376), (('in', 'a'), 0.0035618551022656335), (('and', 'the'), 0.002900892299783351),...]

N-GRAMS (COLLOCATIONS) A set of words that occur together more
often than chance. finder.score_ngrams(bigram_measures.likelihood_ratio) [(('mock', 'turtle'), 781.0917141765854), (('said', 'the'), 597.9581706687363), (('said', 'alice'), 505.46971076855675), (('march', 'hare'), 461.91931122768904), (('went', 'on'), 376.6417465508724), (('do', "n't"), 372.7029564560615), (('the', 'queen'), 351.39319634691446), (('the', 'king'), 342.27277302768084), (('in', 'a'), 341.4084817025905), (('the', 'gryphon'), 278.40108569878106),...]

Phrase Net: X begat Y CO-OCCURENCE Frank Van Ham, http://ieeexplore.ieee.org/ieee_pilot/articles/06/ttg2009061169/article.html#article

Phrase Net: X of Y old testament new testament

CO-OCCURENCE my_text = nltk.Text(tokens) my_text.findall('<.*><of><.*>') tired of sitting; and of
having; use of a; pleasure of making; trouble of getting; out of the; out of it; plenty of time; sides of the; one of the; fear of killing; one of the; nothing of tumbling; top of the; centre of the; things of this; name of the; saucer of milk; sort of way; heap of sticks; row of lamps; made of solid; one of the; doors of the; any of them; out of that; beds of bright; be of very; book of rules; neck of the; sort of mixed; flavour of cherry-tart; flame of a; one of the; legs of the; game of croquet; fond of pretending; enough of me; top of her; way of expecting; Pool of Tears; out of sight; pair of boots; roof of the; ashamed of yourself; gallons of tears; pattering of feet; pair of white; help of any; were of the; any of them; sorts of things; capital of Paris; capital of Rome; waters of the; burst of tears; tired of being; one of the; cause of this; number of bathing; row of lodging; pool of tears; be of any; out of this; tired of swimming; way of speaking; -- of a; one of its; knowledge of history; out of the; end of his; subject of conversation; -- of --;

COLLECTIONS OF DOCUMENTS Grouping/Clustering Comparison

SIGNIFICANCE TF-IDF TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY

The occurrence of "cat" in an article in New York
Times Signiﬁcant The occurrence of "cat" in an article in Cat Weekly Magazine Not Signiﬁcant

TF Term Frequency (TF) = Number of times a term
appears in a document Total number of terms in a document 1 document 100 words 3 "cat" 3 / 100 = 0.03

IDF Inverse Document Frequency (IDF) = log Total number of
documents Number of documents with term t in it ( ) 10 million documents 1000 containing "cat" log(10,000,000/ 1000) = 4

A high weight in TF*IDF is reached by a high
term frequency (in a given document) and a low frequency in the number of documents that contain that term. As the term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and TF-IDF closer to zero. *assuming each document that contains the word "cat" has it in it 3 times and has a total of 100 words 10,000 documents, 1 document containing "cat" (3 / 100) * ln(10,000 / 1) = 0.2763102111592855 10,000 documents, all 10,000 documents containing "cat" (3 / 100) * ln(10,000 / 10,000) = 0.0 10,000 documents, 100 documents containing "cat" (3 / 100) * ln(10,000 / 100) = 0.13815510557964275

Visualizing Email Content: Portraying Relationships from Conversational Histories Fernanda B.
Viégas , 2006 Themail

GROUPING CLASSIFICATION, CLUSTERING

World Sports Entertainment Life Arts News Classiﬁer

News Classiﬁer New article, without a subject assigned yet World

Many Bills: Engaging Citizens through Visualizations of Congressional Legislation Yannick
Assogba, Irene Ros, Joan DiMicco, Matt McKeon IBM Research http://clome.info/papers/manybills_chi.pdf

COMPARISON COSINE SIMILARITY, CLUSTERING

Mike Bostock, Sean Carter, http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

Document Document Document Document Document ... Document collection cat (5),
house, dog, monkey air, strike(2), tanker, machine flight(2), strike, machine, guns light, air(4), balloon, flight cat, scratch, blood(2), hospital flight(4), commercial, aviation

cat house dog monkey air strike tanker machine flight guns
light balloon scratch blood hospital commercial aviation [5,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0] ... [0,0,0,0,1,2,1,1,0,0,0,0,0,0,0,0] [0,0,0,0,0,1,0,1,2,1,0,0,0,0,0,0]

K-MEANS CLUSTERING

TOOLS • Textkit http://learntextvis.github.io/textkit/

TOOLS, NO PROGRAMMING • AntConc (http:/ /www.laurenceanthony.net/software.html) • Overview Project
(https:/ /www.overviewdocs.com/) • Voyant (http:/ /voyant-tools.org/) • Lexos (http:/ /lexos.wheatoncollege.edu/upload) • Word and Phrase (http:/ /www.wordandphrase.info/) • CorpKit (http:/ /interrogator.github.io/corpkit/index.html) Tool collections: • DiRT tools - http:/ /dirtdirectory.org/ • TAPoR (http:/ /tapor.ca/home) • http:/ /guides.library.duke.edu/c.php?g=289707&p=1930856 Many of these from a great talk by Lynn Cherny - http:/ / ghostweather.slides.com/lynncherny/text-data-analysis-without-programming

NOT COVERED, BUT NOTE WORTHY • Topic Modeling • Sentiment
Analysis • Entity Extraction • Word2Vec • Neural networks • Search • Historic Trends

GO VISUALIZE SOME WORDS

Irene Ros [email protected] @ireneros http:/ /ireneros.com | http:/ /bocoup.com/datavis THANK
YOU

CITATION Icon Created by Piola, Noun Project: https://thenounproject.com/search/?q=document&i=709260

Text analysis and Visualization

Text analysis and Visualization

More Decks by Irene Ros

Other Decks in Technology

Featured

Transcript