$30 off During Our Annual Pro Sale. View Details »

Text analysis and Visualization

Irene Ros
November 15, 2016

Text analysis and Visualization

Text is one of the most interesting and varied data sources on the web and beyond, but it is one of the most difficult to deal with because it is fundamentally a messy, fragmented, and unnormalized format. If you have ever wanted to analyze and visualize text, but don’t know where to get started, this talk is for you.
Delivered at Plotcon 2016

Irene Ros

November 15, 2016
Tweet

More Decks by Irene Ros

Other Decks in Technology

Transcript

  1. Text Analysis and
    Visualization
    Irene Ros
    @ireneros
    http:/
    /bocoup.com/datavis

    View Slide

  2. View Slide

  3. Bocoup Datavis Team
    http:/
    /bocoup.com/datavis
    Data Science & Visualization Design & Application Development

    View Slide

  4. https://openvisconf.com

    View Slide

  5. WHY TEXT?

    View Slide

  6. http://www.nbcnews.com/politics/2016-election/trump-shocks-awes-final-new-hampshire-rally-primary-n514266

    View Slide

  7. TEXT IS DATA TOO
    Document Collections

    View Slide

  8. SINGLE DOCUMENT
    Measurements
    Clean up
    Structure
    Word Relationships

    View Slide

  9. Alice was beginning to get very tired of sitting by her sister on the bank, and of having
    nothing to do: once or twice she had peeped into the book her sister was reading, but
    it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice
    'without pictures or conversations?'
    So she was considering in her own mind (as well as she could, for the hot day made
    her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would
    be worth the trouble of getting up and picking the daisies, when suddenly a White
    Rabbit with pink eyes ran close by her.
    There was nothing so very remarkable in that; nor did Alice think it so very much out
    of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when
    she thought it over afterwards, it occurred to her that she ought to have wondered at
    this, but at the time it all seemed quite natural); but when the Rabbit actually took a
    watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started
    to her feet, for it flashed across her mind that she had never before seen a rabbit with
    either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she
    ran across the field after it, and fortunately was just in time to see it pop down a large
    rabbit-hole under the hedge.
    https://www.gutenberg.org/files/11/11-h/11-h.htm

    View Slide

  10. MEASUREMENTS
    BASIC COUNTS
    basic units of text analysis...

    View Slide

  11. 13 the
    11 it
    9 to
    8 of
    8 her
    8 a
    7 she
    7 and
    5 was
    5 rabbit
    4 very
    4 or
    4 in
    4 alice
    3 with
    3 when
    3 that
    3 so
    3 out
    3 had
    3 but
    3 at
    3 '
    2 watch
    2 waistcoat
    2 time
    2 thought
    2 sister
    2 ran
    2 pocket
    2 pictures
    2 on
    2 nothing
    2 mind
    2 for
    2 dear
    2 conversations
    2 by
    2 book
    2 be
    2 as
    2 across
    1 would
    1 worth
    1 wondered
    1 white
    1 whether
    1 what
    1 well
    1 way
    1 use
    1 up
    1 under
    1 twice
    1 trouble
    1 took
    1 tired
    1 this
    1 think
    1 there
    1 then
    1 take
    1 suddenly
    1 stupid
    1 started
    1 sleepy
    1 sitting
    1 shall
    1 seen
    1 seemed
    1 see
    1 say
    1 remarkable
    1 reading
    1 quite
    1 pop
    1 pleasure
    1 pink
    1 picking
    1 peeped
    1 own
    1 over
    1 ought
    1 once
    1 oh
    1 occurred
    1 nor
    1 no
    1 never
    1 natural
    1 much
    1 making
    1 made
    1 looked
    1 late
    1 large
    1 just
    1 itself
    1 its
    1 is
    1 into
    1 i
    1 hurried
    1 hot
    1 hole
    1 hedge
    1 hear
    1 having
    1 have
    1 getting
    1 get
    1 fortunately
    1 flashed
    1 field
    1 feet
    1 feel
    1 eyes
    1 either
    1 down
    1 do
    1 did
    1 day
    1 daisy
    1 daisies
    1 curiosity
    1 could
    1 considering
    1 close
    1 chain
    1 burning
    1 beginning
    1 before
    1 bank
    1 all
    1 afterwards
    1 after
    1 after
    1 actually
    1 'without
    1 'oh
    1 'and

    View Slide

  12. http://wordle.net

    View Slide

  13. http://graphics.wsj.com/elections/2016/democratic-debate-charts/

    View Slide

  14. CODE TIME!
    pyton + nltk

    View Slide

  15. import nltk
    from collections import Counter
    tokens = nltk.word_tokenize(text)
    counts = Counter(tokens)
    sorted_counts = sorted(counts.items(), key=lambda count:
    count[1], reverse=True)
    sorted_counts
    [(',', 2418),
    ('the', 1516),
    ("'", 1129),
    ('.', 975),
    ('and', 757),
    ('to', 717),
    ('a', 612),
    ('it', 513),
    ('she', 507),
    ('of', 496),
    ('said', 456),
    ('!', 450),
    ('Alice', 394),
    ('I', 374),...]

    View Slide

  16. Remove punctuation
    Remove stop words
    Normalize the case
    Remove fragments
    Stemming
    CLEAN-UP
    slight diversion...

    View Slide

  17. REMOVE PUNCTUATION
    # starting point for punctuation from python string
    # punctuation is '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~-'
    from string import punctuation
    def remove_tokens(tokens, remove_tokens):
    return [token for token in tokens if token not in
    remove_tokens]
    no_punc_tokens = remove_tokens(tokens, punctuation)
    ['CHAPTER',
    'I',
    'Down',
    'the',
    'Rabbit-Hole',
    'Alice',
    'was',
    'beginning',
    'to',
    'get',...]
    ['CHAPTER',
    'I',
    '.',
    'Down',
    'the',
    'Rabbit-Hole',
    'Alice',
    'was',
    'beginning',
    'to',...] before

    View Slide

  18. NORMALIZE CASE
    # downcase every token
    def lowercase(tokens):
    return [token.lower() for token in tokens]
    lowercase(no_punc_tokens)
    ['chapter',
    'i',
    'down',
    'rabbit-hole',
    'alice',
    'beginning',
    'get',
    'tired',
    'sitting',...]
    ['CHAPTER',
    'I',
    'Down',
    'Rabbit-Hole',
    'Alice',
    'beginning',
    'get',
    'tired',
    'sitting',...]
    before

    View Slide

  19. REMOVE STOP WORDS
    # import stopwords from nltk
    from nltk.corpus import stopwords
    stops = stopwords.words('english')
    # stop words look like:
    # [u'i', u'my', u'myself', u'we', u'our',
    # u'ours', u'you'...]
    filtered_tokens = remove_tokens(no_punc_tokens, stops)
    before
    ['chapter',
    'i',
    'down',
    'rabbit-hole',
    'alice',
    'beginning',
    'get',
    'tired',
    'sitting',...]
    ['chapter',
    'down',
    'rabbit-hole',
    'alice',
    'beginning',
    'get',
    'tired',
    'sitting',
    'sister',...]

    View Slide

  20. REMOVE FRAGMENTS
    # Removes fragmented words like: n't, 's
    def remove_word_fragments(tokens):
    return [token for token in tokens if "'" not in token]
    no_frag_tokens = remove_word_fragments(filtered_tokens)
    before
    ['chapter',
    'down',
    'rabbit-hole',
    'beginning',
    'tired',
    'sitting',
    'sister',...]
    ['chapter',
    'down',
    'rabbit-hole',
    'n't',
    'beginning',
    ''s',
    'tired',
    'sitting',
    'sister',...]

    View Slide

  21. STEMMING
    Converts words to their 'base' form, for example:
    regular = ['house', 'housing', 'housed']
    stemmed = ['hous', 'hous', 'hous']
    from nltk.stem import PorterStemmer
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in
    no_frag_tokens]
    ['chapter',
    'rabbit-hol',
    'alic',
    'begin',
    'get',
    'tire',
    'sit',
    'sister',
    'bank',
    'noth',...]
    ['chapter',
    'rabbit-hole',
    'alice',
    'beginning',
    'get',
    'tired',
    'sitting',
    'sister',
    'bank',
    'nothing',...]
    before

    View Slide

  22. IMPROVED COUNTS:
    [('said', 462),
    ('alic', 396),
    ('littl', 128),
    ('look', 102),
    ('one', 100),
    ('like', 97),
    ('know', 91),
    ('would', 90),
    ('could', 86),
    ('went', 83),
    ('thought', 80),
    ('thing', 79),
    ('queen', 76),
    ('go', 75),
    ('time', 74),
    ('say', 70),
    ('see', 68),
    ('get', 66),
    ('king', 64),...]
    [(',', 2418),
    ('the', 1516),
    ("'", 1129),
    ('.', 975),
    ('and', 757),
    ('to', 717),
    ('a', 612),
    ('it', 513),
    ('she', 507),
    ('of', 496),
    ('said', 456),
    ('!', 450),
    ('Alice', 394),
    ('I', 374),
    ('was', 362),
    ('in', 351),
    ('you', 337),
    ('that', 267),
    ('--', 264),...]
    before

    View Slide

  23. STRUCTURE
    PART-OF-SPEECH TAGGING
    back to our regular format...

    View Slide

  24. PART OF SPEECH TAGGING
    http://cogcomp.cs.illinois.edu/page/demo_view/pos

    View Slide

  25. http://cogcomp.cs.illinois.edu/page/demo_view/pos

    View Slide

  26. http://tvtropes.org/pmwiki/pmwiki.php/Main/DamselInDistress

    View Slide

  27. View Slide

  28. http://stereotropes.bocoup.com/tropes/DamselInDistress

    View Slide

  29. http://stereotropes.bocoup.com/gender

    View Slide

  30. POS-TAGGING
    POS tag your raw tokens - punctuation and capitalization matter
    tagged = nltk.pos_tag(tokens)
    [('CHAPTER', 'NN'),
    ('I', 'PRP'),
    ('.', '.'),
    ('Down', 'RP'),
    ('the', 'DT'),
    ('Rabbit-Hole', 'JJ'),
    ('Alice', 'NNP'),
    ('was', 'VBD'),
    ('beginning', 'VBG'),
    ('to', 'TO'),
    ('get', 'VB'),
    ('very', 'RB'),
    ('tired', 'JJ'),
    ('of', 'IN'),
    ('sitting', 'VBG'),
    ('by', 'IN'),
    ('her', 'PRP$'),
    ('sister', 'NN'),...]

    View Slide

  31. WORD RELATIONSHIPS
    CONCORDANCE, N-GRAMS, CO-OCCURRENCE

    View Slide

  32. CONCORDANCE
    my_text = nltk.Text(tokens)
    my_text.concordance('Alice')
    Alice was beginning to get very tired of s
    hat is the use of a book , ' thought Alice 'without pictures or conversations ?
    so VERY remarkable in that ; nor did Alice think it so VERY much out of the way
    looked at it , and then hurried on , Alice started to her feet , for it flashed
    hedge . In another moment down went Alice after it , never once considering ho
    ped suddenly down , so suddenly that Alice had not a moment to think about stop
    she fell past it . 'Well ! ' thought Alice to herself , 'after such a fall as t
    own , I think -- ' ( for , you see , Alice had learnt several things of this so
    tude or Longitude I 've got to ? ' ( Alice had no idea what Latitude was , or L
    . There was nothing else to do , so Alice soon began talking again . 'Dinah 'l
    ats eat bats , I wonder ? ' And here Alice began to get rather sleepy , and wen
    dry leaves , and the fall was over . Alice was not a bit hurt , and she jumped
    not a moment to be lost : away went Alice like the wind , and was just in time
    but they were all locked ; and when Alice had been all the way down one side a
    KEYWORD IN CONTEXT

    View Slide

  33. CONCORDANCE

    View Slide

  34. https://www.washingtonpost.com/graphics/politics/2016-election/debates/oct-13-speakers/
    CONCORDANCE

    View Slide

  35. CONCORDANCE: WORD TREE
    https://www.jasondavies.com/wordtree/?source=alice-in-wonderland.txt&prefix=She

    View Slide

  36. http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum
    Visualizing Google's Bi-Gram Data
    N-GRAMS (COLLOCATIONS)

    View Slide

  37. http://www.chrisharrison.net/index.php/Visualizations/WordAssociations
    N-GRAMS (COLLOCATIONS)

    View Slide

  38. N-GRAMS (COLLOCATIONS)
    A set of words that occur together more often then chance.
    from nltk.collocations import BigramCollocationFinder
    finder = BigramCollocationFinder.from_words(filtered_tokens)
    # built in bigram metrics are in here
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    # we call score_ngrams on the finder to produce a sorted list
    # of bigrams. Each comes with its score from the metric, which
    # is how they are sorted.
    finder.score_ngrams(bigram_measures.raw_freq)
    [(('said', 'the'), 0.007674512539933169),
    (('of', 'the'), 0.004700179928762898),
    (('said', 'alice'), 0.004259538060441376),
    (('in', 'a'), 0.0035618551022656335),
    (('and', 'the'), 0.002900892299783351),...]

    View Slide

  39. N-GRAMS (COLLOCATIONS)
    A set of words that occur together more often than chance.
    finder.score_ngrams(bigram_measures.likelihood_ratio)
    [(('mock', 'turtle'), 781.0917141765854),
    (('said', 'the'), 597.9581706687363),
    (('said', 'alice'), 505.46971076855675),
    (('march', 'hare'), 461.91931122768904),
    (('went', 'on'), 376.6417465508724),
    (('do', "n't"), 372.7029564560615),
    (('the', 'queen'), 351.39319634691446),
    (('the', 'king'), 342.27277302768084),
    (('in', 'a'), 341.4084817025905),
    (('the', 'gryphon'), 278.40108569878106),...]

    View Slide

  40. Phrase Net: X begat Y
    CO-OCCURENCE
    Frank Van Ham, http://ieeexplore.ieee.org/ieee_pilot/articles/06/ttg2009061169/article.html#article

    View Slide

  41. Phrase Net: X of Y
    old testament new testament

    View Slide

  42. CO-OCCURENCE
    my_text = nltk.Text(tokens)
    my_text.findall('<.*><.*>')
    tired of sitting; and of having; use of a; pleasure of making; trouble
    of getting; out of the; out of it; plenty of time; sides of the; one
    of the; fear of killing; one of the; nothing of tumbling; top of the;
    centre of the; things of this; name of the; saucer of milk; sort of
    way; heap of sticks; row of lamps; made of solid; one of the; doors of
    the; any of them; out of that; beds of bright; be of very; book of
    rules; neck of the; sort of mixed; flavour of cherry-tart; flame of a;
    one of the; legs of the; game of croquet; fond of pretending; enough
    of me; top of her; way of expecting; Pool of Tears; out of sight; pair
    of boots; roof of the; ashamed of yourself; gallons of tears;
    pattering of feet; pair of white; help of any; were of the; any of
    them; sorts of things; capital of Paris; capital of Rome; waters of
    the; burst of tears; tired of being; one of the; cause of this; number
    of bathing; row of lodging; pool of tears; be of any; out of this;
    tired of swimming; way of speaking; -- of a; one of its; knowledge of
    history; out of the; end of his; subject of conversation; -- of --;

    View Slide

  43. COLLECTIONS OF
    DOCUMENTS
    Grouping/Clustering
    Comparison

    View Slide

  44. SIGNIFICANCE
    TF-IDF
    TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY

    View Slide

  45. The occurrence of "cat"
    in an article in
    New York Times
    Significant
    The occurrence of "cat"
    in an article in
    Cat Weekly Magazine
    Not Significant

    View Slide

  46. TF
    Term Frequency (TF) =
    Number of times a term appears in a document
    Total number of terms in a document
    1 document
    100 words
    3 "cat"
    3 / 100 = 0.03

    View Slide

  47. IDF
    Inverse Document Frequency (IDF) =
    log
    Total number of documents
    Number of documents with term t in it
    ( )
    10 million documents
    1000 containing "cat"
    log(10,000,000/
    1000) = 4

    View Slide

  48. A high weight in TF*IDF is reached by a high term
    frequency (in a given document) and a low frequency in
    the number of documents that contain that term. As the
    term appears in more documents, the ratio inside the
    logarithm approaches 1, bringing the IDF and TF-IDF
    closer to zero.
    *assuming each document that contains the word "cat" has it in it 3 times and has a total of 100 words
    10,000 documents, 1 document containing "cat"
    (3 / 100) * ln(10,000 / 1) = 0.2763102111592855
    10,000 documents, all 10,000 documents containing "cat"
    (3 / 100) * ln(10,000 / 10,000) = 0.0
    10,000 documents, 100 documents containing "cat"
    (3 / 100) * ln(10,000 / 100) = 0.13815510557964275

    View Slide

  49. Visualizing Email Content:
    Portraying Relationships from
    Conversational Histories
    Fernanda B. Viégas , 2006
    Themail

    View Slide

  50. GROUPING
    CLASSIFICATION, CLUSTERING

    View Slide

  51. World
    Sports
    Entertainment
    Life
    Arts
    News Classifier

    View Slide

  52. News Classifier
    New article, without a subject assigned yet
    World

    View Slide

  53. Many Bills: Engaging Citizens through
    Visualizations of Congressional Legislation
    Yannick Assogba, Irene Ros, Joan DiMicco, Matt McKeon
    IBM Research
    http://clome.info/papers/manybills_chi.pdf

    View Slide

  54. View Slide

  55. COMPARISON
    COSINE SIMILARITY, CLUSTERING

    View Slide

  56. Mike Bostock, Sean Carter, http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

    View Slide

  57. Document
    Document
    Document
    Document
    Document
    ...
    Document
    collection
    cat (5), house, dog, monkey
    air, strike(2), tanker, machine
    flight(2), strike, machine, guns
    light, air(4), balloon, flight
    cat, scratch, blood(2), hospital
    flight(4), commercial, aviation

    View Slide

  58. cat
    house
    dog
    monkey
    air
    strike
    tanker
    machine
    flight
    guns
    light
    balloon
    scratch
    blood
    hospital
    commercial
    aviation
    [5,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0]
    ...
    [0,0,0,0,1,2,1,1,0,0,0,0,0,0,0,0]
    [0,0,0,0,0,1,0,1,2,1,0,0,0,0,0,0]

    View Slide

  59. K-MEANS CLUSTERING

    View Slide

  60. TOOLS
    • Textkit
    http://learntextvis.github.io/textkit/

    View Slide

  61. TOOLS, NO PROGRAMMING
    • AntConc (http:/
    /www.laurenceanthony.net/software.html)
    • Overview Project (https:/
    /www.overviewdocs.com/)
    • Voyant (http:/
    /voyant-tools.org/)
    • Lexos (http:/
    /lexos.wheatoncollege.edu/upload)
    • Word and Phrase (http:/
    /www.wordandphrase.info/)
    • CorpKit (http:/
    /interrogator.github.io/corpkit/index.html)
    Tool collections:
    • DiRT tools - http:/
    /dirtdirectory.org/
    • TAPoR (http:/
    /tapor.ca/home)
    • http:/
    /guides.library.duke.edu/c.php?g=289707&p=1930856
    Many of these from a great talk by Lynn Cherny - http:/
    /
    ghostweather.slides.com/lynncherny/text-data-analysis-without-programming

    View Slide

  62. NOT COVERED, BUT NOTE WORTHY
    • Topic Modeling
    • Sentiment Analysis
    • Entity Extraction
    • Word2Vec
    • Neural networks
    • Search
    • Historic Trends

    View Slide

  63. GO VISUALIZE SOME
    WORDS

    View Slide

  64. Irene Ros
    [email protected]
    @ireneros
    http:/
    /ireneros.com | http:/
    /bocoup.com/datavis
    THANK YOU

    View Slide

  65. CITATION
    Icon Created by Piola, Noun Project: https://thenounproject.com/search/?q=document&i=709260

    View Slide