Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MTO On Blast by Robert Elwell

PyCon 2013
March 16, 2013
300

MTO On Blast by Robert Elwell

Introduces the Natural Language Toolkit using language modeling, through the domain of a hip hop gossip blog

PyCon 2013

March 16, 2013
Tweet

Transcript

  1. MTO ON BLAST
    Using Language Models to Identify Endemic
    Constructions in a Hip Hop Gossip Blog
    Robert Elwell, Software Engineer

    View Slide

  2. About This Talk
    •  A light introduction to computational linguistics and
    natural language processing with Python, using NLTK
    •  Some light “web mining” is also involved
    •  A total beginner should walk away from this talk with a
    better idea of the complexity of processing text, and a
    place to start learning

    View Slide

  3. This talk includes explicit rap lyrics and
    gossip headlines in its examples.

    View Slide

  4. A Bit About Me
    •  Computational Linguist, NLP & Search Expert
    •  Cut my teeth on Python in Grad School at UT Austin
    •  Manage Wikia’s Search Stack, using a Python Backend
    •  And, of course…

    View Slide

  5. View Slide

  6. What does that mean?
    • A regular contributor to http://rap.wikia.com
    • I get my mixtapes from DatPiff
    • I get my videos and industry news from
    World Star Hip Hop
    • I get all of my juicy gossip from
    MediaTakeOut

    View Slide

  7. MediaTakeOut.com

    View Slide

  8. MediaTakeOut.com
    • Lots of unique constructions
    • Exciting headlines with a trademark tone
    • Sensationalizes everything about my
    favorite rappers

    View Slide

  9. A Scheme To Entertain Myself
    •  Create a headline generator that mimics MTO
    •  Build a tribute site to my favorite gossip blog
    •  Discover interesting constructions endemic to the corpus
    •  Combine my two favorite things: hip hop and NLP

    View Slide

  10. Some MTO Headline Examples
    •  TOLD YA SO!!! Rapper French Montana CONFIRMS . . .
    That He And Rapper Trina Are DATING!! (Pics And Video)
    •  BALLLLL-LINNNNNG!!!! Lil Wayne's FATHER Bryan
    BIRDMAN William s. . . Buys A $4 MILLION
    Lamborghini!!! (Pics Of His NEW RIDE)
    •  MTO WORLD EXCLUSIVE: Rapper A$AP Rocky's
    CELEBRITY Girlfriend DUMPS HIM . . . And Then
    'REMOVES' The Tattoo he Got . . . Of His NAME!!!

    View Slide

  11. Looking for Patterns
    •  Ellipses used for effect
    •  Headlines introduced with an exclamation
    •  Parenthetical asides advertising pics, videos
    •  Hip hop celebrities as a subject

    View Slide

  12. What is a Headline Generator?
    •  A headline generator randomly applies rules from a
    grammar to create a headline
    •  A grammar is a mode of defining well-formed sentences
    •  A Context-Free Grammar (CFG)
    •  Manually write a series of rules to enforce how different classes of
    words and their groupings are ordered
    •  A Language Model
    •  Use statistical knowledge of a piece of text to apply probabilities to
    orders of words

    View Slide

  13. CFGs in Natural Language
    S à NP VP
    NP à D NP
    NP à N PP
    NP à D N
    PP à P NP
    VP à V A
    D à ‘the’
    N à ‘house’ | ‘end’ | ‘street’
    P à ‘at’ | ‘of’
    V à ‘is’
    A à ‘red’

    View Slide

  14. CFGs for Generating Text
    •  Like writing a single Mad Lib, and letting the computer
    play the game with a limited set of values
    •  A sentence is created by rules that group terminal nodes
    into non-terminal nodes until a sentence is created
    •  These rules can be domain-specific
    •  Randomly pick a value that will satisfy a rule that gets you
    closer to a full sentence
    https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-cfg.py

    View Slide

  15. mto-cfg.py

    View Slide

  16. Example Output

    View Slide

  17. Drawbacks of Using Rules
    •  You have to write the rules
    •  You have to enumerate the vocabulary
    •  You never see anything you haven’t mostly hand-coded
    NOT FUN

    View Slide

  18. Language Models
    •  Identifies the probability of a sequence of N words given
    the prior sequence of N - 1 words
    •  N-Grams: Unigram, Bigram, Trigram (4-gram…)
    •  A trigram language model identifies the probability of the
    next word given the last two words

    View Slide

  19. We Use Probabilities, Too!
    •  Where’s the…
    •  Who’s the…

    View Slide

  20. Tokens vs. Types
    •  Useful when talking about probabilities
    •  Each unique word in a text is a type
    •  A token is an observed instance of a type

    View Slide

  21. Tokens, Types, and Counts
    •  “A dog is a dog” – Will Smith
    Token counts for each type:
    •  ‘dog’
    •  2
    •  ‘a’
    •  2
    •  ‘is’
    •  1

    View Slide

  22. N-Gram Probabilities
    •  The unigram probability of a type is its token count divided
    by the count of all tokens
    •  N-Gram probability is the probability that the last word in a
    sequence will appear, given all other possible outcomes

    View Slide

  23. An Exercise in Probability
    “Gold all in my chains.
    Gold all in my rings.
    Gold all in my watch.
    Don’t believe me, just watch.”
    – Trinidad James
    Probability that the first word of any
    sentence is “gold”?
    .75
    Probability that the last word of the
    sentence is “watch”?
    .5
    Probability that the last word of the
    sentence is “watch” given “all in my”?
    .33

    View Slide

  24. What Do We Need to Make Our
    Language Model Interesting?
    •  A lot of data
    •  No, really, a LOT of data (the more, the better)
    •  An interesting domain space
    •  A methodology for accessing and sanitizing that data

    View Slide

  25. A multi-part problem
    • Web Mining
    • Extracting content from an HTTP-
    accessible data source
    • Language Modeling
    If ya don’t know,
    now ya know!

    View Slide

  26. Web Mining
    •  Access HTML from a script
    •  Use the HTML to extract headline data based on selectors
    •  Iterate over pages by querying selectors
    •  Write headlines to a flat file, newline separated
    https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-scrape.py

    View Slide

  27. mto-scrape.py

    View Slide

  28. What We Got
    •  > 40,000 headlines
    •  Averaging 16 words and 96 characters per headline
    •  A decent amount of data for a trigram language model

    View Slide

  29. Analyzing Text: Tokenization & Filtering
    •  nltk.tokenize
    •  A body of text is opaque without sentence tokenization
    •  Run this on a string to get a list of sentence-delineated strings
    •  Sentences are also opaque without it
    •  “We the best” à [“we”, “the”, “best”]
    •  Some words are functional, or uninteresting
    •  Stopword filter: [“we”, “the”, “best”] à [“we”, “best”]

    View Slide

  30. Analyzing Text: Frequencies
    •  nltk.probability
    •  FreqDist class encapsulates the frequency distribution for the N-
    grams you generate
    •  FreqDist.items() returns top most frequent tokens
    •  These frequencies influence what our model will generate
    https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-analyze.py

    View Slide

  31. mto-analyze.py

    View Slide

  32. mto-analyze.py

    View Slide

  33. mto-analyze.py

    View Slide

  34. NLTK Text class
    •  A useful class for exploring a corpus
    •  Has its own language model functionality (generate())
    •  We explicitly use NLTK’s NgramModel class so that we
    can vary the number of N-grams used to generate

    View Slide

  35. NLTK NgramModel Class
    •  Takes an order of N-grams, an instance of Text and a
    probability estimator for smoothing
    •  Smoothing attempts to approximate probabilities for unseen
    instances, improving the predictability of a statistical model
    •  The model generates sentences between 5 and 25 words
    •  The sentences are sent to STDOUT
    https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-languagemodel.py

    View Slide

  36. mto-languagemodel.py

    View Slide

  37. Example Headlines
    •  MTO WORLD EXCLUSIVE: LEBRON JAMES IS GETTING THE
    LEADING ACTRESS IN DREAMGIRLS... GRAMMYS SINGER GETS
    TO REMAIN ON THE SHAVED HEAD
    •  MTO WORLD EXCLUSIVE: CHRIS BROWN PHOTO'D WITH A LIL
    LATE NEXT MONTH KELLY ROWLAND OUT AND ABOUT...
    MEETING JESUS!
    •  MTO WORLD EXCLUSIVE: KANYE WEST DIED IN DEADLY
    BUFFALO NY PLANE CRASH!
    •  MTO WORLD EXCLUSIVE: MEDIATAKEOUT.COM, PEREZ
    HILTON... AND JAY Z STEPS DOWN FROM OFFICE JOHN
    MCCAIN REFUSES TO MAKE ME LOOK

    View Slide

  38. Observations
    •  “MTO WORLD EXCLUSIVE” is so popular a beginning
    construction that it drowns out other starting phrases
    •  Sentences are not always totally grammatical
    •  We asked for between 5 and 25 words, NOT a grammatical
    sentence
    •  An improvement would be to filter out headlines that can’t be
    parsed with a CFG (provided part-of-speech tags!)
    •  The constructions we saw during the analysis are
    prevalent across generated examples

    View Slide

  39. Generator Site
    •  A PHP script that grabs a random line from a file via shell
    •  “Search” capability that fits a “grep” into that workflow
    http://robertelwell.info/mediatakeout-headline-generator/
    (Content Advisory Applies)

    View Slide

  40. View Slide

  41. Conclusions
    •  Even a toy application can show how complex working
    with natural language can be
    •  Python’s Natural Language Toolkit is a fun and interesting
    way to cut your teeth on NLP’s most popular topics
    •  Language models can be a fun way to identify
    constructions that are endemic to a corpus

    View Slide

  42. ?uestions?
    robertelwell.info
    wikia.com/User:Relwell
    twitter.com/languagehacker
    twitter.com/mtoheadlinebot
    github.com/relwell
    Interested in learning more Search and NLP at Wikia?
    Contact robert at wikia-inc.com for internship opportunities.

    View Slide