Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MTO On Blast by Robert Elwell

PyCon 2013
March 16, 2013
300

MTO On Blast by Robert Elwell

Introduces the Natural Language Toolkit using language modeling, through the domain of a hip hop gossip blog

PyCon 2013

March 16, 2013
Tweet

Transcript

  1. MTO ON BLAST Using Language Models to Identify Endemic Constructions

    in a Hip Hop Gossip Blog Robert Elwell, Software Engineer
  2. About This Talk •  A light introduction to computational linguistics

    and natural language processing with Python, using NLTK •  Some light “web mining” is also involved •  A total beginner should walk away from this talk with a better idea of the complexity of processing text, and a place to start learning
  3. A Bit About Me •  Computational Linguist, NLP & Search

    Expert •  Cut my teeth on Python in Grad School at UT Austin •  Manage Wikia’s Search Stack, using a Python Backend •  And, of course…
  4. What does that mean? • A regular contributor to http://rap.wikia.com • I

    get my mixtapes from DatPiff • I get my videos and industry news from World Star Hip Hop • I get all of my juicy gossip from MediaTakeOut
  5. MediaTakeOut.com • Lots of unique constructions • Exciting headlines with a trademark

    tone • Sensationalizes everything about my favorite rappers
  6. A Scheme To Entertain Myself •  Create a headline generator

    that mimics MTO •  Build a tribute site to my favorite gossip blog •  Discover interesting constructions endemic to the corpus •  Combine my two favorite things: hip hop and NLP
  7. Some MTO Headline Examples •  TOLD YA SO!!! Rapper French

    Montana CONFIRMS . . . That He And Rapper Trina Are DATING!! (Pics And Video) •  BALLLLL-LINNNNNG!!!! Lil Wayne's FATHER Bryan BIRDMAN William s. . . Buys A $4 MILLION Lamborghini!!! (Pics Of His NEW RIDE) •  MTO WORLD EXCLUSIVE: Rapper A$AP Rocky's CELEBRITY Girlfriend DUMPS HIM . . . And Then 'REMOVES' The Tattoo he Got . . . Of His NAME!!!
  8. Looking for Patterns •  Ellipses used for effect •  Headlines

    introduced with an exclamation •  Parenthetical asides advertising pics, videos •  Hip hop celebrities as a subject
  9. What is a Headline Generator? •  A headline generator randomly

    applies rules from a grammar to create a headline •  A grammar is a mode of defining well-formed sentences •  A Context-Free Grammar (CFG) •  Manually write a series of rules to enforce how different classes of words and their groupings are ordered •  A Language Model •  Use statistical knowledge of a piece of text to apply probabilities to orders of words
  10. CFGs in Natural Language S à NP VP NP à

    D NP NP à N PP NP à D N PP à P NP VP à V A D à ‘the’ N à ‘house’ | ‘end’ | ‘street’ P à ‘at’ | ‘of’ V à ‘is’ A à ‘red’
  11. CFGs for Generating Text •  Like writing a single Mad

    Lib, and letting the computer play the game with a limited set of values •  A sentence is created by rules that group terminal nodes into non-terminal nodes until a sentence is created •  These rules can be domain-specific •  Randomly pick a value that will satisfy a rule that gets you closer to a full sentence https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-cfg.py
  12. Drawbacks of Using Rules •  You have to write the

    rules •  You have to enumerate the vocabulary •  You never see anything you haven’t mostly hand-coded NOT FUN
  13. Language Models •  Identifies the probability of a sequence of

    N words given the prior sequence of N - 1 words •  N-Grams: Unigram, Bigram, Trigram (4-gram…) •  A trigram language model identifies the probability of the next word given the last two words
  14. Tokens vs. Types •  Useful when talking about probabilities • 

    Each unique word in a text is a type •  A token is an observed instance of a type
  15. Tokens, Types, and Counts •  “A dog is a dog”

    – Will Smith Token counts for each type: •  ‘dog’ •  2 •  ‘a’ •  2 •  ‘is’ •  1
  16. N-Gram Probabilities •  The unigram probability of a type is

    its token count divided by the count of all tokens •  N-Gram probability is the probability that the last word in a sequence will appear, given all other possible outcomes
  17. An Exercise in Probability “Gold all in my chains. Gold

    all in my rings. Gold all in my watch. Don’t believe me, just watch.” – Trinidad James Probability that the first word of any sentence is “gold”? .75 Probability that the last word of the sentence is “watch”? .5 Probability that the last word of the sentence is “watch” given “all in my”? .33
  18. What Do We Need to Make Our Language Model Interesting?

    •  A lot of data •  No, really, a LOT of data (the more, the better) •  An interesting domain space •  A methodology for accessing and sanitizing that data
  19. A multi-part problem • Web Mining • Extracting content from an HTTP-

    accessible data source • Language Modeling If ya don’t know, now ya know!
  20. Web Mining •  Access HTML from a script •  Use

    the HTML to extract headline data based on selectors •  Iterate over pages by querying selectors •  Write headlines to a flat file, newline separated https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-scrape.py
  21. What We Got •  > 40,000 headlines •  Averaging 16

    words and 96 characters per headline •  A decent amount of data for a trigram language model
  22. Analyzing Text: Tokenization & Filtering •  nltk.tokenize •  A body

    of text is opaque without sentence tokenization •  Run this on a string to get a list of sentence-delineated strings •  Sentences are also opaque without it •  “We the best” à [“we”, “the”, “best”] •  Some words are functional, or uninteresting •  Stopword filter: [“we”, “the”, “best”] à [“we”, “best”]
  23. Analyzing Text: Frequencies •  nltk.probability •  FreqDist class encapsulates the

    frequency distribution for the N- grams you generate •  FreqDist.items() returns top most frequent tokens •  These frequencies influence what our model will generate https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-analyze.py
  24. NLTK Text class •  A useful class for exploring a

    corpus •  Has its own language model functionality (generate()) •  We explicitly use NLTK’s NgramModel class so that we can vary the number of N-grams used to generate
  25. NLTK NgramModel Class •  Takes an order of N-grams, an

    instance of Text and a probability estimator for smoothing •  Smoothing attempts to approximate probabilities for unseen instances, improving the predictability of a statistical model •  The model generates sentences between 5 and 25 words •  The sentences are sent to STDOUT https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-languagemodel.py
  26. Example Headlines •  MTO WORLD EXCLUSIVE: LEBRON JAMES IS GETTING

    THE LEADING ACTRESS IN DREAMGIRLS... GRAMMYS SINGER GETS TO REMAIN ON THE SHAVED HEAD •  MTO WORLD EXCLUSIVE: CHRIS BROWN PHOTO'D WITH A LIL LATE NEXT MONTH KELLY ROWLAND OUT AND ABOUT... MEETING JESUS! •  MTO WORLD EXCLUSIVE: KANYE WEST DIED IN DEADLY BUFFALO NY PLANE CRASH! •  MTO WORLD EXCLUSIVE: MEDIATAKEOUT.COM, PEREZ HILTON... AND JAY Z STEPS DOWN FROM OFFICE JOHN MCCAIN REFUSES TO MAKE ME LOOK
  27. Observations •  “MTO WORLD EXCLUSIVE” is so popular a beginning

    construction that it drowns out other starting phrases •  Sentences are not always totally grammatical •  We asked for between 5 and 25 words, NOT a grammatical sentence •  An improvement would be to filter out headlines that can’t be parsed with a CFG (provided part-of-speech tags!) •  The constructions we saw during the analysis are prevalent across generated examples
  28. Generator Site •  A PHP script that grabs a random

    line from a file via shell •  “Search” capability that fits a “grep” into that workflow http://robertelwell.info/mediatakeout-headline-generator/ (Content Advisory Applies)
  29. Conclusions •  Even a toy application can show how complex

    working with natural language can be •  Python’s Natural Language Toolkit is a fun and interesting way to cut your teeth on NLP’s most popular topics •  Language models can be a fun way to identify constructions that are endemic to a corpus