About This Talk • A light introduction to computational linguistics and natural language processing with Python, using NLTK • Some light “web mining” is also involved • A total beginner should walk away from this talk with a better idea of the complexity of processing text, and a place to start learning
A Bit About Me • Computational Linguist, NLP & Search Expert • Cut my teeth on Python in Grad School at UT Austin • Manage Wikia’s Search Stack, using a Python Backend • And, of course…
What does that mean? • A regular contributor to http://rap.wikia.com • I get my mixtapes from DatPiff • I get my videos and industry news from World Star Hip Hop • I get all of my juicy gossip from MediaTakeOut
A Scheme To Entertain Myself • Create a headline generator that mimics MTO • Build a tribute site to my favorite gossip blog • Discover interesting constructions endemic to the corpus • Combine my two favorite things: hip hop and NLP
Some MTO Headline Examples • TOLD YA SO!!! Rapper French Montana CONFIRMS . . . That He And Rapper Trina Are DATING!! (Pics And Video) • BALLLLL-LINNNNNG!!!! Lil Wayne's FATHER Bryan BIRDMAN William s. . . Buys A $4 MILLION Lamborghini!!! (Pics Of His NEW RIDE) • MTO WORLD EXCLUSIVE: Rapper A$AP Rocky's CELEBRITY Girlfriend DUMPS HIM . . . And Then 'REMOVES' The Tattoo he Got . . . Of His NAME!!!
Looking for Patterns • Ellipses used for effect • Headlines introduced with an exclamation • Parenthetical asides advertising pics, videos • Hip hop celebrities as a subject
What is a Headline Generator? • A headline generator randomly applies rules from a grammar to create a headline • A grammar is a mode of defining well-formed sentences • A Context-Free Grammar (CFG) • Manually write a series of rules to enforce how different classes of words and their groupings are ordered • A Language Model • Use statistical knowledge of a piece of text to apply probabilities to orders of words
CFGs in Natural Language S à NP VP NP à D NP NP à N PP NP à D N PP à P NP VP à V A D à ‘the’ N à ‘house’ | ‘end’ | ‘street’ P à ‘at’ | ‘of’ V à ‘is’ A à ‘red’
CFGs for Generating Text • Like writing a single Mad Lib, and letting the computer play the game with a limited set of values • A sentence is created by rules that group terminal nodes into non-terminal nodes until a sentence is created • These rules can be domain-specific • Randomly pick a value that will satisfy a rule that gets you closer to a full sentence https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-cfg.py
Drawbacks of Using Rules • You have to write the rules • You have to enumerate the vocabulary • You never see anything you haven’t mostly hand-coded NOT FUN
Language Models • Identifies the probability of a sequence of N words given the prior sequence of N - 1 words • N-Grams: Unigram, Bigram, Trigram (4-gram…) • A trigram language model identifies the probability of the next word given the last two words
N-Gram Probabilities • The unigram probability of a type is its token count divided by the count of all tokens • N-Gram probability is the probability that the last word in a sequence will appear, given all other possible outcomes
An Exercise in Probability “Gold all in my chains. Gold all in my rings. Gold all in my watch. Don’t believe me, just watch.” – Trinidad James Probability that the first word of any sentence is “gold”? .75 Probability that the last word of the sentence is “watch”? .5 Probability that the last word of the sentence is “watch” given “all in my”? .33
What Do We Need to Make Our Language Model Interesting? • A lot of data • No, really, a LOT of data (the more, the better) • An interesting domain space • A methodology for accessing and sanitizing that data
Web Mining • Access HTML from a script • Use the HTML to extract headline data based on selectors • Iterate over pages by querying selectors • Write headlines to a flat file, newline separated https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-scrape.py
Analyzing Text: Tokenization & Filtering • nltk.tokenize • A body of text is opaque without sentence tokenization • Run this on a string to get a list of sentence-delineated strings • Sentences are also opaque without it • “We the best” à [“we”, “the”, “best”] • Some words are functional, or uninteresting • Stopword filter: [“we”, “the”, “best”] à [“we”, “best”]
Analyzing Text: Frequencies • nltk.probability • FreqDist class encapsulates the frequency distribution for the N- grams you generate • FreqDist.items() returns top most frequent tokens • These frequencies influence what our model will generate https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-analyze.py
NLTK Text class • A useful class for exploring a corpus • Has its own language model functionality (generate()) • We explicitly use NLTK’s NgramModel class so that we can vary the number of N-grams used to generate
NLTK NgramModel Class • Takes an order of N-grams, an instance of Text and a probability estimator for smoothing • Smoothing attempts to approximate probabilities for unseen instances, improving the predictability of a statistical model • The model generates sentences between 5 and 25 words • The sentences are sent to STDOUT https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-languagemodel.py
Example Headlines • MTO WORLD EXCLUSIVE: LEBRON JAMES IS GETTING THE LEADING ACTRESS IN DREAMGIRLS... GRAMMYS SINGER GETS TO REMAIN ON THE SHAVED HEAD • MTO WORLD EXCLUSIVE: CHRIS BROWN PHOTO'D WITH A LIL LATE NEXT MONTH KELLY ROWLAND OUT AND ABOUT... MEETING JESUS! • MTO WORLD EXCLUSIVE: KANYE WEST DIED IN DEADLY BUFFALO NY PLANE CRASH! • MTO WORLD EXCLUSIVE: MEDIATAKEOUT.COM, PEREZ HILTON... AND JAY Z STEPS DOWN FROM OFFICE JOHN MCCAIN REFUSES TO MAKE ME LOOK
Observations • “MTO WORLD EXCLUSIVE” is so popular a beginning construction that it drowns out other starting phrases • Sentences are not always totally grammatical • We asked for between 5 and 25 words, NOT a grammatical sentence • An improvement would be to filter out headlines that can’t be parsed with a CFG (provided part-of-speech tags!) • The constructions we saw during the analysis are prevalent across generated examples
Generator Site • A PHP script that grabs a random line from a file via shell • “Search” capability that fits a “grep” into that workflow http://robertelwell.info/mediatakeout-headline-generator/ (Content Advisory Applies)
Conclusions • Even a toy application can show how complex working with natural language can be • Python’s Natural Language Toolkit is a fun and interesting way to cut your teeth on NLP’s most popular topics • Language models can be a fun way to identify constructions that are endemic to a corpus
?uestions? robertelwell.info wikia.com/User:Relwell twitter.com/languagehacker twitter.com/mtoheadlinebot github.com/relwell Interested in learning more Search and NLP at Wikia? Contact robert at wikia-inc.com for internship opportunities.