and natural language processing with Python, using NLTK • Some light “web mining” is also involved • A total beginner should walk away from this talk with a better idea of the complexity of processing text, and a place to start learning
that mimics MTO • Build a tribute site to my favorite gossip blog • Discover interesting constructions endemic to the corpus • Combine my two favorite things: hip hop and NLP
Montana CONFIRMS . . . That He And Rapper Trina Are DATING!! (Pics And Video) • BALLLLL-LINNNNNG!!!! Lil Wayne's FATHER Bryan BIRDMAN William s. . . Buys A $4 MILLION Lamborghini!!! (Pics Of His NEW RIDE) • MTO WORLD EXCLUSIVE: Rapper A$AP Rocky's CELEBRITY Girlfriend DUMPS HIM . . . And Then 'REMOVES' The Tattoo he Got . . . Of His NAME!!!
applies rules from a grammar to create a headline • A grammar is a mode of defining well-formed sentences • A Context-Free Grammar (CFG) • Manually write a series of rules to enforce how different classes of words and their groupings are ordered • A Language Model • Use statistical knowledge of a piece of text to apply probabilities to orders of words
Lib, and letting the computer play the game with a limited set of values • A sentence is created by rules that group terminal nodes into non-terminal nodes until a sentence is created • These rules can be domain-specific • Randomly pick a value that will satisfy a rule that gets you closer to a full sentence https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-cfg.py
N words given the prior sequence of N - 1 words • N-Grams: Unigram, Bigram, Trigram (4-gram…) • A trigram language model identifies the probability of the next word given the last two words
its token count divided by the count of all tokens • N-Gram probability is the probability that the last word in a sequence will appear, given all other possible outcomes
all in my rings. Gold all in my watch. Don’t believe me, just watch.” – Trinidad James Probability that the first word of any sentence is “gold”? .75 Probability that the last word of the sentence is “watch”? .5 Probability that the last word of the sentence is “watch” given “all in my”? .33
• A lot of data • No, really, a LOT of data (the more, the better) • An interesting domain space • A methodology for accessing and sanitizing that data
the HTML to extract headline data based on selectors • Iterate over pages by querying selectors • Write headlines to a flat file, newline separated https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-scrape.py
of text is opaque without sentence tokenization • Run this on a string to get a list of sentence-delineated strings • Sentences are also opaque without it • “We the best” à [“we”, “the”, “best”] • Some words are functional, or uninteresting • Stopword filter: [“we”, “the”, “best”] à [“we”, “best”]
frequency distribution for the N- grams you generate • FreqDist.items() returns top most frequent tokens • These frequencies influence what our model will generate https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-analyze.py
corpus • Has its own language model functionality (generate()) • We explicitly use NLTK’s NgramModel class so that we can vary the number of N-grams used to generate
instance of Text and a probability estimator for smoothing • Smoothing attempts to approximate probabilities for unseen instances, improving the predictability of a statistical model • The model generates sentences between 5 and 25 words • The sentences are sent to STDOUT https://github.com/relwell/MTO-ON-BLAST/blob/master/mto-languagemodel.py
THE LEADING ACTRESS IN DREAMGIRLS... GRAMMYS SINGER GETS TO REMAIN ON THE SHAVED HEAD • MTO WORLD EXCLUSIVE: CHRIS BROWN PHOTO'D WITH A LIL LATE NEXT MONTH KELLY ROWLAND OUT AND ABOUT... MEETING JESUS! • MTO WORLD EXCLUSIVE: KANYE WEST DIED IN DEADLY BUFFALO NY PLANE CRASH! • MTO WORLD EXCLUSIVE: MEDIATAKEOUT.COM, PEREZ HILTON... AND JAY Z STEPS DOWN FROM OFFICE JOHN MCCAIN REFUSES TO MAKE ME LOOK
construction that it drowns out other starting phrases • Sentences are not always totally grammatical • We asked for between 5 and 25 words, NOT a grammatical sentence • An improvement would be to filter out headlines that can’t be parsed with a CFG (provided part-of-speech tags!) • The constructions we saw during the analysis are prevalent across generated examples
line from a file via shell • “Search” capability that fits a “grep” into that workflow http://robertelwell.info/mediatakeout-headline-generator/ (Content Advisory Applies)
working with natural language can be • Python’s Natural Language Toolkit is a fun and interesting way to cut your teeth on NLP’s most popular topics • Language models can be a fun way to identify constructions that are endemic to a corpus