Python for Curious People who Like Natural Language a Lot

PyGotham 2014 Python for Curious People who Like Natural Language
a Lot Jackie Cohen | @jczetta | jczetta [at] gmail slides available: http://bit.ly/jzc-pygotham14 All content in this presentation, unless otherwise noted/referenced, is (c) Jackie Cohen 2014, and licensed under a Creative Commons Attribution (CC-BY 4.0) license. 1

today I will ✤ deﬁne terms and give some background
(explain my excitement a bit) ✤ ask some questions I think are relevant ✤ provide the beginnings of answers to those questions ✤ talk a little about why I think those are good questions ✤ hooray. 2

first, language (!!!) ✤ natural language ✤ what humans use
to communicate (usually) ✤ (English, French, the stuff you say to a friend) ✤ formal language ✤ how humans communicate with e.g. computers ✤ (Python, SQL, Java, one of many you can invent) 3

Python is awesome ✤ clarity ✤ teaching ✤ similarities ✤
differences ✤ ﬂexibility 4

“I love language” ✤ ok, well, language is a major
reason I’m a programmer ✤ big overlap in communities ✤ this talk is aimed at celebrating that intersection ✤ so many options (!) 5

“this ‘linguistics research’ thing sounds neat” “and this building stuff,
that sounds neat” ✤ so, can we weave these together well? ✤ the answer to that question is usually yes ✤ but let’s go deeper ✤ super useful: different perspectives (!!!), and curiosity (!!!) ✤ we have some questions to ask and answer 6

let’s ask: ✤ 1. what’s linguistics, really? ✤ 2. which
choices must we make? ✤ 3. where should we start? (and what can we do?) ✤ 4. why does this matter? 7

1. what’s linguistics? ✤ studying parallel, overlapping systems ✤ the
talk (speaking) and the typing (writing) aren’t the same ✤ complex structures are generated from simple rules ✤ tons of subﬁelds 8

2. what choices must we make? ✤ work with different
aspects of natural language: ✤ semantics, phonetics, phonology, syntax, history, many social factors ... ✤ work with different libraries and concepts in Python: ✤ Natural Language ToolKit, fun ways of conceptualizing parsing, various algorithm implementations... 9

subfields I’m talkin’ about ✤ phonetics >> looking at the
discrete units of language and how we produce them, consume them, put them together ✤ lexical semantics >> what words mean to us, and how they relate to other words in one language system or among many ✤ syntax >> how we parse language, how pieces of it ﬁt together (“grammaticality”) ✤ (also lots of other cool things, though) ✤ ((BRAINS!)) 10

3. where should we start? if you want to play
along (later, or whenever) pip install nltk (edit appropriately for your operating system/environment) and see: http://www.nltk.org/data.html 11

interlude on NLTK ✤ major awesome introductory tool ✤ open
source! ✤ lots of code samples ✤ exciting doorways ✤ maybe coast here for a bit! maybe dive in. maybe both. ✤ ALL THE OPTIONS ARE OKAY (worth shouting, sorry) ✤ http://www.nltk.org 12

syntax | parts of speech ‘n stuff for example. part
of speech taggers (are pretty sweet) different algorithms, different success rates in different contexts import nltk snippet = “the fox jumped over the lazy brown dog slowly” tokenized_snippet = nltk.word_tokenize(snippet) tagged_snippet = nltk.pos_tag(tokenized_snippet) print tagged_snippet # for example # result: [('the', 'DT'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('brown', 'NN'), ('dog', 'NN'), ('slowly', 'RB')] sing_nouns = [w for w in tagged_snippet if x[1] == “NN”] + extrapolation! 13

syntax | is that grammatical? Curious green ideas sleep furiously.
14

I go to work on the bus all the time anymore. 15

I go to work on the bus all the time anymore. Jump trash sink phone hotel because. 16

phonetics | phonemes ‘n rhymes spoken and signed language is
made up of individual discrete units ahoy, International Phonetic Alphabet (and text encodings :o) 17

cont. Carnegie Mellon University Pronunciation Dictionary a text corpus! http://www.speech.cs.cmu.edu/cgi-bin/cmudict
http://praat.org : ‘doing phonetics by computer’! syllables! stress! how do you ﬁnd rhymes? how to measure prosody? how does sound production change ________ ? 18

semantics | wordnet, for example from nltk.corpus import wordnet as
wn >>> print wn.synsets('dog') # [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] >>> print wn.synset('dog.n.01').definition() # a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds >>> print wn.synsets('dog',pos=wn.VERB) # [Synset('chase.v.01')] deﬁning relations between sets of words! http://www.nltk.org/howto/wordnet.html 19

3. why does this matter? all the reasons! I’ll touch
on just a couple 20

classification one line importing corpus one line defining dataset(s) (‘we
know about this stuff, have it so you can use it as a basis to make judgments, machine’) one line randomly shuffling all the data one line initializing training data and testing data one line instantiating a classifier object (‘this machine is here to get data + info to train it on, k?’) then, printing stuff out, using or writing other functions, whatever. learned lots! not too painful. 21

cont. (just for reference!) def name_features(word): ! return {'last_letter': word[-1]}
# let's look at 'what is the last letter of the name' # and let's pretend there are only two categories for names, 'male' and 'female' even though there aren't really only two of course names = ([[name, 'male'] for name in names.words('male.txt')] + [[name, 'female'] for name in names.words('female.txt')]) random.shuffle(names) featuresets = [(name_features(n), g) for (n,g) in names] train_set, test_set = featuresets[500:], featuresets[:500] # lots of stuff in the corpus, yay classifier = nltk.NaiveBayesClassifier.train(train_set) # time for printing print classifier.classify(name_features('Neo')) print classifier.classify(name_features('Eowyn')) # Here's accuracy of this classification for our test data so we know how well we’re doin’ print nltk.classify.accuracy(classifier, test_set) # NLTK can just tell you (with behind scenes work!) what some # features are most important that you're looking at. e.g. it’s easiest to tell what category a name falls in by looking at whether or not it ends in “a”, maybe? classifier.show_most_informative_features(5) 22

parsing and translation very simplest translation is 1:1 pattern matching
look stuff up in dictionaries! lst_b = [dictionary[w] for w in str_a] str_b = “ “.join(lst_b) # punctuation is for later then, stats can help! more complicated pattern matching! linguistic theory can touch on this lightly or heavily (often, lightly) using Python, you’ve got lots of niftiness available pre-packaged + a neat community-of-knowledge 23

NEAT. 24

references & relevant links “Nantucket! Hacking with Verse” by Danielle
Sucher & Darius Bacon, !!con 2014: http://bangbangcon.com/2014-transcripts/danielle-sucher-darius-bacon-nantucket- hacking-at-verse.txt (http://bangbangcon.com/) NLTK: http://nltk.org Wordnet examples: http://www.nltk.org/howto/wordnet.html Classiﬁcation examples: http://www.nltk.org/book/ch06.html CMU Pronunciation Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict International Phonetic Alphabet reference: http://en.wikipedia.org/wiki/International_Phonetic_Alphabet Praat: http://praat.org Chomsky, “Three models for the description of language” (1956) : http://noam-chomsky.org/articles/195609--.pdf + endless 25

thanks! 26

Python for Curious People who Like Natural Lang...

Python for Curious People who Like Natural Language a Lot

jczetta

More Decks by jczetta

Other Decks in Programming

Featured

Transcript

PyGotham 2014 Python for Curious People who Like Natural Language

today I will ✤ deﬁne terms and give some background

first, language (!!!) ✤ natural language ✤ what humans use

Python is awesome ✤ clarity ✤ teaching ✤ similarities ✤

“I love language” ✤ ok, well, language is a major

“this ‘linguistics research’ thing sounds neat” “and this building stuff,

let’s ask: ✤ 1. what’s linguistics, really? ✤ 2. which

1. what’s linguistics? ✤ studying parallel, overlapping systems ✤ the

2. what choices must we make? ✤ work with different

subfields I’m talkin’ about ✤ phonetics >> looking at the

3. where should we start? if you want to play

interlude on NLTK ✤ major awesome introductory tool ✤ open

syntax | parts of speech ‘n stuff for example. part

syntax | is that grammatical? Curious green ideas sleep furiously.

syntax | is that grammatical? Curious green ideas sleep furiously.

syntax | is that grammatical? Curious green ideas sleep furiously.

phonetics | phonemes ‘n rhymes spoken and signed language is

cont. Carnegie Mellon University Pronunciation Dictionary a text corpus! http://www.speech.cs.cmu.edu/cgi-bin/cmudict

semantics | wordnet, for example from nltk.corpus import wordnet as

3. why does this matter? all the reasons! I’ll touch

classification one line importing corpus one line deﬁning dataset(s) (‘we

cont. (just for reference!) def name_features(word): ! return {'last_letter': word[-1]}

parsing and translation very simplest translation is 1:1 pattern matching

NEAT. 24

references & relevant links “Nantucket! Hacking with Verse” by Danielle

thanks! 26