Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python for Curious People who Like Natural Language a Lot

jczetta
August 16, 2014

Python for Curious People who Like Natural Language a Lot

Rather than going into details of algorithms, here I give some simple, easy-to-build-upon examples of how Python and open source Python packages can be used to quickly dive into some really awesome aspects of research/investigation in linguistics, and bring them together to explain, at a high level, why I believe Python is an excellent bridge between linguists interested in programming, beginning programmers interested in linguistics, and any curious people who like figuring stuff out about languages all along a spectrum of formality. Given at PyGotham 2014.

jczetta

August 16, 2014
Tweet

More Decks by jczetta

Other Decks in Programming

Transcript

  1. PyGotham 2014 Python for Curious People who Like Natural Language

    a Lot Jackie Cohen | @jczetta | jczetta [at] gmail slides available: http://bit.ly/jzc-pygotham14 All content in this presentation, unless otherwise noted/referenced, is (c) Jackie Cohen 2014, and licensed under a Creative Commons Attribution (CC-BY 4.0) license. 1
  2. today I will ✤ define terms and give some background

    (explain my excitement a bit) ✤ ask some questions I think are relevant ✤ provide the beginnings of answers to those questions ✤ talk a little about why I think those are good questions ✤ hooray. 2
  3. first, language (!!!) ✤ natural language ✤ what humans use

    to communicate (usually) ✤ (English, French, the stuff you say to a friend) ✤ formal language ✤ how humans communicate with e.g. computers ✤ (Python, SQL, Java, one of many you can invent) 3
  4. “I love language” ✤ ok, well, language is a major

    reason I’m a programmer ✤ big overlap in communities ✤ this talk is aimed at celebrating that intersection ✤ so many options (!) 5
  5. “this ‘linguistics research’ thing sounds neat” “and this building stuff,

    that sounds neat” ✤ so, can we weave these together well? ✤ the answer to that question is usually yes ✤ but let’s go deeper ✤ super useful: different perspectives (!!!), and curiosity (!!!) ✤ we have some questions to ask and answer 6
  6. let’s ask: ✤ 1. what’s linguistics, really? ✤ 2. which

    choices must we make? ✤ 3. where should we start? (and what can we do?) ✤ 4. why does this matter? 7
  7. 1. what’s linguistics? ✤ studying parallel, overlapping systems ✤ the

    talk (speaking) and the typing (writing) aren’t the same ✤ complex structures are generated from simple rules ✤ tons of subfields 8
  8. 2. what choices must we make? ✤ work with different

    aspects of natural language: ✤ semantics, phonetics, phonology, syntax, history, many social factors ... ✤ work with different libraries and concepts in Python: ✤ Natural Language ToolKit, fun ways of conceptualizing parsing, various algorithm implementations... 9
  9. subfields I’m talkin’ about ✤ phonetics >> looking at the

    discrete units of language and how we produce them, consume them, put them together ✤ lexical semantics >> what words mean to us, and how they relate to other words in one language system or among many ✤ syntax >> how we parse language, how pieces of it fit together (“grammaticality”) ✤ (also lots of other cool things, though) ✤ ((BRAINS!)) 10
  10. 3. where should we start? if you want to play

    along (later, or whenever) pip install nltk (edit appropriately for your operating system/environment) and see: http://www.nltk.org/data.html 11
  11. interlude on NLTK ✤ major awesome introductory tool ✤ open

    source! ✤ lots of code samples ✤ exciting doorways ✤ maybe coast here for a bit! maybe dive in. maybe both. ✤ ALL THE OPTIONS ARE OKAY (worth shouting, sorry) ✤ http://www.nltk.org 12
  12. syntax | parts of speech ‘n stuff for example. part

    of speech taggers (are pretty sweet) different algorithms, different success rates in different contexts import nltk snippet = “the fox jumped over the lazy brown dog slowly” tokenized_snippet = nltk.word_tokenize(snippet) tagged_snippet = nltk.pos_tag(tokenized_snippet) print tagged_snippet # for example # result: [('the', 'DT'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('brown', 'NN'), ('dog', 'NN'), ('slowly', 'RB')] sing_nouns = [w for w in tagged_snippet if x[1] == “NN”] + extrapolation! 13
  13. syntax | is that grammatical? Curious green ideas sleep furiously.

    I go to work on the bus all the time anymore. 15
  14. syntax | is that grammatical? Curious green ideas sleep furiously.

    I go to work on the bus all the time anymore. Jump trash sink phone hotel because. 16
  15. phonetics | phonemes ‘n rhymes spoken and signed language is

    made up of individual discrete units ahoy, International Phonetic Alphabet (and text encodings :o) 17
  16. cont. Carnegie Mellon University Pronunciation Dictionary a text corpus! http://www.speech.cs.cmu.edu/cgi-bin/cmudict

    http://praat.org : ‘doing phonetics by computer’! syllables! stress! how do you find rhymes? how to measure prosody? how does sound production change ________ ? 18
  17. semantics | wordnet, for example from nltk.corpus import wordnet as

    wn >>> print wn.synsets('dog') # [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] >>> print wn.synset('dog.n.01').definition() # a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds >>> print wn.synsets('dog',pos=wn.VERB) # [Synset('chase.v.01')] defining relations between sets of words! http://www.nltk.org/howto/wordnet.html 19
  18. classification one line importing corpus one line defining dataset(s) (‘we

    know about this stuff, have it so you can use it as a basis to make judgments, machine’) one line randomly shuffling all the data one line initializing training data and testing data one line instantiating a classifier object (‘this machine is here to get data + info to train it on, k?’) then, printing stuff out, using or writing other functions, whatever. learned lots! not too painful. 21
  19. cont. (just for reference!) def name_features(word): ! return {'last_letter': word[-1]}

    # let's look at 'what is the last letter of the name' # and let's pretend there are only two categories for names, 'male' and 'female' even though there aren't really only two of course names = ([[name, 'male'] for name in names.words('male.txt')] + [[name, 'female'] for name in names.words('female.txt')]) random.shuffle(names) featuresets = [(name_features(n), g) for (n,g) in names] train_set, test_set = featuresets[500:], featuresets[:500] # lots of stuff in the corpus, yay classifier = nltk.NaiveBayesClassifier.train(train_set) # time for printing print classifier.classify(name_features('Neo')) print classifier.classify(name_features('Eowyn')) # Here's accuracy of this classification for our test data so we know how well we’re doin’ print nltk.classify.accuracy(classifier, test_set) # NLTK can just tell you (with behind scenes work!) what some # features are most important that you're looking at. e.g. it’s easiest to tell what category a name falls in by looking at whether or not it ends in “a”, maybe? classifier.show_most_informative_features(5) 22
  20. parsing and translation very simplest translation is 1:1 pattern matching

    look stuff up in dictionaries! lst_b = [dictionary[w] for w in str_a] str_b = “ “.join(lst_b) # punctuation is for later then, stats can help! more complicated pattern matching! linguistic theory can touch on this lightly or heavily (often, lightly) using Python, you’ve got lots of niftiness available pre-packaged + a neat community-of-knowledge 23
  21. references & relevant links “Nantucket! Hacking with Verse” by Danielle

    Sucher & Darius Bacon, !!con 2014: http://bangbangcon.com/2014-transcripts/danielle-sucher-darius-bacon-nantucket- hacking-at-verse.txt (http://bangbangcon.com/) NLTK: http://nltk.org Wordnet examples: http://www.nltk.org/howto/wordnet.html Classification examples: http://www.nltk.org/book/ch06.html CMU Pronunciation Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict International Phonetic Alphabet reference: http://en.wikipedia.org/wiki/International_Phonetic_Alphabet Praat: http://praat.org Chomsky, “Three models for the description of language” (1956) : http://noam-chomsky.org/articles/195609--.pdf + endless 25