Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Language sleuthing HOWTO

Language sleuthing HOWTO

Presented at Linux Users of Victoria usergroup 2009.

Brianna Laugher

August 03, 2010
Tweet

More Decks by Brianna Laugher

Other Decks in Technology

Transcript

  1. Language Sleuthing HOWTO or Discovering Interesting Things with Python's Natural

    Language Tool Kit Brianna Laugher modernthings.org brianna[@.]laugher.id.au
  2. Because the web is full of language data Because linguistic

    techniques can reveal unexpected insights Because I don't want to have to read everything
  3. Getting NLTK sudo apt-get install python-nltk in Ubuntu 10.04 or

    sudo apt-get install python-pip pip install nltk or from source at nltk.org/download
  4. Brown corpus A CategorizedTagged corpus: Dear/jj Sirs/nns :/: Let/vb me/ppo

    begin/vb by/in clearing/vbg up/in any/dti possible/jj misconception/nn in/in your/pp$ minds/nns ,/, wherever/wrb you/ppss are/ber ./. The/at collective/nn by/in which/wdt I/ppss address/vb you/ppo in/in the/at title/nn above/rb is/bez neither/cc patronizing/vbg nor/cc jocose/jj but/cc an/at exact/jj industrial/jj term/nn in/in use/nn among/in professional/jj thieves/nns ./.
  5. Inaugural corpus A Plaintext corpus: My fellow citizens: I stand

    here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the presidential oath. ...............
  6. BeautifulSoup to the rescue >>> from BeautifulSoup import BeautifulSoup as

    BS >>> data = open(filename,'r').read() >>> soup = BS(data) >>> print '\n'.join(soup.findAll(text=True))
  7. >>> bqs = s.findAll('blockquote') >>> [bq.extract() for bq in bqs]

    >>> print '\n'.join(s.findAll(text=True)) On 05/08/2007, at 12:05 PM, [...] wrote: If u want it USB bootable, just burn the DSL boot disk to CD and fire it up.  Then from the desktop after boot, right click and create the bootable USB key yourself.  I havent actually done this myself (only seen the option from the menu), but I am assuming it will be a fairly painless process if you are happy with the stock image.  Would be interested in how you go as I have to build 50 USB bootable DSL's in the next couple weeks. Regards, [...] What about blockquotes?
  8. Getting it into NLTK import nltk path = 'path/to/files' corpus

    = nltk.corpus.PlaintextCorpusReader(path, '.*\.html')
  9. What about our metadata? Create a Python dictionary that maps

    filenames to categories e.g. categories={} categories['2008-12/msg00226.html'] = ['year-2008', 'month-12', 'author-BM<bm@xxxxx>' ] ....etc then... import nltk path = 'path/to/files/' corpus = nltk.corpus.CategorizedPlaintextCorpusReader(path, '.*\.html', cat_map=categories)
  10. Simple categories cats = corpus.categories() authorcats=[c for c in cats

    if c.startswith('author')] #>>> len(authorcats) #608 yearcats=[c for c in cats if c.startswith('year')] monthcats=[c for c in cats if c.startswith('month')]
  11. ...who are the top posters? posts = [(len(corpus.fileids(author)), author) for

    author in authorcats] posts.sort(reverse=True) for count, author in posts[:10]: print "%5d\t%s" % (count, author) → 1304 author-JW 1294 author-RC 1243 author-CS 1030 author-JH 868 author-DP 752 author-TWB 608 author-CS#2 556 author-TL 452 author-BM 412 author-RM (email me if you're curious to know if you're on it...)
  12. Frequency distributions popular =['ubuntu','debian','fedora','arch'] niche = ['gentoo','suse','centos','redhat'] def getcfd(distros,limit): cfd

    = nltk.ConditionalFreqDist( (distro, fileid[:limit]) for fileid in corpus.fileids() for w in corpus.words(fileid) for distro in distros if w.lower().startswith(distro)) return cfd popularcfd = getcfd(popular,4) # or 7 for months popularcfd.plot() nichecfd = getcfd(niche,4) nichecfd.plot() another “NLTKism”
  13. Random text generation import random words = [w.lower() for w

    in corpus.words()] bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) def generate_model(cfdist, word, num=15): for i in range(num): print word, words = list(cfdist[word]) word = random.choice(words) text = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'hi', num=20)
  14. hi... hi allan : ages since apparently yum erased .

    attempts now venturing into config run ip 10 431 ms 57 hi serg it illegal address entries must *, t close relative info many families continue fi into modem and reinstalled hi wen and amended :) imageshack does for grade service please blame . warning issued an overall environment consists in hi folks i accidentally due cause excitingly stupid idiots , deletion flag on adding option ? branded ) mounting them hi guys do composite required </ emulator in for unattended has info to catalyse a dbus will see atz init3
  15. hi from Peter... text = [w.lower() for w in corpus.words(categories=

    [c for c in authorcats if 'PeterL' in c])] hi everyone , hence the database schema and that run on memberdb on mail store is 12 . yep , hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle of failure . hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz g4 ibook here hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main host basis hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there ! now ). txt hi cameron , attribution for 30 seconds , and runs out on linux to on www . luv , these
  16. interesting collocations ...or not ext = [w.lower() for w in

    corpus.words() if w.isalpha()] from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(text) finder.apply_freq_filter(3) finder.nbest(bigram_measures.pmi, 10) → bufnewfile bufread busmaster speccycle cellx celly cheswick bellovin cread clocal curtail atl dmcrs rscem dmmrbc dmost dmost dmcrs ...
  17. oblig tag cloud stopwords = nltk.corpus.stopwords.words('english') words = [w.lower() for

    w in corpus.words() if w.isalpha()] words = [w for w in words if w not in stopwords] word_fd = nltk.FreqDist(words) wordmax = word_fd[word_fd.max()] wordmin = 1000 #YMMV taglist = word_fd.items() ranges = getRanges(wordmin, wordmax) writeCloud(taglist, ranges, 'tags.html')
  18. another one for Peter :) cats = [c for c

    in corpus.categories() if 'PeterL' in c] words=[w.lower() for w in corpus.words(categories=cats) if w.isalpha()] wordmin = 10 →
  19. thanks! for more corpus fun: http://www.nltk.org/ The Book: 'Natural Language

    Processing with Python', 2nd ed. pub. Jan 2010 These slides are © Brianna Laugher and are released under the Creative Commons Attribution ShareAlike license, v3.0 unported. The data set is not free, sadly...