$30 off During Our Annual Pro Sale. View Details »

Language sleuthing HOWTO

Language sleuthing HOWTO

Presented at Linux Users of Victoria usergroup 2009.

Brianna Laugher

August 03, 2010
Tweet

More Decks by Brianna Laugher

Other Decks in Technology

Transcript

  1. Language Sleuthing HOWTO
    or
    Discovering Interesting Things
    with Python's
    Natural Language Tool Kit
    Brianna Laugher
    modernthings.org
    brianna[@.]laugher.id.au

    View Slide

  2. why?
    Corpus linguistics on web
    texts

    View Slide

  3. Because the web is full of
    language data
    Because linguistic techniques
    can reveal unexpected insights
    Because I don't want to have to
    read everything

    View Slide

  4. Like... mailing lists

    View Slide

  5. luv-main as a corpus
    √ Big collection of text
    x Messy data
    x Not annotated

    View Slide

  6. what's interesting?
    conversations
    topics
    change over time
    (authors)

    View Slide

  7. Step 1:
    get the data

    View Slide

  8. wget vs Python script
    √ wget is purpose-built
    √ convenient options like
    --convert-links

    View Slide

  9. Meaningful URLs FTW
    Sympa/MhonArc:
    lists.luv.asn.au/wws/arc/luv-main/
    2009-04/
    msg00057.html

    View Slide

  10. View Slide

  11. Step 2:
    clean the data

    View Slide

  12. Cleaning for what?
    Remove archive boilerplate
    Remove HTML
    Remove quoted text?
    Remove signatures?

    View Slide

  13. J.W.
    J.W.
    W.E.

    View Slide

  14. Behind the scenes
    J.W.
    W.E.

    View Slide

  15. what are we aiming for?
    what do NLTK corpora look like?

    View Slide

  16. Getting NLTK
    sudo apt-get install python-nltk
    in Ubuntu 10.04
    or
    sudo apt-get install python-pip
    pip install nltk
    or
    from source at nltk.org/download

    View Slide

  17. Getting NLTK data...
    an “NLTKism”

    View Slide

  18. View Slide

  19. NLTK corpora types

    View Slide

  20. Brown corpus
    A CategorizedTagged corpus:
    Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in
    clearing/vbg up/in any/dti possible/jj
    misconception/nn in/in your/pp$ minds/nns ,/,
    wherever/wrb you/ppss are/ber ./.
    The/at collective/nn by/in which/wdt I/ppss
    address/vb you/ppo in/in the/at title/nn above/rb
    is/bez neither/cc patronizing/vbg nor/cc jocose/jj
    but/cc an/at exact/jj industrial/jj term/nn in/in
    use/nn among/in professional/jj thieves/nns ./.

    View Slide

  21. Inaugural corpus
    A Plaintext corpus:
    My fellow citizens:
    I stand here today humbled by the task before us,
    grateful for the trust you have bestowed, mindful
    of the sacrifices borne by our ancestors. I thank
    President Bush for his service to our nation, as
    well as the generosity and cooperation he has
    shown throughout this transition.
    Forty-four Americans have now taken the
    presidential oath. ...............

    View Slide

  22. But we still have lots of HTML...

    View Slide

  23. View Slide

  24. BeautifulSoup to the rescue
    >>> from BeautifulSoup import BeautifulSoup as BS
    >>> data = open(filename,'r').read()
    >>> soup = BS(data)
    >>> print '\n'.join(soup.findAll(text=True))

    View Slide

  25. View Slide

  26. notice the blockquote!

    View Slide

  27. >>> bqs = s.findAll('blockquote')
    >>> [bq.extract() for bq in bqs]
    >>> print '\n'.join(s.findAll(text=True))
    On 05/08/2007, at 12:05 PM, [...] wrote:
    If u want it USB bootable, just burn the DSL boot disk to CD and fire it
    up.  Then from the desktop after boot, right click and create the
    bootable USB key yourself.  I havent actually done this myself (only
    seen the option from the menu), but I am assuming it will be a fairly painless
    process if you are happy with the stock image.  Would be interested in
    how you go as I have to build 50 USB bootable DSL's in the next couple weeks.
    Regards,
    [...]
    What about blockquotes?

    View Slide

  28. Step 3:
    analyse the data

    View Slide

  29. Getting it into NLTK
    import nltk
    path = 'path/to/files'
    corpus = nltk.corpus.PlaintextCorpusReader(path,
    '.*\.html')

    View Slide

  30. What about our metadata?
    Create a Python dictionary that maps filenames to
    categories
    e.g.
    categories={}
    categories['2008-12/msg00226.html'] =
    ['year-2008',
    'month-12',
    'author-BM'
    ]
    ....etc
    then...
    import nltk
    path = 'path/to/files/'
    corpus =
    nltk.corpus.CategorizedPlaintextCorpusReader(path,
    '.*\.html', cat_map=categories)

    View Slide

  31. Simple categories
    cats = corpus.categories()
    authorcats=[c for c in cats if c.startswith('author')]
    #>>> len(authorcats)
    #608
    yearcats=[c for c in cats if c.startswith('year')]
    monthcats=[c for c in cats if c.startswith('month')]

    View Slide

  32. ...who are the top posters?
    posts = [(len(corpus.fileids(author)), author) for author in
    authorcats]
    posts.sort(reverse=True)
    for count, author in posts[:10]:
    print "%5d\t%s" % (count, author)

    1304 author-JW
    1294 author-RC
    1243 author-CS
    1030 author-JH
    868 author-DP
    752 author-TWB
    608 author-CS#2
    556 author-TL
    452 author-BM
    412 author-RM
    (email me if you're curious to know if you're on it...)

    View Slide

  33. Frequency distributions
    popular =['ubuntu','debian','fedora','arch']
    niche = ['gentoo','suse','centos','redhat']
    def getcfd(distros,limit):
    cfd = nltk.ConditionalFreqDist(
    (distro, fileid[:limit])
    for fileid in corpus.fileids()
    for w in corpus.words(fileid)
    for distro in distros
    if w.lower().startswith(distro))
    return cfd
    popularcfd = getcfd(popular,4) # or 7 for months
    popularcfd.plot()
    nichecfd = getcfd(niche,4)
    nichecfd.plot()
    another “NLTKism”

    View Slide

  34. 'Popular' distros by month

    View Slide

  35. 'Popular' distros by year

    View Slide

  36. 'Niche' distros by year

    View Slide

  37. Random text generation
    import random
    words = [w.lower() for w in corpus.words()]
    bigrams = nltk.bigrams(words)
    cfd = nltk.ConditionalFreqDist(bigrams)
    def generate_model(cfdist, word, num=15):
    for i in range(num):
    print word,
    words = list(cfdist[word])
    word = random.choice(words)
    text = [w.lower() for w in corpus.words()]
    bigrams = nltk.bigrams(text)
    cfd = nltk.ConditionalFreqDist(bigrams)
    generate_model(cfd, 'hi', num=20)

    View Slide

  38. hi...
    hi allan : ages since apparently yum erased . attempts
    now venturing into config run ip 10 431 ms 57
    hi serg it illegal address entries must *, t close relative info
    many families continue fi into modem and reinstalled
    hi wen and amended :) imageshack does for grade service
    please blame . warning issued an overall environment
    consists in
    hi folks i accidentally due cause excitingly stupid idiots ,
    deletion flag on adding option ? branded ) mounting them
    hi guys do composite required emulator in for
    unattended has info to catalyse a dbus will see atz init3

    View Slide

  39. hi from Peter...
    text = [w.lower() for w in corpus.words(categories=
    [c for c in authorcats if 'PeterL' in c])]
    hi everyone , hence the database schema and that run on memberdb on mail
    store is 12 . yep ,
    hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle
    of failure .
    hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz
    g4 ibook here
    hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main
    host basis
    hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there
    ! now ). txt
    hi cameron , attribution for 30 seconds , and runs out on linux to on www .
    luv , these

    View Slide

  40. interesting collocations
    ...or not
    ext = [w.lower() for w in corpus.words() if w.isalpha()]
    from nltk.collocations import *
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(text)
    finder.apply_freq_filter(3)
    finder.nbest(bigram_measures.pmi, 10)

    bufnewfile bufread
    busmaster speccycle
    cellx celly
    cheswick bellovin
    cread clocal
    curtail atl
    dmcrs rscem
    dmmrbc dmost
    dmost dmcrs
    ...

    View Slide

  41. oblig tag cloud
    stopwords =
    nltk.corpus.stopwords.words('english')
    words = [w.lower() for w in corpus.words()
    if w.isalpha()]
    words = [w for w in words if w not in stopwords]
    word_fd = nltk.FreqDist(words)
    wordmax = word_fd[word_fd.max()]
    wordmin = 1000 #YMMV
    taglist = word_fd.items()
    ranges = getRanges(wordmin, wordmax)
    writeCloud(taglist, ranges, 'tags.html')

    View Slide

  42. View Slide

  43. another one for Peter :)
    cats = [c for c in corpus.categories()
    if 'PeterL' in c]
    words=[w.lower() for w in corpus.words(categories=cats)
    if w.isalpha()]
    wordmin = 10

    View Slide

  44. thanks!
    for more corpus fun:
    http://www.nltk.org/
    The Book:
    'Natural Language Processing
    with Python',
    2nd ed. pub. Jan 2010
    These slides are © Brianna Laugher and are released under
    the Creative Commons Attribution ShareAlike license,
    v3.0 unported. The data set is not free, sadly...

    View Slide