Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alyona Medelyan: Understanding human language with Python

Alyona Medelyan: Understanding human language with Python

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Alyona Medelyan:
Understanding human language with Python
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
@ Kiwi PyCon 2014 - Sunday, 14 Sep 2014 - Track 1
http://kiwi.pycon.org/

**Audience level**

Novice

**Description**

Natural Language Processing (NLP) is an area of Computer Science that studies how computers can understand human language. This talk will explain the main principles behind NLP and introduce some key Python libraries.

**Abstract**

Natural Language Processing (NLP) is an area of Computer Science that studies how computers can understand human language. Thanks to NLP, one day you might have your own intelligent assistant who can understand any dialect, who can answer your questions by finding the right answers on the web, and who can help you communicate in any language through instant and accurate translation.

There are many challenges that still need to be overcome for this to happen, but in the meantime researchers have been finessing the building blocks required for NLP analysis. Some of the algorithms out there are quite powerful already. Many exist in Python and are readily available for any Pythonista to use.

This talk will explain the main principles behind NLP and introduce some key Python libraries. If you are interested in finding out more, please also attend the in-depth tutorial on Friday.

**YouTube**

https://www.youtube.com/watch?v=vZsW-xCXfRI

New Zealand Python User Group

September 14, 2014
Tweet

More Decks by New Zealand Python User Group

Other Decks in Programming

Transcript

  1. Understanding++
    human+language++
    with+Python+
    Alyona'Medelyan'

    View Slide

  2. Who+am+I?+
    Alyona''
    Medelyan'
    ▪  In'Natural'Language'Processing'since'2000'
    ▪  PhD'in'NLP'&'Machine'Learning'from'Waikato'
    ▪  Author'of'the'open'source'keyword'extraction'algorithm'Maui'
    ▪  Author'of'the'mostBcited'2009'journal'survey'“Mining'Meaning'with'Wikipedia”'
    ▪  Past:'Chief'Research'Officer'at'Pingar''
    ▪  Now:'Founder'of'Entopix,'NLP'consultancy'&'software'development'
    aka'@zelandiya'

    View Slide

  3. Agenda+
    State'of'NLP'
    Recap'on'fiction'vs'reality:'Are'we'there'yet?'
    NLP'Complexities'
    Why'is'understanding'language'so'complex?'
    NLP'using'Python'
    NLTK,'Gensim,'TextBlob'&'Co'
    Building'NLP'applications'
    A'little'bit'of'data'science'
    Other'NLP'areas'
    And'what’s'coming'next'
    '

    View Slide

  4. State+of+NLP+
    Fiction'versus'Reality'

    View Slide

  5. He'(KITT)'“always'had'an'ego'that'was'easy'to'bruise'and'displayed'a'
    very'sensitive,'but'kind'and'dryly'humorous'personality.”'B'
    Wikipedia(

    View Slide

  6. Android'Auto:'“handsBfree'operation'through'voice'commands''
    will'be'emphasized'to'ensure'safe'driving”'
    (

    View Slide

  7. “by'putting'this'into'one's'ear'one'can'instantly''
    understand'anything'said'in'any'language”'(Hitchhiker'Wiki)'
    (

    View Slide

  8. WordLense:'
    “augmented'
    reality'
    translation”'

    View Slide

  9. Two'girls'use'Google'Translate'to'call'a'real'Indian'restaurant'and'order'in'Hindi…'
    How'did'it'go?'www.youtube.com/watch?v=wxDRburxwz8'
    '

    View Slide

  10. The'LCARS'(or'simply'library'computer)'…'used'sophisticated'
    artificial'intelligence'routines'to'understand'and'execute'vocal'natural'
    language'commands'(From'Memory'Alpha'Wiki)
    (

    View Slide

  11. Let’s+try+out+Google+

    View Slide

  12. “Samantha'[the'OS]'
    proves'to'be'constantly'
    available,'always'curious'
    and'interested,'supportive'
    and'undemanding”'

    View Slide

  13. Siri'doesn’t'seem''
    to'be'as'“available”'

    View Slide

  14. NLP+Complexities+
    Why'is'understanding'language'so'complex?'

    View Slide

  15. View Slide

  16. Word+segmentation+complexities+
    ▪ 
    '
    ▪ 
    '
    ▪  The'first'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in'
    1870.''
    ▪  The'first'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in'
    1870.'
    '

    View Slide

  17. Disambiguation+complexities+
    Flying'planes'can'be'dangerous'

    View Slide

  18. NLP+using+Python+
    NLTK,'Gensim,'TextBlob'&'Co'

    View Slide

  19. text text text
    text text text
    text text text
    text text text
    text text text
    text text text
    sentiment
    keywords
    tags
    genre
    categories
    taxonomy terms
    entities
    names
    patterns
    biochemical
    entities
    … text text text
    text text text '
    text text text '
    text text text '
    text text text'
    text text text'
    What can we do with text?+

    View Slide

  20. NLTK+
    Python+platform+for+NLP+

    View Slide

  21. How+to+get+to+the+core+words?+
    Remove+Stopwords+with+NLTK+
    even'the'acting'in'transcendence'is'solid','with'the'dreamy'
    depp'turning'in'a'typically'strong'performance'
    i'think'that'transcendence'has'a'pretty'solid'acting,'with'the'
    dreamy'depp'turning'in'a'strong'performance'as'he'usually'does'
    >>> from nltk.corpus import stopwords
    >>> stop = stopwords.words('english')
    >>> words = ['the', 'acting', 'in', 'transcendence', 'is',
    'solid', 'with', 'the', 'dreamy', 'depp']
    >>> print [word for word in words if word not in stop]
    ['acting', 'transcendence', 'solid’, 'dreamy', 'depp']

    View Slide

  22. Getting+closer+to+the+meaning:+
    Part+of+Speech+tagging+with+NLTK+
    Flying'planes'can'be'dangerous'
    >>> import nltk
    >>> from nltk.tokenize import word_tokenize
    >>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous"))
    [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'),
    ('be', 'VB'), ('dangerous', 'JJ')]
    ✓'

    View Slide

  23. Keyword+scoring:++
    TFxIDF+
    Relative'frequency'
    of'a'term'
    t
    'in'a'
    document'
    d(
    The'inverse'
    proportion'of'
    documents'
    d
    'in'
    collection'
    D
    '
    mentioning'term'
    t(

    View Slide

  24. from nltk.corpus import movie_reviews
    from gensim import corpora, models
    texts = []
    for fileid in movie_reviews.fileids():
    words = texts.append(movie_reviews.words(fileid))
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    tfidf = models.TfidfModel(corpus)
    TFxIDF+with+Gensim+

    View Slide

  25. TFxIDF+with+Gensim+(Results)+
    for word in ['film', 'movie', 'comedy',
    'violence', 'jolie']:
    my_id = dictionary.token2id.get(word)
    print word, '\t', tfidf.idfs[my_id]
    film 0.190174003903
    movie 0.364013496254
    comedy 1.98564470702
    violence 3.2108967825
    jolie 6.96578428466

    View Slide

  26. Where+does+this+text+belong?+
    Text+Categorization+with+NLTK+
    Entertainment'
    Politics'
    TVNZ:'“Obama'and''
    Hangover'star''
    trade'insults'in'interview”'
    >>> train_set = [(document_features(d), c) for (d,c) in categorized_documents]
    >>> classifier = nltk.NaiveBayesClassifier.train(train_set)
    >>> doc_features = document_features(new_document)
    >>> category = classifier.classify(features)

    View Slide

  27. Sentiment+analysis+with+TextBlob+
    >>> from textblob import TextBlob
    >>> blob = TextBlob("I love this library")
    >>> blob.sentiment
    Sentiment(polarity=0.5, subjectivity=0.6)
    for review in transcendence:
    blob = TextBlob(open(review).read())
    print review, blob.sentiment.polarity
    ../data/transcendence_1star.txt 0.0170799124247
    ../data/transcendence_5star.txt 0.0874591503268
    ../data/transcendence_8star.txt 0.256845238095
    ../data/transcendence_10star.txt 0.304310344828

    View Slide

  28. Building+NLP+applications+
    A'little'bit'of'data'science'

    View Slide

  29. Keywords+extracton+in+3h:+
    Understanding+a+movie+review+
    bellboy+
    jennifer+beals+
    four+rooms+
    beals+
    rooms+
    tarantino+
    madonna+
    antonio+banderas+
    valeria+golino+
    …four'of'the'biggest'directors'in'hollywood':'quentin'
    tarantino','robert'rodriguez','…'were'all'directing'one'big'film'
    with'a'big'and'popular'cast'...the'second'room'('jennifer'
    beals')'was'better','but'lacking'in'plot'...'the'bumbling'and'
    mumbling'bellboy'…'ruins'every'joke'in'the'film'…'
    github.com/zelandiya/KiwiPyConBNLPBtutorial'

    View Slide

  30. Keyword+extraction+on+2000+movie+reviews:+
    What+makes+a+successful+movie?+
    van'damme'
    zeta'–'jones'
    smith''
    batman''
    de'palma''
    eddie'murphy''
    killer''
    tommy'lee'jones''
    wild'west''
    mars''
    murphy''
    ship''
    space''
    brothers''
    de'bont''
    ...'
    star'wars''
    disney''
    war''
    de'niro''
    jackie''
    alien''
    jackie'chan''
    private'ryan''
    truman'show''
    ben'stiller''
    cameron''
    science'fiction''
    cameron'diaz''
    fiction''
    jack''
    ...'
    Negative ( (((((((((Positive(

    View Slide

  31. How+NLP+can+help+a+beer+drinker?+
    Sweaty'Horse'Blanket:'Processing'the'Natural'Language'of'Beer'
    by'Ben'Fields'
    vimeo.com/96809735'

    View Slide

  32. View Slide

  33. Other+NLP+areas+
    What’s'coming'next?'

    View Slide

  34. Filling+the+gaps+in+machine+understanding+
    /m/0d3k14''
    /m/044sb'
    /m/0d3k14''
    …(Jack(Ruby,(who(killed(J.F.Kennedy's(assassin(Lee(Harvey(Oswald.(…((
    Freebase'

    View Slide

  35. What’s+next?+
    Vs.'

    View Slide

  36. Conclusions:+
    Understanding+human+language+with+Python+
    deeplearning.net/software/theano'
    scikitBlearn.org/stable'
    NLTK'
    nltk.org'
    Are(we(there(yet?(
    More'on'Twitter:'@zelandiya'#nlproc''''''
    radimrehurek.com/gensim'
    textblob.readthedocs.org'
    See'also:'github.com/zelandiya/KiwiPyConBNLPBtutorial'

    View Slide