Alyona Medelyan: Understanding human language with Python

Understanding++ human+language++ with+Python+ Alyona'Medelyan'

Who+am+I?+ Alyona'' Medelyan' ▪  In'Natural'Language'Processing'since'2000' ▪  PhD'in'NLP'&'Machine'Learning'from'Waikato' ▪  Author'of'the'open'source'keyword'extraction'algorithm'Maui' ▪ 
Author'of'the'mostBcited'2009'journal'survey'“Mining'Meaning'with'Wikipedia”' ▪  Past:'Chief'Research'Oﬃcer'at'Pingar'' ▪  Now:'Founder'of'Entopix,'NLP'consultancy'&'software'development' aka'@zelandiya'

Agenda+ State'of'NLP' Recap'on'ﬁction'vs'reality:'Are'we'there'yet?' NLP'Complexities' Why'is'understanding'language'so'complex?' NLP'using'Python' NLTK,'Gensim,'TextBlob'&'Co' Building'NLP'applications' A'little'bit'of'data'science' Other'NLP'areas'
And'what’s'coming'next' '

State+of+NLP+ Fiction'versus'Reality'

He'(KITT)'“always'had'an'ego'that'was'easy'to'bruise'and'displayed'a' very'sensitive,'but'kind'and'dryly'humorous'personality.”'B' Wikipedia(

Android'Auto:'“handsBfree'operation'through'voice'commands'' will'be'emphasized'to'ensure'safe'driving”' (

“by'putting'this'into'one's'ear'one'can'instantly'' understand'anything'said'in'any'language”'(Hitchhiker'Wiki)' (

WordLense:' “augmented' reality' translation”'

Two'girls'use'Google'Translate'to'call'a'real'Indian'restaurant'and'order'in'Hindi…' How'did'it'go?'www.youtube.com/watch?v=wxDRburxwz8' '

The'LCARS'(or'simply'library'computer)'…'used'sophisticated' artiﬁcial'intelligence'routines'to'understand'and'execute'vocal'natural' language'commands'(From'Memory'Alpha'Wiki) (

Let’s+try+out+Google+

“Samantha'[the'OS]' proves'to'be'constantly' available,'always'curious' and'interested,'supportive' and'undemanding”'

Siri'doesn’t'seem'' to'be'as'“available”'

NLP+Complexities+ Why'is'understanding'language'so'complex?'

Word+segmentation+complexities+ ▪  ' ▪ 
' ▪  The'ﬁrst'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in' 1870.'' ▪  The'ﬁrst'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in' 1870.' '

Disambiguation+complexities+ Flying'planes'can'be'dangerous'

NLP+using+Python+ NLTK,'Gensim,'TextBlob'&'Co'

text text text text text text text text text text
text text text text text text text text sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical entities … text text text text text text ' text text text ' text text text ' text text text' text text text' What can we do with text?+

NLTK+ Python+platform+for+NLP+

How+to+get+to+the+core+words?+ Remove+Stopwords+with+NLTK+ even'the'acting'in'transcendence'is'solid','with'the'dreamy' depp'turning'in'a'typically'strong'performance' i'think'that'transcendence'has'a'pretty'solid'acting,'with'the' dreamy'depp'turning'in'a'strong'performance'as'he'usually'does' >>> from nltk.corpus import
stopwords >>> stop = stopwords.words('english') >>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp'] >>> print [word for word in words if word not in stop] ['acting', 'transcendence', 'solid’, 'dreamy', 'depp']

Getting+closer+to+the+meaning:+ Part+of+Speech+tagging+with+NLTK+ Flying'planes'can'be'dangerous' >>> import nltk >>> from nltk.tokenize import
word_tokenize >>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous")) [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')] ✓'

Keyword+scoring:++ TFxIDF+ Relative'frequency' of'a'term' t 'in'a' document' d( The'inverse' proportion'of'
documents' d 'in' collection' D ' mentioning'term' t(

from nltk.corpus import movie_reviews from gensim import corpora, models texts
= [] for fileid in movie_reviews.fileids(): words = texts.append(movie_reviews.words(fileid)) dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] tfidf = models.TfidfModel(corpus) TFxIDF+with+Gensim+

TFxIDF+with+Gensim+(Results)+ for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id
= dictionary.token2id.get(word) print word, '\t', tfidf.idfs[my_id] film 0.190174003903 movie 0.364013496254 comedy 1.98564470702 violence 3.2108967825 jolie 6.96578428466

Where+does+this+text+belong?+ Text+Categorization+with+NLTK+ Entertainment' Politics' TVNZ:'“Obama'and'' Hangover'star'' trade'insults'in'interview”' >>> train_set =
[(document_features(d), c) for (d,c) in categorized_documents] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> doc_features = document_features(new_document) >>> category = classifier.classify(features)

Sentiment+analysis+with+TextBlob+ >>> from textblob import TextBlob >>> blob = TextBlob("I
love this library") >>> blob.sentiment Sentiment(polarity=0.5, subjectivity=0.6) for review in transcendence: blob = TextBlob(open(review).read()) print review, blob.sentiment.polarity ../data/transcendence_1star.txt 0.0170799124247 ../data/transcendence_5star.txt 0.0874591503268 ../data/transcendence_8star.txt 0.256845238095 ../data/transcendence_10star.txt 0.304310344828

Building+NLP+applications+ A'little'bit'of'data'science'

Keywords+extracton+in+3h:+ Understanding+a+movie+review+ bellboy+ jennifer+beals+ four+rooms+ beals+ rooms+ tarantino+ madonna+ antonio+banderas+
valeria+golino+ …four'of'the'biggest'directors'in'hollywood':'quentin' tarantino','robert'rodriguez','…'were'all'directing'one'big'ﬁlm' with'a'big'and'popular'cast'...the'second'room'('jennifer' beals')'was'better','but'lacking'in'plot'...'the'bumbling'and' mumbling'bellboy'…'ruins'every'joke'in'the'ﬁlm'…' github.com/zelandiya/KiwiPyConBNLPBtutorial'

Keyword+extraction+on+2000+movie+reviews:+ What+makes+a+successful+movie?+ van'damme' zeta'–'jones' smith'' batman'' de'palma'' eddie'murphy'' killer'' tommy'lee'jones''
wild'west'' mars'' murphy'' ship'' space'' brothers'' de'bont'' ...' star'wars'' disney'' war'' de'niro'' jackie'' alien'' jackie'chan'' private'ryan'' truman'show'' ben'stiller'' cameron'' science'ﬁction'' cameron'diaz'' ﬁction'' jack'' ...' Negative ( (((((((((Positive(

How+NLP+can+help+a+beer+drinker?+ Sweaty'Horse'Blanket:'Processing'the'Natural'Language'of'Beer' by'Ben'Fields' vimeo.com/96809735'

Other+NLP+areas+ What’s'coming'next?'

Filling+the+gaps+in+machine+understanding+ /m/0d3k14'' /m/044sb' /m/0d3k14'' …(Jack(Ruby,(who(killed(J.F.Kennedy's(assassin(Lee(Harvey(Oswald.(…(( Freebase'

What’s+next?+ Vs.'

Conclusions:+ Understanding+human+language+with+Python+ deeplearning.net/software/theano' scikitBlearn.org/stable' NLTK' nltk.org' Are(we(there(yet?( More'on'Twitter:'@zelandiya'#nlproc'''''' radimrehurek.com/gensim' textblob.readthedocs.org'
See'also:'github.com/zelandiya/KiwiPyConBNLPBtutorial'

Alyona Medelyan: Understanding human language w...

Alyona Medelyan: Understanding human language with Python

New Zealand Python User Group

More Decks by New Zealand Python User Group

Other Decks in Programming

Featured

Transcript

Understanding++ human+language++ with+Python+ Alyona'Medelyan'

Who+am+I?+ Alyona'' Medelyan' ▪  In'Natural'Language'Processing'since'2000' ▪  PhD'in'NLP'&'Machine'Learning'from'Waikato' ▪  Author'of'the'open'source'keyword'extraction'algorithm'Maui' ▪

Agenda+ State'of'NLP' Recap'on'ﬁction'vs'reality:'Are'we'there'yet?' NLP'Complexities' Why'is'understanding'language'so'complex?' NLP'using'Python' NLTK,'Gensim,'TextBlob'&'Co' Building'NLP'applications' A'little'bit'of'data'science' Other'NLP'areas'

State+of+NLP+ Fiction'versus'Reality'

He'(KITT)'“always'had'an'ego'that'was'easy'to'bruise'and'displayed'a' very'sensitive,'but'kind'and'dryly'humorous'personality.”'B' Wikipedia(

Android'Auto:'“handsBfree'operation'through'voice'commands'' will'be'emphasized'to'ensure'safe'driving”' (

“by'putting'this'into'one's'ear'one'can'instantly'' understand'anything'said'in'any'language”'(Hitchhiker'Wiki)' (

WordLense:' “augmented' reality' translation”'

Two'girls'use'Google'Translate'to'call'a'real'Indian'restaurant'and'order'in'Hindi…' How'did'it'go?'www.youtube.com/watch?v=wxDRburxwz8' '

The'LCARS'(or'simply'library'computer)'…'used'sophisticated' artiﬁcial'intelligence'routines'to'understand'and'execute'vocal'natural' language'commands'(From'Memory'Alpha'Wiki) (

Let’s+try+out+Google+

“Samantha'[the'OS]' proves'to'be'constantly' available,'always'curious' and'interested,'supportive' and'undemanding”'

Siri'doesn’t'seem'' to'be'as'“available”'

NLP+Complexities+ Why'is'understanding'language'so'complex?'

Word+segmentation+complexities+ ▪  ' ▪

Disambiguation+complexities+ Flying'planes'can'be'dangerous'

NLP+using+Python+ NLTK,'Gensim,'TextBlob'&'Co'

text text text text text text text text text text

NLTK+ Python+platform+for+NLP+

Getting+closer+to+the+meaning:+ Part+of+Speech+tagging+with+NLTK+ Flying'planes'can'be'dangerous' >>> import nltk >>> from nltk.tokenize import

Keyword+scoring:++ TFxIDF+ Relative'frequency' of'a'term' t 'in'a' document' d( The'inverse' proportion'of'

from nltk.corpus import movie_reviews from gensim import corpora, models texts

TFxIDF+with+Gensim+(Results)+ for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id

Where+does+this+text+belong?+ Text+Categorization+with+NLTK+ Entertainment' Politics' TVNZ:'“Obama'and'' Hangover'star'' trade'insults'in'interview”' >>> train_set =

Sentiment+analysis+with+TextBlob+ >>> from textblob import TextBlob >>> blob = TextBlob("I

Building+NLP+applications+ A'little'bit'of'data'science'

Keywords+extracton+in+3h:+ Understanding+a+movie+review+ bellboy+ jennifer+beals+ four+rooms+ beals+ rooms+ tarantino+ madonna+ antonio+banderas+

Keyword+extraction+on+2000+movie+reviews:+ What+makes+a+successful+movie?+ van'damme' zeta'–'jones' smith'' batman'' de'palma'' eddie'murphy'' killer'' tommy'lee'jones''

How+NLP+can+help+a+beer+drinker?+ Sweaty'Horse'Blanket:'Processing'the'Natural'Language'of'Beer' by'Ben'Fields' vimeo.com/96809735'

Other+NLP+areas+ What’s'coming'next?'

Filling+the+gaps+in+machine+understanding+ /m/0d3k14'' /m/044sb' /m/0d3k14'' …(Jack(Ruby,(who(killed(J.F.Kennedy's(assassin(Lee(Harvey(Oswald.(…(( Freebase'

What’s+next?+ Vs.'

Conclusions:+ Understanding+human+language+with+Python+ deeplearning.net/software/theano' scikitBlearn.org/stable' NLTK' nltk.org' Are(we(there(yet?( More'on'Twitter:'@zelandiya'#nlproc'''''' radimrehurek.com/gensim' textblob.readthedocs.org'