Alyona Medelyan: Understanding human language with Python

Slide 1

Slide 1 text

Understanding++ human+language++ with+Python+ Alyona'Medelyan'

Slide 2

Slide 2 text

Who+am+I?+ Alyona'' Medelyan' ▪  In'Natural'Language'Processing'since'2000' ▪  PhD'in'NLP'&'Machine'Learning'from'Waikato' ▪  Author'of'the'open'source'keyword'extraction'algorithm'Maui' ▪  Author'of'the'mostBcited'2009'journal'survey'“Mining'Meaning'with'Wikipedia”' ▪  Past:'Chief'Research'Oﬃcer'at'Pingar'' ▪  Now:'Founder'of'Entopix,'NLP'consultancy'&'software'development' aka'@zelandiya'

Slide 3

Slide 3 text

Agenda+ State'of'NLP' Recap'on'ﬁction'vs'reality:'Are'we'there'yet?' NLP'Complexities' Why'is'understanding'language'so'complex?' NLP'using'Python' NLTK,'Gensim,'TextBlob'&'Co' Building'NLP'applications' A'little'bit'of'data'science' Other'NLP'areas' And'what’s'coming'next' '

Slide 4

Slide 4 text

State+of+NLP+ Fiction'versus'Reality'

Slide 5

Slide 5 text

He'(KITT)'“always'had'an'ego'that'was'easy'to'bruise'and'displayed'a' very'sensitive,'but'kind'and'dryly'humorous'personality.”'B' Wikipedia(

Slide 6

Slide 6 text

Android'Auto:'“handsBfree'operation'through'voice'commands'' will'be'emphasized'to'ensure'safe'driving”' (

Slide 7

Slide 7 text

“by'putting'this'into'one's'ear'one'can'instantly'' understand'anything'said'in'any'language”'(Hitchhiker'Wiki)' (

Slide 8

Slide 8 text

WordLense:' “augmented' reality' translation”'

Slide 9

Slide 9 text

Two'girls'use'Google'Translate'to'call'a'real'Indian'restaurant'and'order'in'Hindi…' How'did'it'go?'www.youtube.com/watch?v=wxDRburxwz8' '

Slide 10

Slide 10 text

The'LCARS'(or'simply'library'computer)'…'used'sophisticated' artiﬁcial'intelligence'routines'to'understand'and'execute'vocal'natural' language'commands'(From'Memory'Alpha'Wiki) (

Slide 11

Slide 11 text

Let’s+try+out+Google+

Slide 12

Slide 12 text

“Samantha'[the'OS]' proves'to'be'constantly' available,'always'curious' and'interested,'supportive' and'undemanding”'

Slide 13

Slide 13 text

Siri'doesn’t'seem'' to'be'as'“available”'

Slide 14

Slide 14 text

NLP+Complexities+ Why'is'understanding'language'so'complex?'

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Word+segmentation+complexities+ ▪  ' ▪  ' ▪  The'ﬁrst'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in' 1870.'' ▪  The'ﬁrst'hot'dogs'were'sold'by'Charles'Feltman'on'Coney'Island'in' 1870.' '

Slide 17

Slide 17 text

Disambiguation+complexities+ Flying'planes'can'be'dangerous'

Slide 18

Slide 18 text

NLP+using+Python+ NLTK,'Gensim,'TextBlob'&'Co'

Slide 19

Slide 19 text

text text text text text text text text text text text text text text text text text text sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical entities … text text text text text text ' text text text ' text text text ' text text text' text text text' What can we do with text?+

Slide 20

Slide 20 text

NLTK+ Python+platform+for+NLP+

Slide 21

Slide 21 text

How+to+get+to+the+core+words?+ Remove+Stopwords+with+NLTK+ even'the'acting'in'transcendence'is'solid','with'the'dreamy' depp'turning'in'a'typically'strong'performance' i'think'that'transcendence'has'a'pretty'solid'acting,'with'the' dreamy'depp'turning'in'a'strong'performance'as'he'usually'does' >>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp'] >>> print [word for word in words if word not in stop] ['acting', 'transcendence', 'solid’, 'dreamy', 'depp']

Slide 22

Slide 22 text

Getting+closer+to+the+meaning:+ Part+of+Speech+tagging+with+NLTK+ Flying'planes'can'be'dangerous' >>> import nltk >>> from nltk.tokenize import word_tokenize >>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous")) [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')] ✓'

Slide 23

Slide 23 text

Keyword+scoring:++ TFxIDF+ Relative'frequency' of'a'term' t 'in'a' document' d( The'inverse' proportion'of' documents' d 'in' collection' D ' mentioning'term' t(

Slide 24

Slide 24 text

from nltk.corpus import movie_reviews from gensim import corpora, models texts = [] for fileid in movie_reviews.fileids(): words = texts.append(movie_reviews.words(fileid)) dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] tfidf = models.TfidfModel(corpus) TFxIDF+with+Gensim+

Slide 25

Slide 25 text

TFxIDF+with+Gensim+(Results)+ for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id = dictionary.token2id.get(word) print word, '\t', tfidf.idfs[my_id] film 0.190174003903 movie 0.364013496254 comedy 1.98564470702 violence 3.2108967825 jolie 6.96578428466

Slide 26

Slide 26 text

Where+does+this+text+belong?+ Text+Categorization+with+NLTK+ Entertainment' Politics' TVNZ:'“Obama'and'' Hangover'star'' trade'insults'in'interview”' >>> train_set = [(document_features(d), c) for (d,c) in categorized_documents] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> doc_features = document_features(new_document) >>> category = classifier.classify(features)

Slide 27

Slide 27 text

Sentiment+analysis+with+TextBlob+ >>> from textblob import TextBlob >>> blob = TextBlob("I love this library") >>> blob.sentiment Sentiment(polarity=0.5, subjectivity=0.6) for review in transcendence: blob = TextBlob(open(review).read()) print review, blob.sentiment.polarity ../data/transcendence_1star.txt 0.0170799124247 ../data/transcendence_5star.txt 0.0874591503268 ../data/transcendence_8star.txt 0.256845238095 ../data/transcendence_10star.txt 0.304310344828

Slide 28

Slide 28 text

Building+NLP+applications+ A'little'bit'of'data'science'

Slide 29

Slide 29 text

Keywords+extracton+in+3h:+ Understanding+a+movie+review+ bellboy+ jennifer+beals+ four+rooms+ beals+ rooms+ tarantino+ madonna+ antonio+banderas+ valeria+golino+ …four'of'the'biggest'directors'in'hollywood':'quentin' tarantino','robert'rodriguez','…'were'all'directing'one'big'ﬁlm' with'a'big'and'popular'cast'...the'second'room'('jennifer' beals')'was'better','but'lacking'in'plot'...'the'bumbling'and' mumbling'bellboy'…'ruins'every'joke'in'the'ﬁlm'…' github.com/zelandiya/KiwiPyConBNLPBtutorial'

Slide 30

Slide 30 text

Keyword+extraction+on+2000+movie+reviews:+ What+makes+a+successful+movie?+ van'damme' zeta'–'jones' smith'' batman'' de'palma'' eddie'murphy'' killer'' tommy'lee'jones'' wild'west'' mars'' murphy'' ship'' space'' brothers'' de'bont'' ...' star'wars'' disney'' war'' de'niro'' jackie'' alien'' jackie'chan'' private'ryan'' truman'show'' ben'stiller'' cameron'' science'ﬁction'' cameron'diaz'' ﬁction'' jack'' ...' Negative ( (((((((((Positive(

Slide 31

Slide 31 text

How+NLP+can+help+a+beer+drinker?+ Sweaty'Horse'Blanket:'Processing'the'Natural'Language'of'Beer' by'Ben'Fields' vimeo.com/96809735'

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Other+NLP+areas+ What’s'coming'next?'

Slide 34

Slide 34 text

Filling+the+gaps+in+machine+understanding+ /m/0d3k14'' /m/044sb' /m/0d3k14'' …(Jack(Ruby,(who(killed(J.F.Kennedy's(assassin(Lee(Harvey(Oswald.(…(( Freebase'

Slide 35

Slide 35 text

What’s+next?+ Vs.'

Slide 36

Slide 36 text

Conclusions:+ Understanding+human+language+with+Python+ deeplearning.net/software/theano' scikitBlearn.org/stable' NLTK' nltk.org' Are(we(there(yet?( More'on'Twitter:'@zelandiya'#nlproc'''''' radimrehurek.com/gensim' textblob.readthedocs.org' See'also:'github.com/zelandiya/KiwiPyConBNLPBtutorial'