Slide 1

Slide 1 text

(a very gentle) Introduction to! Natural Language Processing! in Python Marco Bonzanini! London Python Meet-up 
 Sept 2016

Slide 2

Slide 2 text

Nice to Meet You

Slide 3

Slide 3 text

Objectives for Today • Getting one Python person to try some NLP • Getting one NLP person to try some Python

Slide 4

Slide 4 text

http://speakerdeck.com/marcobonzanini http://github.com/bonzanini/nlp-tutorial

Slide 5

Slide 5 text

Natural Language

Slide 6

Slide 6 text

Language Instrument of communication

Slide 7

Slide 7 text

Natural Not planned, not artificial

Slide 8

Slide 8 text

SELECT name, address
 FROM businesses
 WHERE business_type = ‘pub’
 AND postcode_area = ‘EC2M’ vs Where is the nearest pub?

Slide 9

Slide 9 text

Natural = Easy ?

Slide 10

Slide 10 text

Language is huge

Slide 11

Slide 11 text

• OED: 171,000 words in current use (47k obsolete) • Average person: 10,000-40,000 words
 (according to: The Guardian 1986, BBC 2009, and several random people on the Web) • Average person? (Lies, damn lies and statistics) • Active vs Passive vocabulary

Slide 12

Slide 12 text

Language is confusing (sometimes)

Slide 13

Slide 13 text

That that is is that that is not is not is that it it is (That’s proper English)

Slide 14

Slide 14 text

That that is, is. That that is not, is not. Is that it? It is. More fun at:
 https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences Pics:
 https://en.wikipedia.org/wiki/Socrates and https://en.wikipedia.org/wiki/Parmenides

Slide 15

Slide 15 text

Language is ambiguous

Slide 16

Slide 16 text

Word ambiguity

Slide 17

Slide 17 text

“They ate pizza with anchovies” Syntactic ambiguity

Slide 18

Slide 18 text

Common sense
 is implied (but computers don’t really have it)

Slide 19

Slide 19 text

Natural Language Processing Computational
 Linguistics Computer
 Science NLP NLP vs Text Mining vs Text Analytics

Slide 20

Slide 20 text

NLP Goals Text Data Useful Information Actionable Insights

Slide 21

Slide 21 text

NLP Applications • Text Classification • Text Clustering • Text Summarisation • Machine Translation
 • Semantic Search • Sentiment Analysis • Question Answering • Information Extraction

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

I was told there would be Python

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

NLP Pipeline • pip install nltk • Sentence Boundary Detection • Word Tokenisation • Word Normalisation • Stop-word removal • Bigrams / trigrams / n-grams

Slide 27

Slide 27 text

Sentence Boundary Detection

Slide 28

Slide 28 text

How about str.split(‘.’) ??? How about Mr., Dr. or U.S.A.?

Slide 29

Slide 29 text

from nltk.tokenize
 import sent_tokenize

Slide 30

Slide 30 text

Word Tokenisation

Slide 31

Slide 31 text

How about str.split(‘ ’) ??? How about punctuation?

Slide 32

Slide 32 text

from nltk.tokenize
 import word_tokenize

Slide 33

Slide 33 text

>>> s = "Talking about #NLProc in #Python at @python_london meetup" >>> word_tokenize(s) ['Talking', 'about', '#', 'NLProc', 'in', '#', 'Python', 'at', '@', 'python_london', 'meetup']

Slide 34

Slide 34 text

from nltk.tokenize
 import TweetTokenizer

Slide 35

Slide 35 text

Text Normalisation

Slide 36

Slide 36 text

>>> "python" == "Python" False >>> "python" == "Python".lower() True

Slide 37

Slide 37 text

Stemming • Map a token into its stem • Fish, Fishes, Fishing ‛ Fish

Slide 38

Slide 38 text

from nltk.stem
 import PorterStemmer

Slide 39

Slide 39 text

from nltk.stem
 import SnowballStemmer

Slide 40

Slide 40 text

Lemmatisation • Map a token into its lemma • Go, goes, going, went ‛ Go

Slide 41

Slide 41 text

from nltk.stem
 import WordNetLemmatizer

Slide 42

Slide 42 text

Stop-word Removal

Slide 43

Slide 43 text

from nltk.corpus import stopwords
 
 stopwords.words(‘english’)

Slide 44

Slide 44 text

>>> stop_list = [ … ] # custom
 >>> s = "a piece of butter" >>> [tok for tok in word_tokenize(s) 
 if tok not in stop_list] ['piece', 'butter']

Slide 45

Slide 45 text

n-grams

Slide 46

Slide 46 text

from nltk import bigrams from nltk import trigrams from nltk import ngrams

Slide 47

Slide 47 text

Good for capturing phrases:
 “bad movie”, “good olive oil”, … ! How about stop-words?

Slide 48

Slide 48 text

… Now what?

Slide 49

Slide 49 text

Exploring Text Data

Slide 50

Slide 50 text

from collections import Counter ! frequencies = Counter(all_tokens)

Slide 51

Slide 51 text

Visualising Text Data

Slide 52

Slide 52 text

pip install wordcloud @MarcoBonzanini

Slide 53

Slide 53 text

Can we do something smarter?

Slide 54

Slide 54 text

–J.R. Firth 1957 “You shall know a word 
 by the company it keeps.”

Slide 55

Slide 55 text

• From Bag-of-Words to Word Embeddings
 (e.g. word2vec) • Similar context = close vectors • Semantic relationships: vector arithmetic! • pip install gensim

Slide 56

Slide 56 text

from gensim.models import Word2Vec model = Word2Vec(sentences) ! model.most_similar(positive=['king', 'woman'], negative=['man']) ! [('queen', 0.50882536), ...] Tutorial: https://rare-technologies.com/word2vec-tutorial/

Slide 57

Slide 57 text

More NLP Libraries • spaCy — “industrial-strength NLP”
 Designed for speed and accuracy • scikit-learn — Machine Learning
 Good support for text
 (e.g. TfidfVectorizer)

Slide 58

Slide 58 text

Summary • 80/20 rule: preprocessing is 80%? • Counting words vs Neural Networks
 (in less than 1h!) • Rich Python ecosystem

Slide 59

Slide 59 text

Thank You https://github.com/bonzanini/nlp-tutorial
 @MarcoBonzanini