SELECT name, address
FROM businesses
WHERE business_type = ‘pub’
AND postcode_area = ‘EC2M’
vs
Where is the nearest pub?
Slide 9
Slide 9 text
Natural = Easy
?
Slide 10
Slide 10 text
Language is huge
Slide 11
Slide 11 text
• OED: 171,000 words in current use (47k obsolete)
• Average person: 10,000-40,000 words
(according to: The Guardian 1986, BBC 2009, and several random people on the Web)
• Average person? (Lies, damn lies and statistics)
• Active vs Passive vocabulary
Slide 12
Slide 12 text
Language is confusing
(sometimes)
Slide 13
Slide 13 text
That that is is that that is
not is not is that it it is
(That’s proper English)
Slide 14
Slide 14 text
That that is, is.
That that is not, is not.
Is that it? It is.
More fun at:
https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences
Pics:
https://en.wikipedia.org/wiki/Socrates and https://en.wikipedia.org/wiki/Parmenides
Slide 15
Slide 15 text
Language is ambiguous
Slide 16
Slide 16 text
Word ambiguity
Slide 17
Slide 17 text
“They ate pizza with anchovies”
Syntactic ambiguity
Slide 18
Slide 18 text
Common sense
is implied
(but computers don’t really have it)
Slide 19
Slide 19 text
Natural Language Processing
Computational
Linguistics
Computer
Science
NLP
NLP vs Text Mining vs Text Analytics
Slide 20
Slide 20 text
NLP Goals
Text Data
Useful Information
Actionable Insights
Slide 21
Slide 21 text
NLP Applications
• Text Classification
• Text Clustering
• Text Summarisation
• Machine Translation
• Semantic Search
• Sentiment Analysis
• Question Answering
• Information Extraction
Stemming
• Map a token into its stem
• Fish, Fishes, Fishing ‛ Fish
Slide 38
Slide 38 text
from nltk.stem
import PorterStemmer
Slide 39
Slide 39 text
from nltk.stem
import SnowballStemmer
Slide 40
Slide 40 text
Lemmatisation
• Map a token into its lemma
• Go, goes, going, went ‛ Go
Slide 41
Slide 41 text
from nltk.stem
import WordNetLemmatizer
Slide 42
Slide 42 text
Stop-word Removal
Slide 43
Slide 43 text
from nltk.corpus import stopwords
stopwords.words(‘english’)
Slide 44
Slide 44 text
>>> stop_list = [ … ] # custom
>>> s = "a piece of butter"
>>> [tok for tok in word_tokenize(s)
if tok not in stop_list]
['piece', 'butter']
Slide 45
Slide 45 text
n-grams
Slide 46
Slide 46 text
from nltk import bigrams
from nltk import trigrams
from nltk import ngrams
Slide 47
Slide 47 text
Good for capturing phrases:
“bad movie”, “good olive oil”, …
!
How about stop-words?
Slide 48
Slide 48 text
… Now what?
Slide 49
Slide 49 text
Exploring Text Data
Slide 50
Slide 50 text
from collections import Counter
!
frequencies = Counter(all_tokens)
Slide 51
Slide 51 text
Visualising Text Data
Slide 52
Slide 52 text
pip install wordcloud
@MarcoBonzanini
Slide 53
Slide 53 text
Can we do something
smarter?
Slide 54
Slide 54 text
–J.R. Firth 1957
“You shall know a word
by the company it keeps.”
Slide 55
Slide 55 text
• From Bag-of-Words to Word Embeddings
(e.g. word2vec)
• Similar context = close vectors
• Semantic relationships: vector arithmetic!
• pip install gensim
Slide 56
Slide 56 text
from gensim.models import Word2Vec
model = Word2Vec(sentences)
!
model.most_similar(positive=['king', 'woman'],
negative=['man'])
!
[('queen', 0.50882536), ...]
Tutorial: https://rare-technologies.com/word2vec-tutorial/
Slide 57
Slide 57 text
More NLP Libraries
• spaCy — “industrial-strength NLP”
Designed for speed and accuracy
• scikit-learn — Machine Learning
Good support for text
(e.g. TfidfVectorizer)
Slide 58
Slide 58 text
Summary
• 80/20 rule: preprocessing is 80%?
• Counting words vs Neural Networks
(in less than 1h!)
• Rich Python ecosystem
Slide 59
Slide 59 text
Thank You
https://github.com/bonzanini/nlp-tutorial
@MarcoBonzanini