Introduction to NLP in Python @ London Python Meetup Sept 2016

(a very gentle) Introduction to! Natural Language Processing! in Python
Marco Bonzanini! London Python Meet-up   Sept 2016

Nice to Meet You

Objectives for Today • Getting one Python person to try
some NLP • Getting one NLP person to try some Python

http://speakerdeck.com/marcobonzanini http://github.com/bonzanini/nlp-tutorial

Natural Language

Language Instrument of communication

Natural Not planned, not artiﬁcial

SELECT name, address  FROM businesses  WHERE business_type = ‘pub’  AND
postcode_area = ‘EC2M’ vs Where is the nearest pub?

Natural = Easy ?

Language is huge

• OED: 171,000 words in current use (47k obsolete) •
Average person: 10,000-40,000 words  (according to: The Guardian 1986, BBC 2009, and several random people on the Web) • Average person? (Lies, damn lies and statistics) • Active vs Passive vocabulary

Language is confusing (sometimes)

That that is is that that is not is not
is that it it is (That’s proper English)

That that is, is. That that is not, is not.
Is that it? It is. More fun at:  https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences Pics:  https://en.wikipedia.org/wiki/Socrates and https://en.wikipedia.org/wiki/Parmenides

Language is ambiguous

Word ambiguity

“They ate pizza with anchovies” Syntactic ambiguity

Common sense  is implied (but computers don’t really have it)

Natural Language Processing Computational  Linguistics Computer  Science NLP NLP vs
Text Mining vs Text Analytics

NLP Goals Text Data Useful Information Actionable Insights

NLP Applications • Text Classiﬁcation • Text Clustering • Text
Summarisation • Machine Translation  • Semantic Search • Sentiment Analysis • Question Answering • Information Extraction

I was told there would be Python

NLP Pipeline • pip install nltk • Sentence Boundary Detection
• Word Tokenisation • Word Normalisation • Stop-word removal • Bigrams / trigrams / n-grams

Sentence Boundary Detection

How about str.split(‘.’) ??? How about Mr., Dr. or U.S.A.?

from nltk.tokenize  import sent_tokenize

Word Tokenisation

How about str.split(‘ ’) ??? How about punctuation?

from nltk.tokenize  import word_tokenize

>>> s = "Talking about #NLProc in #Python at @python_london
meetup" >>> word_tokenize(s) ['Talking', 'about', '#', 'NLProc', 'in', '#', 'Python', 'at', '@', 'python_london', 'meetup']

from nltk.tokenize  import TweetTokenizer

Text Normalisation

>>> "python" == "Python" False >>> "python" == "Python".lower() True

Stemming • Map a token into its stem • Fish,
Fishes, Fishing ‛ Fish

from nltk.stem  import PorterStemmer

from nltk.stem  import SnowballStemmer

Lemmatisation • Map a token into its lemma • Go,
goes, going, went ‛ Go

from nltk.stem  import WordNetLemmatizer

Stop-word Removal

from nltk.corpus import stopwords    stopwords.words(‘english’)

>>> stop_list = [ … ] # custom  >>> s
= "a piece of butter" >>> [tok for tok in word_tokenize(s)   if tok not in stop_list] ['piece', 'butter']

n-grams

from nltk import bigrams from nltk import trigrams from nltk
import ngrams

Good for capturing phrases:  “bad movie”, “good olive oil”, …
! How about stop-words?

… Now what?

Exploring Text Data

from collections import Counter ! frequencies = Counter(all_tokens)

Visualising Text Data

pip install wordcloud @MarcoBonzanini

Can we do something smarter?

–J.R. Firth 1957 “You shall know a word   by
the company it keeps.”

• From Bag-of-Words to Word Embeddings  (e.g. word2vec) • Similar
context = close vectors • Semantic relationships: vector arithmetic! • pip install gensim

from gensim.models import Word2Vec model = Word2Vec(sentences) ! model.most_similar(positive=['king', 'woman'],
negative=['man']) ! [('queen', 0.50882536), ...] Tutorial: https://rare-technologies.com/word2vec-tutorial/

More NLP Libraries • spaCy — “industrial-strength NLP”  Designed for
speed and accuracy • scikit-learn — Machine Learning  Good support for text  (e.g. TfidfVectorizer)

Summary • 80/20 rule: preprocessing is 80%? • Counting words
vs Neural Networks  (in less than 1h!) • Rich Python ecosystem

Thank You https://github.com/bonzanini/nlp-tutorial  @MarcoBonzanini

Introduction to NLP in Python @ London Python M...

Introduction to NLP in Python @ London Python Meetup Sept 2016

More Decks by Marco Bonzanini

Other Decks in Technology

Featured

Transcript