Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to NLP in Python @ London Python Meetup Sept 2016

Introduction to NLP in Python @ London Python Meetup Sept 2016

A very gentle introduction to the field of Natural Language Processing (NLP) using Python tools.

The tutorial at https://github.com/bonzanini/nlp-tutorial shows more detailed examples and could be used as a companion for these slides.

Marco Bonzanini

September 29, 2016
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. (a very gentle) Introduction to! Natural Language Processing! in Python

    Marco Bonzanini! London Python Meet-up 
 Sept 2016
  2. Objectives for Today • Getting one Python person to try

    some NLP • Getting one NLP person to try some Python
  3. SELECT name, address
 FROM businesses
 WHERE business_type = ‘pub’
 AND

    postcode_area = ‘EC2M’ vs Where is the nearest pub?
  4. • OED: 171,000 words in current use (47k obsolete) •

    Average person: 10,000-40,000 words
 (according to: The Guardian 1986, BBC 2009, and several random people on the Web) • Average person? (Lies, damn lies and statistics) • Active vs Passive vocabulary
  5. That that is is that that is not is not

    is that it it is (That’s proper English)
  6. That that is, is. That that is not, is not.

    Is that it? It is. More fun at:
 https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences Pics:
 https://en.wikipedia.org/wiki/Socrates and https://en.wikipedia.org/wiki/Parmenides
  7. NLP Applications • Text Classification • Text Clustering • Text

    Summarisation • Machine Translation
 • Semantic Search • Sentiment Analysis • Question Answering • Information Extraction
  8. NLP Pipeline • pip install nltk • Sentence Boundary Detection

    • Word Tokenisation • Word Normalisation • Stop-word removal • Bigrams / trigrams / n-grams
  9. >>> s = "Talking about #NLProc in #Python at @python_london

    meetup" >>> word_tokenize(s) ['Talking', 'about', '#', 'NLProc', 'in', '#', 'Python', 'at', '@', 'python_london', 'meetup']
  10. >>> stop_list = [ … ] # custom
 >>> s

    = "a piece of butter" >>> [tok for tok in word_tokenize(s) 
 if tok not in stop_list] ['piece', 'butter']
  11. • From Bag-of-Words to Word Embeddings
 (e.g. word2vec) • Similar

    context = close vectors • Semantic relationships: vector arithmetic! • pip install gensim
  12. from gensim.models import Word2Vec model = Word2Vec(sentences) ! model.most_similar(positive=['king', 'woman'],

    negative=['man']) ! [('queen', 0.50882536), ...] Tutorial: https://rare-technologies.com/word2vec-tutorial/
  13. More NLP Libraries • spaCy — “industrial-strength NLP”
 Designed for

    speed and accuracy • scikit-learn — Machine Learning
 Good support for text
 (e.g. TfidfVectorizer)
  14. Summary • 80/20 rule: preprocessing is 80%? • Counting words

    vs Neural Networks
 (in less than 1h!) • Rich Python ecosystem