Introduction to NLP in Python @ London Python Meetup Sept 2016

Introduction to NLP in Python @ London Python Meetup Sept 2016

A very gentle introduction to the field of Natural Language Processing (NLP) using Python tools.

The tutorial at https://github.com/bonzanini/nlp-tutorial shows more detailed examples and could be used as a companion for these slides.

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

September 29, 2016
Tweet

Transcript

  1. (a very gentle) Introduction to! Natural Language Processing! in Python

    Marco Bonzanini! London Python Meet-up 
 Sept 2016
  2. Nice to Meet You

  3. Objectives for Today • Getting one Python person to try

    some NLP • Getting one NLP person to try some Python
  4. http://speakerdeck.com/marcobonzanini http://github.com/bonzanini/nlp-tutorial

  5. Natural Language

  6. Language Instrument of communication

  7. Natural Not planned, not artificial

  8. SELECT name, address
 FROM businesses
 WHERE business_type = ‘pub’
 AND

    postcode_area = ‘EC2M’ vs Where is the nearest pub?
  9. Natural = Easy ?

  10. Language is huge

  11. • OED: 171,000 words in current use (47k obsolete) •

    Average person: 10,000-40,000 words
 (according to: The Guardian 1986, BBC 2009, and several random people on the Web) • Average person? (Lies, damn lies and statistics) • Active vs Passive vocabulary
  12. Language is confusing (sometimes)

  13. That that is is that that is not is not

    is that it it is (That’s proper English)
  14. That that is, is. That that is not, is not.

    Is that it? It is. More fun at:
 https://en.wikipedia.org/wiki/List_of_linguistic_example_sentences Pics:
 https://en.wikipedia.org/wiki/Socrates and https://en.wikipedia.org/wiki/Parmenides
  15. Language is ambiguous

  16. Word ambiguity

  17. “They ate pizza with anchovies” Syntactic ambiguity

  18. Common sense
 is implied (but computers don’t really have it)

  19. Natural Language Processing Computational
 Linguistics Computer
 Science NLP NLP vs

    Text Mining vs Text Analytics
  20. NLP Goals Text Data Useful Information Actionable Insights

  21. NLP Applications • Text Classification • Text Clustering • Text

    Summarisation • Machine Translation
 • Semantic Search • Sentiment Analysis • Question Answering • Information Extraction
  22. None
  23. None
  24. I was told there would be Python

  25. None
  26. NLP Pipeline • pip install nltk • Sentence Boundary Detection

    • Word Tokenisation • Word Normalisation • Stop-word removal • Bigrams / trigrams / n-grams
  27. Sentence Boundary Detection

  28. How about str.split(‘.’) ??? How about Mr., Dr. or U.S.A.?

  29. from nltk.tokenize
 import sent_tokenize

  30. Word Tokenisation

  31. How about str.split(‘ ’) ??? How about punctuation?

  32. from nltk.tokenize
 import word_tokenize

  33. >>> s = "Talking about #NLProc in #Python at @python_london

    meetup" >>> word_tokenize(s) ['Talking', 'about', '#', 'NLProc', 'in', '#', 'Python', 'at', '@', 'python_london', 'meetup']
  34. from nltk.tokenize
 import TweetTokenizer

  35. Text Normalisation

  36. >>> "python" == "Python" False >>> "python" == "Python".lower() True

  37. Stemming • Map a token into its stem • Fish,

    Fishes, Fishing ‛ Fish
  38. from nltk.stem
 import PorterStemmer

  39. from nltk.stem
 import SnowballStemmer

  40. Lemmatisation • Map a token into its lemma • Go,

    goes, going, went ‛ Go
  41. from nltk.stem
 import WordNetLemmatizer

  42. Stop-word Removal

  43. from nltk.corpus import stopwords
 
 stopwords.words(‘english’)

  44. >>> stop_list = [ … ] # custom
 >>> s

    = "a piece of butter" >>> [tok for tok in word_tokenize(s) 
 if tok not in stop_list] ['piece', 'butter']
  45. n-grams

  46. from nltk import bigrams from nltk import trigrams from nltk

    import ngrams
  47. Good for capturing phrases:
 “bad movie”, “good olive oil”, …

    ! How about stop-words?
  48. … Now what?

  49. Exploring Text Data

  50. from collections import Counter ! frequencies = Counter(all_tokens)

  51. Visualising Text Data

  52. pip install wordcloud @MarcoBonzanini

  53. Can we do something smarter?

  54. –J.R. Firth 1957 “You shall know a word 
 by

    the company it keeps.”
  55. • From Bag-of-Words to Word Embeddings
 (e.g. word2vec) • Similar

    context = close vectors • Semantic relationships: vector arithmetic! • pip install gensim
  56. from gensim.models import Word2Vec model = Word2Vec(sentences) ! model.most_similar(positive=['king', 'woman'],

    negative=['man']) ! [('queen', 0.50882536), ...] Tutorial: https://rare-technologies.com/word2vec-tutorial/
  57. More NLP Libraries • spaCy — “industrial-strength NLP”
 Designed for

    speed and accuracy • scikit-learn — Machine Learning
 Good support for text
 (e.g. TfidfVectorizer)
  58. Summary • 80/20 rule: preprocessing is 80%? • Counting words

    vs Neural Networks
 (in less than 1h!) • Rich Python ecosystem
  59. Thank You https://github.com/bonzanini/nlp-tutorial
 @MarcoBonzanini