Natural Language Processing with Python @ PyCon UK 2016

Slide 1

Slide 1 text

Natural Language Processing  with Python @MarcoBonzanini and @MiguelMAlvarez github.com/bonzanini/nlp-tutorial

Slide 2

Slide 2 text

Nice to Meet You Marco Bonzanini  Freelance Data Scientist  Miguel Martinez-Alvarez  Head of Research github.com/bonzanini/nlp-tutorial

Slide 3

Slide 3 text

Schedule • Intro & Logistics (10m) • Environment Set Up (10m) • Exploring Text Data (1h + 15m QA) • Break (10:45 — 11:15) • Text Classiﬁcation (1h) • Bonus Content (30m + 15m QA) github.com/bonzanini/nlp-tutorial

Slide 4

Slide 4 text

The Audience (You!) • Know some Python already? • Know some NLP already? • Both / None of the above? github.com/bonzanini/nlp-tutorial

Slide 5

Slide 5 text

Natural Language Processing Computational  Linguistics Computer  Science NLP github.com/bonzanini/nlp-tutorial

Slide 6

Slide 6 text

NLP Goals Text Data Useful Information Actionable Insights github.com/bonzanini/nlp-tutorial

Slide 7

Slide 7 text

Formal vs Natural github.com/bonzanini/nlp-tutorial SELECT name, address  FROM businesses  WHERE business_type = ‘pub’  AND postcode_area = ‘CF10’ vs Where is the nearest pub?

Slide 8

Slide 8 text

NLP Applications • Text Classiﬁcation • Text Clustering • Text Summarisation • Machine Translation  • Semantic Search • Sentiment Analysis • Question Answering • Information Extraction github.com/bonzanini/nlp-tutorial

Slide 9

Slide 9 text

Environment Set Up • Tested with Python 3.4 and 3.5 • Clone the repository:    git clone https://github.com/bonzanini/nlp-tutorial  cd nlp-tutorial

Slide 10

Slide 10 text

Environment Set Up (cont’d) • Set up virtual environment:    virtualenv nlp-venv  source nlp-venv/bin/activate  pip install -r requirements.txt

Slide 11

Slide 11 text

Environment Set Up (cont’d) • Set up virtual environment (alternative):    conda create --name nlp-venv python=3.5  source activate nlp-venv  pip install -r requirements.txt

Slide 12

Slide 12 text

Environment Set Up (cont’d) • Download NLTK data:    python -m nltk.downloader \  punkt stopwords reuters

Slide 13

Slide 13 text

Environment Set Up (cont’d) • Start up Jupyter notebook:    jupyter notebook

Slide 14

Slide 14 text

Exploring Text Data

Slide 15

Slide 15 text

Goal: Answering Important Questions What are the most important  ingredients in Italian cuisine?

Slide 16

Slide 16 text

recipes_exploratory_analysis.ipynb

Slide 17

Slide 17 text

Recipe Analysis: Summary • Tokenisation • Counting words • Stop-words  • Normalisation • Stemming • n-grams

Slide 18

Slide 18 text

pyconuk_exporatory_analysis.ipynb

Slide 19

Slide 19 text

PyConUK Analysis Summary • “This talk will …” • TF-IDF • We’re going to use scikit-learn

Slide 20

Slide 20 text

Break

Slide 21

Slide 21 text

Text Classiﬁcation

Slide 22

Slide 22 text

Text Classification • “Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world”    Scholarpedia (Yiming Yang and Thorsten Joachims)

Slide 23

Slide 23 text

Text Classiﬁcation • Binary: Only two categories which are mutually exclusive • Spam detection, Anomaly detection, Fraud detection, … • Multi-class: Multiple categories, mutually exclusive • Language detection, … • Multi-label: Multiple categories with the possibility of multiple (or none) assignments. • News Categorisation, Marketing proﬁling, …

Slide 24

Slide 24 text

text_classification_Generic.ipynb

Slide 25

Slide 25 text

Text Classiﬁcation Evaluation

Slide 26

Slide 26 text

Text Classiﬁcation Evaluation • “If you cannot measure it, you cannot improve it”.   Lord Kelvin • Main metrics for Text Classiﬁcation:  Precision and Recall

Slide 27

Slide 27 text

Text Classiﬁcation Evaluation Threshold • 1 correct case labelled in the class out of 1 prediction • 1 correct case labelled out of 3 being correct  • Precision: 100%  Recall: 33%

Slide 28

Slide 28 text

Text Classiﬁcation Evaluation Threshold

Slide 29

Slide 29 text

Text Classiﬁcation Evaluation Threshold • 2 correct cases labelled in the class out of 3 predictions • 2 correct cases labelled out of 3 being correct ! • Precision: 66%  Recall: 66%

Slide 30

Slide 30 text

text_classification_Evaluation.ipynb

Slide 31

Slide 31 text

Classifying a real collection text_classification_Reuters.ipynb

Slide 32

Slide 32 text

text_classification_Reuters.ipynb

Slide 33

Slide 33 text

Text Classiﬁcation Summary • Types of Classiﬁcation Problems • Document Representations: Vectorizers • Training and predicting • Evaluation: Precision vs Recall

Slide 34

Slide 34 text

Questions?