Slide 1

Slide 1 text

Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015

Slide 2

Slide 2 text

Christine Doig Data Scientist, Continuum Analytics ch_doig chdoig chdoig.github.io

Slide 3

Slide 3 text

Data Science Machine Learning Supervised learning Classification Kaggle Competitions Dataset Setup Feature preparation Modeling Optimization Validation Anaconda Concepts Process NLP Sentiment analysis

Slide 4

Slide 4 text

Data Science Machine Learning Supervised learning Classification Kaggle Competitions Dataset Setup Feature preparation Modeling Optimization Validation Anaconda Concepts Process NLP Sentiment analysis 45min 10 min 1h 5 min 1h

Slide 5

Slide 5 text

Data Science Machine Learning Supervised learning Classification Concepts NLP Sentiment analysis

Slide 6

Slide 6 text

Data Science

Slide 7

Slide 7 text

Slides: http://www.slideshare.net/joshwills/production-machine-learninginfrastructure Video: https://www.youtube.com/watch?v=v-91JycaKjc From the lab to the factory - Data Day Texas data science http://www.oreilly.com/data/free/files/analyzing-the-analyzers.pdf http://www.experfy.com/blog/become-data-scientist/ http://www.fico.com/landing/infographic/anatomy-of-a-data-scientist_en.html

Slide 8

Slide 8 text

data science Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web

Slide 9

Slide 9 text

data science Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer

Slide 10

Slide 10 text

data science Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer Model Algorithm Report Application Pipeline/ Architecture

Slide 11

Slide 11 text

data science Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Models Deep Learning Supervised Clustering SVM Regression Classification Crossvalidation Dimensionality reduction KNN Unsupervised NN Filter Join Select TopK Sort Groupby min summary statistics avg max databases GPUs arrays algorithms performance compute SQL Reporting clusters hadoop hdfs optimization HPC graphs FTT HTML/CSS/JS algebra stream processing deployment servers frontend sematic web batch jobs consistency A/B testing crawling frameworks parallelism availability tolerance DFT spark scraping databases apps NOSQL parallelism interactive data viz pipeline cloud Developer

Slide 12

Slide 12 text

data science Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer PyMC Numba xlwings Bokeh Kafka RDFLib mrjob mrjob

Slide 13

Slide 13 text

Machine Learning

Slide 14

Slide 14 text

Machine Learning Unsupervised learning Supervised learning Classification Regression Clustering Latent variables/structure labels no labels categorical quantitative Linear regression Logistic regression SVM Decision trees k-NN K-means Hierarchical clustering *Topic modeling Dimenstionality reduction *Topic modeling

Slide 15

Slide 15 text

Exploratory Predictive Machine Learning Unsupervised learning Supervised learning Classification Regression Clustering Latent variables/structure labels no labels categorical quantitative Linear regression Logistic regression SVM Decision trees k-NN K-means Hierarchical clustering *Topic modeling Dimenstionality reduction *Topic modeling

Slide 16

Slide 16 text

Exploratory Predictive Machine Learning Unsupervised learning Supervised learning Classification Regression labels no labels categorical quantitative id gender age job_id 1 F 67 1 2 M 32 2 3 M 45 1 4 F 18 2 group similar individuals together id gender age job_id buy/click_ad money_spent 1 F 67 1 Yes $1,000 2 M 32 2 No - 3 M 45 1 No - 4 F 18 2 Yes $300 predict whether an individual is going to buy/click or not Classification Regression predict how much is the individual going to spend

Slide 17

Slide 17 text

Natural Language Processing

Slide 18

Slide 18 text

Machine Learning Natural language processing field concerned with the interactions between computers and human (natural) languages Sentiment analysis Extract subjective information on polarity (positive or negative) of a document (text, tweet, voice message…) ! e.g online reviews to determine how people feel about a particular object or topic. tasks

Slide 19

Slide 19 text

Machine Learning Unsupervised learning Supervised learning Classification Regression Clustering Latent variables/structure labels no labels categorical quantitative Linear regression Logistic regression SVM Decision trees k-NN K-means Hierarchical clustering *Topic modeling Dimenstionality reduction *Topic modeling Sentiment analysis Movie review Positive Negative e.g.

Slide 20

Slide 20 text

Setup

Slide 21

Slide 21 text

Kaggle Competitions Dataset Setup Anaconda

Slide 22

Slide 22 text

Setup options You already have Python installed and your own workflow to install Python packages happy alternative Install dependencies in README Anaconda Miniconda + conda env Free Python distribution with a bunch of packages for data science too many packages!!! Python + conda (package manager) git clone [email protected]:chdoig/ep2015-ml-tutorial.git cd ep2015-ml-tutorial conda env create source activate ep-ml http://conda.pydata.org/miniconda.html http://continuum.io/downloads Python + conda (package manager) + packages

Slide 23

Slide 23 text

Kaggle https://www.kaggle.com/ hosts online machine learning competitions

Slide 24

Slide 24 text

Kaggle Competition https://www.kaggle.com/c/word2vec-nlp-tutorial

Slide 25

Slide 25 text

Kaggle Competition https://www.kaggle.com/c/word2vec-nlp-tutorial Bag of Words Meets Bags of Popcorn Data T ask 50,000 IMDB movie reviews predict the sentiment for each review in the test data set 25,000 rows containing an id, sentiment, and text for each review. labeledTrainData.tsv testData.tsv 25,000 rows containing an id and text for each review

Slide 26

Slide 26 text

Feature preparation Modeling Optimization Validation Process

Slide 27

Slide 27 text

Feature preparation Modeling Optimization Validation Feature extraction Feature selection Feature imputation Feature scaling Feature discretization Neural Networks Decision trees Random forest SVM Naive Bayes classifier Logistic Regression Boosting Bagging Regularization Hold out method Crossvalidation Confusion matrix ROC curve / AUC Hyperparameters Machine Learning Ensemble

Slide 28

Slide 28 text

Feature preparation Feature extraction the process of making features from available data to be used by the classification algorithms Reviews M N Words Model Evaluation Metrics Visualizations NaiveBayes DecisionT rees Feature extraction id sentiment review count_words terrible_word 1 0 the movie was terrible 4 1 2 1 I love it 3 0 3 1 Awesome! Love it! 3 0 4 0 I hated every minute 4 0

Slide 29

Slide 29 text

T okenization Stopwords transition, metal, oxides, considered, generation, materials, field, electronics, advanced, catalysts, tantalum, v, oxide, reports, synthesis, material, nanometer, size, unusual, properties… transition_metal_oxides, considered, generation, materials, field, electronics, advanced, catalysts, tantalum, oxide, reports, synthesis, material, nanometer_size, unusual, properties, sol_gel_method, biomedical_applications… transition, metal_oxides, tantalum, oxide, nanometer_size, unusual_properties, dna, easy_method, biomedical_applications transition, metal_oxides, generation, tantalum, oxide, nanometer_size, unusual_properties, sol, dna, easy_method, biomedical_applications Simple Collocations Entities Combination Lemmatization transition, metal, oxide, consider, generation, material, field, electronic, advance, catalyst, property… language generic domain specific a! above! across! after! afterwards! again! against! all ! … material! temperature! advance! size! …. Feature extraction Text

Slide 30

Slide 30 text

Vector Space Corpus - Bag of words Dictionary 1 - transition! 2- metal! 3- oxides! 4- considered! … ! [(0, 1), (1, 1), (2, 1)] [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] [(2, 1), (5, 1), (7, 1), (8, 1)] [(1, 1), (5, 2), (8, 1)] [(3, 1), (6, 1), (7, 1)] [(9, 1)] [(9, 1), (10, 1)] [(9, 1), (10, 1), (11, 1)] [(4, 1), (10, 1), (11, 1)]

Slide 31

Slide 31 text

Feature_extraction.ipynb

Slide 32

Slide 32 text

Modeling Naive Bayes Classifier P(A|B) = P(B|A) * P(A) / P(B) id sentiment review count_words terrible_word 1 0 the movie was terrible 4 1 2 1 I love it 3 0 3 1 Awesome! Love it 1 0 4 0 I hated every minute 4 0 P(1 | love) = P(love | 1) * P(1) / P(love) = (2/2 * 2/4)/(2/4) = 100% What’s the probability of the review being positive if the word love appears in the review?

Slide 33

Slide 33 text

Modeling.ipynb

Slide 34

Slide 34 text

occurs whenever a model learns from patterns that are present in the training data but do not reflect the data-generating process. Seeing more than is actually there. A kind of data hallucination. Validation Overfitting http://talyarkoni.org/downloads/ML_Meetup_Yarkoni_Overfitting.pdf

Slide 35

Slide 35 text

T raining data Validation Model Evaluate New data Evaluate

Slide 36

Slide 36 text

Validation.ipynb

Slide 37

Slide 37 text

Validation Hold out method T raining data T est data accuracy

Slide 38

Slide 38 text

Crossvalidation Test Training ! + ! Validation Training Validation Accuracy = average(Round1, Round 2….) Final Accuracy one shot at this! Accuracy in each round with validation set

Slide 39

Slide 39 text

Confusion matrix Validation Positive reviews Negative reviews 95% 5% Accuracy 95% Real Model prediction

Slide 40

Slide 40 text

model/real positive negative positive 95 5 negative 0 0 Confusion matrix Validation

Slide 41

Slide 41 text

ROC curve/ AUC Validation true positive false positive 100% true positive 0 % false positive

Slide 42

Slide 42 text

ROC curve/ AUC Validation true positive false positive 100% true positive 0 % false positive AUC

Slide 43

Slide 43 text

Kaggle leaderboard

Slide 44

Slide 44 text

Optimization Ensemble methods Classifier 1 Classifier 2 Classifier 3 id cls_1 cls_2 cls_3 ensemble 1 0 0 0 0 2 0 1 1 1 3 1 1 1 1 4 0 0 1 0 e.g. majority voting w1 w2 w3 e.g. weighted voting

Slide 45

Slide 45 text

Kaggle forums

Slide 46

Slide 46 text

Data Science Machine Learning Supervised learning Classification Kaggle Competitions Dataset Setup Feature preparation Modeling Optimization Validation Anaconda Concepts Process NLP Sentiment analysis

Slide 47

Slide 47 text

Feature preparation Modeling Optimization Validation Feature extraction Feature selection Feature imputation Feature scaling Feature discretization Neural Networks Decision trees Random forest SVM Naive Bayes classifier Logistic Regression Boosting Bagging Regularization Hold out method Crossvalidation Confusion matrix ROC curve / AUC Hyperparameters Machine Learning Ensemble