Slide 1

Slide 1 text

ML: asking the right questions February 2017

Slide 2

Slide 2 text

Quick recap

Slide 3

Slide 3 text

Hierarchy

Slide 4

Slide 4 text

Explaining Machine Learning Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

Slide 5

Slide 5 text

Linear regression

Slide 6

Slide 6 text

Defining the problem

Slide 7

Slide 7 text

Typical starters Start with a question “How can we do XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve

Slide 8

Slide 8 text

Typical starters Start with a question “How can we do XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve Poor approach Data may lack a crucial feature

Slide 9

Slide 9 text

Is the problem achievable? ● What category is the problem? (regression, classification, anomaly detection, clustering, recommendation…)

Slide 10

Slide 10 text

Is the problem achievable? ● What category is the problem? (regression, classification, anomaly detection, clustering, recommendation…) ● Can a human do it? (with the same amount of data)

Slide 11

Slide 11 text

Moravec’s paradox “Easy” Extremely hard

Slide 12

Slide 12 text

Is the problem achievable? ● What category is the problem? (regression, classification, anomaly detection, clustering, recommendation…) ● Can a human do it? (with the same amount of data) ● Are there successful similar projects?

Slide 13

Slide 13 text

Is the problem achievable? ● What category is the problem? (regression, classification, anomaly detection, clustering, recommendation…) ● Can a human do it? (with the same amount of data) ● Are there successful similar projects? ● Do you have the data for it?

Slide 14

Slide 14 text

Finding data Open data Public datasets Competitions Kaggle Turk Machine Through APIs

Slide 15

Slide 15 text

Create your own data Date,MorningWeight,YesterdayFactors 2012-06-10,185.0, 2012-06-11,182.6,salad sleep bacon cheese tea halfnhalf icecream 2012-06-12,181.0,sleep egg 2012-06-13,183.6,mottsfruitsnack:2 pizza:0.5 bread:0.5 date:3 dietsnapple splenda milk nosleep 2012-06-14,183.6,coffeecandy:2 egg mayo cheese:2 rice meat bread:0.5 peanut:0.4 2012-06-15,183.4,meat sugarlesscandy salad cherry:4 bread:0 dietsnapple:0.5 egg mayo oliveoil 2012-06-16,183.6,caprise bread grape:0.2 pasadena sugaryogurt dietsnapple:0.5 peanut:0.4 hotdog 2012-06-17,182.6,grape meat pistachio:5 peanut:5 cheese sorbet:5 orangejuice:2 # and so on ...

Slide 16

Slide 16 text

Create your own data ... (output trimmed for brevity) ... FeatureName HashVal ... Weight RelScore nosleep 143407 ... +0.6654 90.29% melon 234655 ... +0.4636 62.91% sugarlemonade 203375 ... +0.3975 53.94% trailmix 174671 ... +0.3362 45.63% bread 135055 ... +0.3345 45.40% caramelizedwalnut 148079 ... +0.3316 44.99% bun 1791 ... +0.3094 41.98% ... (trimmed for brevity. Caveat: data is too noisy anyway) ... stayhome 148879 ... -0.2690 -36.50% bacon 64431 ... -0.2998 -40.69% egg 197743 ... -0.3221 -43.70% parmesan 3119 ... -0.3385 -45.94% oliveoil 156831 ... -0.3754 -50.95% halfnhalf 171855 ... -0.4673 -63.41% sleep 127071 ... -0.7369 -100.00%

Slide 17

Slide 17 text

Challenges of data Data preparation accounts for about 80% of the work of data scientists

Slide 18

Slide 18 text

Feature derivation 2017-02-08T19:12:18Z Year : 2017 Month : February Day of month : 08 Hours : 19 Minutes : 12 Seconds : 18

Slide 19

Slide 19 text

Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week : Wednesday Month : February Week of year : 6 Day of month : 08 Holiday ? : no Hours : 19 Daylight saving ? : no Minutes : 12 Seconds : 18

Slide 20

Slide 20 text

Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week : Wednesday Zone A vacation : no Month : February Week of year : 6 Zone B vacation : no Day of month : 08 Holiday ? : no Zone C vacation : yes Hours : 19 Daylight saving ? : no Weather : Cloudy Minutes : 12 Strike ? : no Seconds : 18 Pollution : normal

Slide 21

Slide 21 text

Generating additional data You can create additional data from existing data points. For images, you can : - grayscale - rotate - saturate/desaturate - distort slightly - crop slightly your existing pictures

Slide 22

Slide 22 text

Objective function

Slide 23

Slide 23 text

Machine Learning: definition Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

Slide 24

Slide 24 text

Machine Learning: definition Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. But algo + data is not enough : there is a third piece

Slide 25

Slide 25 text

Linear regression: least squares

Slide 26

Slide 26 text

Cost function Describes - a score to maximize - an error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you

Slide 27

Slide 27 text

Cost function Describes - a score to maximize - an error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you

Slide 28

Slide 28 text

Imbalanced data Population with disproportionate populations (99-1) Always negative? 99% accuracy!

Slide 29

Slide 29 text

Imbalanced classification: different objectives Blood test Fraud detection trying to find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives

Slide 30

Slide 30 text

Imbalanced classification: different objectives Blood test Fraud detection trying to find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives

Slide 31

Slide 31 text

Imbalanced classification: different objectives Blood test Fraud detection trying to find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives Use anomaly detection, not classification

Slide 32

Slide 32 text

Imbalanced classification: different objectives Blood test HR at Google trying to find patients with a specific illness (1% of the population) want to find the best candidates among all applications (1% of all applications) Should detect all potential sick people Never recruit a candidate below a certain threshold Most important : low false negatives Most important : low false positives

Slide 33

Slide 33 text

Fit

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Overfitting: regression

Slide 36

Slide 36 text

Overfitting: classification

Slide 37

Slide 37 text

Solving under-/over-fit Training set 80% of data Test set 20% of data

Slide 38

Slide 38 text

Overfit diagnosis

Slide 39

Slide 39 text

Solving under-/over-fit Training set 60% of data Test set 20% of data Cross-validation set 20% of data

Slide 40

Slide 40 text

Random bits

Slide 41

Slide 41 text

Accuracy is not the only goal : speed, size, ...

Slide 42

Slide 42 text

Transfer learning

Slide 43

Slide 43 text

Transfer learning

Slide 44

Slide 44 text

Machine Learning Canvas

Slide 45

Slide 45 text

Machine Learning Canvas

Slide 46

Slide 46 text

Design for failure

Slide 47

Slide 47 text

Questions? February 2017