ML Session n°2

ML: asking the right questions February 2017

Quick recap

Hierarchy

Explaining Machine Learning Machine learning is the idea that there
are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

Linear regression

Defining the problem

Typical starters Start with a question “How can we do
XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve

Typical starters Start with a question “How can we do
XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve Poor approach Data may lack a crucial feature

Is the problem achievable? • What category is the problem?
(regression, classification, anomaly detection, clustering, recommendation…)

(regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data)

Moravec’s paradox “Easy” Extremely hard

(regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data) • Are there successful similar projects?

(regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data) • Are there successful similar projects? • Do you have the data for it?

Finding data Open data Public datasets Competitions Kaggle Turk Machine
Through APIs

Create your own data Date,MorningWeight,YesterdayFactors 2012-06-10,185.0, 2012-06-11,182.6,salad sleep bacon cheese
tea halfnhalf icecream 2012-06-12,181.0,sleep egg 2012-06-13,183.6,mottsfruitsnack:2 pizza:0.5 bread:0.5 date:3 dietsnapple splenda milk nosleep 2012-06-14,183.6,coffeecandy:2 egg mayo cheese:2 rice meat bread:0.5 peanut:0.4 2012-06-15,183.4,meat sugarlesscandy salad cherry:4 bread:0 dietsnapple:0.5 egg mayo oliveoil 2012-06-16,183.6,caprise bread grape:0.2 pasadena sugaryogurt dietsnapple:0.5 peanut:0.4 hotdog 2012-06-17,182.6,grape meat pistachio:5 peanut:5 cheese sorbet:5 orangejuice:2 # and so on ...

Create your own data ... (output trimmed for brevity) ...
FeatureName HashVal ... Weight RelScore nosleep 143407 ... +0.6654 90.29% melon 234655 ... +0.4636 62.91% sugarlemonade 203375 ... +0.3975 53.94% trailmix 174671 ... +0.3362 45.63% bread 135055 ... +0.3345 45.40% caramelizedwalnut 148079 ... +0.3316 44.99% bun 1791 ... +0.3094 41.98% ... (trimmed for brevity. Caveat: data is too noisy anyway) ... stayhome 148879 ... -0.2690 -36.50% bacon 64431 ... -0.2998 -40.69% egg 197743 ... -0.3221 -43.70% parmesan 3119 ... -0.3385 -45.94% oliveoil 156831 ... -0.3754 -50.95% halfnhalf 171855 ... -0.4673 -63.41% sleep 127071 ... -0.7369 -100.00%

Challenges of data Data preparation accounts for about 80% of
the work of data scientists

Feature derivation 2017-02-08T19:12:18Z Year : 2017 Month : February Day
of month : 08 Hours : 19 Minutes : 12 Seconds : 18

Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week :
Wednesday Month : February Week of year : 6 Day of month : 08 Holiday ? : no Hours : 19 Daylight saving ? : no Minutes : 12 Seconds : 18

Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week :
Wednesday Zone A vacation : no Month : February Week of year : 6 Zone B vacation : no Day of month : 08 Holiday ? : no Zone C vacation : yes Hours : 19 Daylight saving ? : no Weather : Cloudy Minutes : 12 Strike ? : no Seconds : 18 Pollution : normal

Generating additional data You can create additional data from existing
data points. For images, you can : - grayscale - rotate - saturate/desaturate - distort slightly - crop slightly your existing pictures

Objective function

Machine Learning: definition Machine learning is the idea that there
are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

Machine Learning: definition Machine learning is the idea that there
are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. But algo + data is not enough : there is a third piece

Linear regression: least squares

Cost function Describes - a score to maximize - an
error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you

Imbalanced data Population with disproportionate populations (99-1) Always negative? 99%
accuracy!

Imbalanced classification: different objectives Blood test Fraud detection trying to
find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives

Imbalanced classification: different objectives Blood test Fraud detection trying to
find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives Use anomaly detection, not classification

Imbalanced classification: different objectives Blood test HR at Google trying
to find patients with a specific illness (1% of the population) want to find the best candidates among all applications (1% of all applications) Should detect all potential sick people Never recruit a candidate below a certain threshold Most important : low false negatives Most important : low false positives

Overfitting: regression

Overfitting: classification

Solving under-/over-fit Training set 80% of data Test set 20%
of data

Overfit diagnosis

Solving under-/over-fit Training set 60% of data Test set 20%
of data Cross-validation set 20% of data

Random bits

Accuracy is not the only goal : speed, size, ...

Transfer learning

Machine Learning Canvas

Design for failure

Questions? February 2017

ML Session n°2

ML Session n°2

More Decks by Adrien Couque

Other Decks in Technology

Featured

Transcript