Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ML Session n°2

ML Session n°2

Adrien Couque

February 08, 2017
Tweet

More Decks by Adrien Couque

Other Decks in Technology

Transcript

  1. Explaining Machine Learning Machine learning is the idea that there

    are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.
  2. Typical starters Start with a question “How can we do

    XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve
  3. Typical starters Start with a question “How can we do

    XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve Poor approach Data may lack a crucial feature
  4. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…)
  5. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data)
  6. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data) • Are there successful similar projects?
  7. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data) • Are there successful similar projects? • Do you have the data for it?
  8. Create your own data Date,MorningWeight,YesterdayFactors 2012-06-10,185.0, 2012-06-11,182.6,salad sleep bacon cheese

    tea halfnhalf icecream 2012-06-12,181.0,sleep egg 2012-06-13,183.6,mottsfruitsnack:2 pizza:0.5 bread:0.5 date:3 dietsnapple splenda milk nosleep 2012-06-14,183.6,coffeecandy:2 egg mayo cheese:2 rice meat bread:0.5 peanut:0.4 2012-06-15,183.4,meat sugarlesscandy salad cherry:4 bread:0 dietsnapple:0.5 egg mayo oliveoil 2012-06-16,183.6,caprise bread grape:0.2 pasadena sugaryogurt dietsnapple:0.5 peanut:0.4 hotdog 2012-06-17,182.6,grape meat pistachio:5 peanut:5 cheese sorbet:5 orangejuice:2 # and so on ...
  9. Create your own data ... (output trimmed for brevity) ...

    FeatureName HashVal ... Weight RelScore nosleep 143407 ... +0.6654 90.29% melon 234655 ... +0.4636 62.91% sugarlemonade 203375 ... +0.3975 53.94% trailmix 174671 ... +0.3362 45.63% bread 135055 ... +0.3345 45.40% caramelizedwalnut 148079 ... +0.3316 44.99% bun 1791 ... +0.3094 41.98% ... (trimmed for brevity. Caveat: data is too noisy anyway) ... stayhome 148879 ... -0.2690 -36.50% bacon 64431 ... -0.2998 -40.69% egg 197743 ... -0.3221 -43.70% parmesan 3119 ... -0.3385 -45.94% oliveoil 156831 ... -0.3754 -50.95% halfnhalf 171855 ... -0.4673 -63.41% sleep 127071 ... -0.7369 -100.00%
  10. Feature derivation 2017-02-08T19:12:18Z Year : 2017 Month : February Day

    of month : 08 Hours : 19 Minutes : 12 Seconds : 18
  11. Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week :

    Wednesday Month : February Week of year : 6 Day of month : 08 Holiday ? : no Hours : 19 Daylight saving ? : no Minutes : 12 Seconds : 18
  12. Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week :

    Wednesday Zone A vacation : no Month : February Week of year : 6 Zone B vacation : no Day of month : 08 Holiday ? : no Zone C vacation : yes Hours : 19 Daylight saving ? : no Weather : Cloudy Minutes : 12 Strike ? : no Seconds : 18 Pollution : normal
  13. Generating additional data You can create additional data from existing

    data points. For images, you can : - grayscale - rotate - saturate/desaturate - distort slightly - crop slightly your existing pictures
  14. Machine Learning: definition Machine learning is the idea that there

    are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.
  15. Machine Learning: definition Machine learning is the idea that there

    are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. But algo + data is not enough : there is a third piece
  16. Cost function Describes - a score to maximize - an

    error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you
  17. Cost function Describes - a score to maximize - an

    error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you
  18. Imbalanced classification: different objectives Blood test Fraud detection trying to

    find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives
  19. Imbalanced classification: different objectives Blood test Fraud detection trying to

    find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives
  20. Imbalanced classification: different objectives Blood test Fraud detection trying to

    find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives Use anomaly detection, not classification
  21. Imbalanced classification: different objectives Blood test HR at Google trying

    to find patients with a specific illness (1% of the population) want to find the best candidates among all applications (1% of all applications) Should detect all potential sick people Never recruit a candidate below a certain threshold Most important : low false negatives Most important : low false positives
  22. Fit

  23. Solving under-/over-fit Training set 60% of data Test set 20%

    of data Cross-validation set 20% of data