ML Session n°2

ML Session n°2

2696500a913e29a26f38115f8ea56f71?s=128

Adrien Couque

February 08, 2017
Tweet

Transcript

  1. ML: asking the right questions February 2017

  2. Quick recap

  3. Hierarchy

  4. Explaining Machine Learning Machine learning is the idea that there

    are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.
  5. Linear regression

  6. Defining the problem

  7. Typical starters Start with a question “How can we do

    XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve
  8. Typical starters Start with a question “How can we do

    XXX ?” Then find the data Start with the data “What can we do with XXX ?” Try to find a problem to solve Poor approach Data may lack a crucial feature
  9. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…)
  10. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data)
  11. Moravec’s paradox “Easy” Extremely hard

  12. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data) • Are there successful similar projects?
  13. Is the problem achievable? • What category is the problem?

    (regression, classification, anomaly detection, clustering, recommendation…) • Can a human do it? (with the same amount of data) • Are there successful similar projects? • Do you have the data for it?
  14. Finding data Open data Public datasets Competitions Kaggle Turk Machine

    Through APIs
  15. Create your own data Date,MorningWeight,YesterdayFactors 2012-06-10,185.0, 2012-06-11,182.6,salad sleep bacon cheese

    tea halfnhalf icecream 2012-06-12,181.0,sleep egg 2012-06-13,183.6,mottsfruitsnack:2 pizza:0.5 bread:0.5 date:3 dietsnapple splenda milk nosleep 2012-06-14,183.6,coffeecandy:2 egg mayo cheese:2 rice meat bread:0.5 peanut:0.4 2012-06-15,183.4,meat sugarlesscandy salad cherry:4 bread:0 dietsnapple:0.5 egg mayo oliveoil 2012-06-16,183.6,caprise bread grape:0.2 pasadena sugaryogurt dietsnapple:0.5 peanut:0.4 hotdog 2012-06-17,182.6,grape meat pistachio:5 peanut:5 cheese sorbet:5 orangejuice:2 # and so on ...
  16. Create your own data ... (output trimmed for brevity) ...

    FeatureName HashVal ... Weight RelScore nosleep 143407 ... +0.6654 90.29% melon 234655 ... +0.4636 62.91% sugarlemonade 203375 ... +0.3975 53.94% trailmix 174671 ... +0.3362 45.63% bread 135055 ... +0.3345 45.40% caramelizedwalnut 148079 ... +0.3316 44.99% bun 1791 ... +0.3094 41.98% ... (trimmed for brevity. Caveat: data is too noisy anyway) ... stayhome 148879 ... -0.2690 -36.50% bacon 64431 ... -0.2998 -40.69% egg 197743 ... -0.3221 -43.70% parmesan 3119 ... -0.3385 -45.94% oliveoil 156831 ... -0.3754 -50.95% halfnhalf 171855 ... -0.4673 -63.41% sleep 127071 ... -0.7369 -100.00%
  17. Challenges of data Data preparation accounts for about 80% of

    the work of data scientists
  18. Feature derivation 2017-02-08T19:12:18Z Year : 2017 Month : February Day

    of month : 08 Hours : 19 Minutes : 12 Seconds : 18
  19. Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week :

    Wednesday Month : February Week of year : 6 Day of month : 08 Holiday ? : no Hours : 19 Daylight saving ? : no Minutes : 12 Seconds : 18
  20. Feature derivation 2017-02-08T19:12:18Z Year : 2017 Day of week :

    Wednesday Zone A vacation : no Month : February Week of year : 6 Zone B vacation : no Day of month : 08 Holiday ? : no Zone C vacation : yes Hours : 19 Daylight saving ? : no Weather : Cloudy Minutes : 12 Strike ? : no Seconds : 18 Pollution : normal
  21. Generating additional data You can create additional data from existing

    data points. For images, you can : - grayscale - rotate - saturate/desaturate - distort slightly - crop slightly your existing pictures
  22. Objective function

  23. Machine Learning: definition Machine learning is the idea that there

    are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.
  24. Machine Learning: definition Machine learning is the idea that there

    are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. But algo + data is not enough : there is a third piece
  25. Linear regression: least squares

  26. Cost function Describes - a score to maximize - an

    error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you
  27. Cost function Describes - a score to maximize - an

    error minimize Used to compare two different models (which one is better?) This is where you specify what’s important for you
  28. Imbalanced data Population with disproportionate populations (99-1) Always negative? 99%

    accuracy!
  29. Imbalanced classification: different objectives Blood test Fraud detection trying to

    find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives
  30. Imbalanced classification: different objectives Blood test Fraud detection trying to

    find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives
  31. Imbalanced classification: different objectives Blood test Fraud detection trying to

    find patients with a specific illness (1% of the population) trying to find fraudulent transactions (1% of all transactions) Should detect all potential sick people But limited team to investigate Most important : low false negatives Most important : low false positives Use anomaly detection, not classification
  32. Imbalanced classification: different objectives Blood test HR at Google trying

    to find patients with a specific illness (1% of the population) want to find the best candidates among all applications (1% of all applications) Should detect all potential sick people Never recruit a candidate below a certain threshold Most important : low false negatives Most important : low false positives
  33. Fit

  34. None
  35. Overfitting: regression

  36. Overfitting: classification

  37. Solving under-/over-fit Training set 80% of data Test set 20%

    of data
  38. Overfit diagnosis

  39. Solving under-/over-fit Training set 60% of data Test set 20%

    of data Cross-validation set 20% of data
  40. Random bits

  41. Accuracy is not the only goal : speed, size, ...

  42. Transfer learning

  43. Transfer learning

  44. Machine Learning Canvas

  45. Machine Learning Canvas

  46. Design for failure

  47. Questions? February 2017