Lab to Factory: Robust Machine Learning Systems

42d9867a0fee0fa6de6534e9df0f1e9b?s=47 Mark Hibberd
December 02, 2016

Lab to Factory: Robust Machine Learning Systems

Data-driven systems and machine learning continue to be a significant trend across our industry. However, most attempts at these systems face serious difficulties due the tension between the clean, controlled, lab environments where statisticians apply their skills, and the messy unpredictable, production environments where we want to apply their results at scale.

In this talk, we will provide an overview of the machine learning landscape, with an emphasis on the distinction between machine learning as a scientific practice and the larger concept of machine learning systems. Using this base, we will walk through the challenges of taking machine learning out of the lab and applying it successfully in an industrial setting.

By the conclusion of this talk, the audience should take away a better understanding of machine learning as a practice, together with an idea of what it takes to build and deploy machine-learning systems in an environment that deals with real customers and data at scale.

Mark Hibberd spends his time working on large-scale data and machine learning problems for Ambiata. Mark takes software development seriously. Valuing correctness and reliability, he is constantly looking to learn tools and techniques to support these goals.

This approach has led to a history of building teams that utilise purely-functional programming techniques to help deliver robust products.

Ben is the CTO of Ambiata and once upon a time wrote code, built clusters and even compiled Haskell for GPUs.

Nowadays he spends his time figuring out the best way to combine the people and skills from both Ambiata’s sofware engineering and data science teams in order to build machine learning systems that have the best chance of working in the real world.

Ben likes to focus on the real world because it means focusing on delivering robust products to customers that solve problems they actually have.

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

December 02, 2016
Tweet

Transcript

  1. 2.
  2. 4.
  3. 5.
  4. 6.
  5. 7.
  6. 9.
  7. 10.
  8. 19.
  9. 21.
  10. 23.
  11. 24.
  12. 25.
  13. 26.
  14. 27.
  15. 38.

    as a programmer the probabilities are higher than ever of

    working on a machine learning project
  16. 39.

    What do you do if you get pulled in to

    a project with a Machine Learning spin on it?
  17. 43.

    “The field of machine learning is concerned with the question

    of how to construct computer programs that automatically improve with experience.” Tom Mitchell Machine Learning
  18. 44.
  19. 46.
  20. 47.

    The I Just Read “The Lean Start-Up” Solution 1 def

    spell_check(word) 2 dictionary = Dictionary.load(file: "dictionary.yaml") 3 if dictionary.has_value?(word) 4 { correct => true } 5 else 6 { correct => false, suggestions => ["Use a dictionary ;)"] } 7 end 8 end
  21. 48.

    The I Just Read “TAOCP” Solution 1 int spell_check(Dictionary *

    dictionary, const char * word, char ** 2 suggestions) { 3 char **ngrams, distanced, suggestions; 4 int err; 5 6 err = generate_within_levenshtein_distance(word, &distanced); 7 if (err != 0) return err; 8 9 err = generate_ngrams(word, &ngrams); 10 if (err != 0) return err; 11 12 err = matching(dictionary, ngrams, distanced, &suggestions); 13 if (err != 0) return err; 14 15 return suggestions; 16 }
  22. 49.
  23. 50.
  24. 52.

    Look at how Google does spell checking: it's not based

    on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t. Joel Spolsky Joel on Software / 2005-10-17
  25. 53.
  26. 54.
  27. 55.
  28. 56.
  29. 57.
  30. 58.
  31. 59.

    a valid context. Lots of words that are correctly spelled

    in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid
  32. 60.
  33. 61.
  34. 75.
  35. 76.
  36. 77.
  37. 78.
  38. 80.
  39. 81.
  40. 82.
  41. 84.

    2016-06-28,1416856,N,ES,H, 21,2015-07-25,0, 11, 1,,1.0,A,S,N,,KHQ,N,1, 6,"BADAJOZ",1, 38937.48,03 - UNIVERSITARIO 2016-06-28,1202981,N,ES,H, 23,2013-10-18,0,

    32, 1,,1.0,I,S,N,,KHE,N,1,29,"MALAGA",0, 56409.06,03 - UNIVERSITARIO 2016-06-28, 137134,N,ES,V, 51,1999-06-30,0, 204, 1,,1.0,A,S,S,,KAT,N,1,28,"MADRID",1, 443237.88,02 - PARTICULARES 2016-06-28,1256662,N,ES,V, 32,2014-05-06,0, 25, 1,,1.0,A,S,N,,KFC,N,1, 2,"ALBACETE",1, 69776.79,03 - UNIVERSITARIO 2016-06-28, 833024,N,ES,V, 36,2009-02-08,0, 88, 1,,1.0,I,S,N,,KFC,N,1,24,"LEON",0, 80136.27,02 - PARTICULARES 2016-06-28, 198396,N,ES,V, 44,2000-10-13,0, 188, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 451931.22,02 - PARTICULARES 2016-06-28,1055228,N,ES,H, 43,2012-08-31,0, 45, 1,,1,A,S,N,,KFC,N,1,11,"CADIZ",1, 57271.83,02 - PARTICULARES 2016-06-28,1453594,N,ES,H, 21,2015-09-17,0, 9, 1,,1.0,I,S,N,,KHQ,N,1,15,"CORUÑA, A",0, NA,03 - UNIVERSITARIO 2016-06-28,1114959,N,ES,V, 48,2012-12-28,0, 42, 1,,1,A,S,N,,KFC,N,1, 6,"BADAJOZ",1, 164920.32,02 - PARTICULARES 2016-06-28, 193664,N,ES,H, 90,2000-10-09,0, 189, 1,,1.0,I,S,N,,KAT,N,1,50,"ZARAGOZA",0, 63982.68,02 - PARTICULARES 2016-06-28,1461846,N,ES,H, 22,2015-09-25,0, 9, 1,,1,I,S,N,,KHQ,N,1, 6,"BADAJOZ",0, NA,03 - UNIVERSITARIO 2016-06-28, 281786,N,ES,V, 84,2001-10-13,0, 176, 1,,1,I,S,N,,KAT,N,1,41,"SEVILLA",0, 204135.63,02 - PARTICULARES 2016-06-28, 931057,N,ES,V, 25,2011-08-09,0, 58, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 71185.62,03 - UNIVERSITARIO 2016-06-28, 380119,N,ES,H, 66,2002-09-02,0, 166, 1,,1.0,A,S,N,,KFC,N,1,46,"VALENCIA",1, 34973.19,02 - PARTICULARES 2016-06-28, 509236,N,ES,V, 39,2004-12-30,0, 138, 1,,1,A,S,N,,KFC,N,1,41,"SEVILLA",1, 86109.66,02 - PARTICULARES 2016-06-28, 755342,N,ES,V, 51,2008-03-24,0, 99, 1,,1.0,A,S,N,,KAT,N,1, 8,"BARCELONA",1, 29992.74,02 - PARTICULARES 2016-06-28, 678258,N,ES,H, 38,2007-02-20,0, 112, 1,,1.0,I,S,N,,KAT,N,1,28,"MADRID",0, 133180.17,02 - PARTICULARES 2016-06-28, 103307,N,ES,V, 44,1998-08-04,0, 215, 1,,1.0,I,S,S,,KAT,N,1,28,"MADRID",1, 76519.59,02 - PARTICULARES 2016-06-28,1308331,N,ES,H, 22,2014-09-16,0, 21, 1,,1,I,S,N,,KHE,N,1,36,"PONTEVEDRA",0, 134962.29,03 - UNIVERSITARIO 2016-06-28,1006357,N,ES,V, 32,2012-02-27,0, 52, 1,,1,A,S,N,,KFA,N,1,28,"MADRID",1, 65619.90,03 - UNIVERSITARIO 2016-06-28, 124854,N,ES,V, 45,1999-03-10,0, 207, 1,,1,I,S,N,,KAT,N,1, 8,"BARCELONA",0, NA,02 - PARTICULARES 2016-06-28, 757178,N,ES,V, 59,2008-03-31,0, 99, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 109184.13,02 - PARTICULARES 2016-06-28, 759426,N,ES,H, 68,2008-04-09,0, 98, 1,,1,I,S,N,,KFA,N,1,28,"MADRID",0, 210710.49,02 - PARTICULARES 2016-06-28,1193227,N,ES,H, 33,2013-10-09,0, 32, 1,,1.0,I,S,N,,KHE,N,1,50,"ZARAGOZA",0, 42343.29,03 - UNIVERSITARIO 2016-06-28,1192797,N,ES,V, 25,2013-10-09,0, 32, 1,,1,I,S,N,,KHE,N,1,33,"ASTURIAS",0, 176043.90,03 - UNIVERSITARIO 2016-06-28,1085653,N,ES,V, 33,2012-10-22,0, 44, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 128796.93,03 - UNIVERSITARIO 2016-06-28,1486100,N,ES,H, 22,2015-10-21,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,28,"MADRID",0, NA,03 - UNIVERSITARIO 2016-06-28, 31025,N,ES,V, 61,1996-01-12,0, 245, 1,,1.0,A,S,N,,KAT,N,1,28,"MADRID",1, 140976.18,01 - TOP 2016-06-28,1471619,N,ES,H, 22,2015-10-07,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,19,"GUADALAJARA",0, NA,03 - UNIVERSITARIO
  42. 85.

    fecha_dato The table is partitioned for this column ncodpers Customer

    code ind_empleado Employee index: A active, B ex employed, F filial, N not employee, P pasive pais_residencia Customer's Country residence sexo Customer's sex age Age fecha_alta The date in which the customer became as the first holder of a contract in the bank ind_nuevo New customer Index. 1 if the customer registered in the last 6 months. antiguedad Customer seniority (in months) indrel 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month) ult_fec_cli_1t Last date as primary customer (if he isn't at the end of the month) indrel_1mes Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),3 (former primary), 4(former co-owner) tiprel_1mes Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential) indresi Residence index (S (Yes) or N (No) if the residence country is the same than the bank country) indext Foreigner index (S (Yes) or N (No) if the customer's birth country is different than the bank country) conyuemp Spouse index. 1 if the customer is spouse of an employee canal_entrada channel used by the customer to join indfall Deceased index. N/S tipodom Addres type. 1, primary address cod_prov Province code (customer's address) nomprov Province name ind_actividad_cliente Activity index (1, active customer; 0, inactive customer)
  43. 86.
  44. 87.

    Something showing what would be required to “execute” on predictions

    - i.e. plumbing in to decisioning systems, etc
  45. 88.
  46. 89.
  47. 90.

    “We evaluated some of the new methods offline but the

    additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.” Xavier Amatriain and Justin Basilico Personalisation Science and Engineering at Netflix
  48. 92.
  49. 93.
  50. 94.
  51. 97.

    - good results - 1. more/better data - 2. better

    algorithms -> it's a business decision as to which one to focus on -> which has the higher ROI
  52. 99.

    “A wide-spread and uncomfortable trend has emerged: developing and deploying

    ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems
  53. 100.
  54. 102.
  55. 103.
  56. 104.
  57. 105.
  58. 106.
  59. 107.
  60. 110.
  61. 112.

    “Traditional abstractions and boundaries may be subtly corrupted or invalidated

    by the fact that data influences ML system behavior” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems
  62. 113.

    “… Indeed, ML is required in exactly those cases when

    the desired behavior cannot be effectively expressed in software logic without dependency on external data. The real world does not fit into tidy encapsulation.” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems
  63. 121.
  64. 122.

    “The field of machine learning is concerned with the question

    of how to construct computer programs that automatically improve with experience.” Tom Mitchell Machine Learning
  65. 129.
  66. 136.
  67. 137.
  68. 142.
  69. 143.
  70. 144.
  71. 145.
  72. 147.
  73. 148.
  74. 149.
  75. 150.
  76. 153.
  77. 155.

    “At the end of the day, some machine learning projects

    succeed and some fail. What makes the difference? Easily the most important factor is the features used.” Pedro Domingos A Few Useful Things to Know about Machine Learning
  78. 156.
  79. 157.
  80. 158.
  81. 161.
  82. 164.
  83. 171.
  84. 172.
  85. 174.
  86. 175.
  87. 176.
  88. 177.
  89. 178.
  90. 187.
  91. 188.
  92. 194.
  93. 195.
  94. 199.
  95. 203.
  96. 209.
  97. 215.
  98. 219.
  99. 223.

    “CIOs are in trouble right now… We’ve seen exponential growth

    in data. If I drop data on the floor and lose it, I am a bad CIO but if my budget grows exponentially to handle it, I am also a bad CIO.” Stephen Probst CTO at Teradata
  100. 224.