Lab to Factory: Robust Machine Learning Systems

42d9867a0fee0fa6de6534e9df0f1e9b?s=47 Mark Hibberd
December 02, 2016

Lab to Factory: Robust Machine Learning Systems

Data-driven systems and machine learning continue to be a significant trend across our industry. However, most attempts at these systems face serious difficulties due the tension between the clean, controlled, lab environments where statisticians apply their skills, and the messy unpredictable, production environments where we want to apply their results at scale.

In this talk, we will provide an overview of the machine learning landscape, with an emphasis on the distinction between machine learning as a scientific practice and the larger concept of machine learning systems. Using this base, we will walk through the challenges of taking machine learning out of the lab and applying it successfully in an industrial setting.

By the conclusion of this talk, the audience should take away a better understanding of machine learning as a practice, together with an idea of what it takes to build and deploy machine-learning systems in an environment that deals with real customers and data at scale.

Mark Hibberd spends his time working on large-scale data and machine learning problems for Ambiata. Mark takes software development seriously. Valuing correctness and reliability, he is constantly looking to learn tools and techniques to support these goals.

This approach has led to a history of building teams that utilise purely-functional programming techniques to help deliver robust products.

Ben is the CTO of Ambiata and once upon a time wrote code, built clusters and even compiled Haskell for GPUs.

Nowadays he spends his time figuring out the best way to combine the people and skills from both Ambiata’s sofware engineering and data science teams in order to build machine learning systems that have the best chance of working in the real world.

Ben likes to focus on the real world because it means focusing on delivering robust products to customers that solve problems they actually have.

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

December 02, 2016
Tweet

Transcript

  1. Lab to Factory

  2. None
  3. machine learning is our saviour

  4. None
  5. None
  6. None
  7. None
  8. https://techcrunch.com/2016/11/26/machine-learning-can-fix-twitter-facebook-and-maybe-even-america/

  9. None
  10. None
  11. machine learning is being democratised

  12. machine learning

  13. 2006 2011 2016 "machine learning"

  14. http://www.computerworld.com.au/article/601117/machine-learning-new-face-enterprise-data/

  15. 2006 2011 2016 "machine learning" "data science"

  16. http://www.kdnuggets.com/2016/01/businesses-need-one-million-data-scientists-2018.html

  17. https://hbr.org/2016/11/hiring-your-first-chief-ai-officer

  18. http://www.burtchworks.com/2015/03/02/4-ways-to-spot-a-fake-data-scientist/

  19. None
  20. https://www.coursera.org/learn/machine-learning

  21. None
  22. http://www.uts.edu.au/future-students/find-a-course/courses/c04293

  23. None
  24. None
  25. None
  26. None
  27. None
  28. 2006 2011 2016 "machine learning" "data science"

  29. 2006 2011 2016 "machine learning" "data science" "big data"

  30. 2006 2011 2016 "machine learning" "data science" "big data" "hadoop"

  31. 2006 2011 2016 "hadoop"

  32. [do palm card version]

  33. [do palm card version]

  34. 2006 2011 2016 2022

  35. 2006 2011 2016 2022

  36. <there is hype>

  37. <a historical example of hype that programmers were exposed to>

  38. as a programmer the probabilities are higher than ever of

    working on a machine learning project
  39. What do you do if you get pulled in to

    a project with a Machine Learning spin on it?
  40. How do you approach the technology interface between science and

    engineering?
  41. How do you approach the people interface between science and

    engineering?
  42. What is Machine Learning?

  43. “The field of machine learning is concerned with the question

    of how to construct computer programs that automatically improve with experience.” Tom Mitchell Machine Learning
  44. None
  45. mispelled Can we accurately identify mispelled words?

  46. None
  47. The I Just Read “The Lean Start-Up” Solution 1 def

    spell_check(word) 2 dictionary = Dictionary.load(file: "dictionary.yaml") 3 if dictionary.has_value?(word) 4 { correct => true } 5 else 6 { correct => false, suggestions => ["Use a dictionary ;)"] } 7 end 8 end
  48. The I Just Read “TAOCP” Solution 1 int spell_check(Dictionary *

    dictionary, const char * word, char ** 2 suggestions) { 3 char **ngrams, distanced, suggestions; 4 int err; 5 6 err = generate_within_levenshtein_distance(word, &distanced); 7 if (err != 0) return err; 8 9 err = generate_ngrams(word, &ngrams); 10 if (err != 0) return err; 11 12 err = matching(dictionary, ngrams, distanced, &suggestions); 13 if (err != 0) return err; 14 15 return suggestions; 16 }
  49. None
  50. None
  51. mispelled Can we accurately identify mispelled words?

  52. Look at how Google does spell checking: it's not based

    on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t. Joel Spolsky Joel on Software / 2005-10-17
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. a valid context. Lots of words that are correctly spelled

    in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid
  60. None
  61. None
  62. data driven code driven vs

  63. Fixed Algorithm

  64. General Purpose

  65. Can be Simpler

  66. More Experience

  67. Some Problems Intractable

  68. Learning Algorithm

  69. Restricted Domains

  70. Improve with Smarter Algorithms

  71. Improve with More or Better Data

  72. Can Handle Situations Infeasible for Code Driven Approaches

  73. Can Be a Really Expensive Way to Encode an If

    Statement
  74. Machine Learning Systems

  75. None
  76. None
  77. None
  78. ? ? ? ? ?

  79. 480,189 users 17,770 movies 100,480,507 ratings

  80. None
  81. None
  82. None
  83. https://www.kaggle.com/c/santander-product-recommendation/data

  84. 2016-06-28,1416856,N,ES,H, 21,2015-07-25,0, 11, 1,,1.0,A,S,N,,KHQ,N,1, 6,"BADAJOZ",1, 38937.48,03 - UNIVERSITARIO 2016-06-28,1202981,N,ES,H, 23,2013-10-18,0,

    32, 1,,1.0,I,S,N,,KHE,N,1,29,"MALAGA",0, 56409.06,03 - UNIVERSITARIO 2016-06-28, 137134,N,ES,V, 51,1999-06-30,0, 204, 1,,1.0,A,S,S,,KAT,N,1,28,"MADRID",1, 443237.88,02 - PARTICULARES 2016-06-28,1256662,N,ES,V, 32,2014-05-06,0, 25, 1,,1.0,A,S,N,,KFC,N,1, 2,"ALBACETE",1, 69776.79,03 - UNIVERSITARIO 2016-06-28, 833024,N,ES,V, 36,2009-02-08,0, 88, 1,,1.0,I,S,N,,KFC,N,1,24,"LEON",0, 80136.27,02 - PARTICULARES 2016-06-28, 198396,N,ES,V, 44,2000-10-13,0, 188, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 451931.22,02 - PARTICULARES 2016-06-28,1055228,N,ES,H, 43,2012-08-31,0, 45, 1,,1,A,S,N,,KFC,N,1,11,"CADIZ",1, 57271.83,02 - PARTICULARES 2016-06-28,1453594,N,ES,H, 21,2015-09-17,0, 9, 1,,1.0,I,S,N,,KHQ,N,1,15,"CORUÑA, A",0, NA,03 - UNIVERSITARIO 2016-06-28,1114959,N,ES,V, 48,2012-12-28,0, 42, 1,,1,A,S,N,,KFC,N,1, 6,"BADAJOZ",1, 164920.32,02 - PARTICULARES 2016-06-28, 193664,N,ES,H, 90,2000-10-09,0, 189, 1,,1.0,I,S,N,,KAT,N,1,50,"ZARAGOZA",0, 63982.68,02 - PARTICULARES 2016-06-28,1461846,N,ES,H, 22,2015-09-25,0, 9, 1,,1,I,S,N,,KHQ,N,1, 6,"BADAJOZ",0, NA,03 - UNIVERSITARIO 2016-06-28, 281786,N,ES,V, 84,2001-10-13,0, 176, 1,,1,I,S,N,,KAT,N,1,41,"SEVILLA",0, 204135.63,02 - PARTICULARES 2016-06-28, 931057,N,ES,V, 25,2011-08-09,0, 58, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 71185.62,03 - UNIVERSITARIO 2016-06-28, 380119,N,ES,H, 66,2002-09-02,0, 166, 1,,1.0,A,S,N,,KFC,N,1,46,"VALENCIA",1, 34973.19,02 - PARTICULARES 2016-06-28, 509236,N,ES,V, 39,2004-12-30,0, 138, 1,,1,A,S,N,,KFC,N,1,41,"SEVILLA",1, 86109.66,02 - PARTICULARES 2016-06-28, 755342,N,ES,V, 51,2008-03-24,0, 99, 1,,1.0,A,S,N,,KAT,N,1, 8,"BARCELONA",1, 29992.74,02 - PARTICULARES 2016-06-28, 678258,N,ES,H, 38,2007-02-20,0, 112, 1,,1.0,I,S,N,,KAT,N,1,28,"MADRID",0, 133180.17,02 - PARTICULARES 2016-06-28, 103307,N,ES,V, 44,1998-08-04,0, 215, 1,,1.0,I,S,S,,KAT,N,1,28,"MADRID",1, 76519.59,02 - PARTICULARES 2016-06-28,1308331,N,ES,H, 22,2014-09-16,0, 21, 1,,1,I,S,N,,KHE,N,1,36,"PONTEVEDRA",0, 134962.29,03 - UNIVERSITARIO 2016-06-28,1006357,N,ES,V, 32,2012-02-27,0, 52, 1,,1,A,S,N,,KFA,N,1,28,"MADRID",1, 65619.90,03 - UNIVERSITARIO 2016-06-28, 124854,N,ES,V, 45,1999-03-10,0, 207, 1,,1,I,S,N,,KAT,N,1, 8,"BARCELONA",0, NA,02 - PARTICULARES 2016-06-28, 757178,N,ES,V, 59,2008-03-31,0, 99, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 109184.13,02 - PARTICULARES 2016-06-28, 759426,N,ES,H, 68,2008-04-09,0, 98, 1,,1,I,S,N,,KFA,N,1,28,"MADRID",0, 210710.49,02 - PARTICULARES 2016-06-28,1193227,N,ES,H, 33,2013-10-09,0, 32, 1,,1.0,I,S,N,,KHE,N,1,50,"ZARAGOZA",0, 42343.29,03 - UNIVERSITARIO 2016-06-28,1192797,N,ES,V, 25,2013-10-09,0, 32, 1,,1,I,S,N,,KHE,N,1,33,"ASTURIAS",0, 176043.90,03 - UNIVERSITARIO 2016-06-28,1085653,N,ES,V, 33,2012-10-22,0, 44, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 128796.93,03 - UNIVERSITARIO 2016-06-28,1486100,N,ES,H, 22,2015-10-21,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,28,"MADRID",0, NA,03 - UNIVERSITARIO 2016-06-28, 31025,N,ES,V, 61,1996-01-12,0, 245, 1,,1.0,A,S,N,,KAT,N,1,28,"MADRID",1, 140976.18,01 - TOP 2016-06-28,1471619,N,ES,H, 22,2015-10-07,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,19,"GUADALAJARA",0, NA,03 - UNIVERSITARIO
  85. fecha_dato The table is partitioned for this column ncodpers Customer

    code ind_empleado Employee index: A active, B ex employed, F filial, N not employee, P pasive pais_residencia Customer's Country residence sexo Customer's sex age Age fecha_alta The date in which the customer became as the first holder of a contract in the bank ind_nuevo New customer Index. 1 if the customer registered in the last 6 months. antiguedad Customer seniority (in months) indrel 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month) ult_fec_cli_1t Last date as primary customer (if he isn't at the end of the month) indrel_1mes Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),3 (former primary), 4(former co-owner) tiprel_1mes Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential) indresi Residence index (S (Yes) or N (No) if the residence country is the same than the bank country) indext Foreigner index (S (Yes) or N (No) if the customer's birth country is different than the bank country) conyuemp Spouse index. 1 if the customer is spouse of an employee canal_entrada channel used by the customer to join indfall Deceased index. N/S tipodom Addres type. 1, primary address cod_prov Province code (customer's address) nomprov Province name ind_actividad_cliente Activity index (1, active customer; 0, inactive customer)
  86. Something showing the reality of where all of these features

    would have needed to be pulled from
  87. Something showing what would be required to “execute” on predictions

    - i.e. plumbing in to decisioning systems, etc
  88. None
  89. None
  90. “We evaluated some of the new methods offline but the

    additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.” Xavier Amatriain and Justin Basilico Personalisation Science and Engineering at Netflix
  91. Something showing what the Netflix architecture looks like

  92. None
  93. None
  94. None
  95. Ambiata - multiple verticals - different data - getting multiple

    ML systems to production
  96. Receive data every day Batch score models every day Prepare

    features every day x N
  97. - good results - 1. more/better data - 2. better

    algorithms -> it's a business decision as to which one to focus on -> which has the higher ROI
  98. Production

  99. “A wide-spread and uncomfortable trend has emerged: developing and deploying

    ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems
  100. None
  101. Data Acquisition

  102. None
  103. None
  104. None
  105. None
  106. None
  107. None
  108. data acquisition is non-trivial

  109. the data will be messy

  110. format zoo

  111. most important property of a robust data architecture is have

    hard edges
  112. “Traditional abstractions and boundaries may be subtly corrupted or invalidated

    by the fact that data influences ML system behavior” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems
  113. “… Indeed, ML is required in exactly those cases when

    the desired behavior cannot be effectively expressed in software logic without dependency on external data. The real world does not fit into tidy encapsulation.” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems
  114. format zoo data-platform

  115. format zoo data-platform “the” format

  116. format zoo data-platform optimise for data-size

  117. format zoo data-platform optimise for i/o performance

  118. format zoo data-platform optimise for tooling

  119. format zoo data-platform security / privacy?

  120. Data Verification

  121. None
  122. “The field of machine learning is concerned with the question

    of how to construct computer programs that automatically improve with experience.” Tom Mitchell Machine Learning
  123. the margin of benefit on machine learning systems is often

    very low
  124. poor quality data can negate any benefit a good model

    may give you
  125. statistics are good at identifying issues in data

  126. statisticians are good at using statistics to identify issues in

    data
  127. the lab is an optimal environment for this

  128. we can use all the data or statistically equivalent samples

  129. None
  130. time is just trying to mess with us

  131. data will change over time

  132. data changes must be handled as you go

  133. data issues must be fixed as you go

  134. timeliness of data becomes a quality issue

  135. no escape hatch, you can’t start again

  136. None
  137. None
  138. static checks are important

  139. absolute thresholds are meh

  140. proportional thresholds are ok

  141. statistical properties are good

  142. anomalies

  143. None
  144. None
  145. anomaly

  146. need to account for seasonal and growth trends

  147. breakouts

  148. None
  149. None
  150. breakout

  151. need to account for seasonal and growth trends

  152. proportional thresholds are ok

  153. None
  154. Feature Engineering

  155. “At the end of the day, some machine learning projects

    succeed and some fail. What makes the difference? Easily the most important factor is the features used.” Pedro Domingos A Few Useful Things to Know about Machine Learning
  156. None
  157. None
  158. None
  159. our systems should be linear in data volume not feature

    count
  160. we want to be able to throw new features into

    the mix
  161. None
  162. we can’t afford to reprocess historical data

  163. Model Training

  164. None
  165. repeatability

  166. repeatability

  167. repeatability

  168. can we retrain models on demand?

  169. can we reproduce results independently?

  170. Model Scoring

  171. None
  172. None
  173. Model Deployment

  174. None
  175. None
  176. None
  177. Monitoring

  178. None
  179. alert fatigue is real

  180. actionability of alarms needs to be supported by your architecture

  181. time to verify failures is high

  182. time to recover failures is high

  183. cost to recover failures is high

  184. cost of false negative is high

  185. cost of false positive is high

  186. Results Delivery

  187. None
  188. None
  189. Change Management

  190. we want to be able to do this again

  191. more models

  192. better models

  193. and we don’t want to make a mistake

  194. None
  195. Delivery

  196. delivery anti patterns

  197. anti-pattern: programmers using open source ML software

  198. anti-pattern: data scientists scheduling R scripts

  199. None
  200. not just programmers

  201. not just machine learners

  202. anti-pattern: we can’t say upfront how long it will take

    to build a good model
  203. None
  204. time boxing

  205. incremental development

  206. regular reviews

  207. Lab Factory investigate opportunities system build system operate analyse performance

  208. anti-pattern: our model performs really well

  209. None
  210. know what success is and know how to measure it

  211. more revenue / profit

  212. more clicks

  213. more customers

  214. less time between actions

  215. None
  216. run experiments

  217. know if impacts are due to you

  218. anti-pattern: google did it

  219. None
  220. latest algorithms aren’t always the answer

  221. more/better data isn’t always the answer

  222. an informed ROI discussion is the answer

  223. “CIOs are in trouble right now… We’ve seen exponential growth

    in data. If I drop data on the floor and lose it, I am a bad CIO but if my budget grows exponentially to handle it, I am also a bad CIO.” Stephen Probst CTO at Teradata
  224. None
  225. Lab to Factory