$30 off During Our Annual Pro Sale. View Details »

Lab to Factory: Robust Machine Learning Systems

Mark Hibberd
December 02, 2016

Lab to Factory: Robust Machine Learning Systems

Data-driven systems and machine learning continue to be a significant trend across our industry. However, most attempts at these systems face serious difficulties due the tension between the clean, controlled, lab environments where statisticians apply their skills, and the messy unpredictable, production environments where we want to apply their results at scale.

In this talk, we will provide an overview of the machine learning landscape, with an emphasis on the distinction between machine learning as a scientific practice and the larger concept of machine learning systems. Using this base, we will walk through the challenges of taking machine learning out of the lab and applying it successfully in an industrial setting.

By the conclusion of this talk, the audience should take away a better understanding of machine learning as a practice, together with an idea of what it takes to build and deploy machine-learning systems in an environment that deals with real customers and data at scale.

Mark Hibberd spends his time working on large-scale data and machine learning problems for Ambiata. Mark takes software development seriously. Valuing correctness and reliability, he is constantly looking to learn tools and techniques to support these goals.

This approach has led to a history of building teams that utilise purely-functional programming techniques to help deliver robust products.

Ben is the CTO of Ambiata and once upon a time wrote code, built clusters and even compiled Haskell for GPUs.

Nowadays he spends his time figuring out the best way to combine the people and skills from both Ambiata’s sofware engineering and data science teams in order to build machine learning systems that have the best chance of working in the real world.

Ben likes to focus on the real world because it means focusing on delivering robust products to customers that solve problems they actually have.

Mark Hibberd

December 02, 2016
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. Lab to Factory

    View Slide

  2. View Slide

  3. machine learning is our saviour

    View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. https://techcrunch.com/2016/11/26/machine-learning-can-fix-twitter-facebook-and-maybe-even-america/

    View Slide

  9. View Slide

  10. View Slide

  11. machine learning is being democratised

    View Slide

  12. machine learning

    View Slide

  13. 2006 2011 2016
    "machine learning"

    View Slide

  14. http://www.computerworld.com.au/article/601117/machine-learning-new-face-enterprise-data/

    View Slide

  15. 2006 2011 2016
    "machine learning" "data science"

    View Slide

  16. http://www.kdnuggets.com/2016/01/businesses-need-one-million-data-scientists-2018.html

    View Slide

  17. https://hbr.org/2016/11/hiring-your-first-chief-ai-officer

    View Slide

  18. http://www.burtchworks.com/2015/03/02/4-ways-to-spot-a-fake-data-scientist/

    View Slide

  19. View Slide

  20. https://www.coursera.org/learn/machine-learning

    View Slide

  21. View Slide

  22. http://www.uts.edu.au/future-students/find-a-course/courses/c04293

    View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. 2006 2011 2016
    "machine learning" "data science"

    View Slide

  29. 2006 2011 2016
    "machine learning" "data science" "big data"

    View Slide

  30. 2006 2011 2016
    "machine learning" "data science" "big data" "hadoop"

    View Slide

  31. 2006 2011 2016
    "hadoop"

    View Slide

  32. [do palm card version]

    View Slide

  33. [do palm card version]

    View Slide

  34. 2006 2011 2016 2022

    View Slide

  35. 2006 2011 2016 2022

    View Slide


  36. View Slide


  37. View Slide

  38. as a programmer
    the probabilities are higher than ever
    of working on a machine learning project

    View Slide

  39. What do you do if you get pulled in to a project with a
    Machine Learning spin on it?

    View Slide

  40. How do you approach the technology interface between
    science and engineering?

    View Slide

  41. How do you approach the people interface between
    science and engineering?

    View Slide

  42. What is Machine Learning?

    View Slide

  43. “The field of machine learning is concerned
    with the question of how to construct
    computer programs that automatically
    improve with experience.”
    Tom Mitchell
    Machine Learning

    View Slide

  44. View Slide

  45. mispelled
    Can we accurately identify
    mispelled words?

    View Slide

  46. View Slide

  47. The I Just Read “The Lean Start-Up” Solution
    1 def spell_check(word)
    2 dictionary = Dictionary.load(file: "dictionary.yaml")
    3 if dictionary.has_value?(word)
    4 { correct => true }
    5 else
    6 { correct => false, suggestions => ["Use a dictionary ;)"] }
    7 end
    8 end

    View Slide

  48. The I Just Read “TAOCP” Solution
    1 int spell_check(Dictionary * dictionary, const char * word, char **
    2 suggestions) {
    3 char **ngrams, distanced, suggestions;
    4 int err;
    5
    6 err = generate_within_levenshtein_distance(word, &distanced);
    7 if (err != 0) return err;
    8
    9 err = generate_ngrams(word, &ngrams);
    10 if (err != 0) return err;
    11
    12 err = matching(dictionary, ngrams, distanced, &suggestions);
    13 if (err != 0) return err;
    14
    15 return suggestions;
    16 }

    View Slide

  49. View Slide

  50. View Slide

  51. mispelled
    Can we accurately identify
    mispelled words?

    View Slide

  52. Look at how Google does spell checking: it's
    not based on dictionaries; it's based on word
    usage statistics of the entire Internet, which is
    why Google knows how to correct my name,
    misspelled, and Microsoft Word doesn’t.
    Joel Spolsky
    Joel on Software / 2005-10-17

    View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. a valid context. Lots of words that are
    correctly spelled in a valid context. Lots of
    words that are correctly spelled in a valid
    context. Lots of words that are correctly
    spelled in a valid context. Lots of words
    that are correctly spelled in a valid
    context. Lots of words that are correctly
    spelled in a valid context. Lots of words
    that are correctly spelled in a valid

    View Slide

  60. View Slide

  61. View Slide

  62. data
    driven
    code
    driven
    vs

    View Slide

  63. Fixed Algorithm

    View Slide

  64. General Purpose

    View Slide

  65. Can be Simpler

    View Slide

  66. More Experience

    View Slide

  67. Some Problems
    Intractable

    View Slide

  68. Learning Algorithm

    View Slide

  69. Restricted Domains

    View Slide

  70. Improve with
    Smarter Algorithms

    View Slide

  71. Improve with
    More or Better Data

    View Slide

  72. Can Handle Situations
    Infeasible for
    Code Driven Approaches

    View Slide

  73. Can Be a Really
    Expensive Way to
    Encode an If Statement

    View Slide

  74. Machine Learning Systems

    View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. ? ? ? ? ?

    View Slide

  79. 480,189 users
    17,770 movies
    100,480,507 ratings

    View Slide

  80. View Slide

  81. View Slide

  82. View Slide

  83. https://www.kaggle.com/c/santander-product-recommendation/data

    View Slide

  84. 2016-06-28,1416856,N,ES,H, 21,2015-07-25,0, 11, 1,,1.0,A,S,N,,KHQ,N,1, 6,"BADAJOZ",1, 38937.48,03 - UNIVERSITARIO
    2016-06-28,1202981,N,ES,H, 23,2013-10-18,0, 32, 1,,1.0,I,S,N,,KHE,N,1,29,"MALAGA",0, 56409.06,03 - UNIVERSITARIO
    2016-06-28, 137134,N,ES,V, 51,1999-06-30,0, 204, 1,,1.0,A,S,S,,KAT,N,1,28,"MADRID",1, 443237.88,02 - PARTICULARES
    2016-06-28,1256662,N,ES,V, 32,2014-05-06,0, 25, 1,,1.0,A,S,N,,KFC,N,1, 2,"ALBACETE",1, 69776.79,03 - UNIVERSITARIO
    2016-06-28, 833024,N,ES,V, 36,2009-02-08,0, 88, 1,,1.0,I,S,N,,KFC,N,1,24,"LEON",0, 80136.27,02 - PARTICULARES
    2016-06-28, 198396,N,ES,V, 44,2000-10-13,0, 188, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 451931.22,02 - PARTICULARES
    2016-06-28,1055228,N,ES,H, 43,2012-08-31,0, 45, 1,,1,A,S,N,,KFC,N,1,11,"CADIZ",1, 57271.83,02 - PARTICULARES
    2016-06-28,1453594,N,ES,H, 21,2015-09-17,0, 9, 1,,1.0,I,S,N,,KHQ,N,1,15,"CORUÑA, A",0, NA,03 - UNIVERSITARIO
    2016-06-28,1114959,N,ES,V, 48,2012-12-28,0, 42, 1,,1,A,S,N,,KFC,N,1, 6,"BADAJOZ",1, 164920.32,02 - PARTICULARES
    2016-06-28, 193664,N,ES,H, 90,2000-10-09,0, 189, 1,,1.0,I,S,N,,KAT,N,1,50,"ZARAGOZA",0, 63982.68,02 - PARTICULARES
    2016-06-28,1461846,N,ES,H, 22,2015-09-25,0, 9, 1,,1,I,S,N,,KHQ,N,1, 6,"BADAJOZ",0, NA,03 - UNIVERSITARIO
    2016-06-28, 281786,N,ES,V, 84,2001-10-13,0, 176, 1,,1,I,S,N,,KAT,N,1,41,"SEVILLA",0, 204135.63,02 - PARTICULARES
    2016-06-28, 931057,N,ES,V, 25,2011-08-09,0, 58, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 71185.62,03 - UNIVERSITARIO
    2016-06-28, 380119,N,ES,H, 66,2002-09-02,0, 166, 1,,1.0,A,S,N,,KFC,N,1,46,"VALENCIA",1, 34973.19,02 - PARTICULARES
    2016-06-28, 509236,N,ES,V, 39,2004-12-30,0, 138, 1,,1,A,S,N,,KFC,N,1,41,"SEVILLA",1, 86109.66,02 - PARTICULARES
    2016-06-28, 755342,N,ES,V, 51,2008-03-24,0, 99, 1,,1.0,A,S,N,,KAT,N,1, 8,"BARCELONA",1, 29992.74,02 - PARTICULARES
    2016-06-28, 678258,N,ES,H, 38,2007-02-20,0, 112, 1,,1.0,I,S,N,,KAT,N,1,28,"MADRID",0, 133180.17,02 - PARTICULARES
    2016-06-28, 103307,N,ES,V, 44,1998-08-04,0, 215, 1,,1.0,I,S,S,,KAT,N,1,28,"MADRID",1, 76519.59,02 - PARTICULARES
    2016-06-28,1308331,N,ES,H, 22,2014-09-16,0, 21, 1,,1,I,S,N,,KHE,N,1,36,"PONTEVEDRA",0, 134962.29,03 - UNIVERSITARIO
    2016-06-28,1006357,N,ES,V, 32,2012-02-27,0, 52, 1,,1,A,S,N,,KFA,N,1,28,"MADRID",1, 65619.90,03 - UNIVERSITARIO
    2016-06-28, 124854,N,ES,V, 45,1999-03-10,0, 207, 1,,1,I,S,N,,KAT,N,1, 8,"BARCELONA",0, NA,02 - PARTICULARES
    2016-06-28, 757178,N,ES,V, 59,2008-03-31,0, 99, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 109184.13,02 - PARTICULARES
    2016-06-28, 759426,N,ES,H, 68,2008-04-09,0, 98, 1,,1,I,S,N,,KFA,N,1,28,"MADRID",0, 210710.49,02 - PARTICULARES
    2016-06-28,1193227,N,ES,H, 33,2013-10-09,0, 32, 1,,1.0,I,S,N,,KHE,N,1,50,"ZARAGOZA",0, 42343.29,03 - UNIVERSITARIO
    2016-06-28,1192797,N,ES,V, 25,2013-10-09,0, 32, 1,,1,I,S,N,,KHE,N,1,33,"ASTURIAS",0, 176043.90,03 - UNIVERSITARIO
    2016-06-28,1085653,N,ES,V, 33,2012-10-22,0, 44, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 128796.93,03 - UNIVERSITARIO
    2016-06-28,1486100,N,ES,H, 22,2015-10-21,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,28,"MADRID",0, NA,03 - UNIVERSITARIO
    2016-06-28, 31025,N,ES,V, 61,1996-01-12,0, 245, 1,,1.0,A,S,N,,KAT,N,1,28,"MADRID",1, 140976.18,01 - TOP
    2016-06-28,1471619,N,ES,H, 22,2015-10-07,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,19,"GUADALAJARA",0, NA,03 - UNIVERSITARIO

    View Slide

  85. fecha_dato The table is partitioned for this column
    ncodpers Customer code
    ind_empleado Employee index: A active, B ex employed, F filial, N not employee, P pasive
    pais_residencia Customer's Country residence
    sexo Customer's sex
    age Age
    fecha_alta The date in which the customer became as the first holder of a contract in the bank
    ind_nuevo New customer Index. 1 if the customer registered in the last 6 months.
    antiguedad Customer seniority (in months)
    indrel 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month)
    ult_fec_cli_1t Last date as primary customer (if he isn't at the end of the month)
    indrel_1mes Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),3 (former primary), 4(former co-owner)
    tiprel_1mes Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential)
    indresi Residence index (S (Yes) or N (No) if the residence country is the same than the bank country)
    indext Foreigner index (S (Yes) or N (No) if the customer's birth country is different than the bank country)
    conyuemp Spouse index. 1 if the customer is spouse of an employee
    canal_entrada channel used by the customer to join
    indfall Deceased index. N/S
    tipodom Addres type. 1, primary address
    cod_prov Province code (customer's address)
    nomprov Province name
    ind_actividad_cliente Activity index (1, active customer; 0, inactive customer)

    View Slide

  86. Something showing the reality of where all of these
    features would have needed to be pulled from

    View Slide

  87. Something showing what would be required to “execute”
    on predictions - i.e. plumbing in to decisioning systems,
    etc

    View Slide

  88. View Slide

  89. View Slide

  90. “We evaluated some of the new methods
    offline but the additional accuracy gains that
    we measured did not seem to justify the
    engineering effort needed to bring them
    into a production environment.”
    Xavier Amatriain and Justin Basilico
    Personalisation Science and Engineering at Netflix

    View Slide

  91. Something showing what the Netflix architecture looks
    like

    View Slide

  92. View Slide

  93. View Slide

  94. View Slide

  95. Ambiata
    - multiple verticals
    - different data
    - getting multiple ML systems to production

    View Slide

  96. Receive data
    every day
    Batch score models
    every day
    Prepare features
    every day
    x N

    View Slide

  97. - good results
    - 1. more/better data
    - 2. better algorithms
    -> it's a business decision as to which one to focus on -> which has the higher ROI

    View Slide

  98. Production

    View Slide

  99. “A wide-spread and uncomfortable trend has
    emerged: developing and deploying ML systems
    is relatively fast and cheap, but maintaining them
    over time is difficult and expensive.”
    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov,
    Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael
    Young, Jean-Francois Crespo, Dan Dennison
    Hidden Technical Debt in Machine Learning Systems

    View Slide

  100. View Slide

  101. Data Acquisition

    View Slide

  102. View Slide

  103. View Slide

  104. View Slide

  105. View Slide

  106. View Slide

  107. View Slide

  108. data acquisition is non-trivial

    View Slide

  109. the data will be messy

    View Slide

  110. format zoo

    View Slide

  111. most important property
    of a robust data architecture
    is have hard edges

    View Slide

  112. “Traditional abstractions and boundaries
    may be subtly corrupted or invalidated by
    the fact that data influences ML system
    behavior”
    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov,
    Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael
    Young, Jean-Francois Crespo, Dan Dennison
    Hidden Technical Debt in Machine Learning Systems

    View Slide

  113. “… Indeed, ML is required in exactly those cases
    when the desired behavior cannot be effectively
    expressed in software logic without dependency
    on external data. The real world does not fit into
    tidy encapsulation.”
    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov,
    Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael
    Young, Jean-Francois Crespo, Dan Dennison
    Hidden Technical Debt in Machine Learning Systems

    View Slide

  114. format zoo data-platform

    View Slide

  115. format zoo data-platform
    “the” format

    View Slide

  116. format zoo data-platform
    optimise for data-size

    View Slide

  117. format zoo data-platform
    optimise for i/o performance

    View Slide

  118. format zoo data-platform
    optimise for tooling

    View Slide

  119. format zoo data-platform
    security / privacy?

    View Slide

  120. Data Verification

    View Slide

  121. View Slide

  122. “The field of machine learning is concerned
    with the question of how to construct
    computer programs that automatically
    improve with experience.”
    Tom Mitchell
    Machine Learning

    View Slide

  123. the margin of benefit on machine learning
    systems is often very low

    View Slide

  124. poor quality data can negate any benefit
    a good model may give you

    View Slide

  125. statistics are good at identifying issues in data

    View Slide

  126. statisticians are good at using statistics to
    identify issues in data

    View Slide

  127. the lab is an optimal environment for this

    View Slide

  128. we can use all the data or
    statistically equivalent samples

    View Slide

  129. View Slide

  130. time is just trying to mess with us

    View Slide

  131. data will change over time

    View Slide

  132. data changes must be handled as you go

    View Slide

  133. data issues must be fixed as you go

    View Slide

  134. timeliness of data becomes a quality issue

    View Slide

  135. no escape hatch, you can’t start again

    View Slide

  136. View Slide

  137. View Slide

  138. static checks are important

    View Slide

  139. absolute thresholds are meh

    View Slide

  140. proportional thresholds are ok

    View Slide

  141. statistical properties are good

    View Slide

  142. anomalies

    View Slide

  143. View Slide

  144. View Slide

  145. anomaly

    View Slide

  146. need to account for
    seasonal and growth trends

    View Slide

  147. breakouts

    View Slide

  148. View Slide

  149. View Slide

  150. breakout

    View Slide

  151. need to account for
    seasonal and growth trends

    View Slide

  152. proportional thresholds are ok

    View Slide

  153. View Slide

  154. Feature Engineering

    View Slide

  155. “At the end of the day, some machine learning
    projects succeed and some fail. What makes the
    difference? Easily the most important factor is
    the features used.”
    Pedro Domingos
    A Few Useful Things to Know about Machine Learning

    View Slide

  156. View Slide

  157. View Slide

  158. View Slide

  159. our systems should be linear in
    data volume not feature count

    View Slide

  160. we want to be able to throw
    new features into the mix

    View Slide

  161. View Slide

  162. we can’t afford to reprocess historical data

    View Slide

  163. Model Training

    View Slide

  164. View Slide

  165. repeatability

    View Slide

  166. repeatability

    View Slide

  167. repeatability

    View Slide

  168. can we retrain models on demand?

    View Slide

  169. can we reproduce results independently?

    View Slide

  170. Model Scoring

    View Slide

  171. View Slide

  172. View Slide

  173. Model Deployment

    View Slide

  174. View Slide

  175. View Slide

  176. View Slide

  177. Monitoring

    View Slide

  178. View Slide

  179. alert fatigue is real

    View Slide

  180. actionability of alarms
    needs to be supported by your
    architecture

    View Slide

  181. time to verify failures is high

    View Slide

  182. time to recover failures is high

    View Slide

  183. cost to recover failures is high

    View Slide

  184. cost of false negative is high

    View Slide

  185. cost of false positive is high

    View Slide

  186. Results Delivery

    View Slide

  187. View Slide

  188. View Slide

  189. Change Management

    View Slide

  190. we want to be able to do this again

    View Slide

  191. more models

    View Slide

  192. better models

    View Slide

  193. and we don’t want to make a mistake

    View Slide

  194. View Slide

  195. Delivery

    View Slide

  196. delivery anti patterns

    View Slide

  197. anti-pattern:
    programmers using open source ML software

    View Slide

  198. anti-pattern:
    data scientists scheduling R scripts

    View Slide

  199. View Slide

  200. not just programmers

    View Slide

  201. not just machine learners

    View Slide

  202. anti-pattern:
    we can’t say upfront how long it will take to build a good model

    View Slide

  203. View Slide

  204. time boxing

    View Slide

  205. incremental development

    View Slide

  206. regular reviews

    View Slide

  207. Lab Factory
    investigate
    opportunities
    system
    build
    system
    operate
    analyse
    performance

    View Slide

  208. anti-pattern:
    our model performs really well

    View Slide

  209. View Slide

  210. know what success is
    and
    know how to measure it

    View Slide

  211. more revenue / profit

    View Slide

  212. more clicks

    View Slide

  213. more customers

    View Slide

  214. less time between actions

    View Slide

  215. View Slide

  216. run experiments

    View Slide

  217. know if impacts are due to you

    View Slide

  218. anti-pattern:
    google did it

    View Slide

  219. View Slide

  220. latest algorithms aren’t always the answer

    View Slide

  221. more/better data isn’t always the answer

    View Slide

  222. an informed ROI discussion is the answer

    View Slide

  223. “CIOs are in trouble right now… We’ve seen
    exponential growth in data. If I drop data on
    the floor and lose it, I am a bad CIO but if
    my budget grows exponentially to handle it,
    I am also a bad CIO.”
    Stephen Probst
    CTO at Teradata

    View Slide

  224. View Slide

  225. Lab to Factory

    View Slide