Lab to Factory: Robust Machine Learning Systems

Lab to Factory

machine learning is our saviour

https://techcrunch.com/2016/11/26/machine-learning-can-ﬁx-twitter-facebook-and-maybe-even-america/

machine learning is being democratised

machine learning

2006 2011 2016 "machine learning"

http://www.computerworld.com.au/article/601117/machine-learning-new-face-enterprise-data/

2006 2011 2016 "machine learning" "data science"

http://www.kdnuggets.com/2016/01/businesses-need-one-million-data-scientists-2018.html

https://hbr.org/2016/11/hiring-your-ﬁrst-chief-ai-ofﬁcer

http://www.burtchworks.com/2015/03/02/4-ways-to-spot-a-fake-data-scientist/

https://www.coursera.org/learn/machine-learning

http://www.uts.edu.au/future-students/ﬁnd-a-course/courses/c04293

2006 2011 2016 "machine learning" "data science"

2006 2011 2016 "machine learning" "data science" "big data"

2006 2011 2016 "machine learning" "data science" "big data" "hadoop"

2006 2011 2016 "hadoop"

[do palm card version]

2006 2011 2016 2022

as a programmer the probabilities are higher than ever of
working on a machine learning project

What do you do if you get pulled in to
a project with a Machine Learning spin on it?

How do you approach the technology interface between science and
engineering?

How do you approach the people interface between science and
engineering?

What is Machine Learning?

“The field of machine learning is concerned with the question
of how to construct computer programs that automatically improve with experience.” Tom Mitchell Machine Learning

mispelled Can we accurately identify mispelled words?

The I Just Read “The Lean Start-Up” Solution 1 def
spell_check(word) 2 dictionary = Dictionary.load(file: "dictionary.yaml") 3 if dictionary.has_value?(word) 4 { correct => true } 5 else 6 { correct => false, suggestions => ["Use a dictionary ;)"] } 7 end 8 end

The I Just Read “TAOCP” Solution 1 int spell_check(Dictionary *
dictionary, const char * word, char ** 2 suggestions) { 3 char **ngrams, distanced, suggestions; 4 int err; 5 6 err = generate_within_levenshtein_distance(word, &distanced); 7 if (err != 0) return err; 8 9 err = generate_ngrams(word, &ngrams); 10 if (err != 0) return err; 11 12 err = matching(dictionary, ngrams, distanced, &suggestions); 13 if (err != 0) return err; 14 15 return suggestions; 16 }

mispelled Can we accurately identify mispelled words?

Look at how Google does spell checking: it's not based
on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t. Joel Spolsky Joel on Software / 2005-10-17

a valid context. Lots of words that are correctly spelled
in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid context. Lots of words that are correctly spelled in a valid

data driven code driven vs

Fixed Algorithm

General Purpose

Can be Simpler

More Experience

Some Problems Intractable

Learning Algorithm

Restricted Domains

Improve with Smarter Algorithms

Improve with More or Better Data

Can Handle Situations Infeasible for Code Driven Approaches

Can Be a Really Expensive Way to Encode an If
Statement

Machine Learning Systems

? ? ? ? ?

480,189 users 17,770 movies 100,480,507 ratings

https://www.kaggle.com/c/santander-product-recommendation/data

2016-06-28,1416856,N,ES,H, 21,2015-07-25,0, 11, 1,,1.0,A,S,N,,KHQ,N,1, 6,"BADAJOZ",1, 38937.48,03 - UNIVERSITARIO 2016-06-28,1202981,N,ES,H, 23,2013-10-18,0,
32, 1,,1.0,I,S,N,,KHE,N,1,29,"MALAGA",0, 56409.06,03 - UNIVERSITARIO 2016-06-28, 137134,N,ES,V, 51,1999-06-30,0, 204, 1,,1.0,A,S,S,,KAT,N,1,28,"MADRID",1, 443237.88,02 - PARTICULARES 2016-06-28,1256662,N,ES,V, 32,2014-05-06,0, 25, 1,,1.0,A,S,N,,KFC,N,1, 2,"ALBACETE",1, 69776.79,03 - UNIVERSITARIO 2016-06-28, 833024,N,ES,V, 36,2009-02-08,0, 88, 1,,1.0,I,S,N,,KFC,N,1,24,"LEON",0, 80136.27,02 - PARTICULARES 2016-06-28, 198396,N,ES,V, 44,2000-10-13,0, 188, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 451931.22,02 - PARTICULARES 2016-06-28,1055228,N,ES,H, 43,2012-08-31,0, 45, 1,,1,A,S,N,,KFC,N,1,11,"CADIZ",1, 57271.83,02 - PARTICULARES 2016-06-28,1453594,N,ES,H, 21,2015-09-17,0, 9, 1,,1.0,I,S,N,,KHQ,N,1,15,"CORUÑA, A",0, NA,03 - UNIVERSITARIO 2016-06-28,1114959,N,ES,V, 48,2012-12-28,0, 42, 1,,1,A,S,N,,KFC,N,1, 6,"BADAJOZ",1, 164920.32,02 - PARTICULARES 2016-06-28, 193664,N,ES,H, 90,2000-10-09,0, 189, 1,,1.0,I,S,N,,KAT,N,1,50,"ZARAGOZA",0, 63982.68,02 - PARTICULARES 2016-06-28,1461846,N,ES,H, 22,2015-09-25,0, 9, 1,,1,I,S,N,,KHQ,N,1, 6,"BADAJOZ",0, NA,03 - UNIVERSITARIO 2016-06-28, 281786,N,ES,V, 84,2001-10-13,0, 176, 1,,1,I,S,N,,KAT,N,1,41,"SEVILLA",0, 204135.63,02 - PARTICULARES 2016-06-28, 931057,N,ES,V, 25,2011-08-09,0, 58, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 71185.62,03 - UNIVERSITARIO 2016-06-28, 380119,N,ES,H, 66,2002-09-02,0, 166, 1,,1.0,A,S,N,,KFC,N,1,46,"VALENCIA",1, 34973.19,02 - PARTICULARES 2016-06-28, 509236,N,ES,V, 39,2004-12-30,0, 138, 1,,1,A,S,N,,KFC,N,1,41,"SEVILLA",1, 86109.66,02 - PARTICULARES 2016-06-28, 755342,N,ES,V, 51,2008-03-24,0, 99, 1,,1.0,A,S,N,,KAT,N,1, 8,"BARCELONA",1, 29992.74,02 - PARTICULARES 2016-06-28, 678258,N,ES,H, 38,2007-02-20,0, 112, 1,,1.0,I,S,N,,KAT,N,1,28,"MADRID",0, 133180.17,02 - PARTICULARES 2016-06-28, 103307,N,ES,V, 44,1998-08-04,0, 215, 1,,1.0,I,S,S,,KAT,N,1,28,"MADRID",1, 76519.59,02 - PARTICULARES 2016-06-28,1308331,N,ES,H, 22,2014-09-16,0, 21, 1,,1,I,S,N,,KHE,N,1,36,"PONTEVEDRA",0, 134962.29,03 - UNIVERSITARIO 2016-06-28,1006357,N,ES,V, 32,2012-02-27,0, 52, 1,,1,A,S,N,,KFA,N,1,28,"MADRID",1, 65619.90,03 - UNIVERSITARIO 2016-06-28, 124854,N,ES,V, 45,1999-03-10,0, 207, 1,,1,I,S,N,,KAT,N,1, 8,"BARCELONA",0, NA,02 - PARTICULARES 2016-06-28, 757178,N,ES,V, 59,2008-03-31,0, 99, 1,,1.0,A,S,N,,KFC,N,1,28,"MADRID",1, 109184.13,02 - PARTICULARES 2016-06-28, 759426,N,ES,H, 68,2008-04-09,0, 98, 1,,1,I,S,N,,KFA,N,1,28,"MADRID",0, 210710.49,02 - PARTICULARES 2016-06-28,1193227,N,ES,H, 33,2013-10-09,0, 32, 1,,1.0,I,S,N,,KHE,N,1,50,"ZARAGOZA",0, 42343.29,03 - UNIVERSITARIO 2016-06-28,1192797,N,ES,V, 25,2013-10-09,0, 32, 1,,1,I,S,N,,KHE,N,1,33,"ASTURIAS",0, 176043.90,03 - UNIVERSITARIO 2016-06-28,1085653,N,ES,V, 33,2012-10-22,0, 44, 1,,1.0,I,S,N,,KHE,N,1, 8,"BARCELONA",0, 128796.93,03 - UNIVERSITARIO 2016-06-28,1486100,N,ES,H, 22,2015-10-21,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,28,"MADRID",0, NA,03 - UNIVERSITARIO 2016-06-28, 31025,N,ES,V, 61,1996-01-12,0, 245, 1,,1.0,A,S,N,,KAT,N,1,28,"MADRID",1, 140976.18,01 - TOP 2016-06-28,1471619,N,ES,H, 22,2015-10-07,0, 8, 1,,1.0,I,S,N,,KHQ,N,1,19,"GUADALAJARA",0, NA,03 - UNIVERSITARIO

fecha_dato The table is partitioned for this column ncodpers Customer
code ind_empleado Employee index: A active, B ex employed, F ﬁlial, N not employee, P pasive pais_residencia Customer's Country residence sexo Customer's sex age Age fecha_alta The date in which the customer became as the ﬁrst holder of a contract in the bank ind_nuevo New customer Index. 1 if the customer registered in the last 6 months. antiguedad Customer seniority (in months) indrel 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month) ult_fec_cli_1t Last date as primary customer (if he isn't at the end of the month) indrel_1mes Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),3 (former primary), 4(former co-owner) tiprel_1mes Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential) indresi Residence index (S (Yes) or N (No) if the residence country is the same than the bank country) indext Foreigner index (S (Yes) or N (No) if the customer's birth country is different than the bank country) conyuemp Spouse index. 1 if the customer is spouse of an employee canal_entrada channel used by the customer to join indfall Deceased index. N/S tipodom Addres type. 1, primary address cod_prov Province code (customer's address) nomprov Province name ind_actividad_cliente Activity index (1, active customer; 0, inactive customer)

Something showing the reality of where all of these features
would have needed to be pulled from

Something showing what would be required to “execute” on predictions
- i.e. plumbing in to decisioning systems, etc

“We evaluated some of the new methods offline but the
additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.” Xavier Amatriain and Justin Basilico Personalisation Science and Engineering at Netﬂix

Something showing what the Netflix architecture looks like

Ambiata - multiple verticals - different data - getting multiple
ML systems to production

Receive data every day Batch score models every day Prepare
features every day x N

- good results - 1. more/better data - 2. better
algorithms -> it's a business decision as to which one to focus on -> which has the higher ROI

Production

“A wide-spread and uncomfortable trend has emerged: developing and deploying
ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems

Data Acquisition

data acquisition is non-trivial

the data will be messy

format zoo

most important property of a robust data architecture is have
hard edges

“Traditional abstractions and boundaries may be subtly corrupted or invalidated
by the fact that data influences ML system behavior” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems

“… Indeed, ML is required in exactly those cases when
the desired behavior cannot be effectively expressed in software logic without dependency on external data. The real world does not fit into tidy encapsulation.” D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Hidden Technical Debt in Machine Learning Systems

format zoo data-platform

format zoo data-platform “the” format

format zoo data-platform optimise for data-size

format zoo data-platform optimise for i/o performance

format zoo data-platform optimise for tooling

format zoo data-platform security / privacy?

Data Verification

“The field of machine learning is concerned with the question
of how to construct computer programs that automatically improve with experience.” Tom Mitchell Machine Learning

the margin of benefit on machine learning systems is often
very low

poor quality data can negate any benefit a good model
may give you

statistics are good at identifying issues in data

statisticians are good at using statistics to identify issues in
data

the lab is an optimal environment for this

we can use all the data or statistically equivalent samples

time is just trying to mess with us

data will change over time

data changes must be handled as you go

data issues must be fixed as you go

timeliness of data becomes a quality issue

no escape hatch, you can’t start again

static checks are important

absolute thresholds are meh

proportional thresholds are ok

statistical properties are good

anomalies

anomaly

need to account for seasonal and growth trends

breakouts

breakout

need to account for seasonal and growth trends

proportional thresholds are ok

Feature Engineering

“At the end of the day, some machine learning projects
succeed and some fail. What makes the difference? Easily the most important factor is the features used.” Pedro Domingos A Few Useful Things to Know about Machine Learning

our systems should be linear in data volume not feature
count

we want to be able to throw new features into
the mix

we can’t afford to reprocess historical data

Model Training

repeatability

can we retrain models on demand?

can we reproduce results independently?

Model Scoring

Model Deployment

Monitoring

alert fatigue is real

actionability of alarms needs to be supported by your architecture

time to verify failures is high

time to recover failures is high

cost to recover failures is high

cost of false negative is high

cost of false positive is high

Results Delivery

Change Management

we want to be able to do this again

more models

better models

and we don’t want to make a mistake

Delivery

delivery anti patterns

anti-pattern: programmers using open source ML software

anti-pattern: data scientists scheduling R scripts

not just programmers

not just machine learners

anti-pattern: we can’t say upfront how long it will take
to build a good model

time boxing

incremental development

regular reviews

Lab Factory investigate opportunities system build system operate analyse performance

anti-pattern: our model performs really well

know what success is and know how to measure it

more revenue / proﬁt

more clicks

more customers

less time between actions

run experiments

know if impacts are due to you

anti-pattern: google did it

latest algorithms aren’t always the answer

more/better data isn’t always the answer

an informed ROI discussion is the answer

“CIOs are in trouble right now… We’ve seen exponential growth
in data. If I drop data on the floor and lose it, I am a bad CIO but if my budget grows exponentially to handle it, I am also a bad CIO.” Stephen Probst CTO at Teradata

Lab to Factory

Lab to Factory: Robust Machine Learning Systems

Lab to Factory: Robust Machine Learning Systems

More Decks by Mark Hibberd

Other Decks in Programming

Featured

Transcript