Data Products or getting models into Production

Slide 1

Slide 1 text

Data Products Data Products Or how to get models into production PyData track at PyCon Italy Friday 17th of April 2015 [email protected] All opinions my own

Slide 2

Slide 2 text

Who am I? Who am I? I work as a Data Scientist for a large Telecommunications Company Masters in Mathematics Specialized in Statistics and Machine Learning Interned at Amazon Was a consultant for a while I've been an analytics product architect on one product Occasional contributor to Pandas and other projects @springcoil

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

We can't agree what data science is We can't agree what data science is I think a data scientist is someone with enough programming ability to leverage their mathematical skills and domain speciﬁc knowledge to turn data into solutions. The solution should ideally be a product

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

To help the business most To help the business most I believe that data science oﬀers the most value when the models are in production. Some of us call this a 'Data Product' In this talk I will explain how to use ScienceOps from Yhat to build a model in production Why should Amazon or Google get all the fun? Or competitive advantage?

Slide 9

Slide 9 text

The last mile problem The last mile problem Sean Taylor at Facebook calls this the 'last mile problem'. Or how do you translate the insight into something people use?

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

It is hard to incorporate It is hard to incorporate data into day to data into day to day operations. day operations.

Slide 14

Slide 14 text

Data scientists are not software Data scientists are not software engineers engineers Although it is not acknowledged by some! Producing models in code is not the same as producing a good web application, you need domain speciﬁc knowledge of model building and the challenges that presents.

Slide 15

Slide 15 text

R and D != Engineering R and D != Engineering Many software engineers think that data science is just an engineering problem. However, the scoping of a model building task is hard, you never quite know how to scope it eﬀectively. Takeaway: Make sure your stakeholders are ready for such high risk and high reward projects

Slide 16

Slide 16 text

Hiring data scientists is hard... Hiring data scientists is hard...

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Why? Why? The data science process involves something like OSEMIC Obtain Scrub Explore Model Interpret Communicate Building the model involved porting code from Matlab and understanding a new domain speciﬁc problem. The API data sources were messy and hard to understand

Slide 19

Slide 19 text

Case study: Problem description Case study: Problem description A client was working on a visualization tool and needed to provide the results of a diﬀerential equation in a usable form to users. The research problem was already done - so after code was prototyped in Python - what next? One key ingredient was that the results of the 'mathematical engine' had to be incorporated quickly into a Ruby on Rails/ Javascript based product. The challenge therefore is one of interoperability

Slide 20

Slide 20 text

Write models in Ruby --> Turned out ruby doesn't have an ODE solver Possible Solutions (and their Possible Solutions (and their problems) problems) Port code to Java -----> Cross language validation PMML ----> Doesn't have great language support Batch Jobs -------> High maintenance and conﬁg More tools, more work, more time More tools, more work, more time

Slide 21

Slide 21 text

My ﬁrst solution My ﬁrst solution Teach Math....

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

So I did what all data scientists do So I did what all data scientists do when stuck... when stuck...

Slide 26

Slide 26 text

I found these guys I found these guys

Slide 27

Slide 27 text

I could use stuﬀ from YHatHQ to I could use stuﬀ from YHatHQ to build a model as a service... build a model as a service...

Slide 28

Slide 28 text

This is a much better solution! This is a much better solution! I used Science Ops from YHatHQ Key Tenets 1. Work with the tools you already know 2. Iterate quickly 3. Low touch 4. No rewriting code

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Code! Code! http://bit.ly/1J3T4qf import numpy as np A1 = bs * ( astr * N ) ** 2 A2 = c1 / tdS A3 = ( 1 + bs ) * ( A4 * N ) ** 2 A4 = A1 * z0 A5 = A3 * z0 A6 = C A7 = 0.5 * ( ( c2 / tt ) + ( c1 / tdS ) ) A8 = ( c2 / tt ) - ( c1 / tdS ) def dX_dt(X, t=0): """ Return the triple ODE calculations """ return array([ - A1 * X[2] + A4, - A2 * X[1] + A3 * X[2] - A5, X[0] - X[1] ]) from scipy import integrate t = linspace(0, 35, 1000) # time X0 = array([0, 1, 0]) # initials conditions X, infodict = integrate.odeint(dX_dt, X0, t, full_output=True) infodict['message']

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

What are the key takeaways? What are the key takeaways? 1. The 'magic quickly' problem 2. Lack of a shared language between software engineers and data scientists - but investing in the right tooling by using open standards allows success. 3. To help data scientists and analysts succeed your business needs to be prepared to invest in tooling

Slide 33

Slide 33 text

https://xkcd.com/1425/ Research is not engineering! Magic quickly Magic quickly

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Lack of a shared language Lack of a shared language Statisticians and software engineers don't necessarily have a shared language. Services like Science Ops help bridge the gap. "Watch for high skew and kurtosis" Think about your team balance in your projects. Math folk versus coders.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Invest in tooling Invest in tooling For your analysts and data scientists to succeed you need to invest in infrastructure to empower them. Think carefully how you want your company to spend its innovation tokens and take advantage of the excellent tools available like ScienceOps and AWS. I think there is great scope for entrepreneurs to take advantage of this arbitrage opportunity and build good tooling to empower data scientists by building platforms. Contribute to Open Source Software such as the PyData stack!

Slide 38

Slide 38 text

Alternatives to YhatHQ Alternatives to YhatHQ (that I know of) (that I know of)

Slide 39

Slide 39 text

Lessons learned Lessons learned I can write a model in Python and have it deployed! Software Engineers aren't data scientists and shouldn't be expected to write models in code. Models only provide value when they are in production Getting information from stakeholders is really valuable in improving models.

Slide 40

Slide 40 text

Successes Successes Within a few months it was possible to have an analytics product in production, using information consumed from a variety of API's. I have no idea how else - maybe using PMML that I could deploy models. Total development time took 3 months, with 5 people. Only two (including myself) were working fulltime on this project. That development time includes time for us to learn the domain speciﬁc knowledge like models, API sources, etc.

Slide 41

Slide 41 text

Other kinds of data science Products Other kinds of data science Products Credit risk modelling Customer attrition modelling Recommendation engines Airline delay analysis The list goes on....

Slide 42

Slide 42 text

Wanna learn more? Wanna learn more? www.yhathq.com www.yhathq.com [email protected] [email protected]

Slide 43

Slide 43 text

No content