Slide 1

Slide 1 text

This talk was first given at the Open Data Science Conference in San Francisco, November 14th-15th 2015. Please contact me on Twitter (@ianhuston) with any questions or comments.

Slide 2

Slide 2 text

I’m a data scientist in Pivotal Labs, working in the London office with our clients to get value from their data.

Slide 3

Slide 3 text

My talk today is about deploying and running models, something I’ve recently heard people call Data Ops. This is often overlooked as part of a data scientist’s job, but can be a major part of the role, as this Microsoft report shows.

Slide 4

Slide 4 text

My first computer was an Amiga 500 and my first programming language was EasyAMOS, which is an Amiga variant of BASIC. Really hard to share my Star Trek blaster game with my friends due to Interpreter being necessary. This was my first hard lesson in distribution and packaging

Slide 5

Slide 5 text

Anyone in academia (outside CS dept maybe) will recognise the lack of quality in code produced, especially with self taught programmers often writing code to be run only once. A great way to scare a postdoc is to ask the how to reproduce the exact figure in their paper. Despite Reproducibility being a very important part of Science, code reproducibility has been very neglected. Often nobody uses version control, and gets different results by commenting out lines of code between runs.

Slide 6

Slide 6 text

Software Carpentry is a great group that tries to bring industry software engineering practices to academia. This is where I first properly learned about version control, testing and packaging of applications. There is now also a Data Carpentry group doing similar work in the area of data analysis. Courses are available at a university near you now!

Slide 7

Slide 7 text

As part of my research I created a library of numerical code for simulations of the early universe. Lesson learned: code survives a lot longer than you expect or plan for. Make it reusable and easily deployable.

Slide 8

Slide 8 text

When I moved into industry, I thought I would finally get to see how to do things properly. First experiences were a little different though, with not much use of version control, very manual deployment procedures and lots of one-off scripting and glue code holding everything together.

Slide 9

Slide 9 text

Now as a member of Pivotal Labs, the agile development arm of Pivotal, I get to see how we help our clients transform the way they code, including the following practices: Continuous Improvement: learn through constant questioning and testing Test Driven Development: to be confident in your implementation. Continuous Integration and Delivery: to deliver business value quickly and update very often (multiple times a day) Developers are now moving to the idea of Cloud Native Apps, which are designed to take full advantage of the

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

This simple haiku gets across the essence of the Cloud Native philosophy. I feel this sentiment is maybe even more true of data scientists than developers. If you can take away the distractions and the low level problems, that frees up data scientists to worry about the actual analysis. No data scientist should have to worry about what version of Ubuntu is available, or need to check whether the latest OpenSSH vulnerability has been patched.

Slide 12

Slide 12 text

Another way to think of this is to compare old software development practices with what Internet scale companies like Netflix do nowadays. In particular Shared Responsibility means that developers/ data scientists should be part of the team that runs the application, not just throwing code over the wall at operations.

Slide 13

Slide 13 text

Some implications of this viewpoint is that you need to build your app to be ready to fail, making it disposable but also readily scalable. If you accept the constraints of the platform you are operating on you can rely on the platform to provide the reliability and scalability required.

Slide 14

Slide 14 text

Engineers at Heroku came up with this list of rules that a cloud native app should follow. 12 in all, here we highlight just a few. The goal is to let the platform handle all the details and focus on delivering value at top of stack.

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

I hear many similar complaints from data scientists, in particular the desire to have similar dev and production environments and manage dependencies and versions of libraries. Time spent building servers/clusters as unique snowflakes is time you are not delivering value.

Slide 17

Slide 17 text

This is my work-in-progress view of what do data scientists need beyond app developers Reproducibility: record all inputs necessary, random seeds, data versions Data services: the unit of functionality should be the predictive model, these should be exposed as services. Pipelines: have code based description not manual/ one off setup.

Slide 18

Slide 18 text

Firstly we focus on provisioning and deployment. One platform I have used a lot recently is Cloud Foundry, which is the open source building block on which these other services have been built.

Slide 19

Slide 19 text

When you deploy an app with CF, a lot of the hassle of making your model available is taken away. For example load balancing and routing is set up automatically, and the correct package versions are installed based on your requirements. As your app runs, it is continually checked and any failed/ stopped app will be restarted.

Slide 20

Slide 20 text

This tutorial goes through a few first steps with Cloud Foundry by pushing simple Python based applications.

Slide 21

Slide 21 text

One good option to set up data pipeline configuration programmatically is using Spring XD and Spring Cloud Data Flow, parts of the open source Spring Framework. The domain specific language is based on Unix pipes and allows for very simple transformations and connections between data sources. Multiple routes can be specified, so it is easy to create, for example, a Lambda architecture.

Slide 22

Slide 22 text

Many integrations are available for Spring XD including multiple message transports, PMML for model scoring, and batch orchestration of Map Reduce, Spark and other systems.

Slide 23

Slide 23 text

The Phoenix Project is a fictional account of an IT manager’s struggle with getting a project into production and introduces a lot of DevOps techniques. Interesting to note that the project they are trying to release is actually a recommender system, so from the very beginning, data science has been part of the DevOps movement.

Slide 24

Slide 24 text

If you have any questions please contact me on twitter (@ianhuston).