A personal journey towards Cloud Native Data Science - with speaker notes

This talk was ﬁrst given at the Open Data Science
Conference in San Francisco, November 14th-15th 2015. Please contact me on Twitter (@ianhuston) with any questions or comments.

I’m a data scientist in Pivotal Labs, working in the
London ofﬁce with our clients to get value from their data.

My talk today is about deploying and running models, something
I’ve recently heard people call Data Ops. This is often overlooked as part of a data scientist’s job, but can be a major part of the role, as this Microsoft report shows.

My first computer was an Amiga 500 and my first
programming language was EasyAMOS, which is an Amiga variant of BASIC. Really hard to share my Star Trek blaster game with my friends due to Interpreter being necessary. This was my first hard lesson in distribution and packaging

Anyone in academia (outside CS dept maybe) will recognise the
lack of quality in code produced, especially with self taught programmers often writing code to be run only once. A great way to scare a postdoc is to ask the how to reproduce the exact ﬁgure in their paper. Despite Reproducibility being a very important part of Science, code reproducibility has been very neglected. Often nobody uses version control, and gets different results by commenting out lines of code between runs.

Software Carpentry is a great group that tries to bring
industry software engineering practices to academia. This is where I ﬁrst properly learned about version control, testing and packaging of applications. There is now also a Data Carpentry group doing similar work in the area of data analysis. Courses are available at a university near you now!

As part of my research I created a library of
numerical code for simulations of the early universe. Lesson learned: code survives a lot longer than you expect or plan for. Make it reusable and easily deployable.

When I moved into industry, I thought I would ﬁnally
get to see how to do things properly. First experiences were a little different though, with not much use of version control, very manual deployment procedures and lots of one-off scripting and glue code holding everything together.

Now as a member of Pivotal Labs, the agile development
arm of Pivotal, I get to see how we help our clients transform the way they code, including the following practices: Continuous Improvement: learn through constant questioning and testing Test Driven Development: to be conﬁdent in your implementation. Continuous Integration and Delivery: to deliver business value quickly and update very often (multiple times a day) Developers are now moving to the idea of Cloud Native Apps, which are designed to take full advantage of the

This simple haiku gets across the essence of the Cloud
Native philosophy. I feel this sentiment is maybe even more true of data scientists than developers. If you can take away the distractions and the low level problems, that frees up data scientists to worry about the actual analysis. No data scientist should have to worry about what version of Ubuntu is available, or need to check whether the latest OpenSSH vulnerability has been patched.

Another way to think of this is to compare old
software development practices with what Internet scale companies like Netﬂix do nowadays. In particular Shared Responsibility means that developers/ data scientists should be part of the team that runs the application, not just throwing code over the wall at operations.

Some implications of this viewpoint is that you need to
build your app to be ready to fail, making it disposable but also readily scalable. If you accept the constraints of the platform you are operating on you can rely on the platform to provide the reliability and scalability required.

Engineers at Heroku came up with this list of rules
that a cloud native app should follow. 12 in all, here we highlight just a few. The goal is to let the platform handle all the details and focus on delivering value at top of stack.

I hear many similar complaints from data scientists, in particular
the desire to have similar dev and production environments and manage dependencies and versions of libraries. Time spent building servers/clusters as unique snowﬂakes is time you are not delivering value.

This is my work-in-progress view of what do data scientists
need beyond app developers Reproducibility: record all inputs necessary, random seeds, data versions Data services: the unit of functionality should be the predictive model, these should be exposed as services. Pipelines: have code based description not manual/ one off setup.

Firstly we focus on provisioning and deployment. One platform I
have used a lot recently is Cloud Foundry, which is the open source building block on which these other services have been built.

When you deploy an app with CF, a lot of
the hassle of making your model available is taken away. For example load balancing and routing is set up automatically, and the correct package versions are installed based on your requirements. As your app runs, it is continually checked and any failed/ stopped app will be restarted.

This tutorial goes through a few ﬁrst steps with Cloud
Foundry by pushing simple Python based applications.

One good option to set up data pipeline configuration programmatically
is using Spring XD and Spring Cloud Data Flow, parts of the open source Spring Framework. The domain specific language is based on Unix pipes and allows for very simple transformations and connections between data sources. Multiple routes can be specified, so it is easy to create, for example, a Lambda architecture.

Many integrations are available for Spring XD including multiple message
transports, PMML for model scoring, and batch orchestration of Map Reduce, Spark and other systems.

The Phoenix Project is a ﬁctional account of an IT
manager’s struggle with getting a project into production and introduces a lot of DevOps techniques. Interesting to note that the project they are trying to release is actually a recommender system, so from the very beginning, data science has been part of the DevOps movement.

If you have any questions please contact me on twitter
(@ianhuston).

A personal journey towards Cloud Native Data Sc...

A personal journey towards Cloud Native Data Science - with speaker notes

Ian Huston

More Decks by Ian Huston

Other Decks in Technology

Featured

Transcript

This talk was ﬁrst given at the Open Data Science

I’m a data scientist in Pivotal Labs, working in the

My talk today is about deploying and running models, something

My ﬁrst computer was an Amiga 500 and my ﬁrst

Anyone in academia (outside CS dept maybe) will recognise the

Software Carpentry is a great group that tries to bring

As part of my research I created a library of

When I moved into industry, I thought I would ﬁnally

Now as a member of Pivotal Labs, the agile development

This simple haiku gets across the essence of the Cloud

Another way to think of this is to compare old

Some implications of this viewpoint is that you need to

Engineers at Heroku came up with this list of rules

I hear many similar complaints from data scientists, in particular

This is my work-in-progress view of what do data scientists

Firstly we focus on provisioning and deployment. One platform I

When you deploy an app with CF, a lot of

This tutorial goes through a few ﬁrst steps with Cloud

One good option to set up data pipeline conﬁguration programmatically

Many integrations are available for Spring XD including multiple message

The Phoenix Project is a ﬁctional account of an IT

If you have any questions please contact me on twitter