Talk delivered at Open Data Science Conference, San Francisco, November 2015
This version includes speaker notes.
As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.
This talk was ﬁrst given at the Open Data Science
Conference in San Francisco, November 14th-15th 2015.
Please contact me on Twitter (@ianhuston) with any
questions or comments.
I’m a data scientist in Pivotal Labs, working in the London
ofﬁce with our clients to get value from their data.
My talk today is about deploying and running models,
something I’ve recently heard people call Data Ops. This is
often overlooked as part of a data scientist’s job, but can be
a major part of the role, as this Microsoft report shows.
My ﬁrst computer was an Amiga 500 and my ﬁrst
programming language was EasyAMOS, which is an Amiga
variant of BASIC.
Really hard to share my Star Trek blaster game with my
friends due to Interpreter being necessary.
This was my ﬁrst hard lesson in distribution and packaging
Anyone in academia (outside CS dept maybe) will recognise
the lack of quality in code produced, especially with self
taught programmers often writing code to be run only once.
A great way to scare a postdoc is to ask the how to
reproduce the exact ﬁgure in their paper. Despite
Reproducibility being a very important part of Science, code
reproducibility has been very neglected.
Often nobody uses version control, and gets different results
by commenting out lines of code between runs.
Software Carpentry is a great group that tries to bring
industry software engineering practices to academia.
This is where I ﬁrst properly learned about version control,
testing and packaging of applications.
There is now also a Data Carpentry group doing similar work
in the area of data analysis.
Courses are available at a university near you now!
As part of my research I created a library of numerical code
for simulations of the early universe.
Lesson learned: code survives a lot longer than you expect
or plan for.
Make it reusable and easily deployable.
When I moved into industry, I thought I would ﬁnally get to
see how to do things properly.
First experiences were a little different though, with not much
use of version control, very manual deployment procedures
and lots of one-off scripting and glue code holding
Now as a member of Pivotal Labs, the agile
development arm of Pivotal, I get to see how we help
our clients transform the way they code, including the
Continuous Improvement: learn through constant
questioning and testing
Test Driven Development: to be conﬁdent in your
Continuous Integration and Delivery: to deliver
business value quickly and update very often (multiple
times a day)
Developers are now moving to the idea of Cloud Native
Apps, which are designed to take full advantage of the
This simple haiku gets across the essence of the Cloud
I feel this sentiment is maybe even more true of data
scientists than developers.
If you can take away the distractions and the low level
problems, that frees up data scientists to worry about the
No data scientist should have to worry about what version of
Ubuntu is available, or need to check whether the latest
OpenSSH vulnerability has been patched.
Another way to think of this is to compare old software
development practices with what Internet scale companies
like Netﬂix do nowadays.
In particular Shared Responsibility means that developers/
data scientists should be part of the team that runs the
application, not just throwing code over the wall at
Some implications of this viewpoint is that you need to build
your app to be ready to fail, making it disposable but also
If you accept the constraints of the platform you are
operating on you can rely on the platform to provide the
reliability and scalability required.
Engineers at Heroku came up with this list of rules that a
cloud native app should follow. 12 in all, here we highlight
just a few.
The goal is to let the platform handle all the details and focus
on delivering value at top of stack.
I hear many similar complaints from data scientists, in
particular the desire to have similar dev and production
environments and manage dependencies and versions of
Time spent building servers/clusters as unique snowﬂakes is
time you are not delivering value.
This is my work-in-progress view of what do data
scientists need beyond app developers
Reproducibility: record all inputs necessary, random
seeds, data versions
Data services: the unit of functionality should be the
predictive model, these should be exposed as
Pipelines: have code based description not manual/
one off setup.
Firstly we focus on provisioning and deployment.
One platform I have used a lot recently is Cloud Foundry,
which is the open source building block on which these
other services have been built.
When you deploy an app with CF, a lot of the hassle of
making your model available is taken away.
For example load balancing and routing is set up
automatically, and the correct package versions are installed
based on your requirements.
As your app runs, it is continually checked and any failed/
stopped app will be restarted.
This tutorial goes through a few ﬁrst steps with Cloud
Foundry by pushing simple Python based applications.
One good option to set up data pipeline conﬁguration
programmatically is using Spring XD and Spring Cloud Data
Flow, parts of the open source Spring Framework.
The domain speciﬁc language is based on Unix pipes and
allows for very simple transformations and connections
between data sources.
Multiple routes can be speciﬁed, so it is easy to create, for
example, a Lambda architecture.
Many integrations are available for Spring XD including
multiple message transports, PMML for model scoring, and
batch orchestration of Map Reduce, Spark and other
The Phoenix Project is a ﬁctional account of an IT manager’s
struggle with getting a project into production and
introduces a lot of DevOps techniques.
Interesting to note that the project they are trying to release
is actually a recommender system, so from the very
beginning, data science has been part of the DevOps
If you have any questions please contact me on twitter