Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A personal journey towards Cloud Native Data Science - with speaker notes

Ian Huston
November 15, 2015

A personal journey towards Cloud Native Data Science - with speaker notes

Talk delivered at Open Data Science Conference, San Francisco, November 2015

This version includes speaker notes.

As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.

Ian Huston

November 15, 2015

More Decks by Ian Huston

Other Decks in Technology


  1. This talk was first given at the Open Data Science

    Conference in San Francisco, November 14th-15th 2015. Please contact me on Twitter (@ianhuston) with any questions or comments.
  2. I’m a data scientist in Pivotal Labs, working in the

    London office with our clients to get value from their data.
  3. My talk today is about deploying and running models, something

    I’ve recently heard people call Data Ops. This is often overlooked as part of a data scientist’s job, but can be a major part of the role, as this Microsoft report shows.
  4. My first computer was an Amiga 500 and my first

    programming language was EasyAMOS, which is an Amiga variant of BASIC. Really hard to share my Star Trek blaster game with my friends due to Interpreter being necessary. This was my first hard lesson in distribution and packaging
  5. Anyone in academia (outside CS dept maybe) will recognise the

    lack of quality in code produced, especially with self taught programmers often writing code to be run only once. A great way to scare a postdoc is to ask the how to reproduce the exact figure in their paper. Despite Reproducibility being a very important part of Science, code reproducibility has been very neglected. Often nobody uses version control, and gets different results by commenting out lines of code between runs.
  6. Software Carpentry is a great group that tries to bring

    industry software engineering practices to academia. This is where I first properly learned about version control, testing and packaging of applications. There is now also a Data Carpentry group doing similar work in the area of data analysis. Courses are available at a university near you now!
  7. As part of my research I created a library of

    numerical code for simulations of the early universe. Lesson learned: code survives a lot longer than you expect or plan for. Make it reusable and easily deployable.
  8. When I moved into industry, I thought I would finally

    get to see how to do things properly. First experiences were a little different though, with not much use of version control, very manual deployment procedures and lots of one-off scripting and glue code holding everything together.
  9. Now as a member of Pivotal Labs, the agile development

    arm of Pivotal, I get to see how we help our clients transform the way they code, including the following practices: Continuous Improvement: learn through constant questioning and testing Test Driven Development: to be confident in your implementation. Continuous Integration and Delivery: to deliver business value quickly and update very often (multiple times a day) Developers are now moving to the idea of Cloud Native Apps, which are designed to take full advantage of the
  10. This simple haiku gets across the essence of the Cloud

    Native philosophy. I feel this sentiment is maybe even more true of data scientists than developers. If you can take away the distractions and the low level problems, that frees up data scientists to worry about the actual analysis. No data scientist should have to worry about what version of Ubuntu is available, or need to check whether the latest OpenSSH vulnerability has been patched.
  11. Another way to think of this is to compare old

    software development practices with what Internet scale companies like Netflix do nowadays. In particular Shared Responsibility means that developers/ data scientists should be part of the team that runs the application, not just throwing code over the wall at operations.
  12. Some implications of this viewpoint is that you need to

    build your app to be ready to fail, making it disposable but also readily scalable. If you accept the constraints of the platform you are operating on you can rely on the platform to provide the reliability and scalability required.
  13. Engineers at Heroku came up with this list of rules

    that a cloud native app should follow. 12 in all, here we highlight just a few. The goal is to let the platform handle all the details and focus on delivering value at top of stack.
  14. I hear many similar complaints from data scientists, in particular

    the desire to have similar dev and production environments and manage dependencies and versions of libraries. Time spent building servers/clusters as unique snowflakes is time you are not delivering value.
  15. This is my work-in-progress view of what do data scientists

    need beyond app developers Reproducibility: record all inputs necessary, random seeds, data versions Data services: the unit of functionality should be the predictive model, these should be exposed as services. Pipelines: have code based description not manual/ one off setup.
  16. Firstly we focus on provisioning and deployment. One platform I

    have used a lot recently is Cloud Foundry, which is the open source building block on which these other services have been built.
  17. When you deploy an app with CF, a lot of

    the hassle of making your model available is taken away. For example load balancing and routing is set up automatically, and the correct package versions are installed based on your requirements. As your app runs, it is continually checked and any failed/ stopped app will be restarted.
  18. This tutorial goes through a few first steps with Cloud

    Foundry by pushing simple Python based applications.
  19. One good option to set up data pipeline configuration programmatically

    is using Spring XD and Spring Cloud Data Flow, parts of the open source Spring Framework. The domain specific language is based on Unix pipes and allows for very simple transformations and connections between data sources. Multiple routes can be specified, so it is easy to create, for example, a Lambda architecture.
  20. Many integrations are available for Spring XD including multiple message

    transports, PMML for model scoring, and batch orchestration of Map Reduce, Spark and other systems.
  21. The Phoenix Project is a fictional account of an IT

    manager’s struggle with getting a project into production and introduces a lot of DevOps techniques. Interesting to note that the project they are trying to release is actually a recommender system, so from the very beginning, data science has been part of the DevOps movement.