Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A personal journey towards Cloud Native Data Science - with speaker notes

Ian Huston
November 15, 2015

A personal journey towards Cloud Native Data Science - with speaker notes

Talk delivered at Open Data Science Conference, San Francisco, November 2015

This version includes speaker notes.

As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.

Ian Huston

November 15, 2015

More Decks by Ian Huston

Other Decks in Technology


  1. This talk was first given at the Open Data Science
    Conference in San Francisco, November 14th-15th 2015.

    Please contact me on Twitter (@ianhuston) with any
    questions or comments.

    View Slide

  2. I’m a data scientist in Pivotal Labs, working in the London
    office with our clients to get value from their data.

    View Slide

  3. My talk today is about deploying and running models,
    something I’ve recently heard people call Data Ops. This is
    often overlooked as part of a data scientist’s job, but can be
    a major part of the role, as this Microsoft report shows.

    View Slide

  4. My first computer was an Amiga 500 and my first
    programming language was EasyAMOS, which is an Amiga
    variant of BASIC.

    Really hard to share my Star Trek blaster game with my
    friends due to Interpreter being necessary.
    This was my first hard lesson in distribution and packaging

    View Slide

  5. Anyone in academia (outside CS dept maybe) will recognise
    the lack of quality in code produced, especially with self
    taught programmers often writing code to be run only once.

    A great way to scare a postdoc is to ask the how to
    reproduce the exact figure in their paper. Despite

    Reproducibility being a very important part of Science, code
    reproducibility has been very neglected.

    Often nobody uses version control, and gets different results
    by commenting out lines of code between runs.

    View Slide

  6. Software Carpentry is a great group that tries to bring
    industry software engineering practices to academia.
    This is where I first properly learned about version control,
    testing and packaging of applications.

    There is now also a Data Carpentry group doing similar work
    in the area of data analysis.
    Courses are available at a university near you now!

    View Slide

  7. As part of my research I created a library of numerical code
    for simulations of the early universe.

    Lesson learned: code survives a lot longer than you expect
    or plan for.
    Make it reusable and easily deployable.

    View Slide

  8. When I moved into industry, I thought I would finally get to
    see how to do things properly.

    First experiences were a little different though, with not much
    use of version control, very manual deployment procedures
    and lots of one-off scripting and glue code holding
    everything together.

    View Slide

  9. Now as a member of Pivotal Labs, the agile
    development arm of Pivotal, I get to see how we help
    our clients transform the way they code, including the
    following practices:
    Continuous Improvement: learn through constant
    questioning and testing
    Test Driven Development: to be confident in your
    Continuous Integration and Delivery: to deliver
    business value quickly and update very often (multiple
    times a day)

    Developers are now moving to the idea of Cloud Native
    Apps, which are designed to take full advantage of the

    View Slide

  10. View Slide

  11. This simple haiku gets across the essence of the Cloud
    Native philosophy.
    I feel this sentiment is maybe even more true of data
    scientists than developers.

    If you can take away the distractions and the low level
    problems, that frees up data scientists to worry about the
    actual analysis.
    No data scientist should have to worry about what version of
    Ubuntu is available, or need to check whether the latest
    OpenSSH vulnerability has been patched.

    View Slide

  12. Another way to think of this is to compare old software
    development practices with what Internet scale companies
    like Netflix do nowadays.

    In particular Shared Responsibility means that developers/
    data scientists should be part of the team that runs the
    application, not just throwing code over the wall at

    View Slide

  13. Some implications of this viewpoint is that you need to build
    your app to be ready to fail, making it disposable but also
    readily scalable.

    If you accept the constraints of the platform you are
    operating on you can rely on the platform to provide the
    reliability and scalability required.

    View Slide

  14. Engineers at Heroku came up with this list of rules that a
    cloud native app should follow. 12 in all, here we highlight
    just a few.

    The goal is to let the platform handle all the details and focus
    on delivering value at top of stack.

    View Slide

  15. View Slide

  16. I hear many similar complaints from data scientists, in
    particular the desire to have similar dev and production
    environments and manage dependencies and versions of

    Time spent building servers/clusters as unique snowflakes is
    time you are not delivering value.

    View Slide

  17. This is my work-in-progress view of what do data
    scientists need beyond app developers

    Reproducibility: record all inputs necessary, random
    seeds, data versions

    Data services: the unit of functionality should be the
    predictive model, these should be exposed as

    Pipelines: have code based description not manual/
    one off setup.

    View Slide

  18. Firstly we focus on provisioning and deployment.

    One platform I have used a lot recently is Cloud Foundry,
    which is the open source building block on which these
    other services have been built.

    View Slide

  19. When you deploy an app with CF, a lot of the hassle of
    making your model available is taken away.

    For example load balancing and routing is set up
    automatically, and the correct package versions are installed
    based on your requirements.

    As your app runs, it is continually checked and any failed/
    stopped app will be restarted.

    View Slide

  20. This tutorial goes through a few first steps with Cloud
    Foundry by pushing simple Python based applications.

    View Slide

  21. One good option to set up data pipeline configuration
    programmatically is using Spring XD and Spring Cloud Data
    Flow, parts of the open source Spring Framework.

    The domain specific language is based on Unix pipes and
    allows for very simple transformations and connections
    between data sources.
    Multiple routes can be specified, so it is easy to create, for
    example, a Lambda architecture.

    View Slide

  22. Many integrations are available for Spring XD including
    multiple message transports, PMML for model scoring, and
    batch orchestration of Map Reduce, Spark and other

    View Slide

  23. The Phoenix Project is a fictional account of an IT manager’s
    struggle with getting a project into production and
    introduces a lot of DevOps techniques.

    Interesting to note that the project they are trying to release
    is actually a recommender system, so from the very
    beginning, data science has been part of the DevOps

    View Slide

  24. If you have any questions please contact me on twitter

    View Slide