Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A personal journey towards Cloud Native Data Sc...

Ian Huston
November 15, 2015

A personal journey towards Cloud Native Data Science

Talk delivered at Open Data Science Conference, San Francisco, November 2015

As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.

Ian Huston

November 15, 2015
Tweet

More Decks by Ian Huston

Other Decks in Technology

Transcript

  1. A personal journey towards Cloud Native Data Science @OPENDATASCI OPEN

    DATA SCIENCE CONFERENCE Ian Huston @ianhuston S A N F R A N C I S C O | 2 0 1 5
  2. At Microsoft 8 of the 16 data scientists interviewed work

    on or manage the operationalization of predictive models The Emerging Role of Data Scientists on Software Development Teams Microsoft Research. Technical Report. MSR-TR-2015-30 http://research.microsoft.com/apps/pubs/default.aspx?id=242286
  3. Academia: How to Scare a Postdoc “Which version of your

    code and data was used to create Figure 3 in v2 of your recent paper?”
  4. Research library writing •  Packaging & Dependencies •  Installation &

    Deployment •  Try to automate deploy over cluster
  5. Starting out with Data Science •  Big & Fast Bare

    Metal Appliances •  Lots of scripting and glue code •  Manual deployments •  Lots of cron jobs
  6. Cloud Native Haiku Here is my source code Run it

    on the cloud for me I do not care how. -  Onsi Fakhouri @onsijoe
  7. Then Now assume fragile infrastructure assume reliable infrastructure release code

    every 3 months release code early and often works in my environment shared responsibility tightly coupled loosely coupled
  8. •  One Codebase with revision control •  Explicitly declare dependencies

    •  Stateless Processes – attach external data stores •  Parity between dev & production environments •  … http://12factor.net
  9. Things I hear from data scientists: How do I …

    •  speed up the set up of my system? •  keep different versions of Python/R in sync? •  make my dev environment the same as production? •  make my models easily available? •  make repeatable/reproducible model runs?
  10. What is Cloud Native Data Science? 12 factors + Reproducibility

    Expose models as services Explicit configuration for data pipelines What else?
  11. Focus on Provisioning & Deployment Open Source platform powering: GE

    Predix, Intel Trusted Analytics Platform, IBM Bluemix, SAP HANA Cloud, HP Helion Cloud
  12. Deploy your app with cf push CF determines app type

    (Java, Python, Ruby, …) Installs necessary environment Provisions & binds data sources " Creates (sub-)domain, routing and load balancing Continual app health checks & restarts
  13. How do I get started? Covers: •  Deploying simple Python

    applications •  Scaling instances •  Provisioning & Connecting to data sources •  Using Conda for Python package management Cloud Foundry for Data Science tutorial https://github.com/ihuston/python-cf-examples
  14. Focus on data pipelines Spring XD & Spring Cloud Data

    Flow Pipelines for composable data services DSL based on Unix pipes: http | filter | transform | hdfs! Multiple paths with taps: mypipeline.filter > newtransform | redis!
  15. Integrations Data Ingestion and Pipeline Processing Kafka, RabbitMQ, MQTT, JMS,

    HTTP, GPDB, HAWQ Partition, Filter, Transform, Split, Aggregate Real Time Analytics and Complex Event Processing Spark Streaming, RxJava, PMML Scoring Redis, GemFire, Cassandra, etc.. Batch Workflow Orchestration + ETL Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ, Spark RDBMS, FILE, FTP, Log, Mongo, Splunk
  16. Where can I learn more? DevOps Novel: The Phoenix Project

    by Kim, Behr & Stafford 12factor.net cloudfoundry.org github.com/ihuston/python-cf-examples cloud.spring.io/spring-cloud-dataflow projects.spring.io/spring-xd/