A personal journey towards Cloud Native Data Science

A personal journey towards Cloud Native Data Science @OPENDATASCI OPEN
DATA SCIENCE CONFERENCE Ian Huston @ianhuston S A N F R A N C I S C O | 2 0 1 5

Who am I? Data Scientist http://github.com/ihuston @ianhuston

At Microsoft 8 of the 16 data scientists interviewed work
on or manage the operationalization of predictive models The Emerging Role of Data Scientists on Software Development Teams Microsoft Research. Technical Report. MSR-TR-2015-30 http://research.microsoft.com/apps/pubs/default.aspx?id=242286

My story…

Academia: How to Scare a Postdoc “Which version of your
code and data was used to create Figure 3 in v2 of your recent paper?”

http://software-carpentry.org/

Research library writing •  Packaging & Dependencies •  Installation &
Deployment •  Try to automate deploy over cluster

Starting out with Data Science •  Big & Fast Bare
Metal Appliances •  Lots of scripting and glue code •  Manual deployments •  Lots of cron jobs

Test Driven Development Continuous Integration & Delivery Continuous Improvement Cloud
Native Applications

What does Cloud Native mean?

Cloud Native Haiku Here is my source code Run it
on the cloud for me I do not care how. -  Onsi Fakhouri @onsijoe

Then Now assume fragile infrastructure assume reliable infrastructure release code
every 3 months release code early and often works in my environment shared responsibility tightly coupled loosely coupled

Implications Build for failure Make app disposable & scalable Accept
constraints of platform

•  One Codebase with revision control •  Explicitly declare dependencies
•  Stateless Processes – attach external data stores •  Parity between dev & production environments •  … http://12factor.net

Why should we apply this " to Data Science?

Things I hear from data scientists: How do I …
•  speed up the set up of my system? •  keep different versions of Python/R in sync? •  make my dev environment the same as production? •  make my models easily available? •  make repeatable/reproducible model runs?

What is Cloud Native Data Science? 12 factors + Reproducibility
Expose models as services Explicit conﬁguration for data pipelines What else?

Focus on Provisioning & Deployment Open Source platform powering: GE
Predix, Intel Trusted Analytics Platform, IBM Bluemix, SAP HANA Cloud, HP Helion Cloud

Deploy your app with cf push CF determines app type
(Java, Python, Ruby, …) Installs necessary environment Provisions & binds data sources " Creates (sub-)domain, routing and load balancing Continual app health checks & restarts

How do I get started? Covers: •  Deploying simple Python
applications •  Scaling instances •  Provisioning & Connecting to data sources •  Using Conda for Python package management Cloud Foundry for Data Science tutorial https://github.com/ihuston/python-cf-examples

Focus on data pipelines Spring XD & Spring Cloud Data
Flow Pipelines for composable data services DSL based on Unix pipes: http | filter | transform | hdfs! Multiple paths with taps: mypipeline.filter > newtransform | redis!

Integrations Data Ingestion and Pipeline Processing Kafka, RabbitMQ, MQTT, JMS,
HTTP, GPDB, HAWQ Partition, Filter, Transform, Split, Aggregate Real Time Analytics and Complex Event Processing Spark Streaming, RxJava, PMML Scoring Redis, GemFire, Cassandra, etc.. Batch Workﬂow Orchestration + ETL Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ, Spark RDBMS, FILE, FTP, Log, Mongo, Splunk

Where can I learn more? DevOps Novel: The Phoenix Project
by Kim, Behr & Stafford 12factor.net cloudfoundry.org github.com/ihuston/python-cf-examples cloud.spring.io/spring-cloud-dataﬂow projects.spring.io/spring-xd/

@ianhuston

A personal journey towards Cloud Native Data Sc...

A personal journey towards Cloud Native Data Science

Ian Huston

More Decks by Ian Huston

Other Decks in Technology

Featured

Transcript