Talk delivered at Open Data Science Conference, San Francisco, November 2015
As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.
A personal journey towards
Cloud Native Data Science
S A N F R A N C I S C O | 2 0 1 5
Who am I?
8 of the 16 data scientists interviewed
work on or manage the operationalization
of predictive models
The Emerging Role of Data Scientists on Software Development Teams
Microsoft Research. Technical Report. MSR-TR-2015-30
Academia: How to Scare a Postdoc
“Which version of your
code and data
was used to create
Figure 3 in v2 of your recent
Research library writing
• Packaging & Dependencies
• Installation & Deployment
• Try to automate deploy over cluster
Starting out with Data Science
• Big & Fast Bare Metal Appliances
• Lots of scripting and glue code
• Manual deployments
• Lots of cron jobs
Test Driven Development
Continuous Integration & Delivery
Cloud Native Applications
What does Cloud Native mean?
Cloud Native Haiku
Here is my source code
Run it on the cloud for me
I do not care how.
- Onsi Fakhouri
assume fragile infrastructure
assume reliable infrastructure
release code every 3 months release code early and often
works in my environment shared responsibility
tightly coupled loosely coupled
Build for failure
Make app disposable & scalable
Accept constraints of platform
• One Codebase with revision control
• Explicitly declare dependencies
• Stateless Processes – attach external data stores
• Parity between dev & production environments
Why should we apply this "
to Data Science?
Things I hear from data scientists:
How do I …
• speed up the set up of my system?
• keep different versions of Python/R in sync?
• make my dev environment the same as production?
• make my models easily available?
• make repeatable/reproducible model runs?
What is Cloud Native Data Science?
12 factors +
Expose models as services
Explicit conﬁguration for data pipelines
Focus on Provisioning & Deployment
Open Source platform powering:
GE Predix, Intel Trusted Analytics Platform,
IBM Bluemix, SAP HANA Cloud,
HP Helion Cloud
Deploy your app with cf push
CF determines app type (Java, Python, Ruby, …)
Installs necessary environment
Provisions & binds data sources "
Creates (sub-)domain, routing and load balancing
Continual app health checks & restarts
How do I get started?
• Deploying simple Python applications
• Scaling instances
• Provisioning & Connecting to data sources
• Using Conda for Python package management
Cloud Foundry for Data Science tutorial
Focus on data pipelines
Spring XD & Spring Cloud Data Flow
Pipelines for composable data services
DSL based on Unix pipes:
http | filter | transform | hdfs!
Multiple paths with taps:
mypipeline.filter > newtransform | redis!
Data Ingestion and Pipeline Processing
Kafka, RabbitMQ, MQTT, JMS, HTTP, GPDB, HAWQ
Partition, Filter, Transform, Split, Aggregate
Real Time Analytics and Complex Event Processing
Spark Streaming, RxJava, PMML Scoring
Redis, GemFire, Cassandra, etc..
Batch Workﬂow Orchestration + ETL
Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ, Spark
RDBMS, FILE, FTP, Log, Mongo, Splunk
Where can I learn more?
DevOps Novel: The Phoenix Project by Kim, Behr & Stafford