Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A personal journey towards Cloud Native Data Science

Ian Huston
November 15, 2015

A personal journey towards Cloud Native Data Science

Talk delivered at Open Data Science Conference, San Francisco, November 2015

As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.

Ian Huston

November 15, 2015
Tweet

More Decks by Ian Huston

Other Decks in Technology

Transcript

  1. A personal journey towards
    Cloud Native Data Science
    @OPENDATASCI
    OPEN
    DATA
    SCIENCE
    CONFERENCE
    Ian Huston
    @ianhuston
    S A N F R A N C I S C O | 2 0 1 5

    View full-size slide

  2. Who am I?
    Data Scientist
    http://github.com/ihuston
    @ianhuston

    View full-size slide

  3. At Microsoft
    8 of the 16 data scientists interviewed
    work on or manage the operationalization
    of predictive models





    The Emerging Role of Data Scientists on Software Development Teams
    Microsoft Research. Technical Report. MSR-TR-2015-30

    http://research.microsoft.com/apps/pubs/default.aspx?id=242286

    View full-size slide

  4. Academia: How to Scare a Postdoc
    “Which version of your
    code and data
    was used to create
    Figure 3 in v2 of your recent
    paper?”

    View full-size slide

  5. http://software-carpentry.org/

    View full-size slide

  6. Research library writing
    •  Packaging & Dependencies
    •  Installation & Deployment
    •  Try to automate deploy over cluster

    View full-size slide

  7. Starting out with Data Science


    •  Big & Fast Bare Metal Appliances
    •  Lots of scripting and glue code
    •  Manual deployments
    •  Lots of cron jobs

    View full-size slide

  8. Test Driven Development
    Continuous Integration & Delivery
    Continuous Improvement
    Cloud Native Applications

    View full-size slide

  9. What does Cloud Native mean?

    View full-size slide

  10. Cloud Native Haiku

    Here is my source code
    Run it on the cloud for me
    I do not care how.
    -  Onsi Fakhouri
    @onsijoe

    View full-size slide

  11. Then Now
    assume fragile infrastructure
    assume reliable infrastructure
    release code every 3 months release code early and often
    works in my environment shared responsibility
    tightly coupled loosely coupled

    View full-size slide

  12. Implications
    Build for failure

    Make app disposable & scalable

    Accept constraints of platform




    View full-size slide

  13. •  One Codebase with revision control
    •  Explicitly declare dependencies
    •  Stateless Processes – attach external data stores
    •  Parity between dev & production environments
    •  …
    http://12factor.net

    View full-size slide

  14. Why should we apply this "
    to Data Science?

    View full-size slide

  15. Things I hear from data scientists:
    How do I …
    •  speed up the set up of my system?
    •  keep different versions of Python/R in sync?
    •  make my dev environment the same as production?
    •  make my models easily available?
    •  make repeatable/reproducible model runs?

    View full-size slide

  16. What is Cloud Native Data Science?
    12 factors +

    Reproducibility

    Expose models as services

    Explicit configuration for data pipelines

    What else?

    View full-size slide

  17. Focus on Provisioning & Deployment
    Open Source platform powering:
    GE Predix, Intel Trusted Analytics Platform,
    IBM Bluemix, SAP HANA Cloud,
    HP Helion Cloud

    View full-size slide

  18. Deploy your app with cf push
    CF determines app type (Java, Python, Ruby, …)

    Installs necessary environment

    Provisions & binds data sources "

    Creates (sub-)domain, routing and load balancing

    Continual app health checks & restarts

    View full-size slide

  19. How do I get started?
    Covers:
    •  Deploying simple Python applications
    •  Scaling instances
    •  Provisioning & Connecting to data sources
    •  Using Conda for Python package management
    Cloud Foundry for Data Science tutorial
    https://github.com/ihuston/python-cf-examples

    View full-size slide

  20. Focus on data pipelines
    Spring XD & Spring Cloud Data Flow
    Pipelines for composable data services


    DSL based on Unix pipes:
    http | filter | transform | hdfs!


    Multiple paths with taps:
    mypipeline.filter > newtransform | redis!

    View full-size slide

  21. Integrations
    Data Ingestion and Pipeline Processing
    Kafka, RabbitMQ, MQTT, JMS, HTTP, GPDB, HAWQ
    Partition, Filter, Transform, Split, Aggregate
    Real Time Analytics and Complex Event Processing
    Spark Streaming, RxJava, PMML Scoring
    Redis, GemFire, Cassandra, etc..
    Batch Workflow Orchestration + ETL
    Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ, Spark
    RDBMS, FILE, FTP, Log, Mongo, Splunk

    View full-size slide

  22. Where can I learn more?
    DevOps Novel: The Phoenix Project by Kim, Behr & Stafford

    12factor.net

    cloudfoundry.org
    github.com/ihuston/python-cf-examples

    cloud.spring.io/spring-cloud-dataflow
    projects.spring.io/spring-xd/


    View full-size slide