Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A personal journey towards Cloud Native Data Science

Ian Huston
November 15, 2015

A personal journey towards Cloud Native Data Science

Talk delivered at Open Data Science Conference, San Francisco, November 2015

As more and more applications are being deployed in the cloud, developers are learning how to best design software for this new type of system. These Cloud Native applications are single-purpose, stateless and easily scalable. This contrasts with traditional approaches which created large, monolithic, and fragile systems. What can data scientists learn from these new approaches, and how can we make sure our models are easily deployed in this kind of system? In this talk I will discuss the principles of cloud native design, how they apply to data science driven applications and how data scientists can get started using open source cloud native platforms.

Ian Huston

November 15, 2015
Tweet

More Decks by Ian Huston

Other Decks in Technology

Transcript

  1. A personal journey towards
    Cloud Native Data Science
    @OPENDATASCI
    OPEN
    DATA
    SCIENCE
    CONFERENCE
    Ian Huston
    @ianhuston
    S A N F R A N C I S C O | 2 0 1 5

    View Slide

  2. Who am I?
    Data Scientist
    http://github.com/ihuston
    @ianhuston

    View Slide

  3. At Microsoft
    8 of the 16 data scientists interviewed
    work on or manage the operationalization
    of predictive models





    The Emerging Role of Data Scientists on Software Development Teams
    Microsoft Research. Technical Report. MSR-TR-2015-30

    http://research.microsoft.com/apps/pubs/default.aspx?id=242286

    View Slide

  4. My story…

    View Slide

  5. Academia: How to Scare a Postdoc
    “Which version of your
    code and data
    was used to create
    Figure 3 in v2 of your recent
    paper?”

    View Slide

  6. http://software-carpentry.org/

    View Slide

  7. Research library writing
    •  Packaging & Dependencies
    •  Installation & Deployment
    •  Try to automate deploy over cluster

    View Slide

  8. Starting out with Data Science


    •  Big & Fast Bare Metal Appliances
    •  Lots of scripting and glue code
    •  Manual deployments
    •  Lots of cron jobs

    View Slide

  9. Test Driven Development
    Continuous Integration & Delivery
    Continuous Improvement
    Cloud Native Applications

    View Slide

  10. What does Cloud Native mean?

    View Slide

  11. Cloud Native Haiku

    Here is my source code
    Run it on the cloud for me
    I do not care how.
    -  Onsi Fakhouri
    @onsijoe

    View Slide

  12. Then Now
    assume fragile infrastructure
    assume reliable infrastructure
    release code every 3 months release code early and often
    works in my environment shared responsibility
    tightly coupled loosely coupled

    View Slide

  13. Implications
    Build for failure

    Make app disposable & scalable

    Accept constraints of platform




    View Slide

  14. •  One Codebase with revision control
    •  Explicitly declare dependencies
    •  Stateless Processes – attach external data stores
    •  Parity between dev & production environments
    •  …
    http://12factor.net

    View Slide

  15. Why should we apply this "
    to Data Science?

    View Slide

  16. Things I hear from data scientists:
    How do I …
    •  speed up the set up of my system?
    •  keep different versions of Python/R in sync?
    •  make my dev environment the same as production?
    •  make my models easily available?
    •  make repeatable/reproducible model runs?

    View Slide

  17. What is Cloud Native Data Science?
    12 factors +

    Reproducibility

    Expose models as services

    Explicit configuration for data pipelines

    What else?

    View Slide

  18. Focus on Provisioning & Deployment
    Open Source platform powering:
    GE Predix, Intel Trusted Analytics Platform,
    IBM Bluemix, SAP HANA Cloud,
    HP Helion Cloud

    View Slide

  19. Deploy your app with cf push
    CF determines app type (Java, Python, Ruby, …)

    Installs necessary environment

    Provisions & binds data sources "

    Creates (sub-)domain, routing and load balancing

    Continual app health checks & restarts

    View Slide

  20. How do I get started?
    Covers:
    •  Deploying simple Python applications
    •  Scaling instances
    •  Provisioning & Connecting to data sources
    •  Using Conda for Python package management
    Cloud Foundry for Data Science tutorial
    https://github.com/ihuston/python-cf-examples

    View Slide

  21. Focus on data pipelines
    Spring XD & Spring Cloud Data Flow
    Pipelines for composable data services


    DSL based on Unix pipes:
    http | filter | transform | hdfs!


    Multiple paths with taps:
    mypipeline.filter > newtransform | redis!

    View Slide

  22. Integrations
    Data Ingestion and Pipeline Processing
    Kafka, RabbitMQ, MQTT, JMS, HTTP, GPDB, HAWQ
    Partition, Filter, Transform, Split, Aggregate
    Real Time Analytics and Complex Event Processing
    Spark Streaming, RxJava, PMML Scoring
    Redis, GemFire, Cassandra, etc..
    Batch Workflow Orchestration + ETL
    Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ, Spark
    RDBMS, FILE, FTP, Log, Mongo, Splunk

    View Slide

  23. Where can I learn more?
    DevOps Novel: The Phoenix Project by Kim, Behr & Stafford

    12factor.net

    cloudfoundry.org
    github.com/ihuston/python-cf-examples

    cloud.spring.io/spring-cloud-dataflow
    projects.spring.io/spring-xd/


    View Slide

  24. @ianhuston

    View Slide