Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes for Data Engineers

Kubernetes for Data Engineers

The talk will give an introduction to Kubernetes in general and then focus on topics relevant to Data Engineers: in particular, we will talk about how to run stateful workloads on Kubernetes and how to run Machine Learning workloads that use GPUs on Kubernetes.


Rohit Agarwal

April 13, 2018

More Decks by Rohit Agarwal

Other Decks in Technology


  1. Kubernetes for Data Engineers Rohit Agarwal <agarwalrohit@google.com> Software Engineer, Google

    Cloud @mindprince
  2. What is Kubernetes? Open source. Container orchestrator. Runs everywhere. Focus

    on applications, not machines.
  3. Why Kubernetes? Workload portability. Legacy compatible. Modular. Declarative, not imperative.

  4. Kubernetes for stateless applications Deployment and ReplicaSet. Self-healing. Autoscaling. Rollouts

    and rollbacks. De-facto standard.
  5. Applications that Data Engineers care about Stateful. Databases. Data processing

    frameworks. Machine learning frameworks.
  6. Running stateful applications YARN: MapReduce, Hive, Spark etc. Rest of

    workloads: bespoke deployments. Siloed clusters and underutilization. No standard and management pain.
  7. Kubernetes can help All workloads. Standardized tooling. Borg for the

    rest of the world.
  8. Running stateful applications on Kubernetes

  9. StatefulSet Stable, unique network identifiers. Stable, persistent storage. Ordered, graceful

    deployment and scaling. Ordered, graceful termination. Ordered, automated rolling updates. Built-in, no need to reinvent.
  10. Operators Extensions. Encode domain-specific operational knowledge. Control-loops: observe, rectify, repeat.

  11. Lots of Operators etcd. Prometheus. Kafka. Postgres. Elasticsearch. Redis. and

    so on...
  12. Native integration Spark on Kubernetes. JupyterHub. (In progress) Airflow on

  13. ML workloads Kubeflow project. Operators for Tensorflow, PyTorch, Caffe2, MXNet…

    Lot of activity.
  14. GPUs in Kubernetes Support for NVIDIA GPUs. Support for scheduling

    any device (GPUs, FPGAs, Infiniband etc.)
  15. Recap Stateless > Deployment and ReplicaSet Simple stateful > StatefulSet

    Distributed databases > Operators Spark/Airflow > Native integration ML > Kubeflow
  16. Get involved It’s not done yet. #sig-big-data #wg-machine-learning

  17. Questions?

  18. Thank you!