Kubernetes for Data Engineers

Slide 1

Slide 1 text

Kubernetes for Data Engineers Rohit Agarwal Software Engineer, Google Cloud @mindprince

Slide 2

Slide 2 text

What is Kubernetes? Open source. Container orchestrator. Runs everywhere. Focus on applications, not machines.

Slide 3

Slide 3 text

Why Kubernetes? Workload portability. Legacy compatible. Modular. Declarative, not imperative.

Slide 4

Slide 4 text

Kubernetes for stateless applications Deployment and ReplicaSet. Self-healing. Autoscaling. Rollouts and rollbacks. De-facto standard.

Slide 5

Slide 5 text

Applications that Data Engineers care about Stateful. Databases. Data processing frameworks. Machine learning frameworks.

Slide 6

Slide 6 text

Running stateful applications YARN: MapReduce, Hive, Spark etc. Rest of workloads: bespoke deployments. Siloed clusters and underutilization. No standard and management pain.

Slide 7

Slide 7 text

Kubernetes can help All workloads. Standardized tooling. Borg for the rest of the world.

Slide 8

Slide 8 text

Running stateful applications on Kubernetes

Slide 9

Slide 9 text

StatefulSet Stable, unique network identifiers. Stable, persistent storage. Ordered, graceful deployment and scaling. Ordered, graceful termination. Ordered, automated rolling updates. Built-in, no need to reinvent.

Slide 10

Slide 10 text

Operators Extensions. Encode domain-specific operational knowledge. Control-loops: observe, rectify, repeat.

Slide 11

Slide 11 text

Lots of Operators etcd. Prometheus. Kafka. Postgres. Elasticsearch. Redis. and so on...

Slide 12

Slide 12 text

Native integration Spark on Kubernetes. JupyterHub. (In progress) Airflow on Kubernetes.

Slide 13

Slide 13 text

ML workloads Kubeflow project. Operators for Tensorflow, PyTorch, Caffe2, MXNet… Lot of activity.

Slide 14

Slide 14 text

GPUs in Kubernetes Support for NVIDIA GPUs. Support for scheduling any device (GPUs, FPGAs, Infiniband etc.)

Slide 15

Slide 15 text

Recap Stateless > Deployment and ReplicaSet Simple stateful > StatefulSet Distributed databases > Operators Spark/Airflow > Native integration ML > Kubeflow