Kubernetes for Data Engineers

Kubernetes for Data Engineers Rohit Agarwal <[email protected]> Software Engineer, Google
Cloud @mindprince

What is Kubernetes? Open source. Container orchestrator. Runs everywhere. Focus
on applications, not machines.

Why Kubernetes? Workload portability. Legacy compatible. Modular. Declarative, not imperative.

Kubernetes for stateless applications Deployment and ReplicaSet. Self-healing. Autoscaling. Rollouts
and rollbacks. De-facto standard.

Applications that Data Engineers care about Stateful. Databases. Data processing
frameworks. Machine learning frameworks.

Running stateful applications YARN: MapReduce, Hive, Spark etc. Rest of
workloads: bespoke deployments. Siloed clusters and underutilization. No standard and management pain.

Kubernetes can help All workloads. Standardized tooling. Borg for the
rest of the world.

Running stateful applications on Kubernetes

StatefulSet Stable, unique network identifiers. Stable, persistent storage. Ordered, graceful
deployment and scaling. Ordered, graceful termination. Ordered, automated rolling updates. Built-in, no need to reinvent.

Operators Extensions. Encode domain-specific operational knowledge. Control-loops: observe, rectify, repeat.

Lots of Operators etcd. Prometheus. Kafka. Postgres. Elasticsearch. Redis. and
so on...

Native integration Spark on Kubernetes. JupyterHub. (In progress) Airflow on
Kubernetes.

ML workloads Kubeflow project. Operators for Tensorflow, PyTorch, Caffe2, MXNet…
Lot of activity.

GPUs in Kubernetes Support for NVIDIA GPUs. Support for scheduling
any device (GPUs, FPGAs, Infiniband etc.)

Recap Stateless > Deployment and ReplicaSet Simple stateful > StatefulSet
Distributed databases > Operators Spark/Airflow > Native integration ML > Kubeflow

Get involved It’s not done yet. #sig-big-data #wg-machine-learning

Questions?

Thank you!

Kubernetes for Data Engineers

Kubernetes for Data Engineers

Rohit Agarwal

More Decks by Rohit Agarwal

Other Decks in Technology

Featured

Transcript