Kubernetes for Data Engineers
Rohit Agarwal
Software Engineer, Google Cloud
@mindprince
Slide 2
Slide 2 text
What is Kubernetes?
Open source.
Container orchestrator.
Runs everywhere.
Focus on applications, not machines.
Slide 3
Slide 3 text
Why Kubernetes?
Workload portability.
Legacy compatible.
Modular.
Declarative, not imperative.
Slide 4
Slide 4 text
Kubernetes for stateless applications
Deployment and ReplicaSet.
Self-healing.
Autoscaling.
Rollouts and rollbacks.
De-facto standard.
Slide 5
Slide 5 text
Applications that Data Engineers care about
Stateful.
Databases.
Data processing frameworks.
Machine learning frameworks.
Slide 6
Slide 6 text
Running stateful applications
YARN: MapReduce, Hive, Spark etc.
Rest of workloads: bespoke deployments.
Siloed clusters and underutilization.
No standard and management pain.
Slide 7
Slide 7 text
Kubernetes can help
All workloads.
Standardized tooling.
Borg for the rest of the world.
Slide 8
Slide 8 text
Running stateful applications on Kubernetes
Slide 9
Slide 9 text
StatefulSet
Stable, unique network identifiers.
Stable, persistent storage.
Ordered, graceful deployment and scaling.
Ordered, graceful termination.
Ordered, automated rolling updates.
Built-in, no need to reinvent.