Kubernetes-like Reconciliation Protocol for Managed Flink Services

Slide 1

Slide 1 text

Kubernetes-like Reconciliation Protocol for Managed Flink Services Sharon Xie，Flink Babysitter Founding Engineer @ Decodable

Slide 2

Slide 2 text

Our Journey to Automate Babysitting Flink

Slide 3

Slide 3 text

Agenda ● The declarative UX for managed Flink services ● Step by Step Implementation ● Q&A

Slide 4

Slide 4 text

Wishes for Managed Flink Services

Slide 5

Slide 5 text

Declarative VS Imperative Declarative (What) ● I want a chocolate cake to feed 10 people Imperative (How) ● Drive to store; ● Buy eggs, cocoa powder, butter, flour; ● Drive home; ● Preheat Oven; ● Mix Ingredients; ● Place in a baking tray…

Slide 6

Slide 6 text

Platform for Managed Flink Services

Slide 7

Slide 7 text

Challenges ● Network is unreliable ● Arbitrary network latency ● Software and hardware can fail ● Flink jobs are sensitive to external changes

Slide 8

Slide 8 text

Kubernetes Reconciliation Protocol Step 1: Get Target & Actual State Step 2: Reconcile If (Target State != Actual State) { // FIX IT }

Slide 9

Slide 9 text

V0: Get target and actual state

Slide 10

Slide 10 text

V0 Result ✅ Basic declarative API ✅ Single Source of Truth Store ○ Can always recreate the service based off the DB ❌ Can’t update Flink jobs ❌ No reconciliation

Slide 11

Slide 11 text

V1: Update the Flink cluster when the pipeline is updated

Slide 12

Slide 12 text

Flink Controller ● Debezium ○ Gets notified when the target state changes ● FlinkDeployer ○ Take actions based on the job specification ● StateWatch ○ Implements K8S Watch API ○ Listens to Flink state change and update DB with actual state

Slide 13

Slide 13 text

Flink Controller ● Stateless ○ Debezium does a full table scan when the service starts ● Idempotent ○ FlinkDeployer can issue the same commands (create/delete a cluster) without changing the result

Slide 14

Slide 14 text

V1 Result ✅ Flink clusters match the target states if no errors ❌ Lack error handling ○ Flink job creation/deletion can fail ❌ StateWatch can miss events ○ When a Flink cluster is deleted during the service restart/downtime

Slide 15

Slide 15 text

V2: Reconcile if actual state doesn’t match target

Slide 16

Slide 16 text

✅ Reconciler runs a scheduled task for eventual consistency ● Any transient network issues can be recovered ● Missing StateWatch events can be reconciled ❌ No auto healing for Flink runtime issues ❌ No auto scaling for workload changes V2 Result

Slide 17

Slide 17 text

V3: Auto Healing and Scaling

Slide 18

Slide 18 text

V3: Auto Healing and Scaling

Slide 19

Slide 19 text

Auto Controller ● Stop jobs that are unrecoverable ○ Eg: external system issues ● Rules engine to auto fix issues ○ When Flink RPC times out, increase akka.ask.timeout ● Scale up/down based on the metrics ○ Eg: Lag is going up for an extended period of time, scale up with a larger machine

Slide 20

Slide 20 text

Event Order Challenges

Slide 21

Slide 21 text

Solution ● Version: monotonically increasing with every successful update from the API

Slide 22

Slide 22 text

✅ Fully managed Flink service with continuous reconciliation

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

One More Thing…

Slide 25

Slide 25 text

● Control plane must be able to authenticate the data planes ● Network communication should be encrypted BYOC - Bring Your Own Cloud

Slide 26

Slide 26 text

V4: Support BYOC

Slide 27

Slide 27 text

● Bidirectional gRPC channel over mTLS ○ 🔒Encryption ○ 󰠖Authentication ● DB access lives in the control plane ● Data plane continues processing data in the case of a prolonged network partition BYOC

Slide 28

Slide 28 text

Other Benefits ● Resource Efficiency ○ A control plane can manage multiple data planes ● Can relocate data planes to different k8s clusters for ○ Disaster recovery ○ Better resource utilization

Slide 29

Slide 29 text

Summary ● Users ❤ Declarative APIs ● Continuous reconciliation makes distributed error handling easier with eventual consistency ● Control / Data plane separation enables a more flexible architecture

Slide 30

Slide 30 text

Kubernetes-like Reconciliation Protocol for Managed Flink Services Q&A @sharon_rxie