$30 off During Our Annual Pro Sale. View Details »

Kubernetes-like Reconciliation Protocol for Managed Flink Services

Sharon Xie
November 13, 2023

Kubernetes-like Reconciliation Protocol for Managed Flink Services

This talk was given at Flink Forward 2023.

Want your Flink jobs to keep running without failures? Inspired by the robustness of Kubernetes, we created a managed Flink service that brings a similar experience. Users specify the desired Flink job states, and our platform ensures Flink jobs remain in that state. We embraced Kubernetes style reconciliation loops - constant monitoring, comparison of actual and desired states, and proactive actions to resolve any issues.

We've diverged from the conventional Kubernetes operator approach. Our implementation enables a single control plane to manage multiple data planes, and allows relocating Flink jobs to different Kubernetes clusters for cluster utilization and disaster recovery scenarios. With Debezium integration at its core, our reconciliation protocol guarantees efficiency and scalability.

In this talk, you will learn how we designed and implemented such a reconciliation protocol, including various reconciliation methods tailored to the unique demands of Flink.

Sharon Xie

November 13, 2023
Tweet

More Decks by Sharon Xie

Other Decks in Technology

Transcript

  1. Kubernetes-like
    Reconciliation Protocol for
    Managed Flink Services
    Sharon Xie,Flink Babysitter
    Founding Engineer @ Decodable

    View Slide

  2. Our Journey
    to Automate
    Babysitting
    Flink

    View Slide

  3. Agenda
    ● The declarative UX for managed Flink services
    ● Step by Step Implementation
    ● Q&A

    View Slide

  4. Wishes for
    Managed Flink
    Services

    View Slide

  5. Declarative VS Imperative
    Declarative (What)
    ● I want a chocolate cake to feed 10 people
    Imperative (How)
    ● Drive to store;
    ● Buy eggs, cocoa powder, butter, flour;
    ● Drive home;
    ● Preheat Oven;
    ● Mix Ingredients;
    ● Place in a baking tray…

    View Slide

  6. Platform for Managed Flink Services

    View Slide

  7. Challenges
    ● Network is unreliable
    ● Arbitrary network latency
    ● Software and hardware can fail
    ● Flink jobs are sensitive to external changes

    View Slide

  8. Kubernetes Reconciliation Protocol
    Step 1: Get Target & Actual State
    Step 2: Reconcile
    If (Target State != Actual State) {
    // FIX IT
    }

    View Slide

  9. V0: Get target
    and actual state

    View Slide

  10. V0 Result
    ✅ Basic declarative API
    ✅ Single Source of Truth Store
    ○ Can always recreate the service based off the DB
    ❌ Can’t update Flink jobs
    ❌ No reconciliation

    View Slide

  11. V1: Update the Flink cluster when the
    pipeline is updated

    View Slide

  12. Flink Controller
    ● Debezium
    ○ Gets notified when the target
    state changes
    ● FlinkDeployer
    ○ Take actions based on the
    job specification
    ● StateWatch
    ○ Implements K8S Watch API
    ○ Listens to Flink state change
    and update DB with actual
    state

    View Slide

  13. Flink Controller
    ● Stateless
    ○ Debezium does a full table
    scan when the service starts
    ● Idempotent
    ○ FlinkDeployer can issue the
    same commands
    (create/delete a cluster)
    without changing the result

    View Slide

  14. V1 Result
    ✅ Flink clusters match the target states if no errors
    ❌ Lack error handling
    ○ Flink job creation/deletion can fail
    ❌ StateWatch can miss events
    ○ When a Flink cluster is deleted during the service
    restart/downtime

    View Slide

  15. V2: Reconcile if actual state doesn’t
    match target

    View Slide

  16. ✅ Reconciler runs a scheduled task for eventual
    consistency
    ● Any transient network issues can be recovered
    ● Missing StateWatch events can be reconciled
    ❌ No auto healing for Flink runtime issues
    ❌ No auto scaling for workload changes
    V2 Result

    View Slide

  17. V3:
    Auto
    Healing
    and
    Scaling

    View Slide

  18. V3:
    Auto
    Healing
    and
    Scaling

    View Slide

  19. Auto Controller
    ● Stop jobs that are unrecoverable
    ○ Eg: external system issues
    ● Rules engine to auto fix issues
    ○ When Flink RPC times out, increase
    akka.ask.timeout
    ● Scale up/down based on the metrics
    ○ Eg: Lag is going up for an extended period of time,
    scale up with a larger machine

    View Slide

  20. Event Order Challenges

    View Slide

  21. Solution
    ● Version: monotonically increasing with every successful update from the API

    View Slide

  22. ✅ Fully managed Flink service
    with continuous reconciliation

    View Slide

  23. View Slide

  24. One More Thing…

    View Slide

  25. ● Control plane must be able to authenticate the data planes
    ● Network communication should be encrypted
    BYOC - Bring Your Own Cloud

    View Slide

  26. V4: Support BYOC

    View Slide

  27. ● Bidirectional gRPC channel over mTLS
    ○ 🔒Encryption
    ○ 󰠖Authentication
    ● DB access lives in the control plane
    ● Data plane continues processing data in the case
    of a prolonged network partition
    BYOC

    View Slide

  28. Other Benefits
    ● Resource Efficiency
    ○ A control plane can manage multiple data planes
    ● Can relocate data planes to different k8s
    clusters for
    ○ Disaster recovery
    ○ Better resource utilization

    View Slide

  29. Summary
    ● Users ❤ Declarative APIs
    ● Continuous reconciliation makes distributed
    error handling easier with eventual
    consistency
    ● Control / Data plane separation enables a
    more flexible architecture

    View Slide

  30. Kubernetes-like Reconciliation Protocol for
    Managed Flink Services
    Q&A
    @sharon_rxie

    View Slide