This talk was given at Flink Forward 2023.
Want your Flink jobs to keep running without failures? Inspired by the robustness of Kubernetes, we created a managed Flink service that brings a similar experience. Users specify the desired Flink job states, and our platform ensures Flink jobs remain in that state. We embraced Kubernetes style reconciliation loops - constant monitoring, comparison of actual and desired states, and proactive actions to resolve any issues.
We've diverged from the conventional Kubernetes operator approach. Our implementation enables a single control plane to manage multiple data planes, and allows relocating Flink jobs to different Kubernetes clusters for cluster utilization and disaster recovery scenarios. With Debezium integration at its core, our reconciliation protocol guarantees efficiency and scalability.
In this talk, you will learn how we designed and implemented such a reconciliation protocol, including various reconciliation methods tailored to the unique demands of Flink.