Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The top 3 challenges running multi-tenant Flink at scale

Sharon Xie
November 22, 2023

The top 3 challenges running multi-tenant Flink at scale

This talk was given at Flink Forward 2022.

Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain operational cost-efficiency. We’ve learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.

Sharon Xie

November 22, 2023


  1. Flink at Decodable • Runs all the data processing jobs

    • Mixed deployment modes for different use cases
  2. Challenge 1: Abstract Flink away ✅Connect to the data sources

    and destinations ✅Declare data processing logic 🚫Distributed checkpointing systems 🚫State migrations 🚫Job/Task configurations
  3. Our Model Connections • Handle connectivity with external systems •

    Connector specific configurations • Let records with valid schema to pass through • Small Flink states Pipelines • Handle data processing • Same sources and sinks • Records are already in valid forms • Flink states can vary a lot based on the query
  4. Separate Connectivity VS Processing • Clear ownership, responsibility and access

    control • Reusability • Well defined contract for workflow automation • Optimize & scale separately
  5. Challenge 2: Managing Isolation • Reducing the cost with resource

    sharing VS Increasing the risk of ◦ Noisy neighbors ◦ Blast radius ◦ Security
  6. Three layers • Infra (CPU, Disk, Memory, Network) • Data

    processing (Flink) • Configuration storage
  7. Infra cost-effective deployment VPC K8S Cluster DB Cluster Cell: 1

    Cell: 1 decodable-cell-1-control decodable-cell-1-data Cell: 2 Cell: 2 EKS: cluster + nodes + ... Subnets, SGs, ... RDS Aurora cluster + R/W instances n K8s Namespaces: decodable-cell-2-control decodable-cell-2-data K8s Namespaces: Kafka Cluster MSK (AWS Managed Kafka) topics topics
  8. Job Isolation - Flink • Mini-clusters for preview jobs •

    Application mode clusters for connections and pipelines K8S Deployment Application mode cluster JobManager TaskManager TaskManager TaskManager K8S Job Preview cluster pool Mini-cluster 1 Mini-cluster 2 Mini-cluster N
  9. Isolate sensitive info API Service Secrets Manager Create connection Store

    Secrets Activate connection Retrieve Secrets Flink Cluster Launch K8S Secret Control Plane Data Plane IRSA* IRSA*: AWS IAM Roles for Service Accounts
  10. Isolate sensitive info • Principle of least privilege • Minimize

    the # of services with access to the sensitive info • Audit everything
  11. Challenge 3: Observability Internally • Things should just work •

    We commit to SLAs with customers • We evaluate ourselves against our SLOs Users • Know if their jobs are healthy • Understand performance metrics • Get notified with actionable messages if something goes wrong
  12. Internal key metrics Primary signals: • Each running job has

    a healthy Flink deployment • # successful checkpoints per job is increasing Secondary signals: • Error rate / # job restarts • Consumer lag
  13. User facing Error classifier: • Unexpected server error (Page) •

    Temporary server error (Monitor) • User configuration error (Notify)
  14. Parting thoughts The paradox of choice • Massive Flink configurations

    • Different APIs, deployment modes Isolation is very important (and hard). For us, scalability is really about increase the # of users who can do real-time data processing.