Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The top 3 challenges running multi-tenant Flink at scale

Sharon Xie
November 22, 2023
14

The top 3 challenges running multi-tenant Flink at scale

This talk was given at Flink Forward 2022.

Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain operational cost-efficiency. We’ve learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.

Sharon Xie

November 22, 2023
Tweet

Transcript

  1. The top 3 challenges
    running multi-tenant
    Flink at scale
    Sharon Xie ([email protected])
    Founding Engineer at Decodable

    View full-size slide

  2. Flink at Decodable
    ● Runs all the data processing jobs
    ● Mixed deployment modes for different use
    cases

    View full-size slide

  3. Challenge 1: Abstract Flink away
    ✅Connect to the data sources and destinations
    ✅Declare data processing logic
    🚫Distributed checkpointing systems
    🚫State migrations
    🚫Job/Task configurations

    View full-size slide

  4. Our Model
    Connections
    ● Handle connectivity with external
    systems
    ● Connector specific configurations
    ● Let records with valid schema to
    pass through
    ● Small Flink states
    Pipelines
    ● Handle data processing
    ● Same sources and sinks
    ● Records are already in valid forms
    ● Flink states can vary a lot based
    on the query

    View full-size slide

  5. Separate Connectivity VS Processing
    ● Clear ownership, responsibility and access control
    ● Reusability
    ● Well defined contract for workflow automation
    ● Optimize & scale separately

    View full-size slide

  6. Challenge 2: Managing Isolation
    ● Reducing the cost with resource sharing VS Increasing the risk of
    ○ Noisy neighbors
    ○ Blast radius
    ○ Security

    View full-size slide

  7. Three layers
    ● Infra (CPU, Disk, Memory, Network)
    ● Data processing (Flink)
    ● Configuration storage

    View full-size slide

  8. Infra cost-effective deployment
    VPC
    K8S Cluster
    DB Cluster Cell: 1
    Cell: 1
    decodable-cell-1-control
    decodable-cell-1-data
    Cell: 2
    Cell: 2
    EKS:
    cluster + nodes + ...
    Subnets, SGs, ...
    RDS Aurora cluster +
    R/W instances
    n
    K8s Namespaces:
    decodable-cell-2-control
    decodable-cell-2-data
    K8s Namespaces:
    Kafka Cluster
    MSK (AWS Managed
    Kafka)
    topics
    topics

    View full-size slide

  9. Infra max isolation deployment
    VPC
    DB Cluster
    Kafka Cluster
    K8S Cluster
    Cell

    View full-size slide

  10. Job Isolation - Flink
    ● Mini-clusters for preview jobs
    ● Application mode clusters for connections and pipelines
    K8S
    Deployment
    Application mode cluster
    JobManager
    TaskManager
    TaskManager
    TaskManager
    K8S Job
    Preview cluster pool
    Mini-cluster 1
    Mini-cluster 2
    Mini-cluster N

    View full-size slide

  11. Isolate sensitive info
    API
    Service
    Secrets
    Manager
    Create
    connection
    Store
    Secrets
    Activate
    connection
    Retrieve
    Secrets
    Flink Cluster
    Launch
    K8S
    Secret
    Control
    Plane
    Data
    Plane
    IRSA*
    IRSA*: AWS IAM Roles for Service Accounts

    View full-size slide

  12. Isolate sensitive info
    ● Principle of least privilege
    ● Minimize the # of services with access to the sensitive info
    ● Audit everything

    View full-size slide

  13. Challenge 3: Observability
    Internally
    ● Things should just work
    ● We commit to SLAs with customers
    ● We evaluate ourselves against our SLOs
    Users
    ● Know if their jobs are healthy
    ● Understand performance metrics
    ● Get notified with actionable messages if something goes wrong

    View full-size slide

  14. Internal key metrics
    Primary signals:
    ● Each running job has a healthy Flink deployment
    ● # successful checkpoints per job is increasing
    Secondary signals:
    ● Error rate / # job restarts
    ● Consumer lag

    View full-size slide

  15. User facing
    Error classifier:
    ● Unexpected server error (Page)
    ● Temporary server error (Monitor)
    ● User configuration error (Notify)

    View full-size slide

  16. Parting thoughts
    The paradox of choice
    ● Massive Flink configurations
    ● Different APIs, deployment modes
    Isolation is very important (and hard).
    For us, scalability is really about increase the # of users who can
    do real-time data processing.

    View full-size slide

  17. 2022
    Build real-time data apps &
    services. Fast.
    decodable.co

    View full-size slide