$30 off During Our Annual Pro Sale. View Details »

Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

With Flink and Kubernetes, it’s possible to deploy stream processing jobs with just SQL and YAML. This low-code approach can certainly save a lot of development time. However, there is more to data pipelines than just streaming SQL. We must wire up many different systems, thread through schemas, and, worst-of-all, write a lot of configuration.

In this talk, we’ll explore just how “declarative” we can make streaming data pipelines on Kubernetes. I’ll show how we can go deeper by adding more and more operators to the stack. How deep can we go?

StarTree
PRO

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Deeply Declarative Data
    Pipelines
    Ryanne Dolan
    Snr Staff SWE, LinkedIn

    View Slide

  2. What are data pipelines?
    ● transfer data from sources to sinks
    ● replicating (e.g. MirrorMaker)
    ● filtering, routing, joining streams (e.g. Flink)
    ● ETL, rETL (e.g. Kafka Connect, Gobblin)
    ● CDC (e.g. Debezium, Brooklin)
    ● includes batch, but we'll focus on streaming

    View Slide

  3. Deeply Declarative?
    Explicit
    Concrete
    Imperative
    General-purpose
    Internally simple
    Hard
    Implicit
    Abstract
    Declarative
    Domain-specific
    Internally complex
    Easy
    ASM SQL YAML
    Java

    View Slide

  4. 1. Preheat the oven to 450°
    2. Mix flour and water
    ...

    View Slide

  5. When do we need data pipelines?
    online
    operational
    database
    analytics
    database
    data pipeline

    View Slide

  6. Data pipelines as connectors
    MySQL HDFS
    Kafka
    MySQL
    source
    connector
    (CDC)
    HDFS
    sink
    connector
    (ETL)
    Kafka Connect

    View Slide

  7. Connectors as configuration
    {
    "name": "inventory-connector",
    "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.server.name": "dbserver1",
    "database.include.list": "inventory",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.inventory"
    }
    }

    View Slide

  8. Simple transformation pipelines
    MySQL HDFS
    Kafka
    MySQL
    source
    connector
    (CDC)
    HDFS
    sink
    connector
    (ETL)
    Filter
    (SMT)
    Flatten
    (SMT)

    View Slide

  9. Transformations as configuration
    {
    "transforms": "Filter,Flatten",
    "transforms.Filter.type": "org.apache.kafka.connect.transforms.Filter",
    "transforms.Filter.predicate": "IsMuj",
    "transforms.Flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
    "transforms.Flatten.delimiter": "_",
    "predicates": "IsMuj",
    "predicates.IsMuj.type":
    "io.confluent.connect.transforms.predicates.FieldValuePredicate",
    "predicates.IsMuj.field": "$.name",
    "predicates.IsMuj.value": "\"muj\"",
    ...
    }

    View Slide

  10. Data pipelines as configuration
    MySQL HDFS
    Kafka
    MySQL
    source
    connector
    (CDC)
    HDFS
    sink
    connector
    (ETL)
    Filter
    (SMT)
    Flatten
    (SMT)
    JSON JSON

    View Slide

  11. Complex data pipelines
    online
    operational
    database
    event stream
    analytics
    database
    filter and join
    change stream
    view

    View Slide

  12. Apache Kafka Connect and Kafka Streams
    online
    operational
    database
    Kafka
    analytics
    database
    Kafka Streams
    app
    Kafka
    Kafka
    MySQL
    source
    connector
    (CDC)
    Filter
    (SMT)

    View Slide

  13. Apache Flink
    DB
    event stream
    DB
    Flink
    job
    change stream
    upserts

    View Slide

  14. Data pipelines power important stuff
    advertiser
    accounts
    ad click events
    very
    important
    data
    who do we
    charge?
    how much do
    we charge?
    new accounts
    money stuff

    View Slide

  15. How can we make
    data pipelines easy?

    View Slide

  16. How can we build
    new data pipelines
    with as little one-off
    code as possible?

    View Slide

  17. Stream processing with Apache Flink
    DB
    event stream
    DB
    Flink
    change stream
    upserts

    View Slide

  18. Stream processing with the Flink Datastream API
    // Join the two datastreams on the ad id field using a time window of one hour and a custom join function
    DataStream> joined = accounts
    .join(clicks)
    .where(account -> account.id) // Key selector for accounts
    .equalTo(click -> click.adId) // Key selector for clicks
    .window(TumblingEventTimeWindows.of(Time.hours(1))) // Window assigner for joining streams based on event time
    .apply(new JoinFunction>() {
    @Override
    public Tuple2 join(AdvertiserAccountCDC account, AdClickEvent click) {
    return new Tuple2<>(account, click); // Return a tuple of account and click
    }
    });

    View Slide

  19. Flink Table API introduces connectors
    TableFactoryService.find(StreamTableSourceFactory.class, new HashMap<>())
    .connector(new DebeziumDynamicTableSourceFactory()) // use Debezium as the connector
    .option("connector.hostname", "localhost")
    .option("connector.port", "3306")
    .option("connector.username", "flinkuser")
    .option("connector.password", "flinkpw")
    .option("connector.database-name", "inventory") // monitor all tables under inventory database
    .format(new JsonFormatFactory()) // use Json format factory
    .option("json-schema", "...") // specify the JSON schema of the records
    .option("json-fail-on-missing-field", true) // fail if a field is missing
    .inAppendMode()
    .schema(...) // specify the table schema of the source
    .createTemporaryTable("cdc_source"); // register the source as a temporary table
    // create a table from the source descriptor
    Table cdcTable = tableEnv.from("cdc_source");
    // filter out delete operations from the table
    Table filteredTable = cdcTable.where("!op.equals('d')"); // keep only insert and update operations
    // print the filtered table
    filteredTable.execute().print();

    View Slide

  20. Flink SQL API
    -- create a CDC source table for inventory database
    CREATE TABLE inventory (
    id INT NOT NULL,
    name STRING,
    description STRING,
    amount INT
    ) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'localhost',
    'port' = '3306',
    'username' = 'flinkuser',
    'password' = 'flinkpw',
    'database-name' = 'inventory_db',
    'table-name' = 'inventory'
    );
    -- filter out delete operations from the table
    SELECT * FROM inventory WHERE op != "d";
    Data
    Definition
    Language
    Standard
    Query
    Language

    View Slide

  21. Flink SQL for data pipelines
    INSERT INTO stock SELECT * FROM inventory WHERE quantity > 0;
    sink source transformation

    View Slide

  22. Flink SQL API
    // create a TableEnvironment for specific planner batch or streaming
    TableEnvironment tableEnv = ...; // see planners section
    // register a MySQL table named "products" in Flink SQL
    String sourceDDL =
    "CREATE TABLE products (" +
    " id INT NOT NULL," +
    " name STRING," +
    " description STRING," +
    " weight DECIMAL(10,3)" +
    ") WITH (" +
    " 'connector' = 'mysql-cdc'," +
    " 'hostname' = 'localhost'," +
    " 'port' = '3306'," +
    " 'username' = 'flinkuser'," +
    " 'password' = 'flinkpw'," +
    " 'database-name' = 'inventory'," +
    " 'table-name' = 'products'" + // monitor table `products` in database `inventory`
    ")";
    tableEnv.executeSql(sourceDDL);
    // define a dynamic aggregating query
    String query =
    "SELECT id, name, SUM(weight) as total_weight FROM products GROUP BY id, name";
    // convert the Table API query result into DataStream
    DataStream resultStream =
    tableEnv.toRetractStream(tableEnv.sqlQuery(query), Row.class)
    .filter(x -> x.f0) // only emit update changes (true) and filter out delete changes (false)
    .map(x -> x.f1); // drop change flag
    // print the result stream
    resultStream.print();

    View Slide

  23. FlinkDeployment YAML with the Flink K8s Operator
    apiVersion: flink.apache.org/v1beta1
    kind: FlinkDeployment
    metadata:
    name: basic-example
    spec:
    image: flink:1.16
    flinkVersion: v1_16
    flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    serviceAccount: flink
    jobManager:
    resource:
    memory: "2048m"
    cpu: 1
    taskManager:
    resource:
    memory: "2048m"
    cpu: 1
    job:
    jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
    parallelism: 2
    upgradeMode: stateless
    https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.4/examples/basic.yaml

    View Slide

  24. What did a human need to do?
    ● Write Java and compile a jar
    ● Store the jar somewhere (e.g. in Docker image)
    ● Write FlinkDeployment YAML
    ● Deploy the YAML
    → Flink K8s operator steps in
    → Flink job running!

    View Slide

  25. We can do better.

    View Slide

  26. Level 1
    Flink SQL in YAML
    ● "Flink SQL Runner" application
    ● FlinkSqlJob CRD
    ● New k8s operator

    View Slide

  27. Flink SQL Runner
    public class FlinkSqlRunner {
    public static void main(String[] args) {
    // check the arguments
    if (args.length != 1) {
    System.err.println("Usage: FlinkSqlRunner ");
    return;
    }
    // get the sql file name
    String sqlFile = args[0];
    // create a streaming execution environment
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    // create a table environment with blink planner in streaming mode
    EnvironmentSettings settings = EnvironmentSettings.newInstance()
    .useBlinkPlanner()
    .inStreamingMode()
    .build();
    TableEnvironment tableEnv = TableEnvironment.create(settings);
    // read the sql file line by line and execute each statement
    try (BufferedReader reader = new BufferedReader(new FileReader(sqlFile))) {
    String line;
    while ((line = reader.readLine()) != null) {
    // skip empty or comment lines
    if (line.trim().isEmpty() || line.startsWith("--")) {
    continue;
    }
    // execute the statement and print the result if any
    tableEnv.executeSql(line);
    }
    }

    View Slide

  28. FlinkSQLJob Custom Resource
    apiVersion: rta/v1alpha1
    kind: FlinkSQLJob
    metadata:
    name: itemcount
    spec:
    sql: |
    CREATE TABLE source (
    user_id STRING,
    item_id STRING,
    category_id STRING,
    behavior STRING,
    ts TIMESTAMP(3)
    ) WITH (
    'connector' = 'kafka',
    'topic' = 'user_behavior',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json'
    );
    CREATE TABLE sink (
    user_id STRING,
    item_count BIGINT
    ) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://mysql:3306/flink',
    'table-name' = 'user_item_count',
    'username' = 'root',
    'password' = ''
    );
    INSERT INTO sink
    SELECT user_id, COUNT(DISTINCT item_id) AS item_count
    FROM source
    WHERE behavior = 'buy'
    GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;
    parallelism: 2

    View Slide

  29. flink-sql-job-operator
    FlinkSQLJob
    (CR YAML) flink-sql-job-
    -operator
    SQL script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    Kubernetes
    JobMgr
    deployment
    (native YAML)
    Job deployment
    (native YAML)
    etc...

    View Slide

  30. Operators are made up of control loops
    controller 1
    controller 2
    controller 3
    Kubernetes
    YAML
    YAML
    YAML

    View Slide

  31. Deploying a FlinkSQLJob
    $ kubectl apply -f wordcount.yaml
    $ kubectl get flinksqljob wordcount -o yaml
    # Output:
    apiVersion: rta/v1beta1
    kind: FlinkSQLJob
    metadata:
    ...
    spec:
    ...
    status:
    # The current state of the job.
    state: Running

    View Slide

  32. What did a human need to do?
    ● Write SQL in a YAML file
    ● Deploy the YAML
    → Flink SQL Job operator steps in
    → Flink K8s operator steps in
    → Flink job running!

    View Slide

  33. Where do we get this stuff?
    apiVersion: rta/v1alpha1
    kind: FlinkSQLJob
    metadata:
    name: itemcounts
    spec:
    sql: |
    CREATE TABLE source (
    user_id STRING,
    item_id STRING,
    category_id STRING,
    behavior STRING,
    ts TIMESTAMP(3)
    ) WITH (
    'connector' = 'kafka',
    'topic' = 'user_behavior',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json'
    );
    CREATE TABLE sink (
    user_id STRING,
    item_count BIGINT
    ) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://mysql:3306/flink',
    'table-name' = 'user_item_count',
    'username' = 'root',
    'password' = ''
    );
    INSERT INTO sink
    SELECT user_id, COUNT(DISTINCT item_id) AS item_count
    FROM source
    WHERE behavior = 'buy'
    GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;
    parallelism: 2

    View Slide

  34. Level 2
    Table YAML
    ● FlinkTable CRD
    ● smarter operator

    View Slide

  35. FlinkTable Custom Resource
    apiVersion: rta/v1alpha1
    kind: FlinkTable
    metadata:
    name: UserBehavior
    spec:
    ddl: |
    CREATE TABLE UserBehavior (
    user_id STRING,
    item_id STRING,
    category_id STRING,
    behavior STRING,
    ts TIMESTAMP(3)
    ) WITH (
    'connector' = 'kafka',
    'topic' = 'user_behavior',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json'
    );
    apiVersion: rta/v1alpha1
    kind: FlinkTable
    metadata:
    name: UserItemCounts
    spec:
    ddl: |
    CREATE TABLE UserItemCounts (
    user_id STRING,
    item_count BIGINT
    ) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://mysql:3306/flink',
    'table-name' = 'user_item_count',
    'username' = 'root',
    'password' = ''
    );

    View Slide

  36. Reference FlinkTables in FlinkSQLJobs
    apiVersion: rta/v1alpha1
    kind: FlinkSQLJob
    metadata:
    name: itemcounts
    spec:
    tables:
    - UserItemCounts
    - UserBehavior
    sql: |
    INSERT INTO UserItemCounts
    SELECT user_id, COUNT(DISTINCT item_id) AS item_count
    FROM UserBehavior
    WHERE behavior = 'buy'
    GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;

    View Slide

  37. flink-sql-job-operator
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    DDL+SQL
    script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    FlinkTable
    (CR YAML)
    FlinkTable
    (CR YAML)

    View Slide

  38. What happens when you change or delete a FlinkTable?
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    DDL+SQL
    script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    FlinkTable
    (CR YAML)
    FlinkTable
    (CR YAML)

    View Slide

  39. What happens when you delete a FlinkSQLJob?
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    DDL+SQL
    script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    FlinkTable
    (CR YAML)
    FlinkTable
    (CR YAML)
    ownerReferences

    View Slide

  40. What happens when you delete a FlinkSQLJob?
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    SQL script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    FlinkTable
    (CR YAML)
    FlinkTable
    (CR YAML)
    ownerReferences
    ownerReferences

    View Slide

  41. Level 3
    Automatic Table Registration
    ● smarter data plane
    ● centralized catalog

    View Slide

  42. Why do we need to write these by hand?
    apiVersion: rta/v1alpha1
    kind: FlinkTable
    metadata:
    name: UserBehavior
    spec:
    ddl: |
    CREATE TABLE UserBehavior (
    user_id STRING,
    item_id STRING,
    category_id STRING,
    behavior STRING,
    ts TIMESTAMP(3)
    ) WITH (
    'connector' = 'kafka',
    'topic' = 'user_behavior',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json'
    );
    apiVersion: rta/v1alpha1
    kind: FlinkTable
    metadata:
    name: UserItemCounts
    spec:
    ddl: |
    CREATE TABLE UserItemCounts (
    user_id STRING,
    item_count BIGINT
    ) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://mysql:3306/flink',
    'table-name' = 'user_item_count',
    'username' = 'root',
    'password' = ''
    );

    View Slide

  43. Why do we need to write these by hand?
    apiVersion: rta/v1alpha1
    kind: FlinkTable
    metadata:
    name: ImportantMoneyStuff
    spec:
    ddl: |
    CREATE TABLE ImportantMoneyStuff (
    account STRING,
    moneys BIGINT
    ) WITH (
    'connector' = 'pinot',
    ...
    );
    Pinot Table ?

    View Slide

  44. Metadata catalogs
    Pinot
    MySQL
    Kafka
    DataHub
    metadata
    Data Plane
    flink-sql-job-
    -operator
    FlinkSQLJob
    (CR YAML)

    View Slide

  45. Reference tables in the catalog
    apiVersion: rta/v1alpha1
    kind: FlinkSQLJob
    metadata:
    name: itemcounts
    spec:
    datasets:
    - urn:li:dataset:(urn:li:dataPlatform:pinot,project.dataset.UserItemCounts,PROD)
    - urn:li:dataset:(urn:li:dataPlatform:kafka,project.dataset.UserBehavior,PROD)
    sql: |
    INSERT INTO UserItemCounts
    SELECT user_id, COUNT(DISTINCT item_id) AS item_count
    FROM UserBehavior
    WHERE behavior = 'buy'
    GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;

    View Slide

  46. Level 4
    Declarative infra
    ● declarative data plane
    ● even more operators

    View Slide

  47. Data plane operators
    Pinot
    DataHub
    metadata
    flink-sql-job-
    -operator
    FlinkSQLJob
    (CR YAML)
    PinotTable
    (CR YAML) pinot-table-
    -operator
    Kafka
    KafkaTopic
    (CR YAML) kafka-topic-
    -operator

    View Slide

  48. Data plane YAML
    apiVersion: rta/v1beta1
    kind: KafkaTopic
    metadata:
    name: my-topic
    namespace: rta
    ownershipReferences:
    - name: kafka-cluster
    kind: KafkaCluster
    uid: abc123...
    - name: my-table
    kind: PinotTable
    uid: abc123...
    spec:
    clusterRef:
    name: kafka-cluster
    name: my-topic
    partitions: 3
    apiVersion: rta/v1alpha1
    kind: PinotTable
    metadata:
    name: my-table
    namespace: rta
    spec:
    tableType: REALTIME
    schemaRef: my-schema
    kafkaRef:
    name: my-topic

    View Slide

  49. Data plane operators
    Pinot
    DataHub
    metadata
    flink-sql-job-
    -operator
    FlinkSQLJob
    (CR YAML)
    PinotTable
    (CR YAML) pinot-table-
    -operator
    Kafka
    KafkaTopic
    (CR YAML) kafka-topic-
    -operator

    View Slide

  50. Level 5
    End-to-End Ownership
    ● pipeline operator
    ● multiple owners

    View Slide

  51. Pipeline YAML
    apiVersion: rta/v1beta1
    kind: Pipeline
    metadata:
    name: my-pipeline
    namespace: rta
    spec:
    resources:
    - kind: KafkaTopic
    name: my-topic
    - kind: PinotTable
    name: my-table
    - kind: AvroSchema
    name: my-schema
    sql: |
    INSERT INTO my-topic ...

    View Slide

  52. pipeline-
    -operator
    Pipeline operator
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    SQL script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    KafkaTopic
    (CR YAML)
    PinotTable
    (CR YAML)
    Schema
    (CR YAML)
    Pipeline
    (CR YAML)

    View Slide

  53. pipeline-
    -operator
    End-to-end ownership
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    SQL script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    KafkaTopic
    (CR YAML)
    PinotTable
    (CR YAML)
    Schema
    (CR YAML)
    Pipeline
    (CR YAML)

    View Slide

  54. pipeline-
    -operator
    Multiple owners
    FlinkSQLJob
    (CR YAML)
    flink-sql-job-
    -operator
    SQL script
    (ConfigMap)
    FlinkDeployment
    (CR YAML) flink-kubernetes-
    -operator
    KafkaTopic
    (CR YAML)
    PinotTable
    (CR YAML)
    Schema
    (CR YAML)
    Pipeline
    (CR YAML) Pipeline
    (CR YAML)
    Pipeline
    (CR YAML)

    View Slide

  55. Thank you!
    @dolanRyanne
    in/RyanneDolan
    All code from Bing Chat
    All images from OpenAI Dall-E

    View Slide