Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

With Flink and Kubernetes, it’s possible to deploy stream processing jobs with just SQL and YAML. This low-code approach can certainly save a lot of development time. However, there is more to data pipelines than just streaming SQL. We must wire up many different systems, thread through schemas, and, worst-of-all, write a lot of configuration.

In this talk, we’ll explore just how “declarative” we can make streaming data pipelines on Kubernetes. I’ll show how we can go deeper by adding more and more operators to the stack. How deep can we go?

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. What are data pipelines? • transfer data from sources to

    sinks • replicating (e.g. MirrorMaker) • filtering, routing, joining streams (e.g. Flink) • ETL, rETL (e.g. Kafka Connect, Gobblin) • CDC (e.g. Debezium, Brooklin) • includes batch, but we'll focus on streaming
  2. Deeply Declarative? Explicit Concrete Imperative General-purpose Internally simple Hard Implicit

    Abstract Declarative Domain-specific Internally complex Easy ASM SQL YAML Java
  3. Data pipelines as connectors MySQL HDFS Kafka MySQL source connector

    (CDC) HDFS sink connector (ETL) Kafka Connect
  4. Connectors as configuration { "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector",

    "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.include.list": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "schema-changes.inventory" } }
  5. Simple transformation pipelines MySQL HDFS Kafka MySQL source connector (CDC)

    HDFS sink connector (ETL) Filter (SMT) Flatten (SMT)
  6. Transformations as configuration { "transforms": "Filter,Flatten", "transforms.Filter.type": "org.apache.kafka.connect.transforms.Filter", "transforms.Filter.predicate": "IsMuj",

    "transforms.Flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value", "transforms.Flatten.delimiter": "_", "predicates": "IsMuj", "predicates.IsMuj.type": "io.confluent.connect.transforms.predicates.FieldValuePredicate", "predicates.IsMuj.field": "$.name", "predicates.IsMuj.value": "\"muj\"", ... }
  7. Data pipelines as configuration MySQL HDFS Kafka MySQL source connector

    (CDC) HDFS sink connector (ETL) Filter (SMT) Flatten (SMT) JSON JSON
  8. Apache Kafka Connect and Kafka Streams online operational database Kafka

    analytics database Kafka Streams app Kafka Kafka MySQL source connector (CDC) Filter (SMT)
  9. Data pipelines power important stuff advertiser accounts ad click events

    very important data who do we charge? how much do we charge? new accounts money stuff
  10. Stream processing with the Flink Datastream API // Join the

    two datastreams on the ad id field using a time window of one hour and a custom join function DataStream<Tuple2<AdvertiserAccountCDC, AdClickEvent>> joined = accounts .join(clicks) .where(account -> account.id) // Key selector for accounts .equalTo(click -> click.adId) // Key selector for clicks .window(TumblingEventTimeWindows.of(Time.hours(1))) // Window assigner for joining streams based on event time .apply(new JoinFunction<AdvertiserAccountCDC, AdClickEvent, Tuple2<AdvertiserAccountCDC, AdClickEvent>>() { @Override public Tuple2<AdvertiserAccountCDC, AdClickEvent> join(AdvertiserAccountCDC account, AdClickEvent click) { return new Tuple2<>(account, click); // Return a tuple of account and click } });
  11. Flink Table API introduces connectors TableFactoryService.find(StreamTableSourceFactory.class, new HashMap<>()) .connector(new DebeziumDynamicTableSourceFactory())

    // use Debezium as the connector .option("connector.hostname", "localhost") .option("connector.port", "3306") .option("connector.username", "flinkuser") .option("connector.password", "flinkpw") .option("connector.database-name", "inventory") // monitor all tables under inventory database .format(new JsonFormatFactory()) // use Json format factory .option("json-schema", "...") // specify the JSON schema of the records .option("json-fail-on-missing-field", true) // fail if a field is missing .inAppendMode() .schema(...) // specify the table schema of the source .createTemporaryTable("cdc_source"); // register the source as a temporary table // create a table from the source descriptor Table cdcTable = tableEnv.from("cdc_source"); // filter out delete operations from the table Table filteredTable = cdcTable.where("!op.equals('d')"); // keep only insert and update operations // print the filtered table filteredTable.execute().print();
  12. Flink SQL API -- create a CDC source table for

    inventory database CREATE TABLE inventory ( id INT NOT NULL, name STRING, description STRING, amount INT ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = 'localhost', 'port' = '3306', 'username' = 'flinkuser', 'password' = 'flinkpw', 'database-name' = 'inventory_db', 'table-name' = 'inventory' ); -- filter out delete operations from the table SELECT * FROM inventory WHERE op != "d"; Data Definition Language Standard Query Language
  13. Flink SQL for data pipelines INSERT INTO stock SELECT *

    FROM inventory WHERE quantity > 0; sink source transformation
  14. Flink SQL API // create a TableEnvironment for specific planner

    batch or streaming TableEnvironment tableEnv = ...; // see planners section // register a MySQL table named "products" in Flink SQL String sourceDDL = "CREATE TABLE products (" + " id INT NOT NULL," + " name STRING," + " description STRING," + " weight DECIMAL(10,3)" + ") WITH (" + " 'connector' = 'mysql-cdc'," + " 'hostname' = 'localhost'," + " 'port' = '3306'," + " 'username' = 'flinkuser'," + " 'password' = 'flinkpw'," + " 'database-name' = 'inventory'," + " 'table-name' = 'products'" + // monitor table `products` in database `inventory` ")"; tableEnv.executeSql(sourceDDL); // define a dynamic aggregating query String query = "SELECT id, name, SUM(weight) as total_weight FROM products GROUP BY id, name"; // convert the Table API query result into DataStream DataStream<Row> resultStream = tableEnv.<Row>toRetractStream(tableEnv.sqlQuery(query), Row.class) .filter(x -> x.f0) // only emit update changes (true) and filter out delete changes (false) .map(x -> x.f1); // drop change flag // print the result stream resultStream.print();
  15. FlinkDeployment YAML with the Flink K8s Operator apiVersion: flink.apache.org/v1beta1 kind:

    FlinkDeployment metadata: name: basic-example spec: image: flink:1.16 flinkVersion: v1_16 flinkConfiguration: taskmanager.numberOfTaskSlots: "2" serviceAccount: flink jobManager: resource: memory: "2048m" cpu: 1 taskManager: resource: memory: "2048m" cpu: 1 job: jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar parallelism: 2 upgradeMode: stateless https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.4/examples/basic.yaml
  16. What did a human need to do? • Write Java

    and compile a jar • Store the jar somewhere (e.g. in Docker image) • Write FlinkDeployment YAML • Deploy the YAML → Flink K8s operator steps in → Flink job running!
  17. Level 1 Flink SQL in YAML • "Flink SQL Runner"

    application • FlinkSqlJob CRD • New k8s operator
  18. Flink SQL Runner public class FlinkSqlRunner { public static void

    main(String[] args) { // check the arguments if (args.length != 1) { System.err.println("Usage: FlinkSqlRunner <sql-file>"); return; } // get the sql file name String sqlFile = args[0]; // create a streaming execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // create a table environment with blink planner in streaming mode EnvironmentSettings settings = EnvironmentSettings.newInstance() .useBlinkPlanner() .inStreamingMode() .build(); TableEnvironment tableEnv = TableEnvironment.create(settings); // read the sql file line by line and execute each statement try (BufferedReader reader = new BufferedReader(new FileReader(sqlFile))) { String line; while ((line = reader.readLine()) != null) { // skip empty or comment lines if (line.trim().isEmpty() || line.startsWith("--")) { continue; } // execute the statement and print the result if any tableEnv.executeSql(line); } }
  19. FlinkSQLJob Custom Resource apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name: itemcount

    spec: sql: | CREATE TABLE source ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); CREATE TABLE sink ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' ); INSERT INTO sink SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM source WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id; parallelism: 2
  20. flink-sql-job-operator FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment

    (CR YAML) flink-kubernetes- -operator Kubernetes JobMgr deployment (native YAML) Job deployment (native YAML) etc...
  21. Operators are made up of control loops controller 1 controller

    2 controller 3 Kubernetes YAML YAML YAML
  22. Deploying a FlinkSQLJob $ kubectl apply -f wordcount.yaml $ kubectl

    get flinksqljob wordcount -o yaml # Output: apiVersion: rta/v1beta1 kind: FlinkSQLJob metadata: ... spec: ... status: # The current state of the job. state: Running
  23. What did a human need to do? • Write SQL

    in a YAML file • Deploy the YAML → Flink SQL Job operator steps in → Flink K8s operator steps in → Flink job running!
  24. Where do we get this stuff? apiVersion: rta/v1alpha1 kind: FlinkSQLJob

    metadata: name: itemcounts spec: sql: | CREATE TABLE source ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); CREATE TABLE sink ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' ); INSERT INTO sink SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM source WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id; parallelism: 2
  25. FlinkTable Custom Resource apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserBehavior

    spec: ddl: | CREATE TABLE UserBehavior ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserItemCounts spec: ddl: | CREATE TABLE UserItemCounts ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' );
  26. Reference FlinkTables in FlinkSQLJobs apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name:

    itemcounts spec: tables: - UserItemCounts - UserBehavior sql: | INSERT INTO UserItemCounts SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM UserBehavior WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;
  27. flink-sql-job-operator FlinkSQLJob (CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment

    (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML)
  28. What happens when you change or delete a FlinkTable? FlinkSQLJob

    (CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML)
  29. What happens when you delete a FlinkSQLJob? FlinkSQLJob (CR YAML)

    flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML) ownerReferences
  30. What happens when you delete a FlinkSQLJob? FlinkSQLJob (CR YAML)

    flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML) ownerReferences ownerReferences
  31. Why do we need to write these by hand? apiVersion:

    rta/v1alpha1 kind: FlinkTable metadata: name: UserBehavior spec: ddl: | CREATE TABLE UserBehavior ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserItemCounts spec: ddl: | CREATE TABLE UserItemCounts ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' );
  32. Why do we need to write these by hand? apiVersion:

    rta/v1alpha1 kind: FlinkTable metadata: name: ImportantMoneyStuff spec: ddl: | CREATE TABLE ImportantMoneyStuff ( account STRING, moneys BIGINT ) WITH ( 'connector' = 'pinot', ... ); Pinot Table ?
  33. Reference tables in the catalog apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata:

    name: itemcounts spec: datasets: - urn:li:dataset:(urn:li:dataPlatform:pinot,project.dataset.UserItemCounts,PROD) - urn:li:dataset:(urn:li:dataPlatform:kafka,project.dataset.UserBehavior,PROD) sql: | INSERT INTO UserItemCounts SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM UserBehavior WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;
  34. Data plane operators Pinot DataHub metadata flink-sql-job- -operator FlinkSQLJob (CR

    YAML) PinotTable (CR YAML) pinot-table- -operator Kafka KafkaTopic (CR YAML) kafka-topic- -operator
  35. Data plane YAML apiVersion: rta/v1beta1 kind: KafkaTopic metadata: name: my-topic

    namespace: rta ownershipReferences: - name: kafka-cluster kind: KafkaCluster uid: abc123... - name: my-table kind: PinotTable uid: abc123... spec: clusterRef: name: kafka-cluster name: my-topic partitions: 3 apiVersion: rta/v1alpha1 kind: PinotTable metadata: name: my-table namespace: rta spec: tableType: REALTIME schemaRef: my-schema kafkaRef: name: my-topic
  36. Data plane operators Pinot DataHub metadata flink-sql-job- -operator FlinkSQLJob (CR

    YAML) PinotTable (CR YAML) pinot-table- -operator Kafka KafkaTopic (CR YAML) kafka-topic- -operator
  37. Pipeline YAML apiVersion: rta/v1beta1 kind: Pipeline metadata: name: my-pipeline namespace:

    rta spec: resources: - kind: KafkaTopic name: my-topic - kind: PinotTable name: my-table - kind: AvroSchema name: my-schema sql: | INSERT INTO my-topic ...
  38. pipeline- -operator Pipeline operator FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL

    script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML)
  39. pipeline- -operator End-to-end ownership FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL

    script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML)
  40. pipeline- -operator Multiple owners FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL

    script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML) Pipeline (CR YAML) Pipeline (CR YAML)