Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn)

Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

by StarTree

Slide 1

Slide 1 text

Deeply Declarative Data Pipelines Ryanne Dolan Snr Staﬀ SWE, LinkedIn

Slide 2

Slide 2 text

What are data pipelines? ● transfer data from sources to sinks ● replicating (e.g. MirrorMaker) ● ﬁltering, routing, joining streams (e.g. Flink) ● ETL, rETL (e.g. Kafka Connect, Gobblin) ● CDC (e.g. Debezium, Brooklin) ● includes batch, but we'll focus on streaming

Slide 3

Slide 3 text

Deeply Declarative? Explicit Concrete Imperative General-purpose Internally simple Hard Implicit Abstract Declarative Domain-specific Internally complex Easy ASM SQL YAML Java

Slide 4

Slide 4 text

1. Preheat the oven to 450° 2. Mix ﬂour and water ...

Slide 5

Slide 5 text

When do we need data pipelines? online operational database analytics database data pipeline

Slide 6

Slide 6 text

Data pipelines as connectors MySQL HDFS Kafka MySQL source connector (CDC) HDFS sink connector (ETL) Kafka Connect

Slide 7

Slide 7 text

Connectors as conﬁguration { "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.include.list": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "schema-changes.inventory" } }

Slide 8

Slide 8 text

Simple transformation pipelines MySQL HDFS Kafka MySQL source connector (CDC) HDFS sink connector (ETL) Filter (SMT) Flatten (SMT)

Slide 9

Slide 9 text

Transformations as conﬁguration { "transforms": "Filter,Flatten", "transforms.Filter.type": "org.apache.kafka.connect.transforms.Filter", "transforms.Filter.predicate": "IsMuj", "transforms.Flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value", "transforms.Flatten.delimiter": "_", "predicates": "IsMuj", "predicates.IsMuj.type": "io.confluent.connect.transforms.predicates.FieldValuePredicate", "predicates.IsMuj.field": "$.name", "predicates.IsMuj.value": "\"muj\"", ... }

Slide 10

Slide 10 text

Data pipelines as conﬁguration MySQL HDFS Kafka MySQL source connector (CDC) HDFS sink connector (ETL) Filter (SMT) Flatten (SMT) JSON JSON

Slide 11

Slide 11 text

Complex data pipelines online operational database event stream analytics database filter and join change stream view

Slide 12

Slide 12 text

Apache Kafka Connect and Kafka Streams online operational database Kafka analytics database Kafka Streams app Kafka Kafka MySQL source connector (CDC) Filter (SMT)

Slide 13

Slide 13 text

Apache Flink DB event stream DB Flink job change stream upserts

Slide 14

Slide 14 text

Data pipelines power important stuﬀ advertiser accounts ad click events very important data who do we charge? how much do we charge? new accounts money stuff

Slide 15

Slide 15 text

How can we make data pipelines easy?

Slide 16

Slide 16 text

How can we build new data pipelines with as little one-oﬀ code as possible?

Slide 17

Slide 17 text

Stream processing with Apache Flink DB event stream DB Flink change stream upserts

Slide 18

Slide 18 text

Stream processing with the Flink Datastream API // Join the two datastreams on the ad id field using a time window of one hour and a custom join function DataStream> joined = accounts .join(clicks) .where(account -> account.id) // Key selector for accounts .equalTo(click -> click.adId) // Key selector for clicks .window(TumblingEventTimeWindows.of(Time.hours(1))) // Window assigner for joining streams based on event time .apply(new JoinFunction>() { @Override public Tuple2 join(AdvertiserAccountCDC account, AdClickEvent click) { return new Tuple2<>(account, click); // Return a tuple of account and click } });

Slide 19

Slide 19 text

Flink Table API introduces connectors TableFactoryService.find(StreamTableSourceFactory.class, new HashMap<>()) .connector(new DebeziumDynamicTableSourceFactory()) // use Debezium as the connector .option("connector.hostname", "localhost") .option("connector.port", "3306") .option("connector.username", "flinkuser") .option("connector.password", "flinkpw") .option("connector.database-name", "inventory") // monitor all tables under inventory database .format(new JsonFormatFactory()) // use Json format factory .option("json-schema", "...") // specify the JSON schema of the records .option("json-fail-on-missing-field", true) // fail if a field is missing .inAppendMode() .schema(...) // specify the table schema of the source .createTemporaryTable("cdc_source"); // register the source as a temporary table // create a table from the source descriptor Table cdcTable = tableEnv.from("cdc_source"); // filter out delete operations from the table Table filteredTable = cdcTable.where("!op.equals('d')"); // keep only insert and update operations // print the filtered table filteredTable.execute().print();

Slide 20

Slide 20 text

Flink SQL API -- create a CDC source table for inventory database CREATE TABLE inventory ( id INT NOT NULL, name STRING, description STRING, amount INT ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = 'localhost', 'port' = '3306', 'username' = 'flinkuser', 'password' = 'flinkpw', 'database-name' = 'inventory_db', 'table-name' = 'inventory' ); -- filter out delete operations from the table SELECT * FROM inventory WHERE op != "d"; Data Deﬁnition Language Standard Query Language

Slide 21

Slide 21 text

Flink SQL for data pipelines INSERT INTO stock SELECT * FROM inventory WHERE quantity > 0; sink source transformation

Slide 22

Slide 22 text

Flink SQL API // create a TableEnvironment for specific planner batch or streaming TableEnvironment tableEnv = ...; // see planners section // register a MySQL table named "products" in Flink SQL String sourceDDL = "CREATE TABLE products (" + " id INT NOT NULL," + " name STRING," + " description STRING," + " weight DECIMAL(10,3)" + ") WITH (" + " 'connector' = 'mysql-cdc'," + " 'hostname' = 'localhost'," + " 'port' = '3306'," + " 'username' = 'flinkuser'," + " 'password' = 'flinkpw'," + " 'database-name' = 'inventory'," + " 'table-name' = 'products'" + // monitor table `products` in database `inventory` ")"; tableEnv.executeSql(sourceDDL); // define a dynamic aggregating query String query = "SELECT id, name, SUM(weight) as total_weight FROM products GROUP BY id, name"; // convert the Table API query result into DataStream DataStream resultStream = tableEnv.toRetractStream(tableEnv.sqlQuery(query), Row.class) .filter(x -> x.f0) // only emit update changes (true) and filter out delete changes (false) .map(x -> x.f1); // drop change flag // print the result stream resultStream.print();

Slide 23

Slide 23 text

FlinkDeployment YAML with the Flink K8s Operator apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment metadata: name: basic-example spec: image: flink:1.16 flinkVersion: v1_16 flinkConfiguration: taskmanager.numberOfTaskSlots: "2" serviceAccount: flink jobManager: resource: memory: "2048m" cpu: 1 taskManager: resource: memory: "2048m" cpu: 1 job: jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar parallelism: 2 upgradeMode: stateless https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.4/examples/basic.yaml

Slide 24

Slide 24 text

What did a human need to do? ● Write Java and compile a jar ● Store the jar somewhere (e.g. in Docker image) ● Write FlinkDeployment YAML ● Deploy the YAML → Flink K8s operator steps in → Flink job running!

Slide 25

Slide 25 text

We can do better.

Slide 26

Slide 26 text

Level 1 Flink SQL in YAML ● "Flink SQL Runner" application ● FlinkSqlJob CRD ● New k8s operator

Slide 27

Slide 27 text

Flink SQL Runner public class FlinkSqlRunner { public static void main(String[] args) { // check the arguments if (args.length != 1) { System.err.println("Usage: FlinkSqlRunner "); return; } // get the sql file name String sqlFile = args[0]; // create a streaming execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // create a table environment with blink planner in streaming mode EnvironmentSettings settings = EnvironmentSettings.newInstance() .useBlinkPlanner() .inStreamingMode() .build(); TableEnvironment tableEnv = TableEnvironment.create(settings); // read the sql file line by line and execute each statement try (BufferedReader reader = new BufferedReader(new FileReader(sqlFile))) { String line; while ((line = reader.readLine()) != null) { // skip empty or comment lines if (line.trim().isEmpty() || line.startsWith("--")) { continue; } // execute the statement and print the result if any tableEnv.executeSql(line); } }

Slide 28

Slide 28 text

FlinkSQLJob Custom Resource apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name: itemcount spec: sql: | CREATE TABLE source ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); CREATE TABLE sink ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' ); INSERT INTO sink SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM source WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id; parallelism: 2

Slide 29

Slide 29 text

ﬂink-sql-job-operator FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator Kubernetes JobMgr deployment (native YAML) Job deployment (native YAML) etc...

Slide 30

Slide 30 text

Operators are made up of control loops controller 1 controller 2 controller 3 Kubernetes YAML YAML YAML

Slide 31

Slide 31 text

Deploying a FlinkSQLJob $ kubectl apply -f wordcount.yaml $ kubectl get flinksqljob wordcount -o yaml # Output: apiVersion: rta/v1beta1 kind: FlinkSQLJob metadata: ... spec: ... status: # The current state of the job. state: Running

Slide 32

Slide 32 text

What did a human need to do? ● Write SQL in a YAML ﬁle ● Deploy the YAML → Flink SQL Job operator steps in → Flink K8s operator steps in → Flink job running!

Slide 33

Slide 33 text

Where do we get this stuﬀ? apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name: itemcounts spec: sql: | CREATE TABLE source ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); CREATE TABLE sink ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' ); INSERT INTO sink SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM source WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id; parallelism: 2

Slide 34

Slide 34 text

Level 2 Table YAML ● FlinkTable CRD ● smarter operator

Slide 35

Slide 35 text

FlinkTable Custom Resource apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserBehavior spec: ddl: | CREATE TABLE UserBehavior ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserItemCounts spec: ddl: | CREATE TABLE UserItemCounts ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' );

Slide 36

Slide 36 text

Reference FlinkTables in FlinkSQLJobs apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name: itemcounts spec: tables: - UserItemCounts - UserBehavior sql: | INSERT INTO UserItemCounts SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM UserBehavior WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;

Slide 37

Slide 37 text

ﬂink-sql-job-operator FlinkSQLJob (CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML)

Slide 38

Slide 38 text

What happens when you change or delete a FlinkTable? FlinkSQLJob (CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML)

Slide 39

Slide 39 text

What happens when you delete a FlinkSQLJob? FlinkSQLJob (CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML) ownerReferences

Slide 40

Slide 40 text

What happens when you delete a FlinkSQLJob? FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML) ownerReferences ownerReferences

Slide 41

Slide 41 text

Level 3 Automatic Table Registration ● smarter data plane ● centralized catalog

Slide 42

Slide 42 text

Why do we need to write these by hand? apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserBehavior spec: ddl: | CREATE TABLE UserBehavior ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserItemCounts spec: ddl: | CREATE TABLE UserItemCounts ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' );

Slide 43

Slide 43 text

Why do we need to write these by hand? apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: ImportantMoneyStuff spec: ddl: | CREATE TABLE ImportantMoneyStuff ( account STRING, moneys BIGINT ) WITH ( 'connector' = 'pinot', ... ); Pinot Table ?

Slide 44

Slide 44 text

Metadata catalogs Pinot MySQL Kafka DataHub metadata Data Plane flink-sql-job- -operator FlinkSQLJob (CR YAML)

Slide 45

Slide 45 text

Reference tables in the catalog apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name: itemcounts spec: datasets: - urn:li:dataset:(urn:li:dataPlatform:pinot,project.dataset.UserItemCounts,PROD) - urn:li:dataset:(urn:li:dataPlatform:kafka,project.dataset.UserBehavior,PROD) sql: | INSERT INTO UserItemCounts SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM UserBehavior WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;

Slide 46

Slide 46 text

Level 4 Declarative infra ● declarative data plane ● even more operators

Slide 47

Slide 47 text

Data plane operators Pinot DataHub metadata flink-sql-job- -operator FlinkSQLJob (CR YAML) PinotTable (CR YAML) pinot-table- -operator Kafka KafkaTopic (CR YAML) kafka-topic- -operator

Slide 48

Slide 48 text

Data plane YAML apiVersion: rta/v1beta1 kind: KafkaTopic metadata: name: my-topic namespace: rta ownershipReferences: - name: kafka-cluster kind: KafkaCluster uid: abc123... - name: my-table kind: PinotTable uid: abc123... spec: clusterRef: name: kafka-cluster name: my-topic partitions: 3 apiVersion: rta/v1alpha1 kind: PinotTable metadata: name: my-table namespace: rta spec: tableType: REALTIME schemaRef: my-schema kafkaRef: name: my-topic

Slide 49

Slide 49 text

Data plane operators Pinot DataHub metadata flink-sql-job- -operator FlinkSQLJob (CR YAML) PinotTable (CR YAML) pinot-table- -operator Kafka KafkaTopic (CR YAML) kafka-topic- -operator

Slide 50

Slide 50 text

Level 5 End-to-End Ownership ● pipeline operator ● multiple owners

Slide 51

Slide 51 text

Pipeline YAML apiVersion: rta/v1beta1 kind: Pipeline metadata: name: my-pipeline namespace: rta spec: resources: - kind: KafkaTopic name: my-topic - kind: PinotTable name: my-table - kind: AvroSchema name: my-schema sql: | INSERT INTO my-topic ...

Slide 52

Slide 52 text

pipeline- -operator Pipeline operator FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML)

Slide 53

Slide 53 text

pipeline- -operator End-to-end ownership FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML)

Slide 54

Slide 54 text

pipeline- -operator Multiple owners FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML) Pipeline (CR YAML) Pipeline (CR YAML)

Slide 55

Slide 55 text

Thank you! @dolanRyanne in/RyanneDolan All code from Bing Chat All images from OpenAI Dall-E