Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

Deeply Declarative Data Pipelines Ryanne Dolan Snr Staﬀ SWE, LinkedIn

What are data pipelines? • transfer data from sources to
sinks • replicating (e.g. MirrorMaker) • ﬁltering, routing, joining streams (e.g. Flink) • ETL, rETL (e.g. Kafka Connect, Gobblin) • CDC (e.g. Debezium, Brooklin) • includes batch, but we'll focus on streaming

Deeply Declarative? Explicit Concrete Imperative General-purpose Internally simple Hard Implicit
Abstract Declarative Domain-specific Internally complex Easy ASM SQL YAML Java

1. Preheat the oven to 450° 2. Mix ﬂour and
water ...

When do we need data pipelines? online operational database analytics
database data pipeline

Data pipelines as connectors MySQL HDFS Kafka MySQL source connector
(CDC) HDFS sink connector (ETL) Kafka Connect

Connectors as conﬁguration { "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.include.list": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "schema-changes.inventory" } }

Simple transformation pipelines MySQL HDFS Kafka MySQL source connector (CDC)
HDFS sink connector (ETL) Filter (SMT) Flatten (SMT)

Transformations as conﬁguration { "transforms": "Filter,Flatten", "transforms.Filter.type": "org.apache.kafka.connect.transforms.Filter", "transforms.Filter.predicate": "IsMuj",
"transforms.Flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value", "transforms.Flatten.delimiter": "_", "predicates": "IsMuj", "predicates.IsMuj.type": "io.confluent.connect.transforms.predicates.FieldValuePredicate", "predicates.IsMuj.field": "$.name", "predicates.IsMuj.value": "\"muj\"", ... }

Data pipelines as conﬁguration MySQL HDFS Kafka MySQL source connector
(CDC) HDFS sink connector (ETL) Filter (SMT) Flatten (SMT) JSON JSON

Complex data pipelines online operational database event stream analytics database
filter and join change stream view

Apache Kafka Connect and Kafka Streams online operational database Kafka
analytics database Kafka Streams app Kafka Kafka MySQL source connector (CDC) Filter (SMT)

Apache Flink DB event stream DB Flink job change stream
upserts

Data pipelines power important stuﬀ advertiser accounts ad click events
very important data who do we charge? how much do we charge? new accounts money stuff

How can we make data pipelines easy?

How can we build new data pipelines with as little
one-oﬀ code as possible?

Stream processing with Apache Flink DB event stream DB Flink
change stream upserts

Stream processing with the Flink Datastream API // Join the
two datastreams on the ad id field using a time window of one hour and a custom join function DataStream<Tuple2<AdvertiserAccountCDC, AdClickEvent>> joined = accounts .join(clicks) .where(account -> account.id) // Key selector for accounts .equalTo(click -> click.adId) // Key selector for clicks .window(TumblingEventTimeWindows.of(Time.hours(1))) // Window assigner for joining streams based on event time .apply(new JoinFunction<AdvertiserAccountCDC, AdClickEvent, Tuple2<AdvertiserAccountCDC, AdClickEvent>>() { @Override public Tuple2<AdvertiserAccountCDC, AdClickEvent> join(AdvertiserAccountCDC account, AdClickEvent click) { return new Tuple2<>(account, click); // Return a tuple of account and click } });

Flink Table API introduces connectors TableFactoryService.find(StreamTableSourceFactory.class, new HashMap<>()) .connector(new DebeziumDynamicTableSourceFactory())
// use Debezium as the connector .option("connector.hostname", "localhost") .option("connector.port", "3306") .option("connector.username", "flinkuser") .option("connector.password", "flinkpw") .option("connector.database-name", "inventory") // monitor all tables under inventory database .format(new JsonFormatFactory()) // use Json format factory .option("json-schema", "...") // specify the JSON schema of the records .option("json-fail-on-missing-field", true) // fail if a field is missing .inAppendMode() .schema(...) // specify the table schema of the source .createTemporaryTable("cdc_source"); // register the source as a temporary table // create a table from the source descriptor Table cdcTable = tableEnv.from("cdc_source"); // filter out delete operations from the table Table filteredTable = cdcTable.where("!op.equals('d')"); // keep only insert and update operations // print the filtered table filteredTable.execute().print();

Flink SQL API -- create a CDC source table for
inventory database CREATE TABLE inventory ( id INT NOT NULL, name STRING, description STRING, amount INT ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = 'localhost', 'port' = '3306', 'username' = 'flinkuser', 'password' = 'flinkpw', 'database-name' = 'inventory_db', 'table-name' = 'inventory' ); -- filter out delete operations from the table SELECT * FROM inventory WHERE op != "d"; Data Deﬁnition Language Standard Query Language

Flink SQL for data pipelines INSERT INTO stock SELECT *
FROM inventory WHERE quantity > 0; sink source transformation

Flink SQL API // create a TableEnvironment for specific planner
batch or streaming TableEnvironment tableEnv = ...; // see planners section // register a MySQL table named "products" in Flink SQL String sourceDDL = "CREATE TABLE products (" + " id INT NOT NULL," + " name STRING," + " description STRING," + " weight DECIMAL(10,3)" + ") WITH (" + " 'connector' = 'mysql-cdc'," + " 'hostname' = 'localhost'," + " 'port' = '3306'," + " 'username' = 'flinkuser'," + " 'password' = 'flinkpw'," + " 'database-name' = 'inventory'," + " 'table-name' = 'products'" + // monitor table `products` in database `inventory` ")"; tableEnv.executeSql(sourceDDL); // define a dynamic aggregating query String query = "SELECT id, name, SUM(weight) as total_weight FROM products GROUP BY id, name"; // convert the Table API query result into DataStream DataStream<Row> resultStream = tableEnv.<Row>toRetractStream(tableEnv.sqlQuery(query), Row.class) .filter(x -> x.f0) // only emit update changes (true) and filter out delete changes (false) .map(x -> x.f1); // drop change flag // print the result stream resultStream.print();

FlinkDeployment YAML with the Flink K8s Operator apiVersion: flink.apache.org/v1beta1 kind:
FlinkDeployment metadata: name: basic-example spec: image: flink:1.16 flinkVersion: v1_16 flinkConfiguration: taskmanager.numberOfTaskSlots: "2" serviceAccount: flink jobManager: resource: memory: "2048m" cpu: 1 taskManager: resource: memory: "2048m" cpu: 1 job: jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar parallelism: 2 upgradeMode: stateless https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.4/examples/basic.yaml

What did a human need to do? • Write Java
and compile a jar • Store the jar somewhere (e.g. in Docker image) • Write FlinkDeployment YAML • Deploy the YAML → Flink K8s operator steps in → Flink job running!

We can do better.

Level 1 Flink SQL in YAML • "Flink SQL Runner"
application • FlinkSqlJob CRD • New k8s operator

Flink SQL Runner public class FlinkSqlRunner { public static void
main(String[] args) { // check the arguments if (args.length != 1) { System.err.println("Usage: FlinkSqlRunner <sql-file>"); return; } // get the sql file name String sqlFile = args[0]; // create a streaming execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // create a table environment with blink planner in streaming mode EnvironmentSettings settings = EnvironmentSettings.newInstance() .useBlinkPlanner() .inStreamingMode() .build(); TableEnvironment tableEnv = TableEnvironment.create(settings); // read the sql file line by line and execute each statement try (BufferedReader reader = new BufferedReader(new FileReader(sqlFile))) { String line; while ((line = reader.readLine()) != null) { // skip empty or comment lines if (line.trim().isEmpty() || line.startsWith("--")) { continue; } // execute the statement and print the result if any tableEnv.executeSql(line); } }

FlinkSQLJob Custom Resource apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name: itemcount
spec: sql: | CREATE TABLE source ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); CREATE TABLE sink ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' ); INSERT INTO sink SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM source WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id; parallelism: 2

ﬂink-sql-job-operator FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment
(CR YAML) flink-kubernetes- -operator Kubernetes JobMgr deployment (native YAML) Job deployment (native YAML) etc...

Operators are made up of control loops controller 1 controller
2 controller 3 Kubernetes YAML YAML YAML

Deploying a FlinkSQLJob $ kubectl apply -f wordcount.yaml $ kubectl
get flinksqljob wordcount -o yaml # Output: apiVersion: rta/v1beta1 kind: FlinkSQLJob metadata: ... spec: ... status: # The current state of the job. state: Running

What did a human need to do? • Write SQL
in a YAML ﬁle • Deploy the YAML → Flink SQL Job operator steps in → Flink K8s operator steps in → Flink job running!

Where do we get this stuﬀ? apiVersion: rta/v1alpha1 kind: FlinkSQLJob
metadata: name: itemcounts spec: sql: | CREATE TABLE source ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); CREATE TABLE sink ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' ); INSERT INTO sink SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM source WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id; parallelism: 2

Level 2 Table YAML • FlinkTable CRD • smarter operator

FlinkTable Custom Resource apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserBehavior
spec: ddl: | CREATE TABLE UserBehavior ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserItemCounts spec: ddl: | CREATE TABLE UserItemCounts ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' );

Reference FlinkTables in FlinkSQLJobs apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata: name:
itemcounts spec: tables: - UserItemCounts - UserBehavior sql: | INSERT INTO UserItemCounts SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM UserBehavior WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;

ﬂink-sql-job-operator FlinkSQLJob (CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment
(CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML)

What happens when you change or delete a FlinkTable? FlinkSQLJob
(CR YAML) flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML)

What happens when you delete a FlinkSQLJob? FlinkSQLJob (CR YAML)
flink-sql-job- -operator DDL+SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML) ownerReferences

What happens when you delete a FlinkSQLJob? FlinkSQLJob (CR YAML)
flink-sql-job- -operator SQL script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator FlinkTable (CR YAML) FlinkTable (CR YAML) ownerReferences ownerReferences

Level 3 Automatic Table Registration • smarter data plane •
centralized catalog

Why do we need to write these by hand? apiVersion:
rta/v1alpha1 kind: FlinkTable metadata: name: UserBehavior spec: ddl: | CREATE TABLE UserBehavior ( user_id STRING, item_id STRING, category_id STRING, behavior STRING, ts TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' ); apiVersion: rta/v1alpha1 kind: FlinkTable metadata: name: UserItemCounts spec: ddl: | CREATE TABLE UserItemCounts ( user_id STRING, item_count BIGINT ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql:3306/flink', 'table-name' = 'user_item_count', 'username' = 'root', 'password' = '' );

Why do we need to write these by hand? apiVersion:
rta/v1alpha1 kind: FlinkTable metadata: name: ImportantMoneyStuff spec: ddl: | CREATE TABLE ImportantMoneyStuff ( account STRING, moneys BIGINT ) WITH ( 'connector' = 'pinot', ... ); Pinot Table ?

Metadata catalogs Pinot MySQL Kafka DataHub metadata Data Plane flink-sql-job-
-operator FlinkSQLJob (CR YAML)

Reference tables in the catalog apiVersion: rta/v1alpha1 kind: FlinkSQLJob metadata:
name: itemcounts spec: datasets: - urn:li:dataset:(urn:li:dataPlatform:pinot,project.dataset.UserItemCounts,PROD) - urn:li:dataset:(urn:li:dataPlatform:kafka,project.dataset.UserBehavior,PROD) sql: | INSERT INTO UserItemCounts SELECT user_id, COUNT(DISTINCT item_id) AS item_count FROM UserBehavior WHERE behavior = 'buy' GROUP BY TUMBLE(ts, INTERVAL '1' HOUR), user_id;

Level 4 Declarative infra • declarative data plane • even
more operators

Data plane operators Pinot DataHub metadata flink-sql-job- -operator FlinkSQLJob (CR
YAML) PinotTable (CR YAML) pinot-table- -operator Kafka KafkaTopic (CR YAML) kafka-topic- -operator

Data plane YAML apiVersion: rta/v1beta1 kind: KafkaTopic metadata: name: my-topic
namespace: rta ownershipReferences: - name: kafka-cluster kind: KafkaCluster uid: abc123... - name: my-table kind: PinotTable uid: abc123... spec: clusterRef: name: kafka-cluster name: my-topic partitions: 3 apiVersion: rta/v1alpha1 kind: PinotTable metadata: name: my-table namespace: rta spec: tableType: REALTIME schemaRef: my-schema kafkaRef: name: my-topic

Data plane operators Pinot DataHub metadata flink-sql-job- -operator FlinkSQLJob (CR
YAML) PinotTable (CR YAML) pinot-table- -operator Kafka KafkaTopic (CR YAML) kafka-topic- -operator

Level 5 End-to-End Ownership • pipeline operator • multiple owners

Pipeline YAML apiVersion: rta/v1beta1 kind: Pipeline metadata: name: my-pipeline namespace:
rta spec: resources: - kind: KafkaTopic name: my-topic - kind: PinotTable name: my-table - kind: AvroSchema name: my-schema sql: | INSERT INTO my-topic ...

pipeline- -operator Pipeline operator FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL
script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML)

pipeline- -operator End-to-end ownership FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL
script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML)

pipeline- -operator Multiple owners FlinkSQLJob (CR YAML) flink-sql-job- -operator SQL
script (ConfigMap) FlinkDeployment (CR YAML) flink-kubernetes- -operator KafkaTopic (CR YAML) PinotTable (CR YAML) Schema (CR YAML) Pipeline (CR YAML) Pipeline (CR YAML) Pipeline (CR YAML)

Thank you! @dolanRyanne in/RyanneDolan All code from Bing Chat All
images from OpenAI Dall-E

Deeply Declarative Data Pipelines (Ryanne Dolan...

Deeply Declarative Data Pipelines (Ryanne Dolan, LinkedIn) | RTA Summit 2023

More Decks by StarTree

Other Decks in Technology

Featured

Transcript