Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Change Data Streaming Use Cases With Apache Kafka and Debezium (QCon San Francisco 2019)

Practical Change Data Streaming Use Cases With Apache Kafka and Debezium (QCon San Francisco 2019)

Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/) - Secret Sauce for Change Data Capture

Apache Kafka is a highly popular option for asynchronous event propagation between microservices. Things get challenging though when adding a service’s database to the picture: How can you avoid inconsistencies between Kafka and the database?

Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.

In this session you’ll see how to leverage CDC for reliable microservices integration, e.g. using the outbox pattern, as well as many other CDC applications, such as maintaining audit logs, automatically keeping your full-text search index in sync, and driving streaming queries. We’ll also discuss practical matters, e.g. HA set-ups, best practices for running Debezium in production on and off Kubernetes, and the many use cases enabled by Kafka Connect's single message transformations.

Gunnar Morling

November 12, 2019
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Practical Change Data Streaming Use Cases Practical Change Data Streaming

    Use Cases With Apache Kafka and Debezium With Apache Kafka and Debezium Gunnar Morling Gunnar Morling Software Engineer @gunnarmorling 1
  2. 3

  3. 4

  4. The Issue with Dual Writes What's the problem? Change data

    capture to the rescue! CDC Use Cases & Patterns Replication Audit Logs Microservices Practical Matters Deployment Topologies Running on Kubernetes Single Message Transforms 1 2 3 5
  5. Gunnar Morling Gunnar Morling Open source software engineer at Red

    Hat Debezium Hibernate Spec Lead for Bean Validation 2.0 Other projects: Deptective, MapStruct Java Champion #CDCUseCases @gunnarmorling 6
  6. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Database Order Service #CDCUseCases 7
  7. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Cache Database Order Service #CDCUseCases 8
  8. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Cache Database Order Service Search Index #CDCUseCases 9
  9. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Order Service Cache Database Search Index 10 “ Friends Don't Let Friends Do Dual Writes #CDCUseCases
  10. A Better Solution A Better Solution Streaming Change Events From

    the Database Streaming Change Events From the Database @gunnarmorling Order Service #CDCUseCases 11
  11. A Better Solution A Better Solution Streaming Change Events From

    the Database Streaming Change Events From the Database @gunnarmorling Order Service C C U C U U D C C - Create U - Update D - Delete 12 Change Data Capture #CDCUseCases
  12. A Better Solution A Better Solution Streaming Change Events From

    the Database Streaming Change Events From the Database @gunnarmorling Order Service 13 C C U C U U D C C - Create U - Update D - Delete Change Data Capture #CDCUseCases
  13. Debezium Debezium Change Data Capture Platform Change Data Capture Platform

    CDC for multiple databases Based on transaction logs Snapshotting, Filtering etc. Fully open-source, very active community Via Apache Kafka or embedded Many production deployments (e.g. WePay, Convoy, JW Player, Usabilla, BlaBlaCar etc.) @gunnarmorling #CDCUseCases 15
  14. Debezium Connectors Debezium Connectors MySQL Postgres MongoDB SQL Server Cassandra

    (Incubating) Oracle (Incubating, based on XStream) Possible future additions DB2? MariaDB? @gunnarmorling #CDCUseCases 16
  15. Log- vs. Query-Based CDC Log- vs. Query-Based CDC @gunnarmorling Query-Based

    Log-Based All data changes are captured - + No polling delay or overhead - + Transparent to writing applications and models - + Can capture deletes and old record state - + Installation/Configuration + - #CDCUseCases 18
  16. { "before": null, "after": { "id": 1004, "first_name": "Anne", "last_name":

    "Kretchmar", "email": "[email protected]" }, "source": { "name": "dbserver1", "server_id": 0, "ts_sec": 0, "file": "mysql­bin.000003", "pos": 154, "row": 0, "snapshot": true, "db": "inventory", "table": "customers" }, "op": "c", "ts_ms": 1486500577691 } Change Event Structure Change Event Structure Key: Primary key of table Value: Describing the change event Old row state New row state Metadata Serialization formats: JSON Avro @gunnarmorling #CDCUseCases 19
  17. The Issue with Dual Writes What's the problem? Change data

    capture to the rescue! 1 2 3 CDC Use Cases & Patterns Replication Audit Logs Microservices Practical Matters Deployment Topologies Running on Kubernetes Single Message Transforms 20
  18. @gunnarmorling Postgres MySQL Apache Kafka Data Replication Data Replication Zero-Code

    Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases 22
  19. @gunnarmorling Postgres MySQL Apache Kafka Kafka Connect Kafka Connect Data

    Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases 23
  20. @gunnarmorling Postgres MySQL Apache Kafka Kafka Connect Kafka Connect DBZ

    PG DBZ MySQL Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases 24
  21. @gunnarmorling Postgres MySQL Kafka Connect Kafka Connect Apache Kafka DBZ

    PG DBZ MySQL Elasticsearch ES Connector 25 Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases
  22. @gunnarmorling Postgres MySQL Kafka Connect Kafka Connect Apache Kafka DBZ

    PG DBZ MySQL Elasticsearch ES Connector 26 JDBC Connector ES Connector Data Warehouse Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases
  23. @gunnarmorling Postgres MySQL Kafka Connect Kafka Connect Apache Kafka DBZ

    PG DBZ MySQL Elasticsearch ES Connector 27 JDBC Connector ES Connector ISPN Connector Infinispan Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines Data Warehouse #CDCUseCases
  24. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events CRM Service Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table 30 #CDCUseCases
  25. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events Transactions CRM Service Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table 31 #CDCUseCases
  26. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events Transactions CRM Service Kafka Streams 32 Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table #CDCUseCases
  27. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events Transactions CRM Service Kafka Streams 33 Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table Enriched Customer Events #CDCUseCases
  28. @gunnarmorling Auditing Auditing { "before": { "id": 1004, "last_name": "Kretchmar",

    "email": "[email protected]" }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3" }, "op": "u", "ts_ms": 1486500577691 } Customers #CDCUseCases 34
  29. @gunnarmorling { "before": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]"

    }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3" }, "op": "u", "ts_ms": 1486500577691 } { "before": null, "after": { "id": "tx­3", "user": "Rebecca", "use_case": "Update customer" }, "source": { "name": "dbserver1", "table": "transactions", "txId": "tx­3" }, "op": "c", "ts_ms": 1486500577691 } Transactions Customers { "id": "tx­3" } #CDCUseCases 35
  30. { "id": "tx­3" } { "before": { "id": 1004, "last_name":

    "Kretchmar", "email": "[email protected]" }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3" }, "op": "u", "ts_ms": 1486500577691 } Transactions Customers @gunnarmorling #CDCUseCases { "before": null, "after": { "id": "tx­3", "user": "Rebecca", "use_case": "Update customer" }, "source": { "name": "dbserver1", "table": "transactions", "txId": "tx­3" }, "op": "c", "ts_ms": 1486500577691 } 36
  31. @gunnarmorling { "before": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]"

    }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3", "user": "Rebecca", "use_case": "Update customer" }, "op": "u", "ts_ms": 1486500577691 } Enriched Customers Auditing Auditing #CDCUseCases 37
  32. @gunnarmorling @Override public KeyValue<JsonObject, JsonObject> transform(JsonObject key, JsonObject value) {

    boolean enrichedAllBufferedEvents = enrichAndEmitBufferedEvents(); if (!enrichedAllBufferedEvents) { bufferChangeEvent(key, value); return null; } KeyValue<JsonObject, JsonObject> enriched = enrichWithTxMetaData(key, value); if (enriched == null) { bufferChangeEvent(key, value); } return enriched; } Auditing Auditing Non-trivial join implementation no ordering across topics need to buffer change events until TX data available bit.ly/debezium-auditlogs #CDCUseCases 38
  33. @gunnarmorling Order Item Stock App Local DB Local DB Local

    DB App App 40 Item Changes Stock Changes Microservice Architectures Microservice Architectures Data Synchronization Data Synchronization Propagate data between different services without coupling Each service keeps optimised views locally #CDCUseCases
  34. Source DB Kafka Connect Apache Kafka DBZ Order Events Credit

    Worthiness Check Events Outbox Pattern Outbox Pattern Separate Events Table Separate Events Table @gunnarmorling Order Service Shipment Service 41 Customer Service Orders Outbox #CDCUseCases
  35. Source DB Kafka Connect Apache Kafka DBZ Order Events Credit

    Worthiness Check Events Outbox Pattern Outbox Pattern Separate Events Table Separate Events Table @gunnarmorling Order Service Shipment Service Customer Service 42 Id AggregateType AggregateId Type Payload ec6e Order 123 OrderCreated { "id" : 123, ... } 8af8 Order 456 OrderDetailCanceled { "id" : 456, ... } 890b Customer 789 InvoiceCreated { "id" : 789, ... } "Outbox" table bit.ly/debezium-outbox-pattern Orders Outbox #CDCUseCases
  36. Strangler Pattern Strangler Pattern Migrating from Monoliths to Microservices Migrating

    from Monoliths to Microservices https://martinfowler.com/bliki/StranglerFigApplication.html @gunnarmorling #CDCUseCases 43
  37. @gunnarmorling Router CDC Customer Customer' 45 Reads/ Writes Reads Strangler

    Strangler Pattern Pattern Transformation #CDCUseCases
  38. @gunnarmorling Router CDC Customer 46 Reads/ Writes Reads/ Writes CDC

    Strangler Strangler Pattern Pattern #CDCUseCases
  39. The Issue with Dual Writes What's the problem? Change data

    capture to the rescue! 1 3 2 CDC Use Cases & Patterns Replication Audit Logs Microservices Practical Matters Deployment Topologies Running on Kubernetes Single Message Transforms 47
  40. Deployment Topologies Deployment Topologies Can't Change Binlog Mode? Can't Change

    Binlog Mode? @gunnarmorling CDC Primary Secondary #CDCUseCases 53
  41. Deployment Topologies Deployment Topologies High Availability for Connectors High Availability

    for Connectors @gunnarmorling CDC Deduplicator CDC 54 #CDCUseCases
  42. apiVersion: "kafka.strimzi.io/v1alpha1" kind: "KafkaConnector" metadata: name: "inventory­connector" labels: connect­cluster: my­connect­cluster

    spec: class: i.d.c.p.PostgresConnector tasksMax: 1 config: database.hostname: "postgres", database.port: "5432", database.user: "bob", database.password: "secret", database.dbname : "prod", database.server.name: "dbserver1", schema.whitelist: "inventory" Running on Kubernetes Running on Kubernetes Deployment via Operators Deployment via Operators YAML-based custom resource definitions for Kafka/Connect clusters, topics etc. Operator applies configuration Advantages Automated deployment and scaling Simplified upgrading Portability across clouds @gunnarmorling #CDCUseCases 55
  43. Running on Kubernetes Running on Kubernetes Operating Kafka Connect Operating

    Kafka Connect Distributed mode Offsets stored in Kafka Configuration via REST Single node: no re-balancing issues (< Apache Kafka 2.3) Single connector: health checks based on REST API Fight duplication: Jsonnet templates @gunnarmorling // a database + connector per tenant { "name": "inventory­connector", "config": { "connector.class": "i.d.c.p.PostgresConnector", "tasks.max": "1", "database.hostname": "postgres", "database.port": "5432", "database.user": "bob", "database.password": "secret", "database.dbname" : std.extVar('tenant'), "database.server.name": std.extVar('tenant'), "schema.whitelist": "inventory" } } #CDCUseCases 56
  44. Single Message Transformations Single Message Transformations The Swiss Army Knife

    of Kafka Connect The Swiss Army Knife of Kafka Connect Format conversions Time/date fields Extract new row state Aggregate sharded tables to single topic Keep compatibility with existing consumers @gunnarmorling #CDCUseCases © Emilian Robert Vicol https://flic.kr/p/c8s6Y3 57
  45. Single Message Transformations Single Message Transformations Externalizing Large Column Values

    Externalizing Large Column Values @gunnarmorling DBZ Amazon S3 #CDCUseCases 58
  46. Single Message Transformations Single Message Transformations Externalizing Large Column Values

    Externalizing Large Column Values @gunnarmorling DBZ Amazon S3 { "before": { ... }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "[email protected]", "image": "imgs­<offset>­after" }, ... } #CDCUseCases 59
  47. Takeaways Takeaways Change Data Capture – Liberation for your data!

    Enabling use cases such as replication, streaming queries, maintaining CQRS read models etc. Microservices: outbox and strangler patterns Debezium: open-source CDC for a growing number of databases @gunnarmorling #CDCUseCases “ Friends Don't Let Friends Do Dual-Writes 60
  48. Resources Resources Website: Strimzi (Kafka on Kubernetes) Latest news: @debezium

    debezium.io debezium.io/documentation/online-resources debezium.io/blog strimzi.io @gunnarmorling #CDCUseCases 62
  49. Outlook: View Materialization Outlook: View Materialization Awareness of Transaction Boundaries

    Awareness of Transaction Boundaries Topic with BEGIN/END markers Enable consumers to buffer all events of one transaction @gunnarmorling { "transactionId" : "tx­123", "eventType" : "begin transaction", "ts_ms": 1486500577125 } { "transactionId" : "tx­123", "ts_ms": 1486500577691, "eventType" : "end transaction", "eventCount" : [ { "name" : "dbserver1.inventory.Order", "count" : 1 }, { "name" : "dbserver1.inventory.OrderLine", "count" : 5 } ] } #CDCUseCases BEGIN END 64
  50. 65