Practical Change Data Streaming Use Cases With Apache Kafka and Debezium (QCon San Francisco 2019)

Practical Change Data Streaming Use Cases With Apache Kafka and Debezium (QCon San Francisco 2019)

Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/) - Secret Sauce for Change Data Capture

Apache Kafka is a highly popular option for asynchronous event propagation between microservices. Things get challenging though when adding a service’s database to the picture: How can you avoid inconsistencies between Kafka and the database?

Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.

In this session you’ll see how to leverage CDC for reliable microservices integration, e.g. using the outbox pattern, as well as many other CDC applications, such as maintaining audit logs, automatically keeping your full-text search index in sync, and driving streaming queries. We’ll also discuss practical matters, e.g. HA set-ups, best practices for running Debezium in production on and off Kubernetes, and the many use cases enabled by Kafka Connect's single message transformations.

8e25c0ca4bf25113bd9c0ccc5d118164?s=128

Gunnar Morling

November 12, 2019
Tweet

Transcript

  1. Practical Change Data Streaming Use Cases Practical Change Data Streaming

    Use Cases With Apache Kafka and Debezium With Apache Kafka and Debezium Gunnar Morling Gunnar Morling Software Engineer @gunnarmorling 1
  2. DATA 2

  3. 3

  4. 4

  5. The Issue with Dual Writes What's the problem? Change data

    capture to the rescue! CDC Use Cases & Patterns Replication Audit Logs Microservices Practical Matters Deployment Topologies Running on Kubernetes Single Message Transforms 1 2 3 5
  6. Gunnar Morling Gunnar Morling Open source software engineer at Red

    Hat Debezium Hibernate Spec Lead for Bean Validation 2.0 Other projects: Deptective, MapStruct Java Champion #CDCUseCases @gunnarmorling 6
  7. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Database Order Service #CDCUseCases 7
  8. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Cache Database Order Service #CDCUseCases 8
  9. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Cache Database Order Service Search Index #CDCUseCases 9
  10. A Common Problem A Common Problem Updating Multiple Resources Updating

    Multiple Resources @gunnarmorling Order Service Cache Database Search Index 10 “ Friends Don't Let Friends Do Dual Writes #CDCUseCases
  11. A Better Solution A Better Solution Streaming Change Events From

    the Database Streaming Change Events From the Database @gunnarmorling Order Service #CDCUseCases 11
  12. A Better Solution A Better Solution Streaming Change Events From

    the Database Streaming Change Events From the Database @gunnarmorling Order Service C C U C U U D C C - Create U - Update D - Delete 12 Change Data Capture #CDCUseCases
  13. A Better Solution A Better Solution Streaming Change Events From

    the Database Streaming Change Events From the Database @gunnarmorling Order Service 13 C C U C U U D C C - Create U - Update D - Delete Change Data Capture #CDCUseCases
  14. Change Data Capture Change Data Capture With Debezium With Debezium

    14
  15. Debezium Debezium Change Data Capture Platform Change Data Capture Platform

    CDC for multiple databases Based on transaction logs Snapshotting, Filtering etc. Fully open-source, very active community Via Apache Kafka or embedded Many production deployments (e.g. WePay, Convoy, JW Player, Usabilla, BlaBlaCar etc.) @gunnarmorling #CDCUseCases 15
  16. Debezium Connectors Debezium Connectors MySQL Postgres MongoDB SQL Server Cassandra

    (Incubating) Oracle (Incubating, based on XStream) Possible future additions DB2? MariaDB? @gunnarmorling #CDCUseCases 16
  17. Meme idea: Robin Moffatt 17

  18. Log- vs. Query-Based CDC Log- vs. Query-Based CDC @gunnarmorling Query-Based

    Log-Based All data changes are captured - + No polling delay or overhead - + Transparent to writing applications and models - + Can capture deletes and old record state - + Installation/Configuration + - #CDCUseCases 18
  19. { "before": null, "after": { "id": 1004, "first_name": "Anne", "last_name":

    "Kretchmar", "email": "annek@noanswer.org" }, "source": { "name": "dbserver1", "server_id": 0, "ts_sec": 0, "file": "mysql­bin.000003", "pos": 154, "row": 0, "snapshot": true, "db": "inventory", "table": "customers" }, "op": "c", "ts_ms": 1486500577691 } Change Event Structure Change Event Structure Key: Primary key of table Value: Describing the change event Old row state New row state Metadata Serialization formats: JSON Avro @gunnarmorling #CDCUseCases 19
  20. The Issue with Dual Writes What's the problem? Change data

    capture to the rescue! 1 2 3 CDC Use Cases & Patterns Replication Audit Logs Microservices Practical Matters Deployment Topologies Running on Kubernetes Single Message Transforms 20
  21. @gunnarmorling CDC – "Liberation for Your Data" CDC – "Liberation

    for Your Data" #CDCUseCases 21
  22. @gunnarmorling Postgres MySQL Apache Kafka Data Replication Data Replication Zero-Code

    Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases 22
  23. @gunnarmorling Postgres MySQL Apache Kafka Kafka Connect Kafka Connect Data

    Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases 23
  24. @gunnarmorling Postgres MySQL Apache Kafka Kafka Connect Kafka Connect DBZ

    PG DBZ MySQL Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases 24
  25. @gunnarmorling Postgres MySQL Kafka Connect Kafka Connect Apache Kafka DBZ

    PG DBZ MySQL Elasticsearch ES Connector 25 Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases
  26. @gunnarmorling Postgres MySQL Kafka Connect Kafka Connect Apache Kafka DBZ

    PG DBZ MySQL Elasticsearch ES Connector 26 JDBC Connector ES Connector Data Warehouse Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines #CDCUseCases
  27. @gunnarmorling Postgres MySQL Kafka Connect Kafka Connect Apache Kafka DBZ

    PG DBZ MySQL Elasticsearch ES Connector 27 JDBC Connector ES Connector ISPN Connector Infinispan Data Replication Data Replication Zero-Code Streaming Pipelines Zero-Code Streaming Pipelines Data Warehouse #CDCUseCases
  28. @gunnarmorling Data Replication Data Replication Low-Latency Streaming Pipelines Low-Latency Streaming

    Pipelines #CDCUseCases https://medium.com/convoy-tech/ 28
  29. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events CRM Service #CDCUseCases 29
  30. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events CRM Service Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table 30 #CDCUseCases
  31. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events Transactions CRM Service Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table 31 #CDCUseCases
  32. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events Transactions CRM Service Kafka Streams 32 Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table #CDCUseCases
  33. @gunnarmorling Auditing Auditing Source DB Kafka Connect Apache Kafka DBZ

    Customer Events Transactions CRM Service Kafka Streams 33 Id User Use Case tx-1 Bob Create Customer tx-2 Sarah Delete Customer tx-3 Rebecca Update Customer "Transactions" table Enriched Customer Events #CDCUseCases
  34. @gunnarmorling Auditing Auditing { "before": { "id": 1004, "last_name": "Kretchmar",

    "email": "annek@example.com" }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "annek@noanswer.org" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3" }, "op": "u", "ts_ms": 1486500577691 } Customers #CDCUseCases 34
  35. @gunnarmorling { "before": { "id": 1004, "last_name": "Kretchmar", "email": "annek@example.com"

    }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "annek@noanswer.org" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3" }, "op": "u", "ts_ms": 1486500577691 } { "before": null, "after": { "id": "tx­3", "user": "Rebecca", "use_case": "Update customer" }, "source": { "name": "dbserver1", "table": "transactions", "txId": "tx­3" }, "op": "c", "ts_ms": 1486500577691 } Transactions Customers { "id": "tx­3" } #CDCUseCases 35
  36. { "id": "tx­3" } { "before": { "id": 1004, "last_name":

    "Kretchmar", "email": "annek@example.com" }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "annek@noanswer.org" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3" }, "op": "u", "ts_ms": 1486500577691 } Transactions Customers @gunnarmorling #CDCUseCases { "before": null, "after": { "id": "tx­3", "user": "Rebecca", "use_case": "Update customer" }, "source": { "name": "dbserver1", "table": "transactions", "txId": "tx­3" }, "op": "c", "ts_ms": 1486500577691 } 36
  37. @gunnarmorling { "before": { "id": 1004, "last_name": "Kretchmar", "email": "annek@example.com"

    }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "annek@noanswer.org" }, "source": { "name": "dbserver1", "table": "customers", "txId": "tx­3", "user": "Rebecca", "use_case": "Update customer" }, "op": "u", "ts_ms": 1486500577691 } Enriched Customers Auditing Auditing #CDCUseCases 37
  38. @gunnarmorling @Override public KeyValue<JsonObject, JsonObject> transform(JsonObject key, JsonObject value) {

    boolean enrichedAllBufferedEvents = enrichAndEmitBufferedEvents(); if (!enrichedAllBufferedEvents) { bufferChangeEvent(key, value); return null; } KeyValue<JsonObject, JsonObject> enriched = enrichWithTxMetaData(key, value); if (enriched == null) { bufferChangeEvent(key, value); } return enriched; } Auditing Auditing Non-trivial join implementation no ordering across topics need to buffer change events until TX data available bit.ly/debezium-auditlogs #CDCUseCases 38
  39. Microservice Microservice CDC Patterns CDC Patterns 39

  40. @gunnarmorling Order Item Stock App Local DB Local DB Local

    DB App App 40 Item Changes Stock Changes Microservice Architectures Microservice Architectures Data Synchronization Data Synchronization Propagate data between different services without coupling Each service keeps optimised views locally #CDCUseCases
  41. Source DB Kafka Connect Apache Kafka DBZ Order Events Credit

    Worthiness Check Events Outbox Pattern Outbox Pattern Separate Events Table Separate Events Table @gunnarmorling Order Service Shipment Service 41 Customer Service Orders Outbox #CDCUseCases
  42. Source DB Kafka Connect Apache Kafka DBZ Order Events Credit

    Worthiness Check Events Outbox Pattern Outbox Pattern Separate Events Table Separate Events Table @gunnarmorling Order Service Shipment Service Customer Service 42 Id AggregateType AggregateId Type Payload ec6e Order 123 OrderCreated { "id" : 123, ... } 8af8 Order 456 OrderDetailCanceled { "id" : 456, ... } 890b Customer 789 InvoiceCreated { "id" : 789, ... } "Outbox" table bit.ly/debezium-outbox-pattern Orders Outbox #CDCUseCases
  43. Strangler Pattern Strangler Pattern Migrating from Monoliths to Microservices Migrating

    from Monoliths to Microservices https://martinfowler.com/bliki/StranglerFigApplication.html @gunnarmorling #CDCUseCases 43
  44. @gunnarmorling Customer Strangler Strangler Pattern Pattern #CDCUseCases 44

  45. @gunnarmorling Router CDC Customer Customer' 45 Reads/ Writes Reads Strangler

    Strangler Pattern Pattern Transformation #CDCUseCases
  46. @gunnarmorling Router CDC Customer 46 Reads/ Writes Reads/ Writes CDC

    Strangler Strangler Pattern Pattern #CDCUseCases
  47. The Issue with Dual Writes What's the problem? Change data

    capture to the rescue! 1 3 2 CDC Use Cases & Patterns Replication Audit Logs Microservices Practical Matters Deployment Topologies Running on Kubernetes Single Message Transforms 47
  48. @gunnarmorling Deployment Topologies Deployment Topologies Basic Set-Up Basic Set-Up CDC

    #CDCUseCases 48
  49. Deployment Topologies Deployment Topologies Database High Availability Database High Availability

    @gunnarmorling CDC #CDCUseCases 49
  50. Deployment Topologies Deployment Topologies Database High Availability Database High Availability

    @gunnarmorling CDC #CDCUseCases 50
  51. Deployment Topologies Deployment Topologies Automatic Fail-over Automatic Fail-over @gunnarmorling HA

    Proxy CDC #CDCUseCases 51
  52. Deployment Topologies Deployment Topologies Automatic Fail-over Automatic Fail-over @gunnarmorling HA

    Proxy CDC #CDCUseCases 52
  53. Deployment Topologies Deployment Topologies Can't Change Binlog Mode? Can't Change

    Binlog Mode? @gunnarmorling CDC Primary Secondary #CDCUseCases 53
  54. Deployment Topologies Deployment Topologies High Availability for Connectors High Availability

    for Connectors @gunnarmorling CDC Deduplicator CDC 54 #CDCUseCases
  55. apiVersion: "kafka.strimzi.io/v1alpha1" kind: "KafkaConnector" metadata: name: "inventory­connector" labels: connect­cluster: my­connect­cluster

    spec: class: i.d.c.p.PostgresConnector tasksMax: 1 config: database.hostname: "postgres", database.port: "5432", database.user: "bob", database.password: "secret", database.dbname : "prod", database.server.name: "dbserver1", schema.whitelist: "inventory" Running on Kubernetes Running on Kubernetes Deployment via Operators Deployment via Operators YAML-based custom resource definitions for Kafka/Connect clusters, topics etc. Operator applies configuration Advantages Automated deployment and scaling Simplified upgrading Portability across clouds @gunnarmorling #CDCUseCases 55
  56. Running on Kubernetes Running on Kubernetes Operating Kafka Connect Operating

    Kafka Connect Distributed mode Offsets stored in Kafka Configuration via REST Single node: no re-balancing issues (< Apache Kafka 2.3) Single connector: health checks based on REST API Fight duplication: Jsonnet templates @gunnarmorling // a database + connector per tenant { "name": "inventory­connector", "config": { "connector.class": "i.d.c.p.PostgresConnector", "tasks.max": "1", "database.hostname": "postgres", "database.port": "5432", "database.user": "bob", "database.password": "secret", "database.dbname" : std.extVar('tenant'), "database.server.name": std.extVar('tenant'), "schema.whitelist": "inventory" } } #CDCUseCases 56
  57. Single Message Transformations Single Message Transformations The Swiss Army Knife

    of Kafka Connect The Swiss Army Knife of Kafka Connect Format conversions Time/date fields Extract new row state Aggregate sharded tables to single topic Keep compatibility with existing consumers @gunnarmorling #CDCUseCases © Emilian Robert Vicol https://flic.kr/p/c8s6Y3 57
  58. Single Message Transformations Single Message Transformations Externalizing Large Column Values

    Externalizing Large Column Values @gunnarmorling DBZ Amazon S3 #CDCUseCases 58
  59. Single Message Transformations Single Message Transformations Externalizing Large Column Values

    Externalizing Large Column Values @gunnarmorling DBZ Amazon S3 { "before": { ... }, "after": { "id": 1004, "last_name": "Kretchmar", "email": "annek@noanswer.org", "image": "imgs­<offset>­after" }, ... } #CDCUseCases 59
  60. Takeaways Takeaways Change Data Capture – Liberation for your data!

    Enabling use cases such as replication, streaming queries, maintaining CQRS read models etc. Microservices: outbox and strangler patterns Debezium: open-source CDC for a growing number of databases @gunnarmorling #CDCUseCases “ Friends Don't Let Friends Do Dual-Writes 60
  61. DATA 61

  62. Resources Resources Website: Strimzi (Kafka on Kubernetes) Latest news: @debezium

    debezium.io debezium.io/documentation/online-resources debezium.io/blog strimzi.io @gunnarmorling #CDCUseCases 62
  63. gunnar@hibernate.org @gunnarmorling @gunnarmorling Q&A #CDCUseCases 63

  64. Outlook: View Materialization Outlook: View Materialization Awareness of Transaction Boundaries

    Awareness of Transaction Boundaries Topic with BEGIN/END markers Enable consumers to buffer all events of one transaction @gunnarmorling { "transactionId" : "tx­123", "eventType" : "begin transaction", "ts_ms": 1486500577125 } { "transactionId" : "tx­123", "ts_ms": 1486500577691, "eventType" : "end transaction", "eventCount" : [ { "name" : "dbserver1.inventory.Order", "count" : 1 }, { "name" : "dbserver1.inventory.OrderLine", "count" : 5 } ] } #CDCUseCases BEGIN END 64
  65. 65