Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Syncing your Database To OpenSearch In Real-Time (JCON Slovenia)

Syncing your Database To OpenSearch In Real-Time (JCON Slovenia)

You've been tasked with implementing a data streaming pipeline for propagating data changes from your operational Postgres database to a search index in OpenSearch. Data views in OpenSearch should be denormalized for fast querying, and of course there should be no noticeable impact on the production database.

In this session we'll discuss how to build this data pipeline using two popular open-source projects:
Debezium for log-based change data capture (CDC) and Apache Flink for stream processing. Join us for this talk and learn about

* Setting up change data streams with Debezium
* Efficiently building nested data structures from 1:n joins
* Deployment options: Kafka Connect vs. Flink CDC

We'll also touch on some advanced aspects like observability and consistency checks for your real-time data pipeline.

Gunnar Morling

June 03, 2024
Tweet

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Image © massmatt https://flic.kr/p/25eF9D3 (CC BY 2.0) Syncing your Database

    To OpenSearch In Real-Time Gunnar Morling Software Engineer, Decodable @gunnarmorling
  2. From Postgres to OpenSearch | @gunnarmorling • Software engineer at

    Decodable • Former project lead of Debezium • kcctl 🧸, JfrUnit, ModiTect, MapStruct • Spec Lead for Bean Validation 2.0 • Java Champion • 1⃣ 🐝 🏎 Gunnar Morling
  3. From Postgres to OpenSearch | @gunnarmorling Debezium in a Nutshell

    Open-Source Change Data Capture • A CDC Platform ◦ Based on transaction logs ◦ Snapshotting, filtering, etc. ◦ Outbox support ◦ Web-based UI • Fully open-source, very active community • Large production deployments
  4. From Postgres to OpenSearch | @gunnarmorling • Core ◦ MySQL,

    MariaDB ◦ Postgres ◦ SQL Server ◦ MongoDB ◦ Db2, Informix ◦ Oracle • Community-led: ◦ Vitess, Cassandra, Spanner • External: ScyllaDB, Yugabyte Debezium Supported Databases
  5. From Postgres to OpenSearch | @gunnarmorling Debezium: Data Change Events

    • Old and new row state • Metadata on table, TX id, etc. • Operation type, timestamp
  6. From Postgres to OpenSearch | @gunnarmorling Debezium: Data Change Events

    • Old and new row state • Metadata on table, TX id, etc. • Operation type, timestamp
  7. From Postgres to OpenSearch | @gunnarmorling Debezium: Data Change Events

    • Old and new row state • Metadata on table, TX id, etc. • Operation type, timestamp
  8. From Postgres to OpenSearch | @gunnarmorling Becoming the De-Facto CDC

    Standard https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/ Debezium
  9. From Postgres to OpenSearch | @gunnarmorling • Real-time reporting/dashboards •

    Low-latency alerting, notifications • Materialized view maintenance, caches • Real-time cross-database sync, lookup joins, windowed joins, aggregations • Machine learning: model serving, feature engineering • Change data capture, data integration Apache Flink Common Use Cases https://flink.apache.org/poweredby.html
  10. From Postgres to OpenSearch | @gunnarmorling Apache Flink APIs for

    Application Development Image source: “Change Data Capture with Flink SQL and Debezium” by Marta Paes at DataEngBytes (https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium)
  11. From Postgres to OpenSearch | @gunnarmorling Apache Flink APIs for

    Application Development Image source: “Change Data Capture with Flink SQL and Debezium” by Marta Paes at DataEngBytes (https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium)
  12. From Postgres to OpenSearch | @gunnarmorling Nested Data Structures UDFs

    to the Rescue https://www.youtube.com/@decodable
  13. From Postgres to OpenSearch | @gunnarmorling • Debezium: Real-time change

    event streams for your data • Debezium and Apache Flink: Power house of change stream processing ◦ Data Integration ◦ Data Cleansing ◦ Denormalization ◦ Aggregations ◦ Pattern Matching Take Aways
  14. From Postgres to OpenSearch | @gunnarmorling • Provisioning and updating

    infrastructure • Deployment and (auto-)scaling • Observability • State management • Schema management and inference • Developer experience • CI/CD • Security and access control Towards Production What To Consider
  15. From Postgres to OpenSearch | @gunnarmorling • Debezium: @debezium |

    https://debezium.io/ • Apache Flink: @ApacheFlink | https://flink.apache.org/ • Getting started with Flink: github.com/decodableco/examples → flink-learn Learn More