Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mastering Data Sync

Avatar for Posedio Posedio PRO
November 19, 2025

Mastering Data Sync

Join Damjan for a critical session on leveraging modern data platform tooling to achieve real-time bi-directional data synchronization. We'll move beyond theory to practical implementation, focusing specifically on how our platform addresses the challenges of two-way syncing between core enterprise systems.

Avatar for Posedio

Posedio PRO

November 19, 2025
Tweet

More Decks by Posedio

Other Decks in Programming

Transcript

  1. Do it RIGHT. The story starts in London 2 •

    Software, Cloud and Data – KubeCon EU 2025 • Customer call for a proposal: o Migrate data between systems o Ensure a bi-directional sync between the systems for an extended period o Use data fusion • Prepare a PoC in the hotel • Have a cold one to celebrate J DISCLAIMER: Ongoing project, so I will be brief on details
  2. Do it RIGHT. How hard can it be 4 •

    Extract data from Oracle • Use DataFusion to rename fields • Load the data into the destination (sadly, also Oracle)
  3. Do it RIGHT. Hub and Spoke architecture 5 • Head

    office is responsible for master data • Point of Sale office (PoS) is responsible for transactional data • We want to migrate PoS offices one-by-one, to reduce risk and downtime o A day costs multiple hundreds of thousand €
  4. Do it RIGHT. Can AI help 6 • No •

    Mapping is too complicated • Training data does not exist
  5. Do it RIGHT. Data mapping 7 • Source and Destination

    systems work differently • A data catalogue is only the first step, but not enough for a mapping • Monolith with wide tables and domain-driven “microservices” are not always compatible
  6. Do it RIGHT. Reverse mapping 8 • If the head

    office gets migrated first, any changes to the head office need to be reverse-mapped to the model of the PoS • If we roll out in waves, some PoS locations will use the old system, and some will use the new system
  7. Do it RIGHT. Tools for the job 9 • Is

    data fusion really a requirement? • Why? What problem is the tool meant to solve? Is there a different/better way to solve the problem?
  8. Do it RIGHT. Dynamically generated values 10 • Primary key

    is auto- generated from a central sequencing service • Foreign key must reference the correct primary key • Works well through the UI • Is challenging when migrating and when it has to be kept in-sync across systems
  9. Do it RIGHT. Data loss 11 • How do you

    map functions that are not bijective? • If you map cars to cars and trucks to trucks, life is easy • If you map BMW, VW, Audi to cars, you have no problems • If you need to map cars to BMW, VW and Audi, things get complicated • Remember the reverse- migration!
  10. Do it RIGHT. Delta loads 12 • How can you

    tell if something has changed in the underlying database o The easy way o The hard way • Loosening constraints o Only need deltas for transactional data
  11. Do it RIGHT. Data comes in many forms 13 •

    Oracle database o Can we set up a cut-off point? o What about data from the future? • File-based custom (self written) protocols • REST endpoints • SharePoint + Excel • CSV (And TSV and in-betweens) • DWH and historic data
  12. Do it RIGHT. Testing 14 • Smoke tests o How

    many rows did we load? • Sampling • How do you test dynamically generated values • Timestamps, Sequences, Loss of precision, type casting, etc.
  13. Do it RIGHT. Engineering is peoplework 15 • Many experts

    in each system o But no experts in both systems o Generic solution vs. a customised approach • Communication between different teams in different time zones
  14. Do it RIGHT. Error handling 16 • What happens if

    a pipeline crashes? o Roll back o Stop o Continue • Dead letter tables and error tables o Different error messages for different error classes • Acceptable error rate
  15. Do it RIGHT. Generating pipelines 17 • Automatic pipeline generation

    based on the mapping data • Manual fixes might be required • Changes to the mapping require a regeneration and rerun of the pipeline o Underestimated the change rate by a wide mile
  16. Do it RIGHT. Running pipelines 18 • Triggering pipelines o

    Single pipelines and batches that run in parallel • Latency – how long does it take to do a full migration (of a table or of the system) o The database becomes the bottleneck o Some things can’t run in parallel
  17. Do it RIGHT. Large scale challenges 19 • Sequence o

    In which order does data need to be loaded? o Can you follow the foreign key constraints? • Database constraints o What happens with FKs that reference data from before the cut-off period o How do we deal with business transactions
  18. Do it RIGHT. Database triggers 20 • Triggers are evil

    • They change your data • They fill other tables • They block execution and disable parallelism • They don’t tell you why things fail • Can they safely be disabled?
  19. Do it RIGHT. Feedback loops 21 • Dashboard to check

    the status of the pipelines • How much data was loaded • What is the data quality • What is the error rate • Visibility for the developers, testers and data input teams
  20. Do it RIGHT. Why not a platform 22 • We

    are building a specific solution for a specific problem • We are optimising for the existing skills of the project members (excel > csv > parquet; sql > java) rather than abstractions
  21. Do it RIGHT. Some numbers 23 • Team of 5

    people • So far went over 20 677 430 rows for the head office • Around 450 000 JSONs through the REST API • About 89 000 pipeline runs in the last 30 days • Approximately 800€ / month cloud costs