Mastering Data Sync

Mastering Data Sync Damjan Gjurovski, CTO of Posedio Vienna Data
Engineering Meetup 19.11.2025

Do it RIGHT. The story starts in London 2 •
Software, Cloud and Data – KubeCon EU 2025 • Customer call for a proposal: o Migrate data between systems o Ensure a bi-directional sync between the systems for an extended period o Use data fusion • Prepare a PoC in the hotel • Have a cold one to celebrate J DISCLAIMER: Ongoing project, so I will be brief on details

Do it RIGHT. Not Another Platform

Do it RIGHT. How hard can it be 4 •
Extract data from Oracle • Use DataFusion to rename fields • Load the data into the destination (sadly, also Oracle)

Do it RIGHT. Hub and Spoke architecture 5 • Head
office is responsible for master data • Point of Sale office (PoS) is responsible for transactional data • We want to migrate PoS offices one-by-one, to reduce risk and downtime o A day costs multiple hundreds of thousand €

Do it RIGHT. Can AI help 6 • No •
Mapping is too complicated • Training data does not exist

Do it RIGHT. Data mapping 7 • Source and Destination
systems work differently • A data catalogue is only the first step, but not enough for a mapping • Monolith with wide tables and domain-driven “microservices” are not always compatible

Do it RIGHT. Reverse mapping 8 • If the head
office gets migrated first, any changes to the head office need to be reverse-mapped to the model of the PoS • If we roll out in waves, some PoS locations will use the old system, and some will use the new system

Do it RIGHT. Tools for the job 9 • Is
data fusion really a requirement? • Why? What problem is the tool meant to solve? Is there a different/better way to solve the problem?

Do it RIGHT. Dynamically generated values 10 • Primary key
is auto- generated from a central sequencing service • Foreign key must reference the correct primary key • Works well through the UI • Is challenging when migrating and when it has to be kept in-sync across systems

Do it RIGHT. Data loss 11 • How do you
map functions that are not bijective? • If you map cars to cars and trucks to trucks, life is easy • If you map BMW, VW, Audi to cars, you have no problems • If you need to map cars to BMW, VW and Audi, things get complicated • Remember the reverse- migration!

Do it RIGHT. Delta loads 12 • How can you
tell if something has changed in the underlying database o The easy way o The hard way • Loosening constraints o Only need deltas for transactional data

Do it RIGHT. Data comes in many forms 13 •
Oracle database o Can we set up a cut-off point? o What about data from the future? • File-based custom (self written) protocols • REST endpoints • SharePoint + Excel • CSV (And TSV and in-betweens) • DWH and historic data

Do it RIGHT. Testing 14 • Smoke tests o How
many rows did we load? • Sampling • How do you test dynamically generated values • Timestamps, Sequences, Loss of precision, type casting, etc.

Do it RIGHT. Engineering is peoplework 15 • Many experts
in each system o But no experts in both systems o Generic solution vs. a customised approach • Communication between different teams in different time zones

Do it RIGHT. Error handling 16 • What happens if
a pipeline crashes? o Roll back o Stop o Continue • Dead letter tables and error tables o Different error messages for different error classes • Acceptable error rate

Do it RIGHT. Generating pipelines 17 • Automatic pipeline generation
based on the mapping data • Manual fixes might be required • Changes to the mapping require a regeneration and rerun of the pipeline o Underestimated the change rate by a wide mile

Do it RIGHT. Running pipelines 18 • Triggering pipelines o
Single pipelines and batches that run in parallel • Latency – how long does it take to do a full migration (of a table or of the system) o The database becomes the bottleneck o Some things can’t run in parallel

Do it RIGHT. Large scale challenges 19 • Sequence o
In which order does data need to be loaded? o Can you follow the foreign key constraints? • Database constraints o What happens with FKs that reference data from before the cut-off period o How do we deal with business transactions

Do it RIGHT. Database triggers 20 • Triggers are evil
• They change your data • They fill other tables • They block execution and disable parallelism • They don’t tell you why things fail • Can they safely be disabled?

Do it RIGHT. Feedback loops 21 • Dashboard to check
the status of the pipelines • How much data was loaded • What is the data quality • What is the error rate • Visibility for the developers, testers and data input teams

Do it RIGHT. Why not a platform 22 • We
are building a specific solution for a specific problem • We are optimising for the existing skills of the project members (excel > csv > parquet; sql > java) rather than abstractions

Do it RIGHT. Some numbers 23 • Team of 5
people • So far went over 20 677 430 rows for the head office • Around 450 000 JSONs through the REST API • About 89 000 pipeline runs in the last 30 days • Approximately 800€ / month cloud costs

Do it RIGHT. THANK YOU! CONTACT US: Weyringergasse 1-3/DG 1040
Wien www.posedio.com [email protected] 24

Mastering Data Sync

Mastering Data Sync

Posedio PRO

More Decks by Posedio

Other Decks in Programming

Featured

Transcript

Mastering Data Sync Damjan Gjurovski, CTO of Posedio Vienna Data

Do it RIGHT. The story starts in London 2 •

Do it RIGHT. Not Another Platform

Do it RIGHT. How hard can it be 4 •

Do it RIGHT. Hub and Spoke architecture 5 • Head

Do it RIGHT. Can AI help 6 • No •

Do it RIGHT. Data mapping 7 • Source and Destination

Do it RIGHT. Reverse mapping 8 • If the head

Do it RIGHT. Tools for the job 9 • Is

Do it RIGHT. Dynamically generated values 10 • Primary key

Do it RIGHT. Data loss 11 • How do you

Do it RIGHT. Delta loads 12 • How can you

Do it RIGHT. Data comes in many forms 13 •

Do it RIGHT. Testing 14 • Smoke tests o How

Do it RIGHT. Engineering is peoplework 15 • Many experts

Do it RIGHT. Error handling 16 • What happens if

Do it RIGHT. Generating pipelines 17 • Automatic pipeline generation

Do it RIGHT. Running pipelines 18 • Triggering pipelines o

Do it RIGHT. Large scale challenges 19 • Sequence o

Do it RIGHT. Database triggers 20 • Triggers are evil

Do it RIGHT. Feedback loops 21 • Dashboard to check

Do it RIGHT. Why not a platform 22 • We

Do it RIGHT. Some numbers 23 • Team of 5

Do it RIGHT. THANK YOU! CONTACT US: Weyringergasse 1-3/DG 1040