Old Dogs, New Tricks
A pragmatic guide for data movement
Sharon Xie,
Founding Engineer, Decodable
Slide 2
Slide 2 text
About Me
Software Engineer
Product Manager
Solution Architect
Customer Support
Slide 3
Slide 3 text
Agenda
● Where is Data Movement?
● Data Movement Patterns
● A Unified Approach to Data Movement
Slide 4
Slide 4 text
Data Movement Is Everywhere
Slide 5
Slide 5 text
Data Movement Use Cases
● Online
○ Caches and search index
○ User-facing analytics
○ Monitoring and alerting
● Offline
○ Data analytics
○ Business intelligence
○ ML model training
Slide 6
Slide 6 text
Old Dogs
● Batch ETL (Extract, Transform, Load)
● ELT (Extract, Load, Transform)
● Point to point
Slide 7
Slide 7 text
Batch ETL
Runs data movement at a predefined schedule
Slide 8
Slide 8 text
Batch ETL
● ✅ Well known pattern
●
✅ Robust systems
●
❌ Online use cases
Slide 9
Slide 9 text
ELT
● Loads raw data into data systems
● Transformation in the data systems
Slide 10
Slide 10 text
ELT
● ✅ Simple to use
● ❌ Online use cases
● ❌ When data must be transformed before
storing
○ Eg: security and compliance
Slide 11
Slide 11 text
Point-to-point
Direct connectivity and transformation
Slide 12
Slide 12 text
Point-to-point
● ✅ Specialized for use cases
● 🟡 Lack of abstraction and data inconsistency
issues
Learn More
Slide 13
Slide 13 text
New Tricks
● Change Data Capture (CDC)
● Event Streaming
● Stream Processing
Slide 14
Slide 14 text
Change Data Capture (CDC)
Capture database change events in real-time
Slide 15
Slide 15 text
Example - Debezium
Log-Based Change Data Capture
Slide 16
Slide 16 text
Change Data Capture (CDC)
● ✅ Enable real-time data movement
○ ✅ Online use cases
● 🟡 Must integrate with other technologies
Slide 17
Slide 17 text
Event Streaming
● Real-time event ingestion in persistent storage
● Decoupled event producers and consumers
Slide 18
Slide 18 text
Example - Apache Kafka
Slide 19
Slide 19 text
Event Streaming
● ✅ Online use cases
● ✅ Source once, consume multiple times
● 🟡 Processing is limited
○ Additional infrastructure for complex transformations
Slide 20
Slide 20 text
Stream Processing
Continuous and Incremental processing over
streaming data
Stream Processing
● ✅ Online use cases
● ✅ Support complex transformations
● 🟡 Hard to operationalize
Slide 23
Slide 23 text
● Data stack is heterogeneous
● Many patterns for data movement with
different trade-offs
● Newer patterns focus on online use cases
Conclusion
Slide 24
Slide 24 text
As an engineer
● In which systems does the data I need live? Where
does it need to go?
● How does that data need to be queried?
● What are the latency characteristics?
● Can this data be updated or is it immutable?
● Do I need to do any transformation before it hits the
target system?
● What is the schema and format of the source and
destination?
● What kind of guarantees are required on this data?
● How should failures be handled?
Slide 25
Slide 25 text
💡
A Unified Data Movement Platform
Slide 26
Slide 26 text
What about
Slide 27
Slide 27 text
A Unified Data movement platform
Should:
● Abstract away the technologies
● Automatically choose the most appropriate
technologies
● Support full range of simple to complex use
cases
Slide 28
Slide 28 text
Principles
● Real-time is the default
● 1:n Connectivity
● ETL > ELT
● Unified UX
Slide 29
Slide 29 text
Real-time Is the Default
CDC + Event Streaming + Stream Processing
Slide 30
Slide 30 text
1:n Connectivity
Source once, consume for different use cases
Slide 31
Slide 31 text
ETL > ELT
● E, T, L all interface with streaming data
● Ability to transform data when needed
● Zero additional cost or latency when it’s not
Slide 32
Slide 32 text
Unified UX - Abstraction
● Source Connectors
○ Turn all data into streaming data
● Stream Processing (Optional)
○ Continuously process streaming data
● Sink Connectors
○ Consume streaming data and put them in the
destination systems
Slide 33
Slide 33 text
Unified UX
Declarative YAML for
● ETL or ELT
● SQL or language-specific
processing
● Online or offline use cases
Slide 34
Slide 34 text
Summary
● Data movement is a stubborn problem
● A Unified Data Movement should be the
equivalent of Kubernetes for data movement