Old Dogs, New Tricks – A pragmatic guide for modern data movement platforms

Old Dogs, New Tricks A pragmatic guide for data movement
Sharon Xie， Founding Engineer, Decodable

About Me Software Engineer Product Manager Solution Architect Customer Support

Agenda • Where is Data Movement? • Data Movement Patterns
• A Unified Approach to Data Movement

Data Movement Is Everywhere

Data Movement Use Cases • Online ◦ Caches and search
index ◦ User-facing analytics ◦ Monitoring and alerting • Offline ◦ Data analytics ◦ Business intelligence ◦ ML model training

Old Dogs • Batch ETL (Extract, Transform, Load) • ELT
(Extract, Load, Transform) • Point to point

Batch ETL Runs data movement at a predefined schedule

Batch ETL • ✅ Well known pattern • ✅ Robust
systems • ❌ Online use cases

ELT • Loads raw data into data systems • Transformation
in the data systems

ELT • ✅ Simple to use • ❌ Online use
cases • ❌ When data must be transformed before storing ◦ Eg: security and compliance

Point-to-point Direct connectivity and transformation

Point-to-point • ✅ Specialized for use cases • 🟡 Lack
of abstraction and data inconsistency issues Learn More

New Tricks • Change Data Capture (CDC) • Event Streaming
• Stream Processing

Change Data Capture (CDC) Capture database change events in real-time

Example - Debezium Log-Based Change Data Capture

Change Data Capture (CDC) • ✅ Enable real-time data movement
◦ ✅ Online use cases • 🟡 Must integrate with other technologies

Event Streaming • Real-time event ingestion in persistent storage •
Decoupled event producers and consumers

Example - Apache Kafka

Event Streaming • ✅ Online use cases • ✅ Source
once, consume multiple times • 🟡 Processing is limited ◦ Additional infrastructure for complex transformations

Stream Processing Continuous and Incremental processing over streaming data

Technology - Apache Flink • Highly Scalable • Exactly-once processing
semantics • Layered APIs: Streaming SQL (easy to use) ↔ DataStream (expressive)

Stream Processing • ✅ Online use cases • ✅ Support
complex transformations • 🟡 Hard to operationalize

• Data stack is heterogeneous • Many patterns for data
movement with different trade-offs • Newer patterns focus on online use cases Conclusion

As an engineer • In which systems does the data
I need live? Where does it need to go? • How does that data need to be queried? • What are the latency characteristics? • Can this data be updated or is it immutable? • Do I need to do any transformation before it hits the target system? • What is the schema and format of the source and destination? • What kind of guarantees are required on this data? • How should failures be handled?

💡 A Unified Data Movement Platform

What about

A Unified Data movement platform Should: • Abstract away the
technologies • Automatically choose the most appropriate technologies • Support full range of simple to complex use cases

Principles • Real-time is the default • 1:n Connectivity •
ETL > ELT • Unified UX

Real-time Is the Default CDC + Event Streaming + Stream
Processing

1:n Connectivity Source once, consume for different use cases

ETL > ELT • E, T, L all interface with
streaming data • Ability to transform data when needed • Zero additional cost or latency when it’s not

Unified UX - Abstraction • Source Connectors ◦ Turn all
data into streaming data • Stream Processing (Optional) ◦ Continuously process streaming data • Sink Connectors ◦ Consume streaming data and put them in the destination systems

Unified UX Declarative YAML for • ETL or ELT •
SQL or language-specific processing • Online or offline use cases

Summary • Data movement is a stubborn problem • A
Unified Data Movement should be the equivalent of Kubernetes for data movement

Q&A @sharon_rxie

Old Dogs, New Tricks – A pragmatic guide for mo...

Old Dogs, New Tricks – A pragmatic guide for modern data movement platforms

More Decks by Sharon Xie

Featured

Transcript