Slide 1

Slide 1 text

Old Dogs, New Tricks A pragmatic guide for data movement Sharon Xie, Founding Engineer, Decodable

Slide 2

Slide 2 text

About Me Software Engineer Product Manager Solution Architect Customer Support

Slide 3

Slide 3 text

Agenda ● Where is Data Movement? ● Data Movement Patterns ● A Unified Approach to Data Movement

Slide 4

Slide 4 text

Data Movement Is Everywhere

Slide 5

Slide 5 text

Data Movement Use Cases ● Online ○ Caches and search index ○ User-facing analytics ○ Monitoring and alerting ● Offline ○ Data analytics ○ Business intelligence ○ ML model training

Slide 6

Slide 6 text

Old Dogs ● Batch ETL (Extract, Transform, Load) ● ELT (Extract, Load, Transform) ● Point to point

Slide 7

Slide 7 text

Batch ETL Runs data movement at a predefined schedule

Slide 8

Slide 8 text

Batch ETL ● ✅ Well known pattern ● ✅ Robust systems ● ❌ Online use cases

Slide 9

Slide 9 text

ELT ● Loads raw data into data systems ● Transformation in the data systems

Slide 10

Slide 10 text

ELT ● ✅ Simple to use ● ❌ Online use cases ● ❌ When data must be transformed before storing ○ Eg: security and compliance

Slide 11

Slide 11 text

Point-to-point Direct connectivity and transformation

Slide 12

Slide 12 text

Point-to-point ● ✅ Specialized for use cases ● 🟡 Lack of abstraction and data inconsistency issues Learn More

Slide 13

Slide 13 text

New Tricks ● Change Data Capture (CDC) ● Event Streaming ● Stream Processing

Slide 14

Slide 14 text

Change Data Capture (CDC) Capture database change events in real-time

Slide 15

Slide 15 text

Example - Debezium Log-Based Change Data Capture

Slide 16

Slide 16 text

Change Data Capture (CDC) ● ✅ Enable real-time data movement ○ ✅ Online use cases ● 🟡 Must integrate with other technologies

Slide 17

Slide 17 text

Event Streaming ● Real-time event ingestion in persistent storage ● Decoupled event producers and consumers

Slide 18

Slide 18 text

Example - Apache Kafka

Slide 19

Slide 19 text

Event Streaming ● ✅ Online use cases ● ✅ Source once, consume multiple times ● 🟡 Processing is limited ○ Additional infrastructure for complex transformations

Slide 20

Slide 20 text

Stream Processing Continuous and Incremental processing over streaming data

Slide 21

Slide 21 text

Technology - Apache Flink ● Highly Scalable ● Exactly-once processing semantics ● Layered APIs: Streaming SQL (easy to use) ↔ DataStream (expressive)

Slide 22

Slide 22 text

Stream Processing ● ✅ Online use cases ● ✅ Support complex transformations ● 🟡 Hard to operationalize

Slide 23

Slide 23 text

● Data stack is heterogeneous ● Many patterns for data movement with different trade-offs ● Newer patterns focus on online use cases Conclusion

Slide 24

Slide 24 text

As an engineer ● In which systems does the data I need live? Where does it need to go? ● How does that data need to be queried? ● What are the latency characteristics? ● Can this data be updated or is it immutable? ● Do I need to do any transformation before it hits the target system? ● What is the schema and format of the source and destination? ● What kind of guarantees are required on this data? ● How should failures be handled?

Slide 25

Slide 25 text

💡 A Unified Data Movement Platform

Slide 26

Slide 26 text

What about

Slide 27

Slide 27 text

A Unified Data movement platform Should: ● Abstract away the technologies ● Automatically choose the most appropriate technologies ● Support full range of simple to complex use cases

Slide 28

Slide 28 text

Principles ● Real-time is the default ● 1:n Connectivity ● ETL > ELT ● Unified UX

Slide 29

Slide 29 text

Real-time Is the Default CDC + Event Streaming + Stream Processing

Slide 30

Slide 30 text

1:n Connectivity Source once, consume for different use cases

Slide 31

Slide 31 text

ETL > ELT ● E, T, L all interface with streaming data ● Ability to transform data when needed ● Zero additional cost or latency when it’s not

Slide 32

Slide 32 text

Unified UX - Abstraction ● Source Connectors ○ Turn all data into streaming data ● Stream Processing (Optional) ○ Continuously process streaming data ● Sink Connectors ○ Consume streaming data and put them in the destination systems

Slide 33

Slide 33 text

Unified UX Declarative YAML for ● ETL or ELT ● SQL or language-specific processing ● Online or offline use cases

Slide 34

Slide 34 text

Summary ● Data movement is a stubborn problem ● A Unified Data Movement should be the equivalent of Kubernetes for data movement

Slide 35

Slide 35 text

Q&A @sharon_rxie