Kubernetes for Data Engineers: Building Scalable, Reliable Data Pipelines

Slide 1

Slide 1 text

K8S Summit TW 2024 Kubernetes for Data Engineers: Building Scalable, Reliable Data Pipelines Shuhsi Lin 20241024 Photo by Krishna Mantripragada on Unsplash

Slide 2

Slide 2 text

About Me Find me on sciwork member Interested in • Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi Lin Working in Smart manufacturing & AI With data and people Photo by NordWood Themes on Unsplash

Slide 3

Slide 3 text

Agenda 01 ETL/ELT & Complex Data pipelines Data Pipeline 02 What are Scalable and Reliable (+ Maintainable) pipelines Scalable & Reliable + Maintainable 03 dbt & Data Pipelines On K8S 04 Recap & More How to be Scalable and Reliable (+ Maintainable) with dbt + K8S More to do for Scalable, Reliable and Maintainable data pipelines

Slide 4

Slide 4 text

Scenario SmartPizza (like Pizzaxxx, but smarter?) ● Daily work: get orders with speciﬁc recipes to make pizza ● Data (tables) ○ Order ○ Recipes ○ Customer ○ Inventory ○ … Assume if - Operate 20000 branches worldwide - Serve 4 million customers per day - Make 5 million pizzas per day

Slide 5

Slide 5 text

Data Pipelines https://unsplash.com/photos/a-group-of-pipes-that-are-connected-to-each-other-Xlg2KbYFUoM

Slide 6

Slide 6 text

Hello ETL/ELT World Extract Transform Load Load Transform Extract Input Data Output Data Operation ETL/ELT pipelines Data source Data store/target Data applications

Slide 7

Slide 7 text

Data Pipeline Transform/Process Parse Filter Split/merge route Data Store Data Pipeline 7 BI tool Target data store Load/Ingest/Move Application Extra/Acquire Data Data Pipeline Data Pipeline Diverse sources Diverse targets

Slide 8

Slide 8 text

Simplistic Data Flow 8 Data Store/Application B Acquire/Ingest Data Process and Analyzed Data Data Store/Application A ● Data movement as ﬂow ● Moving data content from A to B

Slide 9

Slide 9 text

Many Flow-like Data in a Real World 9 Across Organizations/ Business unit/ Geographic locations

Slide 10

Slide 10 text

https://mattturck.com/landscape/mad2024.pdf https://mad.firstmark.com/ The 2024 MAD (Machine Learning, AI & Data) Landscape

Slide 11

Slide 11 text

medium @milkdd datafold blog.bytebytego.com Linkedin Post

Slide 12

Slide 12 text

https://a16z.com/emerging-architectures-for-modern-data-infrastructure/ We will talk this later

Slide 13

Slide 13 text

What often happen in a complex Data pipeline https://unsplash.com/photos/chrome-plated-industrial-background-equipment-industrial-tools-and-machiner y-for-the-production-in-factory-shops-dairy-factory-steel-water-pipeline-chrome-pipes-modern-factory-interi or-production-line-maze-of-metal-pipes-background-NLsF62mB1oc

Slide 14

Slide 14 text

Unreliable? Questionable? Quality issues?

Slide 15

Slide 15 text

Pipeline debt Technical debt in data pipeline Down with pipeline debt / introducing Great Expectations. https://greatexpectations.io/blog/down-with-pipeline-debt-introducing-great-expectations ● Undocumented/ unmanaged ● Untested ● Unstable

Slide 16

Slide 16 text

Designing Data-Intensive Applications Designing Data-Intensive Applications, 2nd Edition by Martin Kleppmann, Chris Riccomini, O'Reilly, 2025 Reliable Scalable Maintainable The ability of a system to perform its required functions consistently over time without failure. ● Data Integrity and Consistency ● Fault Tolerance and Error Handling ● Recoverability and Disaster Recovery ● Monitoring, Alerting, and Observability ● Security and Compliance ● Modular, Reusable, and Evolvable Design ● Standardization and Best Practices ● Simplicity and Ease of Understanding ● Configurability and Operability ● Comprehensive Documentation and Knowledge Sharing ● Version Control and Collaboration Practices ● Automated Testing and Validation ● Handling Increased Load and Concurrency ● Performance Optimization ● Dynamic Scaling Strategies and Resource Management ● Elasticity and Automated Scaling ● Reliability, Fault Tolerance, and Automation The ease with which a data pipeline can be understood, modified, extended, and troubleshooted over its lifecycle. The system's ability to handle increasing amounts of data, higher processing loads, and more complex transformations efficiently and effectively.

Slide 17

Slide 17 text

Reliability + Scalability + Maintainability in SmartPizza https://unsplash.com/photos/pizza-in-oven-IfQlwNqJqV8

Slide 18

Slide 18 text

● Variable Load ● Resource Allocation ● Complex Codebase ● Deployment Diﬃculties ● Lack of Modular System ● Lack of Visibility ● Ineﬃcient Debugging ● Order Processing Failures ● Data Inconsistency ● Schema Changes ● Single Points of Failure Real-Time Order Processing and Delivery Optimization Orders Request Real-time delivery estimates ETL/ELT pipelines Data store/target Order Processing Delivery Optimization Challenges

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

dbt (data build tool) Transform data using the same practices that software engineers use to build applications. ● Centralized ● Version Control ● Documentation ● Modularity ● Open-Source https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?

Slide 21

Slide 21 text

model write model layer model Run Build dbt models: dbt model =a single .sql ﬁle dbt code = SQL + Jinja ● SQL select statement dbt model reference each other ● Creates Natural dependencies ● dbt determine model execution order 1 command ● V ● Create DAG ● Parallel execution

Slide 22

Slide 22 text

https://docs.getdbt.com/docs/collaborate/explore-projects dbt doc

Slide 23

Slide 23 text

https://docs.getdbt.com/docs/collaborate/explore-projects model lineage graph

Slide 24

Slide 24 text

Ensure Reliability + Scalability + Maintainability in Data pipeline https://unsplash.com/photos/person-writing-bucket-list-on-book-RLw-UC03Gwc

Slide 25

Slide 25 text

Solutions Provided by dbt Modular Data Transformations Data Testing and Validation Reliable Scalable Maintainable ● Modular SQL Models ● Refactoring Support ● Built-in Testing Framework ● Schema Tests ● Custom Tests Version Control Integration ● Git Integration ● Collaborative Development Documentation and Data Lineage ● Auto-Generated Documentation ● Data Lineage Graphs Handling Schema Changes ● Incremental Models ● Macro Support Performance Optimization ● Compiled SQL ● Materializations

Slide 26

Slide 26 text

Ensure Reliability in Data https://unsplash.com/photos/person-writing-bucket-list-on-book-RLw-UC03Gwc

Slide 27

Slide 27 text

6 Dimensions of Data Quality The degree to which data correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/

Slide 28

Slide 28 text

How do we test?

Slide 29

Slide 29 text

Test data content ● Built-in ○ Singular data tests ○ Generic data tests ● Packages ○ dbt_utils ○ dbt_expectation ○ dbt_elementary ○ … Test data schema ● dbt (model) contract Test data code ● dbt unit test ● Recce

Slide 30

Slide 30 text

Assumption about Data Assertion about Code ref: Webinar on: Testing frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ○ pytest,... model code ○ dbt unit testing ○ Recce data content: ○ pydantic ○ great expectations ○ dbt test ○ dbt_utils/expectatio ns/elementary… data schemas: ○ dbt data contracts

Slide 31

Slide 31 text

dbt run on Kubernetes https://unsplash.com/photos/opened-window-panel-uH2J-RqAChI

Slide 32

Slide 32 text

DataStore Data Data Data Data DataStore Data Data Data Data ETL/ELT pipelines conﬁgMap CronJob schedule: “0 ****” Database Data warehouse dbt run starts Run query (SQL) dbt run on Kubernetes ● Scalability ● High Availability ● Resource Optimization ● Automation ● Monitoring and Logging

Slide 33

Slide 33 text

Modern data platform using dbt in GCP https://www.analytics8.com/blog/best-in-breed-data-stack-platform-bigquery-dbt-and-looker/ Landing Staging Warehouse Mart

Slide 34

Slide 34 text

Modern data platform using dbt AWS https://aws.amazon.com/tw/blogs/big-data/create-a-modern-data-platform-using-the-data-build-tool-dbt-in-the-aws-cloud/

Slide 35

Slide 35 text

dbt Cloud on Microsoft Fabric https://www.getdbt.com/blog/dbt-cloud-on-microsoft-fabric

Slide 36

Slide 36 text

dbt + cube https://cube.dev/docs/guides/dbt semantic layer

Slide 37

Slide 37 text

Scalability of SmartPizza https://unsplash.com/photos/baked-pizza-in-oven-x5jilo3ck3o Key Takeaways and Next

Slide 38

Slide 38 text

The Analytics Development Lifecycle (ADLC) https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle An integrated, iterative process Ingest/collect Store Process Output Source

Slide 39

Slide 39 text

● Data Pipelines are more and more complex ○ More Challenges (SmartPizza) ■ Reliability ■ Scalability ■ Maintainability ● dbt + Kubernetes ○ Embrace Modular Architecture ○ Implement Robust Data and Code Testing ○ Self-healing and replication for fault-tolerant ○ Automate Deployment Processes ○ Utilize Auto-Scaling Features ● More ○ Orchestrating dbt Workﬂows ○ Enhance Observability ○ Secure Data and Applications ○ Promote Collaboration and Version Control Key Takeaways and Tips

Slide 40

Slide 40 text

Thanks!