Kubernetes for Data Engineers: Building Scalable, Reliable Data Pipelines

K8S Summit TW 2024 Kubernetes for Data Engineers: Building Scalable,
Reliable Data Pipelines Shuhsi Lin 20241024 Photo by Krishna Mantripragada on Unsplash

About Me Find me on sciwork member Interested in •
Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi Lin Working in Smart manufacturing & AI With data and people Photo by NordWood Themes on Unsplash

Agenda 01 ETL/ELT & Complex Data pipelines Data Pipeline 02
What are Scalable and Reliable (+ Maintainable) pipelines Scalable & Reliable + Maintainable 03 dbt & Data Pipelines On K8S 04 Recap & More How to be Scalable and Reliable (+ Maintainable) with dbt + K8S More to do for Scalable, Reliable and Maintainable data pipelines

Scenario SmartPizza (like Pizzaxxx, but smarter?) • Daily work: get
orders with speciﬁc recipes to make pizza • Data (tables) ◦ Order ◦ Recipes ◦ Customer ◦ Inventory ◦ … Assume if - Operate 20000 branches worldwide - Serve 4 million customers per day - Make 5 million pizzas per day

Data Pipelines https://unsplash.com/photos/a-group-of-pipes-that-are-connected-to-each-other-Xlg2KbYFUoM

Hello ETL/ELT World Extract Transform Load Load Transform Extract Input
Data Output Data Operation ETL/ELT pipelines Data source Data store/target Data applications

Data Pipeline Transform/Process Parse Filter Split/merge route Data Store Data
Pipeline 7 BI tool Target data store Load/Ingest/Move Application Extra/Acquire Data Data Pipeline Data Pipeline Diverse sources Diverse targets

Simplistic Data Flow 8 Data Store/Application B Acquire/Ingest Data Process
and Analyzed Data Data Store/Application A • Data movement as ﬂow • Moving data content from A to B

Many Flow-like Data in a Real World 9 Across Organizations/
Business unit/ Geographic locations

https://mattturck.com/landscape/mad2024.pdf https://mad.firstmark.com/ The 2024 MAD (Machine Learning, AI & Data)
Landscape

medium @milkdd datafold blog.bytebytego.com Linkedin Post

https://a16z.com/emerging-architectures-for-modern-data-infrastructure/ We will talk this later

What often happen in a complex Data pipeline https://unsplash.com/photos/chrome-plated-industrial-background-equipment-industrial-tools-and-machiner y-for-the-production-in-factory-shops-dairy-factory-steel-water-pipeline-chrome-pipes-modern-factory-interi
or-production-line-maze-of-metal-pipes-background-NLsF62mB1oc

Unreliable? Questionable? Quality issues?

Pipeline debt Technical debt in data pipeline Down with pipeline
debt / introducing Great Expectations. https://greatexpectations.io/blog/down-with-pipeline-debt-introducing-great-expectations • Undocumented/ unmanaged • Untested • Unstable

Designing Data-Intensive Applications Designing Data-Intensive Applications, 2nd Edition by Martin
Kleppmann, Chris Riccomini, O'Reilly, 2025 Reliable Scalable Maintainable The ability of a system to perform its required functions consistently over time without failure. • Data Integrity and Consistency • Fault Tolerance and Error Handling • Recoverability and Disaster Recovery • Monitoring, Alerting, and Observability • Security and Compliance • Modular, Reusable, and Evolvable Design • Standardization and Best Practices • Simplicity and Ease of Understanding • Configurability and Operability • Comprehensive Documentation and Knowledge Sharing • Version Control and Collaboration Practices • Automated Testing and Validation • Handling Increased Load and Concurrency • Performance Optimization • Dynamic Scaling Strategies and Resource Management • Elasticity and Automated Scaling • Reliability, Fault Tolerance, and Automation The ease with which a data pipeline can be understood, modified, extended, and troubleshooted over its lifecycle. The system's ability to handle increasing amounts of data, higher processing loads, and more complex transformations efficiently and effectively.

Reliability + Scalability + Maintainability in SmartPizza https://unsplash.com/photos/pizza-in-oven-IfQlwNqJqV8

• Variable Load • Resource Allocation • Complex Codebase •
Deployment Diﬃculties • Lack of Modular System • Lack of Visibility • Ineﬃcient Debugging • Order Processing Failures • Data Inconsistency • Schema Changes • Single Points of Failure Real-Time Order Processing and Delivery Optimization Orders Request Real-time delivery estimates ETL/ELT pipelines Data store/target Order Processing Delivery Optimization Challenges

dbt (data build tool) Transform data using the same practices
that software engineers use to build applications. • Centralized • Version Control • Documentation • Modularity • Open-Source https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?

model write model layer model Run Build dbt models: dbt
model =a single .sql ﬁle dbt code = SQL + Jinja • SQL select statement dbt model reference each other • Creates Natural dependencies • dbt determine model execution order 1 command • V • Create DAG • Parallel execution

https://docs.getdbt.com/docs/collaborate/explore-projects dbt doc

https://docs.getdbt.com/docs/collaborate/explore-projects model lineage graph

Ensure Reliability + Scalability + Maintainability in Data pipeline https://unsplash.com/photos/person-writing-bucket-list-on-book-RLw-UC03Gwc

Solutions Provided by dbt Modular Data Transformations Data Testing and
Validation Reliable Scalable Maintainable • Modular SQL Models • Refactoring Support • Built-in Testing Framework • Schema Tests • Custom Tests Version Control Integration • Git Integration • Collaborative Development Documentation and Data Lineage • Auto-Generated Documentation • Data Lineage Graphs Handling Schema Changes • Incremental Models • Macro Support Performance Optimization • Compiled SQL • Materializations

Ensure Reliability in Data https://unsplash.com/photos/person-writing-bucket-list-on-book-RLw-UC03Gwc

6 Dimensions of Data Quality The degree to which data
correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/

How do we test?

Test data content • Built-in ◦ Singular data tests ◦
Generic data tests • Packages ◦ dbt_utils ◦ dbt_expectation ◦ dbt_elementary ◦ … Test data schema • dbt (model) contract Test data code • dbt unit test • Recce

Assumption about Data Assertion about Code ref: Webinar on: Testing
frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ◦ pytest,... model code ◦ dbt unit testing ◦ Recce data content: ◦ pydantic ◦ great expectations ◦ dbt test ◦ dbt_utils/expectatio ns/elementary… data schemas: ◦ dbt data contracts

dbt run on Kubernetes https://unsplash.com/photos/opened-window-panel-uH2J-RqAChI

DataStore Data Data Data Data DataStore Data Data Data Data
ETL/ELT pipelines conﬁgMap CronJob schedule: “0 ****” Database Data warehouse dbt run starts Run query (SQL) dbt run on Kubernetes • Scalability • High Availability • Resource Optimization • Automation • Monitoring and Logging

Modern data platform using dbt in GCP https://www.analytics8.com/blog/best-in-breed-data-stack-platform-bigquery-dbt-and-looker/ Landing Staging
Warehouse Mart

Modern data platform using dbt AWS https://aws.amazon.com/tw/blogs/big-data/create-a-modern-data-platform-using-the-data-build-tool-dbt-in-the-aws-cloud/

dbt Cloud on Microsoft Fabric https://www.getdbt.com/blog/dbt-cloud-on-microsoft-fabric

dbt + cube https://cube.dev/docs/guides/dbt semantic layer

Scalability of SmartPizza https://unsplash.com/photos/baked-pizza-in-oven-x5jilo3ck3o Key Takeaways and Next

The Analytics Development Lifecycle (ADLC) https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle An integrated, iterative process
Ingest/collect Store Process Output Source

• Data Pipelines are more and more complex ◦ More
Challenges (SmartPizza) ▪ Reliability ▪ Scalability ▪ Maintainability • dbt + Kubernetes ◦ Embrace Modular Architecture ◦ Implement Robust Data and Code Testing ◦ Self-healing and replication for fault-tolerant ◦ Automate Deployment Processes ◦ Utilize Auto-Scaling Features • More ◦ Orchestrating dbt Workﬂows ◦ Enhance Observability ◦ Secure Data and Applications ◦ Promote Collaboration and Version Control Key Takeaways and Tips

Thanks!

Kubernetes for Data Engineers: Building Scalab...

Kubernetes for Data Engineers: Building Scalable, Reliable Data Pipelines

More Decks by suci

Other Decks in Programming

Featured

Transcript