Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes for Data Engineers: Building Scalab...

suci
October 24, 2024

Kubernetes for Data Engineers: Building Scalable, Reliable Data Pipelines

K8S Summit 2024 Taipei

suci

October 24, 2024
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. K8S Summit TW 2024 Kubernetes for Data Engineers: Building Scalable,

    Reliable Data Pipelines Shuhsi Lin 20241024 Photo by Krishna Mantripragada on Unsplash
  2. About Me Find me on sciwork member Interested in •

    Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi Lin Working in Smart manufacturing & AI With data and people Photo by NordWood Themes on Unsplash
  3. Agenda 01 ETL/ELT & Complex Data pipelines Data Pipeline 02

    What are Scalable and Reliable (+ Maintainable) pipelines Scalable & Reliable + Maintainable 03 dbt & Data Pipelines On K8S 04 Recap & More How to be Scalable and Reliable (+ Maintainable) with dbt + K8S More to do for Scalable, Reliable and Maintainable data pipelines
  4. Scenario SmartPizza (like Pizzaxxx, but smarter?) • Daily work: get

    orders with specific recipes to make pizza • Data (tables) ◦ Order ◦ Recipes ◦ Customer ◦ Inventory ◦ … Assume if - Operate 20000 branches worldwide - Serve 4 million customers per day - Make 5 million pizzas per day
  5. Hello ETL/ELT World Extract Transform Load Load Transform Extract Input

    Data Output Data Operation ETL/ELT pipelines Data source Data store/target Data applications
  6. Data Pipeline Transform/Process Parse Filter Split/merge route Data Store Data

    Pipeline 7 BI tool Target data store Load/Ingest/Move Application Extra/Acquire Data Data Pipeline Data Pipeline Diverse sources Diverse targets
  7. Simplistic Data Flow 8 Data Store/Application B Acquire/Ingest Data Process

    and Analyzed Data Data Store/Application A • Data movement as flow • Moving data content from A to B
  8. Pipeline debt Technical debt in data pipeline Down with pipeline

    debt / introducing Great Expectations. https://greatexpectations.io/blog/down-with-pipeline-debt-introducing-great-expectations • Undocumented/ unmanaged • Untested • Unstable
  9. Designing Data-Intensive Applications Designing Data-Intensive Applications, 2nd Edition by Martin

    Kleppmann, Chris Riccomini, O'Reilly, 2025 Reliable Scalable Maintainable The ability of a system to perform its required functions consistently over time without failure. • Data Integrity and Consistency • Fault Tolerance and Error Handling • Recoverability and Disaster Recovery • Monitoring, Alerting, and Observability • Security and Compliance • Modular, Reusable, and Evolvable Design • Standardization and Best Practices • Simplicity and Ease of Understanding • Configurability and Operability • Comprehensive Documentation and Knowledge Sharing • Version Control and Collaboration Practices • Automated Testing and Validation • Handling Increased Load and Concurrency • Performance Optimization • Dynamic Scaling Strategies and Resource Management • Elasticity and Automated Scaling • Reliability, Fault Tolerance, and Automation The ease with which a data pipeline can be understood, modified, extended, and troubleshooted over its lifecycle. The system's ability to handle increasing amounts of data, higher processing loads, and more complex transformations efficiently and effectively.
  10. • Variable Load • Resource Allocation • Complex Codebase •

    Deployment Difficulties • Lack of Modular System • Lack of Visibility • Inefficient Debugging • Order Processing Failures • Data Inconsistency • Schema Changes • Single Points of Failure Real-Time Order Processing and Delivery Optimization Orders Request Real-time delivery estimates ETL/ELT pipelines Data store/target Order Processing Delivery Optimization Challenges
  11. dbt (data build tool) Transform data using the same practices

    that software engineers use to build applications. • Centralized • Version Control • Documentation • Modularity • Open-Source https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?
  12. model write model layer model Run Build dbt models: dbt

    model =a single .sql file dbt code = SQL + Jinja • SQL select statement dbt model reference each other • Creates Natural dependencies • dbt determine model execution order 1 command • V • Create DAG • Parallel execution
  13. Solutions Provided by dbt Modular Data Transformations Data Testing and

    Validation Reliable Scalable Maintainable • Modular SQL Models • Refactoring Support • Built-in Testing Framework • Schema Tests • Custom Tests Version Control Integration • Git Integration • Collaborative Development Documentation and Data Lineage • Auto-Generated Documentation • Data Lineage Graphs Handling Schema Changes • Incremental Models • Macro Support Performance Optimization • Compiled SQL • Materializations
  14. 6 Dimensions of Data Quality The degree to which data

    correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/
  15. Test data content • Built-in ◦ Singular data tests ◦

    Generic data tests • Packages ◦ dbt_utils ◦ dbt_expectation ◦ dbt_elementary ◦ … Test data schema • dbt (model) contract Test data code • dbt unit test • Recce
  16. Assumption about Data Assertion about Code ref: Webinar on: Testing

    frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ◦ pytest,... model code ◦ dbt unit testing ◦ Recce data content: ◦ pydantic ◦ great expectations ◦ dbt test ◦ dbt_utils/expectatio ns/elementary… data schemas: ◦ dbt data contracts
  17. DataStore Data Data Data Data DataStore Data Data Data Data

    ETL/ELT pipelines configMap CronJob schedule: “0 ****” Database Data warehouse dbt run starts Run query (SQL) dbt run on Kubernetes • Scalability • High Availability • Resource Optimization • Automation • Monitoring and Logging
  18. • Data Pipelines are more and more complex ◦ More

    Challenges (SmartPizza) ▪ Reliability ▪ Scalability ▪ Maintainability • dbt + Kubernetes ◦ Embrace Modular Architecture ◦ Implement Robust Data and Code Testing ◦ Self-healing and replication for fault-tolerant ◦ Automate Deployment Processes ◦ Utilize Auto-Scaling Features • More ◦ Orchestrating dbt Workflows ◦ Enhance Observability ◦ Secure Data and Applications ◦ Promote Collaboration and Version Control Key Takeaways and Tips