Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Functional Data Pipelines for Reproducibility and Maintainability

Designing Functional Data Pipelines for Reproducibility and Maintainability

Designing Functional Data Pipelines for Reproducibility and Maintainability
Event: EuroPython 2021
Date: 29 July 2021
Location: Online

When building data pipelines at scale, it is crucial to design data pipelines that are reliable, scalable and extensible according to evolving business needs. Designing data pipelines for reproducibility and maintainability is a challenge, as testing and debugging across compute units (threads/cores/computes) are often complex and time-consuming due to dependencies and shared states at runtime. In this talk, I will be sharing about common challenges in designing reproducible and maintainable data pipelines at scale, and exploring the use of functional programming in Python to build scalable production-ready data pipelines that are designed for reproducibility and maintainability. Through analogies and realistic examples inspired by data pipeline designs in production environments, you will learn about:

1. What is Functional Programming, and how it differs from other programming paradigms
2. Key Principles of Functional Programming
3. How "control flow" is implemented in Functional Programming
4. Functional design patterns for data pipeline design in Python, and how they improve reproducibility and maintainability
5. Whether it is possible to write a purely functional program in Python

This talk assumes basic understanding of building data pipelines with functions and classes/objects. While the main target audience are data scientists/engineers and developers building data-intensive applications, anyone with hands-on experience in imperative programming (including Python) would be able to understand the key concepts and use-cases in functional programming.

78a26060bbb88be50cc352664e6e2648?s=128

Ong Chin Hwee

July 29, 2021
Tweet

Transcript

  1. Designing Functional Data Pipelines for Reproducibility and Maintainability By: Chin

    Hwee Ong (@ongchinhwee) 29 July 2021
  2. About me Ong Chin Hwee 王敬惠 • Data Engineer @

    DT One • Aerospace Engineering + Computational Modelling • Speaker and (occasional) writer on data processing @ongchinhwee Slides link: bit.ly/ep2021-design-fp-data
  3. Basic Design Pattern: Data Pipeline @ongchinhwee

  4. Designing a Data Pipeline at Scale • Reliable ◦ Data

    pipeline must produce the desired output → Reproducibility • Scalable ◦ Data pipeline must run independently across multiple nodes → Parallelism • Extensible ◦ Able to extend data pipeline with changing business logic → Maintainability @ongchinhwee
  5. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    during Testing
  6. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    during Testing
  7. Challenges in Designing Data Pipelines at Scale • Reproducibility during

    Testing ◦ Dependencies in data pipeline design ▪ Data source ▪ Computation logic @ongchinhwee
  8. Challenges in Designing Data Pipelines at Scale • Reproducibility during

    Testing ◦ Challenge: Given the same data source, how do we ensure that we replicate the same result every time we re-run the same process? @ongchinhwee
  9. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production
  10. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production
  11. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production
  12. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production ◦ Debugging parallel/concurrent code at runtime due to shared states ▪ E.g. What is the current state of the data source?
  13. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production ◦ Challenge: How do we design data pipelines that run the same computation logic across multiple nodes and reproduce predictable results every time?
  14. Challenges in Designing Data Pipelines at Scale • Maintainability during

    Debugging ◦ “Works in testing, breaks in production” 😔 ▪ Edge cases and inefficiencies not detected in test cases causing performance issues and/or failures in production ▪ Complexities in debugging and logging for parallelism @ongchinhwee
  15. Challenges in Designing Data Pipelines at Scale • Maintainability during

    Debugging ◦ Challenge: How do we design data pipelines that are readable and maintainable at its core to reduce inefficiencies in production debugging at scale? @ongchinhwee
  16. Challenges in Designing Data Pipelines at Scale • Maintainability when

    Adding New Features ◦ Adding new features to an evolving (growing) codebase ▪ Code reasoning becomes more challenging with increasing code complexity ▪ Risk of introducing unintended behaviour due to dependencies @ongchinhwee
  17. Challenges in Designing Data Pipelines at Scale • Maintainability when

    Adding New Features ◦ Challenge: How do we design data pipelines that adapts well to changing business and technical requirements and ensures developer productivity? @ongchinhwee
  18. Data Pipelines as Functions @ongchinhwee

  19. What is Functional Programming? • Declarative style of programming that

    emphasizes writing software using only: ◦ Pure functions; and ◦ Immutable values. @ongchinhwee
  20. 3 Key Principles of Functional Programming • Pure functions and

    avoid side effects • Immutability • Referential transparency @ongchinhwee
  21. Pure Function and Avoid Side Effects @ongchinhwee

  22. The concept of a “pure function” • Pure function ◦

    Output depends only on its input parameters and its internal algorithm ◦ No side effects ⇒ same function f, same input parameter x → same result y regardless of number of invocations @ongchinhwee
  23. Pure Function: Making Pizza 160°C, 10 mins P U T

    T H E M T O G ET H ER @ongchinhwee
  24. “Impure” Function: Making Pizza with Side Effects 160°C, 10 mins

    P U T T H E M T O G ET H ER @ongchinhwee Side Effect: Radiation Heat
  25. “Impure” Function: Making Pizza with Side Effects 180°C, 10 mins

    P U T T H E M T O G ET H ER @ongchinhwee Side Effect: Oven Overheat, Burnt Pizza! 😖
  26. What is a side effect? • A function with side

    effects changes state outside the local function scope ◦ Examples: ▪ modifying a variable or data structure in place ▪ modifying a global state ▪ performing any I/O operation ▪ throwing an exception with an error @ongchinhwee
  27. The concept of Immutability @ongchinhwee

  28. The concept of Immutability • Immutability of an assigned variable

    ◦ Once a value is assigned to a variable, the state of the variable cannot be changed. ⇒ Disciplined state management ⇒ Prevents side effect resulting from state change → “pure function” @ongchinhwee
  29. The concept of Immutability: Key Implication • Key implication: Ease

    of writing parallel/concurrent programs @ongchinhwee
  30. @ongchinhwee

  31. The concept of Referential Transparency A function is referentially transparent

    when an expression can be substituted by its equivalent algorithm without affecting the program logic for all programs @ongchinhwee
  32. Conditions for Referential Transparency • Pure function • Deterministic ◦

    Expression always returns the same output given the same input @ongchinhwee
  33. @ongchinhwee

  34. Conditions for Referential Transparency • Pure function • Deterministic ◦

    Expression returns the same output given the same input • Idempotent ◦ Expression can be applied multiple times without changing the result beyond its initial application @ongchinhwee
  35. @ongchinhwee

  36. Equational Reasoning • A key consequent of referential transparency ◦

    Expression can be replaced with its equivalent result @ongchinhwee
  37. Functional Control Flow @ongchinhwee

  38. Function Composition @ongchinhwee

  39. Functions are Values • In Python, functions are first-class objects.

    • A function can be: ◦ assigned to a variable ◦ passed as a parameter to other functions ◦ returned as a value from other functions @ongchinhwee
  40. Higher-order Functions • Key consequent of first-class functions • A

    higher-order function has at least one of these properties: ◦ Accepts functions as parameters ◦ Returns a function as a value @ongchinhwee
  41. Anonymous Functions • Also known as “lambda expressions” in Python

    • Using function as input without defining named function object @ongchinhwee
  42. map @ongchinhwee

  43. filter @ongchinhwee

  44. reduce @ongchinhwee

  45. map/filter/reduce vs for-loops @ongchinhwee Managing state changes of mutable variables

    in a for-loop
  46. Recursion as a form of “functional iteration” • Recursion is

    a form of self-referential function composition ◦ Takes the results of itself as inputs into another instance of itself ◦ To prevent infinite recursive loop, base case required as terminating condition @ongchinhwee
  47. Recursion as a form of “functional iteration” @ongchinhwee

  48. Recursion as a form of “functional iteration” • Tail-call optimization

    ◦ Objective: reduce stack frame consumption in call stack ◦ Tail call: does nothing other than returning the value of function call ◦ Identify tail calls and compile them to iterative loops @ongchinhwee
  49. Recursion as a form of “functional iteration” @ongchinhwee

  50. Functional Design Patterns for Data Pipeline Design (in Python) @ongchinhwee

  51. Built-in Higher-order Functions • map/filter vs list comprehensions @ongchinhwee

  52. Built-in Higher-order Functions • map/filter vs list comprehensions ◦ List

    comprehensions are syntactic sugar for map/filter operations in a data collection (list) @ongchinhwee
  53. Built-in Higher-order Functions • Using map/filter in data transformations @ongchinhwee

  54. Built-in Higher-order Functions • Benefits of using map/filter in data

    transformations ◦ Keeping data and transformation logic separate ▪ Improved code reusability with better transparency of transformation logic @ongchinhwee
  55. Extending map/filter to parallel/concurrent programming @ongchinhwee

  56. Extending map/filter to parallel/concurrent programming • Using multiprocessing.Pool or concurrent.futures

    ◦ Generate iterator using map, then filter results to a collection (list) @ongchinhwee More details on parallel processing and concurrent.futures: My EuroPython 2020 talk "Speed Up Your Data Processing"
  57. Immutable Data Structures • Once an immutable data structure is

    created, it cannot be changed • Benefits: ◦ Easier to reason - “what you see is what you get” ◦ Easier to test - worry about the logic, not the state ◦ Thread-safe - easier for parallelism @ongchinhwee
  58. Tuple vs List @ongchinhwee

  59. Tuple vs List @ongchinhwee

  60. Namedtuple vs Class vs Dictionary @ongchinhwee

  61. Namedtuple vs Class vs Dictionary @ongchinhwee

  62. Namedtuple vs Class vs Dictionary @ongchinhwee

  63. Namedtuple vs Class vs Dictionary @ongchinhwee

  64. Structural Pattern Matching (PEP 634) • Python 3.10 feature inspired

    by similar syntax with Scala • Especially useful for conditional matching of data structure patterns match Item: case Something: do_something() @ongchinhwee
  65. Structural Pattern Matching (PEP 634) • match/case expressions vs if/elif/else

    @ongchinhwee
  66. Structural Pattern Matching (PEP 634) • Pattern matching for maintainability

    of data schema @ongchinhwee Note: Example based on case classes and pattern matching syntax in Scala Dataclasses used as the Python equivalent of Scala case classes
  67. Recursions in Python • Tail-call optimization not supported in Python

    ◦ Optimization has to be implemented manually • Recursion limit of 1000 (by default) as a prevention mechanism against call stack overflow in CPython implementation @ongchinhwee
  68. Type Systems • Python has support for type hints (though

    not enforced in runtime) @ongchinhwee
  69. Type Systems • Type checking with mypy @ongchinhwee

  70. Type Systems • Type checking with mypy • Preventing bugs

    at runtime by ensuring type safety and consistency across the data pipeline @ongchinhwee
  71. Can we write a purely functional data pipeline in Python?

    @ongchinhwee
  72. Short Answer: Not really. Can we write a purely functional

    data pipeline in Python? @ongchinhwee
  73. “Functional Core, Imperative Shell” • I/O operations still needed for

    reading and writing data outside of the application domain • Keeping core domain logic and infrastructure code separate Ref: Gary Bernhardt's PyCon 2013 talk on "Boundaries" @ongchinhwee
  74. “Functional Core, Imperative Shell” @ongchinhwee

  75. Key Takeaways • Adopt functional design patterns when designing data

    pipelines at scale (parallel and distributed workflows) ◦ Reproducible ◦ Scalable ◦ Maintainable • “Functional Core, Imperative Shell” to manage side effects separately from data pipeline logic @ongchinhwee
  76. Reach out to me! : ongchinhwee : @ongchinhwee : hweecat

    : https://ongchinhwee.me And check out my ongoing series on Functional Programming at: https://ongchinhwee.me/tag/functional -programming @ongchinhwee