Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Global 2021 - Designing Functional Data Pipelines for Reproducibility and Maintainability

PyData Global 2021 - Designing Functional Data Pipelines for Reproducibility and Maintainability

Designing Functional Data Pipelines for Reproducibility and Maintainability
Event: PyData Global 2021
Date: 29 October 2021
Location: Online

When building data pipelines at scale, it is crucial to design data pipelines that are reliable, scalable and extensible according to evolving business needs. Designing data pipelines for reproducibility and maintainability is a challenge, as testing and debugging across compute units (threads/cores/computes) are often complex and time-consuming due to dependencies and shared states at runtime. In this talk, I will be sharing about common challenges in designing reproducible and maintainable data pipelines at scale, and exploring the use of functional programming in Python and Apache Spark to build scalable production-ready data pipelines that are designed for reproducibility and maintainability. Through analogies and realistic examples inspired by data pipeline designs in production environments, you will learn about:

1. What is Functional Programming, and how it differs from other programming paradigms
2. Key Principles of Functional Programming
3. How "control flow" is implemented in Functional Programming
Functional design patterns for data pipeline design in Python and Apache Spark, and how they improve reproducibility and maintainability
4. Whether it is possible to write a purely functional program

This talk assumes basic understanding of building data pipelines with functions and classes/objects. While the main target audience are data scientists/engineers and developers building data-intensive applications, anyone with hands-on experience in imperative programming (including Python) would be able to understand the key concepts and use-cases in functional programming.

Ong Chin Hwee

October 29, 2021
Tweet

More Decks by Ong Chin Hwee

Other Decks in Programming

Transcript

  1. Designing Functional Data Pipelines
    for Reproducibility and Maintainability
    By: Chin Hwee Ong (@ongchinhwee)
    29 November 2021

    View Slide

  2. About me
    Ong Chin Hwee 王敬惠
    ● Data Engineer @ DT One
    ● Aerospace Engineering +
    Computational Modelling
    ● Speaker and (occasional) writer on
    data processing
    @ongchinhwee
    Slides link:
    bit.ly/pg2021-design-fp-data

    View Slide

  3. Basic Design Pattern: Data Pipeline
    @ongchinhwee

    View Slide

  4. Designing a Data Pipeline at Scale
    ● Reliable
    ○ Data pipeline must produce the desired output → Reproducibility
    ● Scalable
    ○ Data pipeline must run independently across multiple nodes →
    Parallelism
    ● Extensible
    ○ Able to extend data pipeline with changing business logic →
    Maintainability
    @ongchinhwee

    View Slide

  5. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility during Testing

    View Slide

  6. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility during Testing

    View Slide

  7. Challenges in Designing Data Pipelines at Scale
    ● Reproducibility during Testing
    ○ Dependencies in data pipeline design
    ■ Data source
    ■ Computation logic
    @ongchinhwee

    View Slide

  8. Challenges in Designing Data Pipelines at Scale
    ● Reproducibility during Testing
    ○ Challenge: Given the same data source, how do we ensure
    that we replicate the same result every time we re-run the
    same process?
    @ongchinhwee

    View Slide

  9. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility in Production

    View Slide

  10. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility in Production

    View Slide

  11. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility in Production

    View Slide

  12. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility in Production
    ○ Debugging parallel/concurrent code at runtime due to
    shared states
    ■ E.g. What is the current state of the data source?

    View Slide

  13. Challenges in Designing Data Pipelines at Scale
    @ongchinhwee
    ● Reproducibility in Production
    ○ Challenge: How do we design data pipelines that run the
    same computation logic across multiple nodes and
    reproduce predictable results every time?

    View Slide

  14. Challenges in Designing Data Pipelines at Scale
    ● Maintainability during Debugging
    ○ “Works in testing, breaks in production” 😔
    ■ Edge cases and inefficiencies not detected in test cases
    causing performance issues and/or failures in production
    ■ Complexities in debugging and logging for parallelism
    @ongchinhwee

    View Slide

  15. Challenges in Designing Data Pipelines at Scale
    ● Maintainability during Debugging
    ○ Challenge: How do we design data pipelines that are
    readable and maintainable at its core to reduce
    inefficiencies in production debugging at scale?
    @ongchinhwee

    View Slide

  16. Challenges in Designing Data Pipelines at Scale
    ● Maintainability when Adding New Features
    ○ Adding new features to an evolving (growing) codebase
    ■ Code reasoning becomes more challenging with increasing
    code complexity
    ■ Risk of introducing unintended behaviour due to
    dependencies
    @ongchinhwee

    View Slide

  17. Challenges in Designing Data Pipelines at Scale
    ● Maintainability when Adding New Features
    ○ Challenge: How do we design data pipelines that adapts
    well to changing business and technical requirements and
    ensures developer productivity?
    @ongchinhwee

    View Slide

  18. Data Pipelines as Functions
    @ongchinhwee

    View Slide

  19. What is Functional Programming?
    ● Declarative style of programming that emphasizes writing
    software using only:
    ○ Pure functions; and
    ○ Immutable values.
    @ongchinhwee

    View Slide

  20. 3 Key Principles of Functional Programming
    ● Pure functions and avoid side effects
    ● Immutability
    ● Referential transparency
    @ongchinhwee

    View Slide

  21. Pure Function and Avoid Side Effects
    @ongchinhwee

    View Slide

  22. The concept of a “pure function”
    ● Pure function
    ○ Output depends only on its input parameters and its
    internal algorithm
    ○ No side effects
    ⇒ same function f, same input parameter x → same result y
    regardless of number of invocations
    @ongchinhwee

    View Slide

  23. Pure Function: Making Pizza
    160°C, 10 mins
    P
    U
    T
    T
    H
    E
    M
    T
    O
    G
    ET
    H
    ER
    @ongchinhwee

    View Slide

  24. “Impure” Function: Making Pizza with Side Effects
    160°C, 10 mins
    P
    U
    T
    T
    H
    E
    M
    T
    O
    G
    ET
    H
    ER
    @ongchinhwee
    Side Effect:
    Radiation Heat

    View Slide

  25. “Impure” Function: Making Pizza with Side Effects
    180°C, 10 mins
    P
    U
    T
    T
    H
    E
    M
    T
    O
    G
    ET
    H
    ER
    @ongchinhwee
    Side Effect:
    Oven Overheat,
    Burnt Pizza! 😖

    View Slide

  26. What is a side effect?
    ● A function with side effects changes state outside the
    local function scope
    ○ Examples:
    ■ modifying a variable or data structure in place
    ■ modifying a global state
    ■ performing any I/O operation
    ■ throwing an exception with an error
    @ongchinhwee

    View Slide

  27. The concept of Immutability
    @ongchinhwee

    View Slide

  28. The concept of Immutability
    ● Immutability of an assigned variable
    ○ Once a value is assigned to a variable, the state of the
    variable cannot be changed.
    ⇒ Disciplined state management
    ⇒ Prevents side effect resulting from state change → “pure
    function”
    @ongchinhwee

    View Slide

  29. The concept of Immutability: Key Implication
    ● Key implication: Ease of writing parallel/concurrent programs
    @ongchinhwee

    View Slide

  30. @ongchinhwee

    View Slide

  31. The concept of Referential Transparency
    A function is referentially transparent when an expression
    can be substituted by its equivalent algorithm without
    affecting the program logic for all programs
    @ongchinhwee

    View Slide

  32. Conditions for Referential Transparency
    ● Pure function
    ● Deterministic
    ○ Expression always returns the same output given the same input
    @ongchinhwee

    View Slide

  33. @ongchinhwee

    View Slide

  34. Conditions for Referential Transparency
    ● Pure function
    ● Deterministic
    ○ Expression returns the same output given the same input
    ● Idempotent
    ○ Expression can be applied multiple times without changing the
    result beyond its initial application
    @ongchinhwee

    View Slide

  35. @ongchinhwee

    View Slide

  36. Equational Reasoning
    ● A key consequent of referential transparency
    ○ Expression can be replaced with its equivalent result
    @ongchinhwee

    View Slide

  37. Functional Control Flow
    @ongchinhwee

    View Slide

  38. Function Composition
    @ongchinhwee

    View Slide

  39. Functions are Values
    ● In Python, functions are first-class objects.
    ● A function can be:
    ○ assigned to a variable
    ○ passed as a parameter to other functions
    ○ returned as a value from other functions
    @ongchinhwee

    View Slide

  40. Higher-order Functions
    ● Key consequent of first-class functions
    ● A higher-order function has at least one of these
    properties:
    ○ Accepts functions as parameters
    ○ Returns a function as a value
    @ongchinhwee

    View Slide

  41. Anonymous Functions
    ● Also known as “lambda expressions” in Python
    ● Using function as input without defining named function object
    @ongchinhwee

    View Slide

  42. map
    @ongchinhwee

    View Slide

  43. filter
    @ongchinhwee

    View Slide

  44. reduce
    @ongchinhwee

    View Slide

  45. map/filter/reduce vs for-loops
    @ongchinhwee
    Managing state changes of
    mutable variables in a for-loop

    View Slide

  46. Recursion as a form of “functional iteration”
    ● Recursion is a form of self-referential function composition
    ○ Takes the results of itself as inputs into another instance of
    itself
    ○ To prevent infinite recursive loop, base case required as
    terminating condition
    @ongchinhwee

    View Slide

  47. Recursion as a form of “functional iteration”
    @ongchinhwee

    View Slide

  48. Recursion as a form of “functional iteration”
    ● Tail-call optimization
    ○ Objective: reduce stack frame consumption in call stack
    ○ Tail call: does nothing other than returning the value of
    function call
    ○ Identify tail calls and compile them to iterative loops
    @ongchinhwee

    View Slide

  49. Recursion as a form of “functional iteration”
    @ongchinhwee

    View Slide

  50. Functional Design Patterns for
    Data Pipeline Design
    @ongchinhwee

    View Slide

  51. Immutable Data Structures
    ● Once an immutable data structure is created, it cannot be
    changed
    ● Benefits:
    ○ Easier to reason - “what you see is what you get”
    ○ Easier to test - worry about the logic, not the state
    ○ Thread-safe - easier for parallelism
    @ongchinhwee

    View Slide

  52. Tuple vs List
    @ongchinhwee

    View Slide

  53. Tuple vs List
    @ongchinhwee

    View Slide

  54. Namedtuple vs Class vs Dictionary
    @ongchinhwee

    View Slide

  55. Namedtuple vs Class vs Dictionary
    @ongchinhwee

    View Slide

  56. Namedtuple vs Class vs Dictionary
    @ongchinhwee

    View Slide

  57. Namedtuple vs Class vs Dictionary
    @ongchinhwee

    View Slide

  58. Data Transformations
    ● map/filter in data transformations
    @ongchinhwee

    View Slide

  59. Data Transformations
    ● map/filter (and its derivatives) in data transformations
    ○ Keeping data and transformation logic separate
    ■ Improved code reusability with better transparency
    of transformation logic
    @ongchinhwee

    View Slide

  60. Extending map/filter to parallel/concurrent programming
    @ongchinhwee

    View Slide

  61. Data Actions / Aggregations
    ● reduce in data actions / aggregations
    @ongchinhwee

    View Slide

  62. Data Actions / Aggregations
    ● reduce (and its derivatives) in data actions / aggregations
    ○ Transformations first, actions last
    ■ Transformation logic can be applied to each
    element / partition
    ■ Actions / aggregations consolidates results from
    partitions
    @ongchinhwee

    View Slide

  63. Functional Design Patterns in Apache Spark
    @ongchinhwee
    ● Resilient Distributed Datasets (RDDs)
    ○ Low-level data abstraction in Apache Spark
    ○ Immutable and read-only
    ○ Designed for fault-tolerant parallel operations with
    logical partitioning across nodes

    View Slide

  64. Transformations vs Actions in Apache Spark
    @ongchinhwee

    View Slide

  65. Transformations vs Actions in Apache Spark
    @ongchinhwee

    View Slide

  66. Structural Pattern Matching (PEP 634)
    ● Python 3.10 feature inspired by similar syntax with Scala
    ● Especially useful for conditional matching of data structure
    patterns
    match Item:
    case Something:
    do_something()
    @ongchinhwee

    View Slide

  67. Structural Pattern Matching (PEP 634)
    ● match/case expressions vs if/elif/else
    @ongchinhwee

    View Slide

  68. Structural Pattern Matching (PEP 634)
    ● Pattern matching for maintainability of data schema
    @ongchinhwee
    Note: Example based on
    case classes and pattern
    matching syntax in Scala
    Dataclasses used as the
    Python equivalent of Scala
    case classes

    View Slide

  69. Type Systems
    ● Python has support for type hints (though not enforced in
    runtime)
    @ongchinhwee

    View Slide

  70. Type Systems
    ● Type checking with mypy
    @ongchinhwee

    View Slide

  71. Type Systems
    ● Type checking with mypy
    ● Preventing bugs at runtime by ensuring type safety and
    consistency across the data pipeline
    @ongchinhwee

    View Slide

  72. Can we write a purely functional data
    pipeline in Python?
    @ongchinhwee

    View Slide

  73. Short Answer: Not really.
    Can we write a purely functional data
    pipeline in Python?
    @ongchinhwee

    View Slide

  74. “Functional Core, Imperative Shell”
    ● I/O operations still needed for reading
    and writing data outside of the
    application domain
    ● Keeping core domain logic and
    infrastructure code separate
    Ref: Gary Bernhardt's PyCon 2013 talk on "Boundaries"
    @ongchinhwee

    View Slide

  75. “Functional Core, Imperative Shell”
    @ongchinhwee

    View Slide

  76. Key Takeaways
    ● Adopt functional design patterns when designing data
    pipelines at scale (parallel and distributed workflows)
    ○ Reproducible
    ○ Scalable
    ○ Maintainable
    ● “Functional Core, Imperative Shell” to manage side effects
    separately from data pipeline logic
    @ongchinhwee

    View Slide

  77. Reach out to
    me!
    : ongchinhwee
    : @ongchinhwee
    : hweecat
    : https://ongchinhwee.me
    And check out my ongoing
    series on Functional
    Programming at:
    https://ongchinhwee.me/tag/functional
    -programming
    @ongchinhwee

    View Slide