Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating extensible workflows with off-label use of Python

Creating extensible workflows with off-label use of Python

Workflow-oriented systems have many uses, including data processing and analysis, ETL, CI/CD, and more. But creating a programmatic interface to a workflow system is a delicate balancing act: we want the API to be flexible enough to support useful work, but also constrained enough that tasks run cooperatively within the larger system.

We faced this challenge when designing the task API for the Pants build system. We needed to allow custom task code to enjoy the benefits of complex features like caching, concurrency and remote execution, without every task author having to reason about them.

In this talk we'll show how we found the right balance through unconventional use of Python's type annotations, coroutines, and dataclasses. Combining these seemingly disparate features in the context of a workflow engine allows you to build elegant extensibility APIs with just the right amount of flexibility.

Benjy

June 09, 2021
Tweet

More Decks by Benjy

Other Decks in Programming

Transcript

  1. Creating extensible workflows with
    off-label use of Python
    Benjy Weinberger
    Maintainer, Pants Build
    PyCon US 2021

    View Slide

  2. About me
    ● 25 years' experience as a
    Software Engineer.
    ● Worked at Google, Twitter, Foursquare.
    ● Maintainer of the Pants OSS project.
    ● Co-founder of Toolchain.

    View Slide

  3. What is a workflow?
    A sequence of tasks that processes data to produce
    a desired result.

    View Slide

  4. Workflows show up all over the place
    ● Processing uploaded images
    ● Building ML models
    ● Aggregating ad clicks
    ● ETL
    ● CI/CD
    And many, many more examples.

    View Slide

  5. Example: Processing Uploaded Images
    Validate
    Extract
    Metadata
    Resize
    Store
    Image data
    Image name
    DB Entry

    View Slide

  6. Workflow is defined by a task graph
    A directed, acyclic graph in which the vertices are
    tasks and the edges are direct data dependencies:
    B → A if B requires A's output as one of its inputs.

    View Slide

  7. Workflow system design
    A non-trivial workflow system requires a Task API.
    Allows you to plug in task implementations that the
    system can use at runtime.

    View Slide

  8. Motivating example: Software builds
    Pants is a scalable software build system with a
    design emphasis on user-friendliness.
    Implements a workflow system in which:
    ● The workflow engine is implemented in Rust
    ● The tasks themselves are implemented in Python

    View Slide

  9. Workflow for software builds
    Tasks are build steps:
    generating code linting
    resolving dependencies formatting
    compiling type-checking
    testing packaging

    View Slide

  10. Design goals
    ● Fine-grained tasks
    ● Caching
    ● Concurrency
    ● Remote execution
    ● Extensibility

    View Slide

  11. Task API design challenges
    ● Tasks must run cooperatively
    ● Tasks must not side-effect
    ● Task dependencies must be explicit
    But also…
    ● Tasks must be straightforward to write

    View Slide

  12. Python to the rescue!
    Specifically:
    ● type annotations
    ● asyncio
    ● dataclasses

    View Slide

  13. ● type annotations
    ● asyncio
    ● dataclasses

    View Slide

  14. Rules
    A rule is a pure function that maps a set of
    statically-declared input types to a statically-declared
    output type.

    View Slide

  15. Example
    @rule
    def run_python_test(test_file: PythonTestFile,
    pytest_config: PyTestConfig,
    test_options: TestOptions)
    -> TestResult:
    """Runs pytest on one test file."""
    ...

    View Slide

  16. Building the rule graph
    Given a set of rules, we construct a rule graph by
    introspecting the type annotations:
    B → A if B has A's output type as one of its input
    types.

    View Slide

  17. Static validation
    Rules are statically validated for ambiguity,
    reachability, satisfiability.
    You can register custom rules, to extend functionality.
    no wiring necessary!

    View Slide

  18. Type annotations provide
    ● Fine-grained tasks✓
    ● Caching
    ● Concurrency
    ● Remote execution
    ● Extensibility✓

    View Slide

  19. ● type annotations
    ● asyncio
    ● dataclasses

    View Slide

  20. Rules - a correction
    A rule is a pure function coroutine that maps a set of
    statically-declared input types to a statically-declared
    output type.

    View Slide

  21. Example
    @rule
    async
    def run_python_test(test_file: PythonTestFile,
    pytest_config: PyTestConfig,
    test_options: TestOptions)
    -> TestResult:
    """Runs pytest on one test file."""
    ...

    View Slide

  22. Why coroutines?
    As a rule runs, if it needs some extra input, it can
    yield back to the workflow engine:
    if not test_file.is_empty():
    pytest = await Get(PyTest, pytest_config)

    View Slide

  23. Coroutines are powerful
    Rules are applied dynamically, on the fly, rather than
    execution being precomputed statically.
    However even in this case, rules are still statically
    validated for ambiguity, reachability, satisfiability.

    View Slide

  24. Coroutines can express concurrency
    test_results = await MultiGet(
    Get(TestResult, test_file)
    for test_file in test_files
    )

    View Slide

  25. Coroutines help avoid side effects
    init_files = await Get(
    Snapshot, PathGlobs(["**/__init__.py"]))
    result = await Get(
    ProcessResult,
    Process(argv=["/bin/echo", "hello world"])
    )

    View Slide

  26. Coroutines provide natural control points for
    ● Fine-grained tasks✓
    ● Caching
    ● Concurrency✓
    ● Remote execution✓
    ● Extensibility✓

    View Slide

  27. ● type annotations
    ● asyncio
    ● dataclasses

    View Slide

  28. Cacheability
    For caching to work, rule input types must be
    immutable and hashable.

    View Slide

  29. It's trivial to make types cacheable
    @dataclass(frozen=True)
    PyTestConfig:
    version: str
    plugins: Tuple[str, ...]
    pytest_config = PyTestConfig(
    version="pytest>=5.3.5,<5.4",
    plugins=("pytest-cov>=2.8.1,<2.9",)
    )

    View Slide

  30. Coroutines provide natural control points for
    ● Fine-grained tasks✓
    ● Caching✓
    ● Concurrency✓
    ● Remote execution✓
    ● Extensibility✓

    View Slide

  31. Summary
    Using Python features in unusual ways allow us to
    expose a simple programming model to a complex
    system.
    You write Pythonic code, and caching, concurrency
    and remote execution "just happen".

    View Slide

  32. Thanks for attending!
    I'll be happy to take any questions.
    You can find us in Startup Row!
    You can also find more about Pants at
    https://www.pantsbuild.org/

    View Slide