Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating extensible workflows with off-label us...

Benjy
June 09, 2021

Creating extensible workflows with off-label use of Python

Workflow-oriented systems have many uses, including data processing and analysis, ETL, CI/CD, and more. But creating a programmatic interface to a workflow system is a delicate balancing act: we want the API to be flexible enough to support useful work, but also constrained enough that tasks run cooperatively within the larger system.

We faced this challenge when designing the task API for the Pants build system. We needed to allow custom task code to enjoy the benefits of complex features like caching, concurrency and remote execution, without every task author having to reason about them.

In this talk we'll show how we found the right balance through unconventional use of Python's type annotations, coroutines, and dataclasses. Combining these seemingly disparate features in the context of a workflow engine allows you to build elegant extensibility APIs with just the right amount of flexibility.

Benjy

June 09, 2021
Tweet

More Decks by Benjy

Other Decks in Programming

Transcript

  1. About me • 25 years' experience as a Software Engineer.

    • Worked at Google, Twitter, Foursquare. • Maintainer of the Pants OSS project. • Co-founder of Toolchain.
  2. What is a workflow? A sequence of tasks that processes

    data to produce a desired result.
  3. Workflows show up all over the place • Processing uploaded

    images • Building ML models • Aggregating ad clicks • ETL • CI/CD And many, many more examples.
  4. Workflow is defined by a task graph A directed, acyclic

    graph in which the vertices are tasks and the edges are direct data dependencies: B → A if B requires A's output as one of its inputs.
  5. Workflow system design A non-trivial workflow system requires a Task

    API. Allows you to plug in task implementations that the system can use at runtime.
  6. Motivating example: Software builds Pants is a scalable software build

    system with a design emphasis on user-friendliness. Implements a workflow system in which: • The workflow engine is implemented in Rust • The tasks themselves are implemented in Python
  7. Workflow for software builds Tasks are build steps: generating code

    linting resolving dependencies formatting compiling type-checking testing packaging
  8. Task API design challenges • Tasks must run cooperatively •

    Tasks must not side-effect • Task dependencies must be explicit But also… • Tasks must be straightforward to write
  9. Rules A rule is a pure function that maps a

    set of statically-declared input types to a statically-declared output type.
  10. Building the rule graph Given a set of rules, we

    construct a rule graph by introspecting the type annotations: B → A if B has A's output type as one of its input types.
  11. Static validation Rules are statically validated for ambiguity, reachability, satisfiability.

    You can register custom rules, to extend functionality. no wiring necessary!
  12. Rules - a correction A rule is a pure function

    coroutine that maps a set of statically-declared input types to a statically-declared output type.
  13. Why coroutines? As a rule runs, if it needs some

    extra input, it can yield back to the workflow engine: if not test_file.is_empty(): pytest = await Get(PyTest, pytest_config)
  14. Coroutines are powerful Rules are applied dynamically, on the fly,

    rather than execution being precomputed statically. However even in this case, rules are still statically validated for ambiguity, reachability, satisfiability.
  15. Coroutines help avoid side effects init_files = await Get( Snapshot,

    PathGlobs(["**/__init__.py"])) result = await Get( ProcessResult, Process(argv=["/bin/echo", "hello world"]) )
  16. Coroutines provide natural control points for • Fine-grained tasks✓ •

    Caching • Concurrency✓ • Remote execution✓ • Extensibility✓
  17. It's trivial to make types cacheable @dataclass(frozen=True) PyTestConfig: version: str

    plugins: Tuple[str, ...] pytest_config = PyTestConfig( version="pytest>=5.3.5,<5.4", plugins=("pytest-cov>=2.8.1,<2.9",) )
  18. Coroutines provide natural control points for • Fine-grained tasks✓ •

    Caching✓ • Concurrency✓ • Remote execution✓ • Extensibility✓
  19. Summary Using Python features in unusual ways allow us to

    expose a simple programming model to a complex system. You write Pythonic code, and caching, concurrency and remote execution "just happen".
  20. Thanks for attending! I'll be happy to take any questions.

    You can find us in Startup Row! You can also find more about Pants at https://www.pantsbuild.org/