Slide 1

Slide 1 text

Creating extensible workflows with off-label use of Python Benjy Weinberger Maintainer, Pants Build PyCon US 2021

Slide 2

Slide 2 text

About me ● 25 years' experience as a Software Engineer. ● Worked at Google, Twitter, Foursquare. ● Maintainer of the Pants OSS project. ● Co-founder of Toolchain.

Slide 3

Slide 3 text

What is a workflow? A sequence of tasks that processes data to produce a desired result.

Slide 4

Slide 4 text

Workflows show up all over the place ● Processing uploaded images ● Building ML models ● Aggregating ad clicks ● ETL ● CI/CD And many, many more examples.

Slide 5

Slide 5 text

Example: Processing Uploaded Images Validate Extract Metadata Resize Store Image data Image name DB Entry

Slide 6

Slide 6 text

Workflow is defined by a task graph A directed, acyclic graph in which the vertices are tasks and the edges are direct data dependencies: B → A if B requires A's output as one of its inputs.

Slide 7

Slide 7 text

Workflow system design A non-trivial workflow system requires a Task API. Allows you to plug in task implementations that the system can use at runtime.

Slide 8

Slide 8 text

Motivating example: Software builds Pants is a scalable software build system with a design emphasis on user-friendliness. Implements a workflow system in which: ● The workflow engine is implemented in Rust ● The tasks themselves are implemented in Python

Slide 9

Slide 9 text

Workflow for software builds Tasks are build steps: generating code linting resolving dependencies formatting compiling type-checking testing packaging

Slide 10

Slide 10 text

Design goals ● Fine-grained tasks ● Caching ● Concurrency ● Remote execution ● Extensibility

Slide 11

Slide 11 text

Task API design challenges ● Tasks must run cooperatively ● Tasks must not side-effect ● Task dependencies must be explicit But also… ● Tasks must be straightforward to write

Slide 12

Slide 12 text

Python to the rescue! Specifically: ● type annotations ● asyncio ● dataclasses

Slide 13

Slide 13 text

● type annotations ● asyncio ● dataclasses

Slide 14

Slide 14 text

Rules A rule is a pure function that maps a set of statically-declared input types to a statically-declared output type.

Slide 15

Slide 15 text

Example @rule def run_python_test(test_file: PythonTestFile, pytest_config: PyTestConfig, test_options: TestOptions) -> TestResult: """Runs pytest on one test file.""" ...

Slide 16

Slide 16 text

Building the rule graph Given a set of rules, we construct a rule graph by introspecting the type annotations: B → A if B has A's output type as one of its input types.

Slide 17

Slide 17 text

Static validation Rules are statically validated for ambiguity, reachability, satisfiability. You can register custom rules, to extend functionality. no wiring necessary!

Slide 18

Slide 18 text

Type annotations provide ● Fine-grained tasks✓ ● Caching ● Concurrency ● Remote execution ● Extensibility✓

Slide 19

Slide 19 text

● type annotations ● asyncio ● dataclasses

Slide 20

Slide 20 text

Rules - a correction A rule is a pure function coroutine that maps a set of statically-declared input types to a statically-declared output type.

Slide 21

Slide 21 text

Example @rule async def run_python_test(test_file: PythonTestFile, pytest_config: PyTestConfig, test_options: TestOptions) -> TestResult: """Runs pytest on one test file.""" ...

Slide 22

Slide 22 text

Why coroutines? As a rule runs, if it needs some extra input, it can yield back to the workflow engine: if not test_file.is_empty(): pytest = await Get(PyTest, pytest_config)

Slide 23

Slide 23 text

Coroutines are powerful Rules are applied dynamically, on the fly, rather than execution being precomputed statically. However even in this case, rules are still statically validated for ambiguity, reachability, satisfiability.

Slide 24

Slide 24 text

Coroutines can express concurrency test_results = await MultiGet( Get(TestResult, test_file) for test_file in test_files )

Slide 25

Slide 25 text

Coroutines help avoid side effects init_files = await Get( Snapshot, PathGlobs(["**/__init__.py"])) result = await Get( ProcessResult, Process(argv=["/bin/echo", "hello world"]) )

Slide 26

Slide 26 text

Coroutines provide natural control points for ● Fine-grained tasks✓ ● Caching ● Concurrency✓ ● Remote execution✓ ● Extensibility✓

Slide 27

Slide 27 text

● type annotations ● asyncio ● dataclasses

Slide 28

Slide 28 text

Cacheability For caching to work, rule input types must be immutable and hashable.

Slide 29

Slide 29 text

It's trivial to make types cacheable @dataclass(frozen=True) PyTestConfig: version: str plugins: Tuple[str, ...] pytest_config = PyTestConfig( version="pytest>=5.3.5,<5.4", plugins=("pytest-cov>=2.8.1,<2.9",) )

Slide 30

Slide 30 text

Coroutines provide natural control points for ● Fine-grained tasks✓ ● Caching✓ ● Concurrency✓ ● Remote execution✓ ● Extensibility✓

Slide 31

Slide 31 text

Summary Using Python features in unusual ways allow us to expose a simple programming model to a complex system. You write Pythonic code, and caching, concurrency and remote execution "just happen".

Slide 32

Slide 32 text

Thanks for attending! I'll be happy to take any questions. You can find us in Startup Row! You can also find more about Pants at https://www.pantsbuild.org/