Hermetic Environments in Pantsbuild

Christopher Neugebauer • [email protected] @chrisjrn • @pantsbuild Hermetic Environments in
Pantsbuild and how they make development tools more ef fi cient, no matter how large your codebase Hi! I’m Christopher Neugebauer, I work as an engineer at Toolchain, and I’m also a maintainer on the Pantsbuild open source project. Feel free to tweet me at the handle at the bottom of this slide if you want to loudly disagree with me, or send me questions by the email on my screen. Today’s talk is about hermetic environments, which is an approach Pants uses to make sure that we can predictably model tasks that you run while writing and testing code.

@chrisjrn • @pantsbuild • pantsbuild.org WARNING I am not a
security person This is not a security talk First up, I want to point out that this is not a security talk, a lot of approaches we’re talking about have tradeoffs between guaranteeing absolute correctness and security at every stage, and delivering speedy performance. If you have different goals, you’d probably make different choices.

@chrisjrn • @pantsbuild • pantsbuild.org Today Hermetic environments: Why, What,
and How • A high-level look at Pants • What is reproducibility? • Sandboxing for hermetic builds • How Pants implements these things The focus of today’s talk is on how reproducibility can make your build tools perform better, and what sandboxing techniques can deliver that reproducibility without sacri fi cing performance. That theoretical material is sandwiched between an introduction to Pants and how Pants actually implements these things, so you can see how this all works in practice.

@chrisjrn • @pantsbuild • pantsbuild.org What are is Pants? We’re
going to start the talk with a quick introduction to Pants, and some of the problems that motivate the techniques we’re talking about in this talk…

@chrisjrn • @pantsbuild • pantsbuild.org Pants is a build system
Pantsbuild is a Build System, which is a term that’s a bit of a holdover from compiled languages, where you need to run lots of tools in a speci fi c order to get your code to run at all. We orchestrate all of the tools that interact with your code, everything from linting to testing and all the way to building a package for deployment or distribution. So even in Python, where there’s no compilation step, it can help orchestrate correctness tools like pytest, mypy, or formatters and linters like fl ake8 and black, so that you can run them more ef fi ciently, and only have to interact with one tool to run all of them.

@chrisjrn • @pantsbuild • pantsbuild.org Terminology! Goals Rules Processes There’s
a small amount of jargon that I am going to need to get out of the way fi rst, which will hopefully make the rest of the talk go more smoothly. In the Build Systems community – and Pants – there are terms for the units of work that get run in your development work fl ows. The ones I’m going to use a lot are Goals, Rules, and Processes.

@chrisjrn • @pantsbuild • pantsbuild.org Terminology! Goals Rules Processes Goals
are things that an end-user will ask Pants to do for them. Things like “run this test”, or “typecheck the source fi les in this directory”, or “package this library”. Most CI and pre-commit work fl ows are made out of multiple Goals.

@chrisjrn • @pantsbuild • pantsbuild.org Terminology! Goals Rules Processes Rules
are the individual steps that Pants needs to perform to accomplish a Goal. A Rule might be something like “ fi gure out which fi les to give to MyPy to run the type checker”, or “ fi gure out which source fi le is the entry point for a python executable”. Goals are made up of Rules, and Rules can themselves run more Rules.

@chrisjrn • @pantsbuild • pantsbuild.org Terminology! Goals Rules Processes Processes
are when we run the actual underlying tools that Pants is orchestrating. Processes might be something like “run PyTest with these speci fi c source fi les”.

@chrisjrn • @pantsbuild • pantsbuild.org Pants 2 is a Python
tool for Python People Now. Today we’re talking about Pants 2, which is a new tool inspired by the original version of Pants developed at Twitter. It’s a complete rebuild from the ground up by a community of open source developers led by us at Toolchain, and the fi rst releases were made very much with Python in mind. We’re trying to be good members of the Python community, and we want Pants to be a great experience for Python-focused codebases of any size. If you’re using it on a smallish Python-only codebase, we fi t into the same category of tools as Tox…

@chrisjrn • @pantsbuild • pantsbuild.org Multiple languages Large codebases …
but the goal of Pants is to grow with your codebase: this means supporting multiple languages, and being as ef fi cient within a large codebase as it is with a small codebase. The ideal world for us is for a developer to be able to work on Scala, or Go, or Python code, and interact with the same tool commands no matter what language they use – and being pleasant enough to WANT to use those same tools. Finally, we aim to make it easier to use a monorepo development work fl ow, and get the con fi guration management and code re-use bene fi ts that come with that, by making it as ef fi cient and easy to reason about as a multi-repo setup.

@chrisjrn • @pantsbuild • pantsbuild.org Better performance with the same
underlying tools The main goal for us is to complete your goals faster, while making use of the same underlying tools. Pants can do things like identify which rules can be run in parallel, or eliminate duplicated or unnecessary work. The idea is that as your codebase scales, the work that Pants does to con fi gure and orchestrate your tools will be more ef fi cient and effective than just running those tools in their default con fi gurations. As an example…

@chrisjrn • @pantsbuild • pantsbuild.org % ./pants test helloworld:: …
09:29:25.73 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. 09:29:25.87 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.50s. ✓ helloworld/translator/translator_test.py:tests succeeded in 0.38s. Let’s say you have a test suite with two test fi les. <CLICK>

@chrisjrn • @pantsbuild • pantsbuild.org % ./pants test helloworld:: …
09:29:25.73 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. 09:29:25.87 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.50s. ✓ helloworld/translator/translator_test.py:tests succeeded in 0.38s. Pants will run your test suite! That’s not surprising!

@chrisjrn • @pantsbuild • pantsbuild.org % echo "# Let's modify
a test" \ >> helloworld/translator/translator_test.py % ./pants test helloworld:: 09:31:41.20 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. 09:31:41.89 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.50s (memoized). ✓ helloworld/translator/translator_test.py:tests succeeded in 0.59s. If you edit one of those two test fi les, <CLICK>

a test" \ >> helloworld/translator/translator_test.py % ./pants test helloworld:: 09:31:41.20 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. 09:31:41.89 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.50s (memoized). ✓ helloworld/translator/translator_test.py:tests succeeded in 0.59s. Pants will only actually re-run the test fi le that you edited. <CLICK> you can see that `greeting_test` is reused. That’s what that `memoized` means. That’s cool! It cuts the runtime of this trivial case in half! It’s still not that surprising! Most tools do something like this! What is surprising?

@chrisjrn • @pantsbuild • pantsbuild.org % git reset --hard HEAD
is now at 93a76bc Upgrade to 2.10.0 (#99) % ./pants test helloworld:: 09:35:42.60 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. 09:35:42.61 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.50s (memoized). ✓ helloworld/translator/translator_test.py:tests succeeded in 0.38s (memoized). First, let’s revert and run the tests again.

@chrisjrn • @pantsbuild • pantsbuild.org % git reset --hard HEAD
is now at 93a76bc Upgrade to 2.10.0 (#99) % ./pants test helloworld:: 09:35:42.60 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. 09:35:42.61 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.50s (memoized). ✓ helloworld/translator/translator_test.py:tests succeeded in 0.38s (memoized). Note that both tests are memoized this time. Rather than just saving the most recent state of the tests, Pants has cached both of our previous runs! Since we’re running tests that we’ve run in the past, in exactly the same con fi guration, Pants will just reuse those results.

an implementation" \ >> helloworld/greet/greeting.py % ./pants test helloworld:: 09:38:23.71 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. 09:38:24.31 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.51s. ✓ helloworld/translator/translator_test.py:tests succeeded in 0.38s (memoized). Now, what happens if — instead of changing a test fi le — I change the implementation fi le that is under test?

an implementation" \ >> helloworld/greet/greeting.py % ./pants test helloworld:: 09:38:23.71 [INFO] Completed: Run Pytest - helloworld/ translator/translator_test.py:tests succeeded. 09:38:24.31 [INFO] Completed: Run Pytest - helloworld/ greet/greeting_test.py:tests succeeded. ✓ helloworld/greet/greeting_test.py:tests succeeded in 0.51s. ✓ helloworld/translator/translator_test.py:tests succeeded in 0.38s (memoized). Edit the implementation Only one test re-run! Again, Pants only re-runs the test that runs against the implementations that change. Everything else is cached. <CLICK> That’s because Pants automatically understands the dependencies in your codebase through static analysis, and can use that information to fi gure out which tests must be run again. When you start dealing with codebases with hundreds or thousands of test and implementation fi les, decisions like this mean you don’t have to remember which tests are relevant to the parts of the codebase you’ve changed. When you have a huge, slow test suite, this can make it pleasant to regularly run your tests as you develop.

@chrisjrn • @pantsbuild • pantsbuild.org Remote cacheing and execution We
can run tools more ef fi ciently locally, but to scale properly, we’re working to properly support remote cacheing and remote execution. Remote cacheing means that if one person on your team runs a given rule and another person needs an identical rule completed, we can fetch that from a cache instead of running it again. It’s surprising how often a team can run exactly the same rule over and over again. That’s a huge waste of time and money. Solving remote cacheing means thinking about a lot of problems around how we run the underlying tools that Pants orchestrates…

@chrisjrn • @pantsbuild • pantsbuild.org “What rules must be run
to accomplish a goal?” … like what rules actually need to be run in order to fi nish a user’s goal? Do we split our goals into one process, or several?

@chrisjrn • @pantsbuild • pantsbuild.org “Was this rule already run?”
Once we’ve split up the rules, can we test — with accuracy — to see if a given rule was already run?

@chrisjrn • @pantsbuild • pantsbuild.org “Can we reuse this result?”
And if a rule has already been run — can we reuse its output instead of running it again?

@chrisjrn • @pantsbuild • pantsbuild.org Was the result reproducible enough
for our needs? All of these boil down to one question: how do we make sure the end result of a rule is reliably reproducible? Can we be con fi dent that if we use a result from a previous run, the end result is going to be valid? If we can run rules and be con fi dent that running that same rule again will yield exactly the same result, then we don’t need to run those rules again…

@chrisjrn • @pantsbuild • pantsbuild.org Reproducibility: how much do you
need? … Which leads us to our fi rst concept: Reproducibility. Reproducibility is the idea that if you run the same rule, you end up with the same result. It seems like a simple enough concept, but there’s enough of a sticking point that I’m going to need to talk a bit about what we mean by “the same”…

@chrisjrn • @pantsbuild • pantsbuild.org Reproducible Builds … the elephant
in the room here is this thing called “Reproducible Builds”. Reproducible Builds are a process in open source releasing that provides you with a guarantee that a binary package corresponds to a given set of source fi les.

@chrisjrn • @pantsbuild • pantsbuild.org Reproducible Builds identical code +
identical environment + identical dependencies = identical package The idea behind reproducible builds is that if you start with a given snapshot of your codebase, and run it in a very well-speci fi ed environment, and guarantee that the dependencies are the same, you’ll end up with exactly the same package as the published binary.

@chrisjrn • @pantsbuild • pantsbuild.org Reproducible Builds Pants can be
used as part of a Reproducible Builds work fl ow, but most development teams do not actually need this level of guarantee: Ops people installing internal software tend to have high levels of trust of the developers at their own company. So Pants doesn’t go out of its way to be cryptographically reproducible, and that’s not what we’re going to be talking about today.

@chrisjrn • @pantsbuild • pantsbuild.org Improved performance for developer workflows
Our primary goal here is ensuring a useful level of correctness, but we care about the completion time for your goals, so that you spend less time at your desk waiting for tests to pass, and you spend less time waiting for your CI to go green.

@chrisjrn • @pantsbuild • pantsbuild.org Improved performance through Parallelisation For
us, Reproducibility means being able to be certain that you’ll get the same results running rules sequentially or in parallel.

@chrisjrn • @pantsbuild • pantsbuild.org Improved performance through Reducing the
workload And reproducibility means being con fi dent that a given rule can be cached and reused rather than being run multiple times.

@chrisjrn • @pantsbuild • pantsbuild.org Predictable modelling So rather than
guaranteeing an identical result, the reproducibility we care about is predictably modelling the behaviour of the rule. Mathematically, we model each rule as a pure function of inputs that produce given outputs. A rule with the same inputs should yield the same outputs.

@chrisjrn • @pantsbuild • pantsbuild.org This is easy for most
rules! For rules that are implemented entirely inside Pants, Python gives us all the tools we need. We make heavy use of frozen Dataclasses, which are easy to cache and check for equality.

@chrisjrn • @pantsbuild • pantsbuild.org Processes are Rules too The
problem for us is that processes are rules too, indeed they end up being the rules that underpin basically every other rule that gets run. The rules that run entirely inside Pants are usually just setting up inputs and con fi guration for the rules that run processes. But processes are really dif fi cult to model. They’re impacted by dependency versions, by operating system characteristics and more. Being able to parallelise or cache any rule means being able to make sure that we can make the results of processes as predictable as the code that we write ourselves in Python. Being predictable means modelling processes so they’re as cacheable as any other rule we run.

@chrisjrn • @pantsbuild • pantsbuild.org Degrees of Reproducibility “works on
my machine” Same results in similar environments Cryptographic reproducibility So if we want to be predictable, we can’t rely on “works on my machine” because…

my machine” Same results in similar environments Cryptographic reproducibility that’s not reproducible at all! And full cryptographic reproducibility…

my machine” Same results in similar environments Cryptographic reproducibility is more of a guarantee than most internal development teams actually need. What we care about is making sure we get the don’t get wrong results as long as we start with a similar environment. Annoyingly, this brings about the question…

@chrisjrn • @pantsbuild • pantsbuild.org Similar environments? … what does
it mean for an environment to be similar? Obviously this discussion is going to centre on Python tooling, but the same general concepts apply in other language ecosystems such as the JVM.

@chrisjrn • @pantsbuild • pantsbuild.org OS (and architecture) Python version
Dependency versions Tool configurations In Python, the environment consists of four aspects. The fi rst is the operating system, which may also include the architecture (particularly on Mac OS). Then there is the actual Python version. Then there’s is the version of dependencies – Pants captures this using lock fi les, so we can capture the dependencies down to the speci fi c artefact And then there’s the con fi guration of the tools that are run in each rule.

@chrisjrn • @pantsbuild • pantsbuild.org Dependency versions Now, even without
pants, most of these things are easy to control in Python. Your OS rarely changes, it’s easy enough to fi x to a speci fi c Python version, and it’s easy enough to keep your tools con fi gured in the same way. By far the most complicated of these in Python land is how to handle dependency versions, particularly if you’re the sort of developer who incrementally adds dependencies as you go.

@chrisjrn • @pantsbuild • pantsbuild.org The understated chaos of virtualenvs
If you spend a lot of time working in a given virtualenv, it can be very easy for your dependencies to drift away from what your codebase actually speci fi es, and what your collaborators might be working with. This is because standard python tools really encourage you to build up your environment one package at a time, and then freeze your requirements when you’re ready to release.

@chrisjrn • @pantsbuild • pantsbuild.org Environments need to be predictable
This really runs contrary to the idea of environments being predictable. A much better approach is to make sure that every process gets run in a con fi guration that fully corresponds to the lock fi le.

@chrisjrn • @pantsbuild • pantsbuild.org Pants creates a new environment
for every task The best way to provide a predictable environment is to create a completely new environment for every process that needs to be run.

@chrisjrn • @pantsbuild • pantsbuild.org Tools don’t need to run
in compatible environments One advantage of this is that you can manage the versions and dependencies of your tools separately. The version of black you run won’t need to be tied to the version of pytest you run because they happen to share dependencies.

@chrisjrn • @pantsbuild • pantsbuild.org Processes: input files + environment
→ output artefacts + side effects There’s another reason why we care about starting from clean environments for every process, which is that often processes produce outputs that we don’t care about, or modify the environment in ways that might impact subsequent runs of the same tool, or the behaviour of other tools. So in terms of modelling, a reproducible process for us is one…

@chrisjrn • @pantsbuild • pantsbuild.org Reproducible-Enough Processes: input files +
environment → output artefacts + side effects … where we specify the input fi les and the environment, and we only collect the output artefacts that we actually care about.

@chrisjrn • @pantsbuild • pantsbuild.org Reusing the Environment and why
it’s bad Input Files Con fi g 1 Environment Run process 1 Input Files Con fi g 1 Con fi g 2 Side Effects 1 Environment Run process 2 Input Files C1, C2, C3 Side Effects 1 Side Effects 2 Environment Run process 3 Input Files C1, C2, C3 S1, S2, S3 Environment Initial Setup Final Result To make it a bit clearer, if we retain the side-effects of any given process, modelling the behaviour of the process becomes dependent on the order in which they get run. You need to run a given process to collect its side effects, and you need to make sure the side-effects are tracked. Rather than having a predictable model of the processes, you have a model of one step in a chain of processes, which is not worth modelling at all.

it’s bad Input Files Con fi g 1 Environment Run process 1 Input Files Con fi g 1 Con fi g 2a Side Effects 1 Environment Run process 2 Input Files C1, C2, C3 Side Effects 1 Side Effects 2 Environment Run process 3 Input Files C1, C2, C3 S1, S2, S3 Environment Initial Setup Final Result Run process 2 Input Files C1, C2a, C3 Side Effects 1 Side Effects 2a Environment Run process 3 Input Files C1, C2a, C3 S1, S2a, S3 Environment But what happens if we change some of the inputs in one step of the chain? You break that chain, <click> and you need to rerun every subsequent step.

it’s bad Input Files Con fi g 1 Environment Run process 1 Input Files Con fi g 1 Con fi g 2 Side Effects 1 Environment Run process 2 Input Files C1, C2, C3 Side Effects 1 Side Effects 2 Environment Run process 3 Input Files C1, C2, C3 S1, S2, S3 Environment Initial Setup Final Result Input Files C1 S1, S2, S3 Environment Run process 1 Input Files Con fi g 1 Con fi g 2 S1, S1a, … Environment … Start again Even worse, if you keep the side-effects around, if you want to achieve the same goal, and do it correctly, you have to run everything, because the starting environment has changed. This is tedious, so most tools re-use the environment and assume the side-effects don’t matter. This leads to subtle contamination which breaks things, and that’s why most tools have a `clean` command to deal with garbage that individual processes leave around.

@chrisjrn • @pantsbuild • pantsbuild.org If you don’t re-use the
environment, you can run tasks however you want. … on the other hand, if you don’t re-use the environment, then you actually break this dependency on ordering.

@chrisjrn • @pantsbuild • pantsbuild.org Re-ordering processes This means you
can re-order processes so that they’re more ef fi cient. This could be things as simple as running formatters before you run linters, through to dividing up a test suite to run on multiple cores in separate processes…

@chrisjrn • @pantsbuild • pantsbuild.org Skipping rules entirely (and re-using
results) … through to skipping rules that have been run earlier, even by someone else.

@chrisjrn • @pantsbuild • pantsbuild.org Pants doesn’t need a clean
goal! (There’s nothing to clean!) And it can do all of this without an explicit `clean` goal, because by not re-using the environment, Pants cleans as it goes!

@chrisjrn • @pantsbuild • pantsbuild.org Hermetic Environments isolate side-effects between
tasks in a workflow So that’s what a hermetic environment is. It’s an environment that isolates the side-effects of a process from other process in the same work fl ow. It means that if one process does something unfortunate that would invalidate the behaviour of another process, that effect is not captured. Indeed, it means only capturing the effects of a process that we actually intend to capture, and discarding everything else.

@chrisjrn • @pantsbuild • pantsbuild.org Known environment   Known dependencies
Carefully specified inputs Carefully specified outputs And the way we achieve a hermetic build is by preparing a knowable environment, containing dependencies that we can model, into which we place known input fi les, And the only things we preserve from this environment are speci fi c output fi les.

@chrisjrn • @pantsbuild • pantsbuild.org Sandboxes for build tools So
I just explained some theory about what constitutes a hermetic environment.   The question now is how we go about actually preparing these environments. So let’s talk a bit about sandboxing techniques…

@chrisjrn • @pantsbuild • pantsbuild.org Isolation Process Filesystem OS resources
The goal of a sandbox is to isolate processes from one another so that their execution does not interfere with each other. Isolation comes in many forms, but generally speaking, the things we tend to care about are: Making sure that processes are isolated from each other   Making sure that processes don’t write over fi les produced by other processes Making sure that OS resources are allocated fairly

@chrisjrn • @pantsbuild • pantsbuild.org Docker Good enough, most of
the time. … these days, 90% of processes that need isolation can be adequately handled by docker.

@chrisjrn • @pantsbuild • pantsbuild.org Docker has overheads And honestly,
you can do hermetic environments with docker. Another monorepo build system, Bazel, does in fact do this, but it’s amongst several sandboxing options you can choose from, because Docker – even if it’s lighter than a virtual machine – still has performance overheads…

@chrisjrn • @pantsbuild • pantsbuild.org Multiple Dockers have Multiple Overheads
… Replicating a containerised environment multiple times means materialising a containerised OS multiple times. There’s a lot of underlying fi les that just don’t change very often, so that’s a lot of repeated, and often unnecessary work. Building up truly isolated environments — be it through Docker, or through other means, is really really slow.

@chrisjrn • @pantsbuild • pantsbuild.org Safety vs Performance So all
sandboxing approaches have to trade between the level of isolation and the level of speed in which your sandboxes can be built. So that raises the following question:

@chrisjrn • @pantsbuild • pantsbuild.org How much isolation do we
need to predictably model our processes? How much isolation do we actually need to predictably model our processes? And the answer is actually… not a lot. Again, our needs are not really built around ensuring security…

@chrisjrn • @pantsbuild • pantsbuild.org Build tools tend to be
trustworthy* * you audit your tools before you use them, right? … most tools that Pants orchestrates are trustworthy. You’re already using them on your system to build your software. You’ve probably audited their functionality as much as your organisation needs. Pants doesn’t make these tools do anything they can’t do on their own. Beyond that, build tools tend to do a predictable amount of work, and they tend to do a good job of only reading the fi les you tell them to, and outputting fi les in places you tell them to…

@chrisjrn • @pantsbuild • pantsbuild.org Guidance, not enforcement … so
unlike in cases where processes might run away, Like a server facing the internet, the processes that Pants runs don’t really need enforcement of isolation, they just need to stay in their lane.

@chrisjrn • @pantsbuild • pantsbuild.org Enforcement Separate machines Virtual machines
Containers Jails (chroot) Enforcement approaches to isolation include things like running processes on dedicated machines, or containers, or even just running in a fi lesystem chroot jail. All of these require some amount of operating system to be put in place before you can run a process, but they give you some level of actual resource isolation in return.

@chrisjrn • @pantsbuild • pantsbuild.org Guidance venv Guidance tools are
a bit less frequent, but virtualenvs are one such tool. All they do is modify your `PATH` environment variable so that when you ask to run Python, the fi rst one your shell fi nds is the version that you speci fi ed, with the pip dependencies that you’ve installed.

@chrisjrn • @pantsbuild • pantsbuild.org venv doesn’t stop you from
running other Pythons But the key observation is that venv doesn’t stop you from being able to run other versions of Python that are on your system. It just makes it easier to run one speci fi c version of Python and harder to run the rest.

@chrisjrn • @pantsbuild • pantsbuild.org Isolation through obscurity! (But in
a good way, I swear) Pants does more or less the same thing. Rather than putting you in a predictable place inside a completely isolated environment, Pants runs processes inside the host OS, but the working directory is a temporary directory in an unpredictable part of your fi lesystem. We set environment variables from scratch. The tools that Pants runs are usually well-behaved. They’ll only access the fi les that you ask them to. If you con fi gure a Python tool to load dependencies from a speci fi c place, it’ll do that rather than looking to where the OS stores them. That’s how Virtualenv works, we just do it a bit more aggressively.

@chrisjrn • @pantsbuild • pantsbuild.org Run trustworthy tools with predictable
configurations So because the tools that Pants runs tends to be trustworthy, and are con fi gurable in a way that makes them not interfere with fi les they aren’t told to,

@chrisjrn • @pantsbuild • pantsbuild.org Inside the host OS …
we’re able to run those tasks inside the host OS. We run them in a temporary directory that is created especially for a given process, and we never run another process in the same directory.

@chrisjrn • @pantsbuild • pantsbuild.org Copied In Input files Dependencies
Copied out Output files So to create our environments, all we need to do is copy in the input fi les and the dependencies that aren’t in the host environment, and copy the build artefacts out when the process is done.

@chrisjrn • @pantsbuild • pantsbuild.org How Pants does all this
So there’s still one more thing that we need to discuss here, which is how Pants actually does all this stuff in practice. Caching the results of rules is only useful if it’s faster to compute a cache key than it is to run the rule itself. Copying fi les around can be really slow and wasteful, and if preparing environments makes it substantially slower to run the processes, then users just won’t tolerate it.

@chrisjrn • @pantsbuild • pantsbuild.org Running a process with result
cacheing Process request Input fi les Environment Have we run this? Fetch result from Cache Run process Store result Into cache Process Result Output fi les stdout stderr Yes No Request Process   result So once you decide you might want to cache the results of rules, the execution steps look somewhat like this: we create some sort of process request, and send that request into the internals of Pants, and sometime later, Pants sends back a ProcessResult object. To a Python developer, this is an asyncio call, and we don’t care what goes on under the hood… … Under the hood, Pants makes the decision to either run the process, or fetch it out of the cache.

@chrisjrn • @pantsbuild • pantsbuild.org Have we run this? The
point where we can end up saving time comes from answering the question of whether we’ve already run a given task. Hermetic environments only answer one question – if we have the same fi les and con fi guration, will we get the same result? — it doesn’t answer the question of whether a set of fi les is the same.

@chrisjrn • @pantsbuild • pantsbuild.org Inputs and outputs can be
large And that’s a problem because you can end up having hundreds or thousands of fi les, and fi les are annoyingly mutable. Reasoning about the fi les themselves is slow and unreliable …

@chrisjrn • @pantsbuild • pantsbuild.org Content-Addressable Storage Pants solves this
by using content-addressable storage (in our case, it’s LMDB). Having our own system to reason about the fi les that we’re modelling means that we get to decide what operations are cheap and what operations aren’t, and we get to do that without the constraints that fi lesystems normally put on us. It’s also useful because as people writing orchestration code, we very rarely care about individual fi les, but we do care about sets of fi les – we want to to work with a set of source fi les, or the resolved dependencies of an application, or perhaps a combination of both.

@chrisjrn • @pantsbuild • pantsbuild.org Digests If you’re a rule
author, your window into the content-addressable storage is a thing called a Digest, which is a reference to a set of fi les…

@chrisjrn • @pantsbuild • pantsbuild.org Every process produces an output
Digest … so when we run a Process, what we end up with is a Digest that represents the fi les that we asked to copy out of the process execution environment, and nothing else.

@chrisjrn • @pantsbuild • pantsbuild.org Every process accepts an input
Digest Similarly, when we talk about copying input fi les into the execution environment, what we’re actually doing is materialising the content speci fi ed by the digest into the execution environment.

@chrisjrn • @pantsbuild • pantsbuild.org Digests are lightweight Digests are
lightweight, which makes them cheap to use in a cache key,

@chrisjrn • @pantsbuild • pantsbuild.org Digests are immutable And importantly
for us, unlike fi les on a fi lesystem, items in our content-addressable storage are immutable, just like the rest of the things that we attempt to reason about in Pants. So Digests are immutable, and they refer to immutable content…

@chrisjrn • @pantsbuild • pantsbuild.org Digest operations   are very
cheap … and most importantly, the sorts of operations that we perform in our rule code in Pants are very cheap compared with doing the same sorts of operations on a fi lesystem. This includes things like renaming fi les, moving batches of fi les into different directories, and merging multiple digests. Under the hood, we only store relative paths, so that we can materialise fi les into a temporary directory as the rest of our execution model requires…

@chrisjrn • @pantsbuild • pantsbuild.org Files only appear if you
explicitly ask for them … and speaking of materialising fi les, the key thing is that we don’t actually materialise the fi les into the host fi lesystem until we start running processes that need fi les to be in place on the fi lesystem. This approach means that we don’t waste our time doing piles of IO in order to maintain hermetic environments. We only do the bare minimum amount of copying fi les around to ensure that we end up with the fi les that we actually want.

@chrisjrn • @pantsbuild • pantsbuild.org Modelling a process So now
that we have digests that can reliably represent an input or output fi les, we can model a process entirely in Python. In Pants, when we request to run a process, we supply the command we want to run, a series of environment variables, and the fi les in the environment are speci fi ed as a Digest. Under the hood, it’s a data class, which is ef fi cient to cache in Python.

@chrisjrn • @pantsbuild • pantsbuild.org Have we run this? So
with that, we have a lightweight reference to a lot of immutable fi les — so a thing that’s accurate enough to use in a cache key — and a result that is itself cheap to store into a cache. So now the question of “have we run this rule” before becomes easy and lightweight to answer:

@chrisjrn • @pantsbuild • pantsbuild.org Same args? Same Digest? Same
env? If we’ve got a process with the same args, input digest, environment, et cetera, then we can be con fi dent that we’ve already run it. What’s more, this question can be answered equally effectively by a local cache or a remote cache.

@chrisjrn • @pantsbuild • pantsbuild.org Execution environment = Host OS
Predictable env Materialised digest And so that’s it. For Pants, a process execution consists of the Host OS, an environment that we can predictably model, and the contents of a digest that we can reliably store inside a cache.

@chrisjrn • @pantsbuild • pantsbuild.org Now you know about hermetic
builds! So that’s the end of this talk! We’ve done a high-level look into hermetic environments and how they unlock some of Pants’ more interesting time-saving features.

@chrisjrn • @pantsbuild • pantsbuild.org Recap Hermetic Environments and Pants
•Developer work fl ows get faster if we make good choices about running each task •Hermetic Environments let us predictably model build tasks •Build processes do not need much process isolation •A content-addressable database makes tool orchestration and cache key computation faster We saw that we’re able to make existing Python tools run faster, or not at all, if we can make good choices about how tools get run and when. We saw that we need to be able to predictably model processes, which is tricky unless you have a predictable environment. We did some handwaving that build processes do not need a whole lot of process isolation, or at least, not a lot to predictably model their behaviour And then we saw that using a content-addressable database makes it cheaper to do orchestration tasks, like merging the set of dependencies and a set of source fi les. It also makes it easier to reason about cache keys.

@chrisjrn • @pantsbuild • pantsbuild.org Pantsbuild and Pex At the
rest of PyCon • Sprinting Tomorrow! • Pantsbuild and PEX • Come say hi to me, Benjy, or John • Docs and demos at pantsbuild.org • Join us on Slack! • TALK NOTES: https://blog.pantsbuild.org/pycon-us-2022-talk/

Christopher Neugebauer • [email protected] @chrisjrn • @pantsbuild Hermetic Environments in
Pantsbuild and how they make development tools more ef fi cient, no matter how large your codebase

Hermetic Environments in Pantsbuild

Hermetic Environments in Pantsbuild

More Decks by Christopher Neugebauer

Other Decks in Programming

Featured

Transcript