Applied Science Fiction: Operating a Research-Led Product (with notes)

APPLIED SCIENCE FICTION Operating a Research-Led Product Noah Kantrowitz -
@kantrn - SRECon 2022 - March 14 @kantrn - SRECon 2022 1

Hi there, I'm Noah Kantrowitz. I'm an engineer at Geomagical
Labs. We do computer vision and augmented reality stuff for IKEA. NOAH KANTROWITZ » He/him » coderanger.net / @kantrn » Kubernetes and Python » SRE/Platform for Geomagical Labs, a division of IKEA » We do CV/AR for the home @kantrn - SRECon 2022 2

But I'm not really here to talk about the products
I work on, more about how we build them. I know "MLOps" is the popular term but one, I think it sounds silly. And two, we do a lot more than just machine learning as most places would think about it. Our production code runs the gamut from old-school computer vision algorithms to brand new research. Really I just call myself an engineer but buzzwords are fun. DEV SCI OPS » Machine Learning++ » Traditional CV algorithms, new research papers, everything » A lot of experimentation, but also running in real-time » Needs: autonomy, stability, performance » Wants: "play with cool new science", iterate fast @kantrn - SRECon 2022 3

So why bother extending self-service all the way to our
research teams? On the practical side, we don't have much of an option. There's too many scientists and too few platform engineers. Additionally most of our testing can only be done after a deployment is in place. We'll talk about our CI in a moment but full corpus tests take hours so running them before merge would massively slow down development. Also I just think it's how all platforms should run but that's a story for another day. WHY RESEARCHER LED? » ~1 platform engineer : 4 researchers » SRE is me, I am SRE, it me » Post deployment testing » (I think ops should be a service provider anyway) @kantrn - SRECon 2022 4

A quick aside about pipelines, or more generally DAG (directed
acylic graph) runners. Like most ML systems, we keep our complexity under control by dividing things into smaller work stages and connecting them up. Our internal jargon is "pipelines" and "modules" but they go by many names depending on your libraries and frameworks. It's all the same design in the end. PIPELINES @kantrn - SRECon 2022 5

All of our researchers write code all day, we don't
have anyone who is working purely on math papers or something. But their backgrounds are generally academic research or similar research- heavy teams elsewhere. Very few have ever had someone sit down and explain what makes a good or bad unit test, for example. So step one of this journey was setting up guardrails so they would feel confident in moving forward. BUILDING GUARDRAILS » (Most) researchers are not engineers » And don't want to be engineers » Build safe paths to move forward » Strong mutual respect for specialization » Progress with confidence and quality @kantrn - SRECon 2022 6

When I joined the team, there was some basic CI
in place, but it was only running nightly so feedback was slow and a lot of "we know that test is failing, ignore it". This is not fine. We quickly moved to a more standard setup running tests on each commit with pull request status blocking. Buildkite has been great for this as we can centralize all our CI configs, keeping them always unified across all modules. One notable problem has been that many of our tests require a GPU for CUDA, and running those nodes 24/7 is expensive. Using a Kubernetes Job-based CI runner lets us use our existing cluster autoscaling to handle CI as well. CI » Nightly -> per-commit » Buildkite for custom pre-processing » Testing with GPUs: Kubernetes Jobs @kantrn - SRECon 2022 7

So what is actually running in CI? Mostly it's functional
tests, we have canned work units to run the modules against along with an internal API to emit metrics and indicate if they are "better is low" or "better is high". The harness runs the corpus, checks if the metrics are better or worse than the last main branch build, and updates stored metrics if needed. Simple, catches most issues, but slow to run. We do also encourage unit tests and report unit test coverage but we don't require it. WRITING TESTS? » Functional tests » Easy harness » Numeric metrics » Unit tests, as applicable and comfortable @kantrn - SRECon 2022 8

Next on the list was revamping our code review process.
We set up review gates from a subset of the research teams, letting the most technical of the researchers help out the others by checking for common code quality issues and other basics. This is still research code, we're not expecting maximum design patterns, but a little bit of review goes a long way. CODE REVIEW » Deputize the most technical » Code quality, obvious performance issues, test coverage » Science review: I stay out of it @kantrn - SRECon 2022 9

As we watched the review process evolve, we started seeing
some similar friction as in our other engineering teams. Debates over formatting, frustration with repeated common issues. To me this sounded like a place to try static analysis tools. There was some worry it would be too complex or off-putting for the research teams but I think it's gone really well and has helped a lot with those review issues and general code quality. STATIC ANALYSIS » Pre-Commit: tool runner » Black: code formatter » Isort: more code formatting » Flake8: code quality » Overall well received @kantrn - SRECon 2022 10

I said before that our individual DAG steps are called
modules. I'm not going to get too deep into the technology but in specific terms these are Celery worker daemons listening on RabbitMQ queues. The important things here are that they are all independent projects that can be developed and managed on their own. We have a basic skeleton that's shared between all of them to deal with stuff like setting up the RabbitMQ connections and other boilerplate, but inside that it's up to each author. Some modules are 99% C++, others are all Python making REST calls, most are somewhere in the middle with tools like PyTorch. MODULES » Queue worker daemon » Horizontally scalable » Standardized but only barely @kantrn - SRECon 2022 11

The guiding principle for our pipeline and module deployment has
been "let Noah take a vacation". We want to keep it simple and hands-off. The heart of it is a small and well- tested Kubernetes operator, this avoids the drift risk of things like Helm. To make it easier for the teams to control their own releases, the only required input is creating and pushing a git tag, after that the automation takes over. DEPLOYMENT » Ludicrously Automated » Custom operator » CI deploys on tag or main @kantrn - SRECon 2022 12

Originally we followed a standard 3- component SemVer model but
we found that limiting and difficult to manage over time so we streamlined things a bit. We have a continually- updating "latest" release of each module that deploys on any merge to main, and when a tag is pushed that will automatically create a release branch for that x dot y version and then deploy a new instance of the module. VERSIONING » SemVer in our hearts » mymod-latest » Branch main » mymod-x.y » Branch release-x.y for hotfixes @kantrn - SRECon 2022 13

So we have a bunch of versioned container images, what
now? In a normal application we would have one running copy and a new release would roll out to replace the old one. Here we do it a little differently, every version runs independently, listening on a different task queue. This means we can leave old versions running indefinitely. This allows a researcher who is heads-down on one particular system and doesn't want to work against a moving target to keep that environment until they are ready to test against newer stuff. You can think of it as blue/green deployment but with a much bigger rainbow. PARALLEL INSTANCES » New release, new instance » Latest builds update in place » Old versions are left time-locked » Unless there's a critical issue » Pipeline definitions pin module versions @kantrn - SRECon 2022 14

But running everything forever would get very expensive. Solution: autoscaling
queue workers based on queue depth. We use KEDA and Kubernetes HPAs as the core of this system, with our own custom load estimations on top but KEDA's built-in queue depth scaling is a great place to start for simpler systems. This means that while every old version is quote running, it doesn't use up any resources until a pipeline asks for it. AUTOSCALING » Running every version forever is $$$ » We already have a queue driven architecture » Autoscaling! » ??? » Profit @kantrn - SRECon 2022 15

If there has been one problem with our versioning it's
the dash latest instances. They are overall positive, but definitely leave room for accidental issues where work can overlap. The main benefit has been using them for doing larger corpus runs before cutting a release. I mentioned before that we do a few work units as functional tests in CI but before a release, we usually want to throw a lot more at it. In the past this was done locally on a developer workstation but that was already slow at the time and today would be excruciating. Overall good but if you're using a similar architecture, warn everyone of the risks. LATEST » Work collisions » Like an unpinned dependency » Needed for corpus runs » Overall positive @kantrn - SRECon 2022 16

Once we have a bunch of running modules, we need
the DAG to connect them up. We started out with plain JSON definitions, but these got unwieldy to edit. A GUI editor helped for a while but still required a lot of duplicated effort as pipelines started to have repeated subunits in them. We've now moved to a very thin Python DSL which compiles down to the same JSON so we have some programmatic flexibility without the costs of a full executable environment. PIPELINE DEFINITIONS » Declarative is nice » DSL is more realistic » Compilable DSL is best @kantrn - SRECon 2022 17

We have two main types of pipeline definitions, system defaults
and one-off experiments. System default pipelines are the ones for normal users, either currently active, a past version which inactive but can still be used for debugging, or a proposed new end-user pipeline that is being tested. For the one-offs, these are staging-only pipelines used by researchers while developing or testing specific modules. Making it easy to create experimental pipelines with only a subset of the modules has helped improve iteration time a lot. DEFAULTS ONE-OFFS @kantrn - SRECon 2022 18

This structure gives us all the tools we need to
build a full release process. There's four main gates: PR merge, module release, pipeline release, and pipeline is the default for users. We tend to let merges and module releases happen whenever it makes sense, and new pipelines are on a weekly cadence to match our full corpus runs every weekend. RELEASE FLOW » Develop locally » Make a PR, code review, etc » Merge to main » Test -latest in staging » Tag a release » Make a new pipeline » Test new pipeline in staging » Propose new pipeline for prod » Team approvals » Mark pipeline as default » Smoke test in prod » Repeat @kantrn - SRECon 2022 19

All of this so far has been workflow and development
tooling, but what about production? In keeping with our theme of building guardrails, we have some helpers there too. A utils library covers some common functionality, and that library can be held to a higher quality standard. We also have a config-file-driven system for picking which file assets to sync as inputs and output so each module author can manage that themselves. RUNTIME SUPPORT » Helper libraries » Format conversion » Logging and metrics » Common algorithms » Asset sync, in and out » Retries! @kantrn - SRECon 2022 20

But if I had to pick one runtime support feature,
it's automatic retries. It's a simple thing but a lot of this stuff isn't quite an exact science and slow completion is better than none. We do have to monitor things to make sure our error rate doesn't get too high, but structured and controlled retries are worth their weight in gold with experimental systems. NO REALLY, RETRIES! » CELERY_ACKS_LATE » Queue based retries » Pipeline based retries @kantrn - SRECon 2022 21

In my mind, the flip side of retries are timeouts.
Retries improve availability at the sake of performance, timeouts are the reverse. Every DAG runner system should support some kind of timeout for each work unit overall, and when possible you can have them on smaller steps internally as well. Also as a pro-tip for monitoring, anything with a timeout on it should also have a histogram metric or whatever it is in your tooling. TIMEOUTS » Trust with retries but verify with timeouts » Nested timeouts » Work unit » Individual steps » If it has a timeout, it should also have a metric @kantrn - SRECon 2022 22

I'm not going to harp on it much because you've
heard a million people tell you structured logging is good, but it really is worth doing. The key feature for us was using structlog to attach some global context to all log output during each work unit with all the various IDs so we can slice and dice the search in Loki. All that code can live in the shared utility libraries so the module authors don't have to worry about it. We opted for logfmt instead of JSON to be a bit more human readable while also working well with queries. ASIDE: STRUCTURED LOGGING » You already know this is good » But really, it is » logfmt is a nice mix of parsing but also humans » Context variables that attach to every log line » ts=2022-... level=INFO msg="Hello world" run_id=6362 @kantrn - SRECon 2022 23

Technical solutions are important but for any developer to feel
happy and supported, we need social support systems too. We tried weekly office hours early on but got very low uptake, even with prompting there was a lot of "I don't want to bother you" and as our timezone footprint grew, scheduling was hard. Internal presentations helped a bit more, but were difficult to wrangle volunteers for and often incomprehensible to teams further from the subject matter. Testing weeks were really where we found success though. SOCIAL SUPPORT » Office hours » Brown-bag talks » Testing weeks @kantrn - SRECon 2022 24

Test weeks are a recasting of a "tech debt week"
or "personal projects week". For 5 days we put all feature development and non-critical bug fixes on the back burner and everyone works on quality-related projects. For our pipelines this has mostly been improving test coverage, upgrading dependencies, and cleaning up fixtures. This was also how we introduced the static analysis setup I talked about earlier, which worked great as everyone could learn it together and support each other. We aim to do these around once a quarter, scheduled as best we can to avoid any major external deadlines for all teams. TESTING WEEK » No new features or fixes » All quality all the time » Great place to introduce new tools » Gets everyone talking about tests/quality @kantrn - SRECon 2022 25

Like most of you, I hope, we have a suite
of tools for debugging issues in staging and production. Our main application is written using Django so we lean heavily on Admin for internal UIs. I'll probably regret advocating for custom admin views in public but it has worked well for us so far. We also make a lot of Grafana dashboards for other teams to use, as well as posting bot messages in Slack for things like error conditions or daily reports. Domo has been a good tool for less technical folks to build quick visualizations. It's not as flexible as Grafana but it's almost infinitely easier to use and we're glad to have both. PLATFORM TOOLS » Django admin » View customization‽ » Grafana and Loki » Domo » Slack notifications @kantrn - SRECon 2022 26

As with all learning projects, our training corpus is both
huge and important. Many of these were created by the teams themselves during development so personal info isn't a worry but there will always be edge cases from real users. Legally this is all authorized in the terms of service but to be ethically good stewards for our users it's important to balance our research needs with expectations of privacy. As much as possible we try to get reproduction cases from our internal QA folks or paid user testers, when that isn't enough we've built all the management tools to automatically hide PII and anonymize data before it's added to a corpus. CORPUS MANAGEMENT » Production edge cases? » Personal info (PII) » Easy copy tools w/ anonymization » Paid testers whenever possible @kantrn - SRECon 2022 27

On the planning and organization side, we do overall strategic
planning in 4 month blocks. We aim for a mix of top-down for external priorities and bottom-up for internal ones. Where this intersects our topic here is mostly stressing that research teams should be included in all this planning along with everyone else. Getting early signals of friction is important to keeping ahead of it. PLANNING » Cross-team OKRs » Team-level plans » Two week discussion cycles » Talk to each other! @kantrn - SRECon 2022 28

Not everything has been perfect, so for balance let's go
through some pain points. OPEN ISSUES @kantrn - SRECon 2022 29

We do our best to wrap developer tools in easy-to-use
bash scripts or Makefiles, but weird failures of internal components like Python dependency errors have to go somewhere. These are usually only barely understandable if you have deep knowledge of the tools, and completely useless to most. This is a problem for all teams but here especially. BAD ERROR MESSAGES » Pip and Poetry » Docker » The usual suspects » "Could you look at this build failure?" @kantrn - SRECon 2022 30

Academia plays a bit fast and loose with software and
data licensing sometimes. The whole of machine learning is crowded with a lot of intellectual property and it can be hard to navigate. Internal training has helped a lot but we are only as strong as our weakest link here. LICENSING » What is safe to use » Experiments vs. production » I am not a lawyer, talk to your legal team » Internal training @kantrn - SRECon 2022 31

I'm going to tempt fate and mention that we don't
yet have an on-call rotation for our researchers. We do for the platform team but it has never actually been used as our only two outages so far happened during the day and were noticed in Slack before pager alerts went out. But as we look into the future this will probably be needed at some point and will probably not be very popular. ON CALL » Just hasn't come up yet » Do it when needed » Likely unpopular @kantrn - SRECon 2022 32

So what did we learn here today? WRAP UP @kantrn
- SRECon 2022 33

Most of this is all the usual wisdom for any
self-service platform, but tweaked a bit for the needs of a research organization. Build tools with empathy as a feature and everyone benefits. DAG runner tools keep complex systems manageable and accessible. Running old versions in parallel allows for looser release management as long as input and output contracts are kept to. Automatic retries at every level of the system keep things reliable in the face of randomness. Observability should be for everyone, not just SREs, so work with your teams to expose the data they need. And keep your plans aligned so everyone is moving together. » Engineering tools work for research too » As long as you build them well » DAGs are your friend » New versions as new deploys » Retry failures everywhere » Observability is for everyone » Plan together, succeed together @kantrn - SRECon 2022 34

And really in the end, be kind and open and
empathetic and you'll end up with a system that helps everyone be at their best. In the end that's what we're all here for. BE KIND @kantrn - SRECon 2022 35

THANK YOU @kantrn - SRECon 2022 36

QUESTIONS? @kantrn - SRECon 2022 37

» Intros » DevSciOps » ML++ » Needs » Wants
» Building Guardrails @kantrn - SRECon 2022 38

Applied Science Fiction: Operating a Research-L...

Applied Science Fiction: Operating a Research-Led Product (with notes)

More Decks by Noah Kantrowitz

Other Decks in Technology

Featured

Transcript