Slide 1

Slide 1 text

APPLIED SCIENCE FICTION Operating a Research-Led Product Noah Kantrowitz - @kantrn - SRECon 2022 - March 14 @kantrn - SRECon 2022 1

Slide 2

Slide 2 text

NOAH KANTROWITZ » He/him » coderanger.net / @kantrn » Kubernetes and Python » SRE/Platform for Geomagical Labs, a division of IKEA » We do CV/AR for the home @kantrn - SRECon 2022 2

Slide 3

Slide 3 text

DEV SCI OPS » Machine Learning++ » Traditional CV algorithms, new research papers, everything » A lot of experimentation, but also running in real-time » Needs: autonomy, stability, performance » Wants: "play with cool new science", iterate fast @kantrn - SRECon 2022 3

Slide 4

Slide 4 text

WHY RESEARCHER LED? » ~1 platform engineer : 4 researchers » SRE is me, I am SRE, it me » Post deployment testing » (I think ops should be a service provider anyway) @kantrn - SRECon 2022 4

Slide 5

Slide 5 text

PIPELINES @kantrn - SRECon 2022 5

Slide 6

Slide 6 text

BUILDING GUARDRAILS » (Most) researchers are not engineers » And don't want to be engineers » Build safe paths to move forward » Strong mutual respect for specialization » Progress with confidence and quality @kantrn - SRECon 2022 6

Slide 7

Slide 7 text

CI » Nightly -> per-commit » Buildkite for custom pre-processing » Testing with GPUs: Kubernetes Jobs @kantrn - SRECon 2022 7

Slide 8

Slide 8 text

WRITING TESTS? » Functional tests » Easy harness » Numeric metrics » Unit tests, as applicable and comfortable @kantrn - SRECon 2022 8

Slide 9

Slide 9 text

CODE REVIEW » Deputize the most technical » Code quality, obvious performance issues, test coverage » Science review: I stay out of it @kantrn - SRECon 2022 9

Slide 10

Slide 10 text

STATIC ANALYSIS » Pre-Commit: tool runner » Black: code formatter » Isort: more code formatting » Flake8: code quality » Overall well received @kantrn - SRECon 2022 10

Slide 11

Slide 11 text

MODULES » Queue worker daemon » Horizontally scalable » Standardized but only barely @kantrn - SRECon 2022 11

Slide 12

Slide 12 text

DEPLOYMENT » Ludicrously Automated » Custom operator » CI deploys on tag or main @kantrn - SRECon 2022 12

Slide 13

Slide 13 text

VERSIONING » SemVer in our hearts » mymod-latest » Branch main » mymod-x.y » Branch release-x.y for hotfixes @kantrn - SRECon 2022 13

Slide 14

Slide 14 text

PARALLEL INSTANCES » New release, new instance » Latest builds update in place » Old versions are left time-locked » Unless there's a critical issue » Pipeline definitions pin module versions @kantrn - SRECon 2022 14

Slide 15

Slide 15 text

AUTOSCALING » Running every version forever is $$$ » We already have a queue driven architecture » Autoscaling! » ??? » Profit @kantrn - SRECon 2022 15

Slide 16

Slide 16 text

LATEST » Work collisions » Like an unpinned dependency » Needed for corpus runs » Overall positive @kantrn - SRECon 2022 16

Slide 17

Slide 17 text

PIPELINE DEFINITIONS » Declarative is nice » DSL is more realistic » Compilable DSL is best @kantrn - SRECon 2022 17

Slide 18

Slide 18 text

DEFAULTS ONE-OFFS @kantrn - SRECon 2022 18

Slide 19

Slide 19 text

RELEASE FLOW » Develop locally » Make a PR, code review, etc » Merge to main » Test -latest in staging » Tag a release » Make a new pipeline » Test new pipeline in staging » Propose new pipeline for prod » Team approvals » Mark pipeline as default » Smoke test in prod » Repeat @kantrn - SRECon 2022 19

Slide 20

Slide 20 text

RUNTIME SUPPORT » Helper libraries » Format conversion » Logging and metrics » Common algorithms » Asset sync, in and out » Retries! @kantrn - SRECon 2022 20

Slide 21

Slide 21 text

NO REALLY, RETRIES! » CELERY_ACKS_LATE » Queue based retries » Pipeline based retries @kantrn - SRECon 2022 21

Slide 22

Slide 22 text

TIMEOUTS » Trust with retries but verify with timeouts » Nested timeouts » Work unit » Individual steps » If it has a timeout, it should also have a metric @kantrn - SRECon 2022 22

Slide 23

Slide 23 text

ASIDE: STRUCTURED LOGGING » You already know this is good » But really, it is » logfmt is a nice mix of parsing but also humans » Context variables that attach to every log line » ts=2022-... level=INFO msg="Hello world" run_id=6362 @kantrn - SRECon 2022 23

Slide 24

Slide 24 text

SOCIAL SUPPORT » Office hours » Brown-bag talks » Testing weeks @kantrn - SRECon 2022 24

Slide 25

Slide 25 text

TESTING WEEK » No new features or fixes » All quality all the time » Great place to introduce new tools » Gets everyone talking about tests/quality @kantrn - SRECon 2022 25

Slide 26

Slide 26 text

PLATFORM TOOLS » Django admin » View customization‽ » Grafana and Loki » Domo » Slack notifications @kantrn - SRECon 2022 26

Slide 27

Slide 27 text

CORPUS MANAGEMENT » Production edge cases? » Personal info (PII) » Easy copy tools w/ anonymization » Paid testers whenever possible @kantrn - SRECon 2022 27

Slide 28

Slide 28 text

PLANNING » Cross-team OKRs » Team-level plans » Two week discussion cycles » Talk to each other! @kantrn - SRECon 2022 28

Slide 29

Slide 29 text

OPEN ISSUES @kantrn - SRECon 2022 29

Slide 30

Slide 30 text

BAD ERROR MESSAGES » Pip and Poetry » Docker » The usual suspects » "Could you look at this build failure?" @kantrn - SRECon 2022 30

Slide 31

Slide 31 text

LICENSING » What is safe to use » Experiments vs. production » I am not a lawyer, talk to your legal team » Internal training @kantrn - SRECon 2022 31

Slide 32

Slide 32 text

ON CALL » Just hasn't come up yet » Do it when needed » Likely unpopular @kantrn - SRECon 2022 32

Slide 33

Slide 33 text

WRAP UP @kantrn - SRECon 2022 33

Slide 34

Slide 34 text

» Engineering tools work for research too » As long as you build them well » DAGs are your friend » New versions as new deploys » Retry failures everywhere » Observability is for everyone » Plan together, succeed together @kantrn - SRECon 2022 34

Slide 35

Slide 35 text

BE KIND @kantrn - SRECon 2022 35

Slide 36

Slide 36 text

THANK YOU @kantrn - SRECon 2022 36

Slide 37

Slide 37 text

QUESTIONS? @kantrn - SRECon 2022 37

Slide 38

Slide 38 text

» Intros » DevSciOps » ML++ » Needs » Wants » Building Guardrails @kantrn - SRECon 2022 38