2025 Update update at OxCon 2025

2025 Update update OxCon 2025 David Pacheco (dap)

2025 Update update • About software update • Path to
self-service update • Current status and next steps • Experience report for a long-running project

Software update

Software update • Upgrading all the updateable software in the
rack. (Obviously?) • There are hundreds of upgradable components within a single rack! ◦ Each sled/switch/PSC: SP software, RoT bootloader, RoT ◦ Each sled: host OS (phase 1, phase 2) ◦ Control plane components (one per disk + another 1-2 dozen) ◦ (Not yet included: SSD ﬁrmware)

Upgrade today • Release process produces a single ZIP ﬁle
with all the software in the system (~2-3 GiB) • When release is ready, Support reaches out to customer to schedule upgrade • Upgrade process (1-2 hours): ◦ Oxide support engineer connects via technician port (often via jump host; authenticated via hardware token) ◦ Uses a combination of TUI and Unix shell to: ▪ “Park the rack” – shut (almost) everything down ▪ Replace all software ▪ Apply database schema updates ▪ Bring everything back up

Problems • It’s not self-service ◦ Oxide support requires access
(usually via jump host) ◦ Non-starter for many (most?) customers • Instances oﬄine for duration of upgrade (1-2 hours) ◦ Impacts users ◦ Constrains scheduling ◦ Constrains workloads that can be deployed on Oxide • It doesn’t scale for us

Aside: importance of software update • A top company priority
for 2+ years • Upgrade is how Oxide continues to deliver value to customers after their initial purchase (bugﬁxes, features) ◦ Customers buy racks with the expectation that the software will improve over time. • Nearly every customer needs upgrade to be better operationalized in order to go into production (i.e., buy more racks) • Upgrade experience has to be smooth for any of this to work!

Self-service upgrade: operator view • Operator downloads ZIP file (TUF
repo) from Oxide to their machine (laptop, workstation, etc.), validates checksum • Using web console or CLI (or API), operator: ◦ Uploads ZIP file to Oxide system ◦ Configures the target system release ◦ Waits for the upgrade to finish ◦ If anything goes wrong: system pauses updates with clear explanation • Operator focuses on high-level policy, not implementation details • Goal: scales up with number of racks

Self-service update • Implies: API-driven • Implies: fully automated •
Implies: control plane is driving the update ◦ (Implies: control plane is online during the update) ▪ Replacing all the parts of the ___ while it’s ___ (plane while it’s in the air, car while it’s driving down the freeway, pick your metaphor)

Path to self-service update

OxCon 2023 • First customer install: July, 2023 (upgrades work
basically the way they do today) • RFD 418 laid out vision ◦ Autonomous, self-service, minimal impact, risk mitigation ◦ Exercise it in development ◦ Foundations: policy vs. planning vs. execution, operability (supportability)

2023 - 2024: foundations for dynamic reconfiguration • Developed foundations:
inventory, blueprints • Deliverables: ◦ Sled expungement (N different projects) ◦ Sled addition (N different projects) ◦ Disk expungement ◦ Designs: TUF repo distribution, API versioning

Aside: dynamic reconﬁguration • Components (and their dependencies) can come
and go while the system is online • Examples of work unrelated to mechanics of add/remove/upgrade of a component: ◦ Oximeter needs to stop collecting from producers that are gone ◦ Sagas that were running when Nexus was expunged need to be dealt with ◦ Internal NTP zones need to use latest set of boundary NTP zones ◦ CockroachDB needs to start replicating data to new cluster nodes ◦ Anything with network dependencies needs to notice when they come or go ◦ Need to keep track of what IPs and other networking resources are in use ◦ (there’s a very long tail of these things)

OxCon 2024 • What we said was next: ◦ Dogfood
of Theseus (ran into LRTQ limitation) ◦ Multi-node Clickhouse (did much of this, deprioritized) ◦ First steps towards upgrade • Team meeting: brainstorm ﬁrst upgrade demos / milestones

2024 - 2025: planned work done • API versioning: design,
dropshot support, developer workﬂow, update sequence, etc. • TUF repo management: upload, distribution, trust • Upgrade mechanisms: control plane zones, SP, RoT bootloader, RoT, host OS phases 1 and 2 • Planning: orchestration of updates, planner reports (why is it stuck?), safety checks (don’t update sled with CockroachDB on it if the CockroachDB cluster is unhealthy)

2024 - 2025: unplanned work done • Foundation work: ◦
Planner sequencing (#6999, came out of colo issue) ◦ Rendezvous tables ◦ Blippy • MUPdate + update coexistence (RFD 556) • CockroachDB decommissioning woes • Sled Agent reconciler, dataset bugs (#7313) • Undrainable sagas: RFD 555 and associated tools

Current status and next steps

Delivering self-service upgrade • Deliverable: self-service upgrade in R17 (targeting
end of September) (side note: how do you deliver an upgrade feature in one release?) • Blockers remaining: ◦ Nexus handoﬀ ◦ Console support ◦ Lots of loose ends (better status, more safeties, etc.)

“Important non-blockers” • The stuﬀ we’ve punted on for initial
release • Ongoing risks

Important non-blockers • Developer guard rails to catch upgrade breakage
(Update all internal APIs to use versioning; lock down Clickhouse schema) • #4745 lack of rack health checks • Switch update risks (dendrite#49) • Archival / cleanup ◦ Old blueprints and logs ◦ TUF repo deletion ◦ Decommissioning Cockroach nodes ◦ GC of expunged zones and datasets within blueprint • External API versioning • #8316 transient zone root datasets • #8769 NTP zone needs to be updated before host OS

Major risk: lack of automated end-to-end testing • Lots of
unit tests and integration tests with simulated Omicron • No end-to-end automated tests for upgrade (lot of work, no hardware available, no a4x2 in CI) • Goals: regular tests for: ◦ Upgrade from previous commit on “main” to latest (exercises upgrade from each commit) ◦ Upgrade from previous release to latest (exercises upgrade to each commit) ◦ MUPdate + update together

Next deliverable • Non-disruptive update ◦ VMs live-migrated before sleds
are rebooted ◦ Never update so many Crucible zones at once that a customer volume could be impacted ◦ System makes best eﬀort to ensure there will be enough capacity to do this (but hard choices may be required) • Original target: 2025 Q4 (will very likely change)

Experience report for a long-running project

Long-running projects • Delivering a long-term project like upgrade is
hard • We have other long-term projects like this • What follows is an experience report ◦ I think it’s worked pretty well, but take it as just a data point (I’d love feedback, positive or negative) ◦ I’d be interested to hear how other teams are doing this

My big fear (well, one of them) • My big
fear (since 2023): upgrade perpetually feels “a year away” • Focus: how do we keep our eyes on the prize? • Prioritization: how do we decide what to punt? • Bonus topic: one weird trick for sequencing / decoupling work

Focus challenges • One enemy: “organizational procrastination” ◦ Next steps
toward upgrade are often not clear ◦ There are lots of other important problems where next steps are clear e.g., #3732 conﬁgurable external DNS IPs ◦ Other problems are both real and present! ▪ e.g., Cannot factory-reset / restart RSS • How do we avoid letting the dozens of small, well-understood problems starve work on the big, hard problems (like upgrade)? • This is a balance. Being responsive to changing priorities is important, too.

Losing focus • It’s easy to lose focus when: ◦
We get stuck on technical issues ◦ We don’t know what the next step is ◦ We run across other important problems (or they are thrust upon us)

Maintaining focus: daily watercooler • 30-minute, optional Meet – every
day • Low-pressure way to surface issues that are keeping people stuck (upgrade-related stuff, yes, but also rust-analyzer, a4x2, test flakes, etc.) • Goal is to feel more like “working alongside my colleagues”, not “another meeting” ◦ It’s not where we make big decisions ◦ Extended silence is fine ◦ It’s truly optional

Maintaining focus: making a path • We (I) spend a
lot of time both: ◦ Laying out the big picture … (e.g., RFD 418, RFD 504, RFD 565) ◦ … but also laying out very concrete next steps for each work stream (stuﬀ folks can pick up for at least the next 1-2 weeks) • Goal is to avoid getting stuck silently on “what’s next” • This takes work!

Maintaining focus: demos • Demos are a great tool for
focusing ◦ Before: Planning for them forces prioritization ◦ Demo itself usually represents big points of de-risk ◦ After: inspires follow-on work, sustains momentum • Also a great tool for communicating: ◦ With each other about how subsystems are shaping up ◦ With the rest of the company about progress • Personally, I ﬁnd them very motivating • For update: this has been a struggle (because of the problem space), but very much worth it and something we keep striving for

Maintaining focus: prioritization • Me constantly annoying people: “is ﬁxing
this more urgent/important than taking the next step on upgrade?” (hard to compare: “solve one problem” vs. “take one step”)

Prioritization: punt on third down • Mantra: “Upgrade MVP” need
only be as complete/correct as what we’re doing today. ◦ Perspective: customers have asked why they can’t just do what we’re doing today themselves! • Useful exercise, in the small and in the large: ◦ What could we ship in 2 weeks? (N weeks) ◦ What constraints would we put on it if we had to ship something in 2 weeks? (N weeks) ◦ Basically how we got to “shipping in 2025 Q3” • Problem: not a lot of obvious room for cutting scope in upgrade. Right?

Reducing scope of upgrade • In retrospect: the big area
for cutting scope is in the set of operational conditions the system can handle on its own • This is not cutting back on rigor. ◦ Every system has operational limits. (The system cannot handle 31 sleds failing.) ◦ The system cannot upgrade itself at all today. ◦ Useful deliverable: system that can upgrade itself on a sunny day ▪ De-risks a lot ▪ Shows customers their investment is paying oﬀ ▪ Can still be useful. • There’s a balance: edge cases can inform architecture!

Prioritizing “edge cases” • Is this likely to come up
in practice? • How bad is the impact? ◦ Permanent? ◦ Support call? ◦ Operator annoyance? • Avoid all-or-nothing thinking ◦ “Fixing this is hard, but making it safe (but annoying) is easy” (e.g., resetting SP instead of host OS) • Would we rather ship upgrade later? (Would the customer rather wait for an upgrade system that can handle that edge case? Or get their hands on one that works on sunny days and see it improve over time?)

Examples of what we’ve punted on • Proceeding with upgrade
when a sled is down ◦ Determination: annoying, unlikely in the short term, handleable by support • Resetting the SP when all we need is to bounce the host OS ◦ Determination: higher impact than ideal, but that’s okay • Handling MGS state changing behind our back (e.g., sled moves cubbies during execution) ◦ Determination: unlikely, not catastrophic • TUF repo authentication ◦ Impact: slightly annoying customer ﬂow, but no compromise on security • Clickhouse schema update ◦ Impact: we’ll have more work to do when we next need to update this schema

More examples Edge cases we’ve spent a lot of time
discussing (and largely punted on): • What if the two M.2 devices contain different configuration? What if they disagree about whether a MUPdate has happened? • What if a sled fails during the Nexus handoff after the point of no return but before new Nexus instances take over? • What if a sled fails after its host OS is updated but before the rest of the upgrade is completed? • What if we have a rack power-off after expunging a CockroachDB node and before we’ve managed to decommission it?

More punts • Rollback (big can of worms: RFD 534)
• Cleaning up technical debt :( ◦ Dataset ids ◦ Blueprint builder cleanup ◦ More faithful emulation in the test suite • Deploying multi-node Clickhouse (this punt was pretty late)

Prioritizing process • Company-wide, we prioritize big features in the
product roundtable. • How do we make more ﬁne-grained calls about risks and blockers? ◦ Generally, I’ve just been making these calls (trying to get consensus) and communicating it out/up • Do we track product-wide “important non-blockers” (ongoing risks)? ◦ This could help ongoing background fretting about these problems (“well, we still don’t handle … “) • Do we review these and (re)prioritize them?

Decoupling / sequencing work • Write down a data structure
representing the interface • Expect it to change, but both sides can start working from it ◦ E.g., planning vs. execution during SP update ◦ E.g., rendez-vous tables ◦ E.g., recent work on Nexus handoﬀ

Last words

Thank you to the update team • I’m extremely proud
of what our team has done (and I hope you all are, too!) • Not just the end result, but the teamwork that’s gone into it • Thanks to everybody that’s helped out!

Upgrade project • We have our own watercooler! Come join
us. ◦ M/W/F: 9am PT ◦ T/Th: 1:30pm PT • Matrix: oxide-update channel • Documentation: ◦ RFD 418 Towards automated system update ◦ RFD 565 Reconfigurator-driven system upgrade ◦ https://github.com/oxidecomputer/omicron/blob/main/docs/reconfigurator.adoc • Project board: https://github.com/orgs/oxidecomputer/projects/44/views/16 • Project page: https://github.com/oxidecomputer/meta/tree/master/engineering/reconfigurator/upgrade

2025 Update update at OxCon 2025

2025 Update update at OxCon 2025

Featured

Transcript