Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2025 Update update at OxCon 2025

Avatar for David David
September 17, 2025
100

2025 Update update at OxCon 2025

A review of ongoing work on software update in the Oxide system, followed by discussion of challenges working on large, long-term projects.

Avatar for David

David

September 17, 2025
Tweet

Transcript

  1. 2025 Update update • About software update • Path to

    self-service update • Current status and next steps • Experience report for a long-running project
  2. Software update • Upgrading all the updateable software in the

    rack. (Obviously?) • There are hundreds of upgradable components within a single rack! ◦ Each sled/switch/PSC: SP software, RoT bootloader, RoT ◦ Each sled: host OS (phase 1, phase 2) ◦ Control plane components (one per disk + another 1-2 dozen) ◦ (Not yet included: SSD firmware)
  3. Upgrade today • Release process produces a single ZIP file

    with all the software in the system (~2-3 GiB) • When release is ready, Support reaches out to customer to schedule upgrade • Upgrade process (1-2 hours): ◦ Oxide support engineer connects via technician port (often via jump host; authenticated via hardware token) ◦ Uses a combination of TUI and Unix shell to: ▪ “Park the rack” – shut (almost) everything down ▪ Replace all software ▪ Apply database schema updates ▪ Bring everything back up
  4. Problems • It’s not self-service ◦ Oxide support requires access

    (usually via jump host) ◦ Non-starter for many (most?) customers • Instances offline for duration of upgrade (1-2 hours) ◦ Impacts users ◦ Constrains scheduling ◦ Constrains workloads that can be deployed on Oxide • It doesn’t scale for us
  5. Aside: importance of software update • A top company priority

    for 2+ years • Upgrade is how Oxide continues to deliver value to customers after their initial purchase (bugfixes, features) ◦ Customers buy racks with the expectation that the software will improve over time. • Nearly every customer needs upgrade to be better operationalized in order to go into production (i.e., buy more racks) • Upgrade experience has to be smooth for any of this to work!
  6. Self-service upgrade: operator view • Operator downloads ZIP file (TUF

    repo) from Oxide to their machine (laptop, workstation, etc.), validates checksum • Using web console or CLI (or API), operator: ◦ Uploads ZIP file to Oxide system ◦ Configures the target system release ◦ Waits for the upgrade to finish ◦ If anything goes wrong: system pauses updates with clear explanation • Operator focuses on high-level policy, not implementation details • Goal: scales up with number of racks
  7. Self-service update • Implies: API-driven • Implies: fully automated •

    Implies: control plane is driving the update ◦ (Implies: control plane is online during the update) ▪ Replacing all the parts of the ___ while it’s ___ (plane while it’s in the air, car while it’s driving down the freeway, pick your metaphor)
  8. OxCon 2023 • First customer install: July, 2023 (upgrades work

    basically the way they do today) • RFD 418 laid out vision ◦ Autonomous, self-service, minimal impact, risk mitigation ◦ Exercise it in development ◦ Foundations: policy vs. planning vs. execution, operability (supportability)
  9. 2023 - 2024: foundations for dynamic reconfiguration • Developed foundations:

    inventory, blueprints • Deliverables: ◦ Sled expungement (N different projects) ◦ Sled addition (N different projects) ◦ Disk expungement ◦ Designs: TUF repo distribution, API versioning
  10. Aside: dynamic reconfiguration • Components (and their dependencies) can come

    and go while the system is online • Examples of work unrelated to mechanics of add/remove/upgrade of a component: ◦ Oximeter needs to stop collecting from producers that are gone ◦ Sagas that were running when Nexus was expunged need to be dealt with ◦ Internal NTP zones need to use latest set of boundary NTP zones ◦ CockroachDB needs to start replicating data to new cluster nodes ◦ Anything with network dependencies needs to notice when they come or go ◦ Need to keep track of what IPs and other networking resources are in use ◦ (there’s a very long tail of these things)
  11. OxCon 2024 • What we said was next: ◦ Dogfood

    of Theseus (ran into LRTQ limitation) ◦ Multi-node Clickhouse (did much of this, deprioritized) ◦ First steps towards upgrade • Team meeting: brainstorm first upgrade demos / milestones
  12. 2024 - 2025: planned work done • API versioning: design,

    dropshot support, developer workflow, update sequence, etc. • TUF repo management: upload, distribution, trust • Upgrade mechanisms: control plane zones, SP, RoT bootloader, RoT, host OS phases 1 and 2 • Planning: orchestration of updates, planner reports (why is it stuck?), safety checks (don’t update sled with CockroachDB on it if the CockroachDB cluster is unhealthy)
  13. 2024 - 2025: unplanned work done • Foundation work: ◦

    Planner sequencing (#6999, came out of colo issue) ◦ Rendezvous tables ◦ Blippy • MUPdate + update coexistence (RFD 556) • CockroachDB decommissioning woes • Sled Agent reconciler, dataset bugs (#7313) • Undrainable sagas: RFD 555 and associated tools
  14. Delivering self-service upgrade • Deliverable: self-service upgrade in R17 (targeting

    end of September) (side note: how do you deliver an upgrade feature in one release?) • Blockers remaining: ◦ Nexus handoff ◦ Console support ◦ Lots of loose ends (better status, more safeties, etc.)
  15. Important non-blockers • Developer guard rails to catch upgrade breakage

    (Update all internal APIs to use versioning; lock down Clickhouse schema) • #4745 lack of rack health checks • Switch update risks (dendrite#49) • Archival / cleanup ◦ Old blueprints and logs ◦ TUF repo deletion ◦ Decommissioning Cockroach nodes ◦ GC of expunged zones and datasets within blueprint • External API versioning • #8316 transient zone root datasets • #8769 NTP zone needs to be updated before host OS
  16. Major risk: lack of automated end-to-end testing • Lots of

    unit tests and integration tests with simulated Omicron • No end-to-end automated tests for upgrade (lot of work, no hardware available, no a4x2 in CI) • Goals: regular tests for: ◦ Upgrade from previous commit on “main” to latest (exercises upgrade from each commit) ◦ Upgrade from previous release to latest (exercises upgrade to each commit) ◦ MUPdate + update together
  17. Next deliverable • Non-disruptive update ◦ VMs live-migrated before sleds

    are rebooted ◦ Never update so many Crucible zones at once that a customer volume could be impacted ◦ System makes best effort to ensure there will be enough capacity to do this (but hard choices may be required) • Original target: 2025 Q4 (will very likely change)
  18. Long-running projects • Delivering a long-term project like upgrade is

    hard • We have other long-term projects like this • What follows is an experience report ◦ I think it’s worked pretty well, but take it as just a data point (I’d love feedback, positive or negative) ◦ I’d be interested to hear how other teams are doing this
  19. My big fear (well, one of them) • My big

    fear (since 2023): upgrade perpetually feels “a year away” • Focus: how do we keep our eyes on the prize? • Prioritization: how do we decide what to punt? • Bonus topic: one weird trick for sequencing / decoupling work
  20. Focus challenges • One enemy: “organizational procrastination” ◦ Next steps

    toward upgrade are often not clear ◦ There are lots of other important problems where next steps are clear e.g., #3732 configurable external DNS IPs ◦ Other problems are both real and present! ▪ e.g., Cannot factory-reset / restart RSS • How do we avoid letting the dozens of small, well-understood problems starve work on the big, hard problems (like upgrade)? • This is a balance. Being responsive to changing priorities is important, too.
  21. Losing focus • It’s easy to lose focus when: ◦

    We get stuck on technical issues ◦ We don’t know what the next step is ◦ We run across other important problems (or they are thrust upon us)
  22. Maintaining focus: daily watercooler • 30-minute, optional Meet – every

    day • Low-pressure way to surface issues that are keeping people stuck (upgrade-related stuff, yes, but also rust-analyzer, a4x2, test flakes, etc.) • Goal is to feel more like “working alongside my colleagues”, not “another meeting” ◦ It’s not where we make big decisions ◦ Extended silence is fine ◦ It’s truly optional
  23. Maintaining focus: making a path • We (I) spend a

    lot of time both: ◦ Laying out the big picture … (e.g., RFD 418, RFD 504, RFD 565) ◦ … but also laying out very concrete next steps for each work stream (stuff folks can pick up for at least the next 1-2 weeks) • Goal is to avoid getting stuck silently on “what’s next” • This takes work!
  24. Maintaining focus: demos • Demos are a great tool for

    focusing ◦ Before: Planning for them forces prioritization ◦ Demo itself usually represents big points of de-risk ◦ After: inspires follow-on work, sustains momentum • Also a great tool for communicating: ◦ With each other about how subsystems are shaping up ◦ With the rest of the company about progress • Personally, I find them very motivating • For update: this has been a struggle (because of the problem space), but very much worth it and something we keep striving for
  25. Maintaining focus: prioritization • Me constantly annoying people: “is fixing

    this more urgent/important than taking the next step on upgrade?” (hard to compare: “solve one problem” vs. “take one step”)
  26. Prioritization: punt on third down • Mantra: “Upgrade MVP” need

    only be as complete/correct as what we’re doing today. ◦ Perspective: customers have asked why they can’t just do what we’re doing today themselves! • Useful exercise, in the small and in the large: ◦ What could we ship in 2 weeks? (N weeks) ◦ What constraints would we put on it if we had to ship something in 2 weeks? (N weeks) ◦ Basically how we got to “shipping in 2025 Q3” • Problem: not a lot of obvious room for cutting scope in upgrade. Right?
  27. Reducing scope of upgrade • In retrospect: the big area

    for cutting scope is in the set of operational conditions the system can handle on its own • This is not cutting back on rigor. ◦ Every system has operational limits. (The system cannot handle 31 sleds failing.) ◦ The system cannot upgrade itself at all today. ◦ Useful deliverable: system that can upgrade itself on a sunny day ▪ De-risks a lot ▪ Shows customers their investment is paying off ▪ Can still be useful. • There’s a balance: edge cases can inform architecture!
  28. Prioritizing “edge cases” • Is this likely to come up

    in practice? • How bad is the impact? ◦ Permanent? ◦ Support call? ◦ Operator annoyance? • Avoid all-or-nothing thinking ◦ “Fixing this is hard, but making it safe (but annoying) is easy” (e.g., resetting SP instead of host OS) • Would we rather ship upgrade later? (Would the customer rather wait for an upgrade system that can handle that edge case? Or get their hands on one that works on sunny days and see it improve over time?)
  29. Examples of what we’ve punted on • Proceeding with upgrade

    when a sled is down ◦ Determination: annoying, unlikely in the short term, handleable by support • Resetting the SP when all we need is to bounce the host OS ◦ Determination: higher impact than ideal, but that’s okay • Handling MGS state changing behind our back (e.g., sled moves cubbies during execution) ◦ Determination: unlikely, not catastrophic • TUF repo authentication ◦ Impact: slightly annoying customer flow, but no compromise on security • Clickhouse schema update ◦ Impact: we’ll have more work to do when we next need to update this schema
  30. More examples Edge cases we’ve spent a lot of time

    discussing (and largely punted on): • What if the two M.2 devices contain different configuration? What if they disagree about whether a MUPdate has happened? • What if a sled fails during the Nexus handoff after the point of no return but before new Nexus instances take over? • What if a sled fails after its host OS is updated but before the rest of the upgrade is completed? • What if we have a rack power-off after expunging a CockroachDB node and before we’ve managed to decommission it?
  31. More punts • Rollback (big can of worms: RFD 534)

    • Cleaning up technical debt :( ◦ Dataset ids ◦ Blueprint builder cleanup ◦ More faithful emulation in the test suite • Deploying multi-node Clickhouse (this punt was pretty late)
  32. Prioritizing process • Company-wide, we prioritize big features in the

    product roundtable. • How do we make more fine-grained calls about risks and blockers? ◦ Generally, I’ve just been making these calls (trying to get consensus) and communicating it out/up • Do we track product-wide “important non-blockers” (ongoing risks)? ◦ This could help ongoing background fretting about these problems (“well, we still don’t handle … “) • Do we review these and (re)prioritize them?
  33. Decoupling / sequencing work • Write down a data structure

    representing the interface • Expect it to change, but both sides can start working from it ◦ E.g., planning vs. execution during SP update ◦ E.g., rendez-vous tables ◦ E.g., recent work on Nexus handoff
  34. Thank you to the update team • I’m extremely proud

    of what our team has done (and I hope you all are, too!) • Not just the end result, but the teamwork that’s gone into it • Thanks to everybody that’s helped out!
  35. Upgrade project • We have our own watercooler! Come join

    us. ◦ M/W/F: 9am PT ◦ T/Th: 1:30pm PT • Matrix: oxide-update channel • Documentation: ◦ RFD 418 Towards automated system update ◦ RFD 565 Reconfigurator-driven system upgrade ◦ https://github.com/oxidecomputer/omicron/blob/main/docs/reconfigurator.adoc • Project board: https://github.com/orgs/oxidecomputer/projects/44/views/16 • Project page: https://github.com/oxidecomputer/meta/tree/master/engineering/reconfigurator/upgrade