Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay #3 - From half a rack to 1MW

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

PremDay #3 - From half a rack to 1MW

Proton presents 12 years of scaling across 3 regions

Avatar for Premday

Premday

June 12, 2026

More Decks by Premday

Other Decks in Technology

Transcript

  1. PREMDAY From half a rack to 1MW 12 years scaling

    Proton across 3 regions Sebastien Ceuterickx Engineering Director, Infrastructure — Proton Treasurer — Premday
  2. From quarter-rack to 1 MW —12 years, 3 regions, 6

    sites 2014 1st Bunker DC 2021 1st DC Germany –250kW 2024 1st DC Norway -400kW 2026 3 regions –1MW 100% green Energy PUE < 1.2 2017 1st Private Cage –50kW
  3. PREMDAY Things that went wrong –but not a single user

    data loss Wrong PSU sizing → fire Customs & logistics hell Loading dock too small Racks tipped over Supply chain disaster Wrong disk swapped Global power outage ...also: rack flooded from above · NIC overheating to 100°C · CPU throttling · Dead on Arrival ˜1% · Security incidents Firmware-upgrade campaigns
  4. PREMDAY When you think your bad day is over... Flat

    tire Black ice Snowstorm ...you still have to drive home.
  5. PREMDAY Pushing storage platforms to its limits —The ExaBytepath From

    100 TB to 500 PB — the phased approached for 99.999999999% (11 nines) durability 2014 —2018 MySQL 100 TB Emails & attachments stored in MySQL. Humble start. 2018 —2024 Ceph on bare metal 1PB → 30+ PB per cluster Multiblob to kill 192×overhead on kB blobs. EC 4+2 → 8+3 once 3,000 OSDs × 1% = 12-day MTBF. 12 to 24 HDD machinces 2024 —today Ceph on K8s 300 PB raw 3 regions. Operations at scale, on Kubernetes since end-2024. Reach 65PB per cluster limit 2025 —today Custom cold tier +200PB 50% cheaper than Ceph Low EC over 3 regions Ceph for hot blob (<3 mo). Older data → tiering on higher TB per CPU/RAM platform NEXT What’s next Ceph full flash · TiKVfrom kB blobs · HDD JBOD + SSD buffer · power-optimized drives · SMR · tape-to-glacier We optimized in 3 dimensions Less Raw TB / User TB –Lower $/TB –Lower Watt/TB
  6. PREMDAY Lesson 1 — Storage Storage is a systems problem.

    What it actually costs us 50–60% hardware — 25–40% electricity Both grow more expensive over time. Buying more is not a strategy. Workload-shaped • kB blobs, billions per cluster. 192×overhead off-the-shelf —we engineer around it (multiblob, OMAP). • ceph-ansible → cephadm → rook SLAs that drive every choice • N+1 metro · survive any DC + 2 machines/site – 11nines durability • p99 50 ms hot data (<3 mo) • p99 100 ms for all data What we ask vendors now Density — TB / U, JBOD-friendly, toolless design Power per TB — embrace SMR, low-power profiles, cooling efficiency Lifecycle data — we plan around it Co-design — tiered systems: HDD JBOD + SSD buffer + tape Five years ago we asked for more drives. Today we ask vendors to design with us.
  7. PREMDAY Lesson 2 — Hardware provisioning What scales badly when

    you grow 100× Average 3% DOA rate $1.5M of Dead HW at current scale 5 days to 3 months to fix a DOA part! RMA pipelines aren’t built for our scale — or our urgency. Hidden cost per failed unit Hardware immobilized · OPEX overhead · rack space wasted Cascade on deployment timelines Burn-in 48 h · provisioning hrs · cluster-add hrs · one fail → re-batch → k$ loss This increases my TCO. I bake it into every order!
  8. PREMDAY Lesson 3 —Standardization & automation Heterogeneity is fine when

    humans handle it. At scale for velocity and cost-efficiency, Standardization + Automation Where we came from 50 SKUs for a team of 10 Startup mode: build fast, ship to market. Optimized for flexibility and cost Humans absorbed the heterogeneity. What scale forced Fewer SKUs — dual vendors per platform Automation everywhere Inventory · burn-in · failed-unit registry · provisioning Tight integration Facilities compute data — not siloed Standardization isn’t ideology —it’s what makes scaling possible through automation.
  9. PREMDAY Hand-over to CERN PROTON ~1 MW 3 regions ·

    1 EB in 2027 & CERN 250,000 drives 10 EB · HL-LHC Different scales. Same walls. Now —what we both see
  10. PREMDAY Joint diagnosis What we both converge on Densify ↓

    $/TB 100+ drives JBOD 400+ Gbps uplink SMR drives Optimize ↓ W/TB Single-socket nodes EPYC and ARM? Flash caching tier Automate 100k+ drives “Lazy” repair –Batch RMA Automated inventory Proactive monitoring –OCP Datacenter SAS-SATA Device Specification 1 MW or 250,000 drives —the answer is the same.
  11. PREMDAY The vendor relationship Where the vendor model breaks Firmware

    quality . Lifecycle visibility One workload fits all Designed for the median. kB blobs & large physics files miss the spec. Procurement, not partnership Vendors sell SKUs. We want to co-design hardware. We’re not asking for a different vendor. We’re asking for a different conversation. Firmware validation→ a major manual workload OEM “latest and best” firmware reduces testing flexibility Known firmware issues create recurring engineering overhead From 5-7 years to 7-10 years
  12. PREMDAY We pivot to OCP CAPEX −30% vs optimized 19″

    legacy solution OPEX −20% Energy efficiency · tool-less L11 maintenance Technology leadership + DLC Faster next-gen CPU onboarding · Direct Liquid Cooling (another −20% OPEX) · OpenBMC standardization Supply chain control Direct Direct procurement, multi-trusted manufacturers Component-level partnerships for strategic parts Two infrastructures. One open specification. Meet us at OCP.