PremDay #3 - From half a rack to 1MW

PREMDAY From half a rack to 1MW 12 years scaling
Proton across 3 regions Sebastien Ceuterickx Engineering Director, Infrastructure — Proton Treasurer — Premday

From quarter-rack to 1 MW —12 years, 3 regions, 6
sites 2014 1st Bunker DC 2021 1st DC Germany –250kW 2024 1st DC Norway -400kW 2026 3 regions –1MW 100% green Energy PUE < 1.2 2017 1st Private Cage –50kW

PREMDAY Things that went wrong –but not a single user
data loss Wrong PSU sizing → fire Customs & logistics hell Loading dock too small Racks tipped over Supply chain disaster Wrong disk swapped Global power outage ...also: rack flooded from above · NIC overheating to 100°C · CPU throttling · Dead on Arrival ˜1% · Security incidents Firmware-upgrade campaigns

PREMDAY When you think your bad day is over... Flat
tire Black ice Snowstorm ...you still have to drive home.

PREMDAY Pushing storage platforms to its limits —The ExaBytepath From
100 TB to 500 PB — the phased approached for 99.999999999% (11 nines) durability 2014 —2018 MySQL 100 TB Emails & attachments stored in MySQL. Humble start. 2018 —2024 Ceph on bare metal 1PB → 30+ PB per cluster Multiblob to kill 192×overhead on kB blobs. EC 4+2 → 8+3 once 3,000 OSDs × 1% = 12-day MTBF. 12 to 24 HDD machinces 2024 —today Ceph on K8s 300 PB raw 3 regions. Operations at scale, on Kubernetes since end-2024. Reach 65PB per cluster limit 2025 —today Custom cold tier +200PB 50% cheaper than Ceph Low EC over 3 regions Ceph for hot blob (<3 mo). Older data → tiering on higher TB per CPU/RAM platform NEXT What’s next Ceph full flash · TiKVfrom kB blobs · HDD JBOD + SSD buffer · power-optimized drives · SMR · tape-to-glacier We optimized in 3 dimensions Less Raw TB / User TB –Lower $/TB –Lower Watt/TB

PREMDAY Lesson 1 — Storage Storage is a systems problem.
What it actually costs us 50–60% hardware — 25–40% electricity Both grow more expensive over time. Buying more is not a strategy. Workload-shaped • kB blobs, billions per cluster. 192×overhead off-the-shelf —we engineer around it (multiblob, OMAP). • ceph-ansible → cephadm → rook SLAs that drive every choice • N+1 metro · survive any DC + 2 machines/site – 11nines durability • p99 50 ms hot data (<3 mo) • p99 100 ms for all data What we ask vendors now Density — TB / U, JBOD-friendly, toolless design Power per TB — embrace SMR, low-power profiles, cooling efficiency Lifecycle data — we plan around it Co-design — tiered systems: HDD JBOD + SSD buffer + tape Five years ago we asked for more drives. Today we ask vendors to design with us.

PREMDAY Lesson 2 — Hardware provisioning What scales badly when
you grow 100× Average 3% DOA rate $1.5M of Dead HW at current scale 5 days to 3 months to fix a DOA part! RMA pipelines aren’t built for our scale — or our urgency. Hidden cost per failed unit Hardware immobilized · OPEX overhead · rack space wasted Cascade on deployment timelines Burn-in 48 h · provisioning hrs · cluster-add hrs · one fail → re-batch → k$ loss This increases my TCO. I bake it into every order!

PREMDAY Lesson 3 —Standardization & automation Heterogeneity is fine when
humans handle it. At scale for velocity and cost-efficiency, Standardization + Automation Where we came from 50 SKUs for a team of 10 Startup mode: build fast, ship to market. Optimized for flexibility and cost Humans absorbed the heterogeneity. What scale forced Fewer SKUs — dual vendors per platform Automation everywhere Inventory · burn-in · failed-unit registry · provisioning Tight integration Facilities compute data — not siloed Standardization isn’t ideology —it’s what makes scaling possible through automation.

PREMDAY Hand-over to CERN PROTON ~1 MW 3 regions ·
1 EB in 2027 & CERN 250,000 drives 10 EB · HL-LHC Different scales. Same walls. Now —what we both see

PREMDAY Joint diagnosis What we both converge on Densify ↓
$/TB 100+ drives JBOD 400+ Gbps uplink SMR drives Optimize ↓ W/TB Single-socket nodes EPYC and ARM? Flash caching tier Automate 100k+ drives “Lazy” repair –Batch RMA Automated inventory Proactive monitoring –OCP Datacenter SAS-SATA Device Specification 1 MW or 250,000 drives —the answer is the same.

PREMDAY The vendor relationship Where the vendor model breaks Firmware
quality . Lifecycle visibility One workload fits all Designed for the median. kB blobs & large physics files miss the spec. Procurement, not partnership Vendors sell SKUs. We want to co-design hardware. We’re not asking for a different vendor. We’re asking for a different conversation. Firmware validation→ a major manual workload OEM “latest and best” firmware reduces testing flexibility Known firmware issues create recurring engineering overhead From 5-7 years to 7-10 years

PREMDAY We pivot to OCP CAPEX −30% vs optimized 19″
legacy solution OPEX −20% Energy efficiency · tool-less L11 maintenance Technology leadership + DLC Faster next-gen CPU onboarding · Direct Liquid Cooling (another −20% OPEX) · OpenBMC standardization Supply chain control Direct Direct procurement, multi-trusted manufacturers Component-level partnerships for strategic parts Two infrastructures. One open specification. Meet us at OCP.

PremDay #3 - From half a rack to 1MW

PremDay #3 - From half a rack to 1MW

Premday

More Decks by Premday

Other Decks in Technology

Featured

Transcript

PREMDAY From half a rack to 1MW 12 years scaling

From quarter-rack to 1 MW —12 years, 3 regions, 6

PREMDAY Things that went wrong –but not a single user

PREMDAY When you think your bad day is over... Flat

PREMDAY Pushing storage platforms to its limits —The ExaBytepath From

PREMDAY Lesson 1 — Storage Storage is a systems problem.

PREMDAY Lesson 2 — Hardware provisioning What scales badly when

PREMDAY Lesson 3 —Standardization & automation Heterogeneity is fine when

PREMDAY Hand-over to CERN PROTON ~1 MW 3 regions ·

PREMDAY Joint diagnosis What we both converge on Densify ↓

PREMDAY The vendor relationship Where the vendor model breaks Firmware

PREMDAY We pivot to OCP CAPEX −30% vs optimized 19″