Upgrade to Pro — share decks privately, control downloads, hide ads and more …

VirtualiZarr & Icechunk: Build a cloud-optimize...

VirtualiZarr & Icechunk: Build a cloud-optimized datacube in 3 lines

Talk given at the Cloud-Native Geospatial Forum 2025

Avatar for Tom Nicholas

Tom Nicholas

May 06, 2025
Tweet

More Decks by Tom Nicholas

Other Decks in Programming

Transcript

  1. Virtu a liZ a rr & Icechunk: Build a cloud-optimised

    d a t a cube in 3 lines Tom Nichol a s - CNG - M a y 2025
  2. OPEN SOURCE CONTRIBUTIONS PRIOR EXPERIENCE a bout me • Pl

    a sm a Physicist • Pivoted to geoscience d a t a • Open source m a int a iner Tom Nichol a s, PhD FORWARD ENGINEER MISSION STATEMENT To empower people to use scientific d a t a to solve hum a nity’s gre a test ch a llenges
  3. EARTHMOVER.IO story  18 months a go I w a

    s a sked to cloud-optimize some d a t a - I’m fi n a lly done 😅  Such a p a in I wrote a new p a ck a ge - now it’s 3 lines of Python 💪  This a ppro a ch (Virtu a liZ a rr + Icechunk) should works for a ny Level 3/4 d a t a set 🌐  Avoids copying the d a t a - a cloud-n a tive bridge for a rchiv a l d a t a 📜☁ Coll a bor a tors: Virtu a liZ a rr: M a x Jones, Se a n H a rkins, Aimee B a rci a uk a s & Kyle B a rron (DevSeed), R a ph a el H a gen (C a rbonPl a n), Juli a Signell (Element84) OAE E ff iciency M a p: Sh a ne Loe ff ler & K a t a M a rtin (C a rbonPl a n), M a tt Long ([C]Worthy)
  4. EARTHMOVER.IO d a t a set: [C]Worthy OAE Efficiency Atl

    a s • Ensemble of clim a te model simul a tions • Arr a nged in a grid (“tiles”) • On S3 (in Source Cooper a tive’s bucket) 🙄 NetCDF 😬 50TB (200TB uncompressed) 😱 500,000 netCDF4 files! 😵💫 Logic a lly 6-dimension a l d a t a cube ☝ (3D + time + 2 ensemble dimensions)
  5. EARTHMOVER.IO be a utiful geosp a ti a l visu

    a liz a tion HTTPS://CARBONPLAN.ORG/RESEARCH/OAE - EFFICIENCY/ Sh a ne Loeffler
  6. EARTHMOVER.IO ugly distribution: “Pile of files / tiles” → Wh

    a t p a sses for a “cloud d a t a a rchive” • Result of a n a ive “lift a nd shift”
  7. EARTHMOVER.IO problem: not “cloud-optimized” • Client code h a s

    to look a t met a d a t a spre a d throughout e a ch file • Do this for every file, then check they c a n be combined • If you c a lled xr.open_mfdataset th a t’s wh a t it would do 1 minute per file Time to open: = a n entire ye a r!!! 🤯 x 500,000 files
  8. EARTHMOVER.IO rewrite a s Z a rr! 50TB +50TB •

    Time to open now <1 second! 🥳 • Uses 2x the stor a ge 👎
  9. EARTHMOVER.IO Icechunk stores “virtu a l chunks”  Tr a

    ns a ction a l, cloud- n a tive stor a ge engine for Z a rr  Works together with Z a rr Python 3 a nd X a rr a y  Supports virtu a l chunks https://github.com/ e a rth-mover/icechunk/ https://icechunk.io  Core implemented in Rust; thin Python wr a pper  Rust + Juli a + C interoper a bility  100% open source (Ap a che 2.0)
  10. EARTHMOVER.IO Virtu a liZ a rr extr a cts virtu

    a l chunks import virtualizarr as vz vz.open_virtual_dataset ( ) • Combining chunk references from different files == a rr a y conc a ten a tion • So use xarray.concat()! (By wr a pping ManifestArrays) • C a n write references to Icechunk stores • ( a lso to Kerchunk form a t) “chunk m a nifests” “Virtu a l” xr.Dataset ( )
  11. EARTHMOVER.IO Sc a ling to 500,000 files • M a

    p-reduce problem, c a lled vi a vz.open_virtual_mfdataset(<filepaths>) • Serverlessly m a p vz.open_virtual_dataset over files a s AWS L a mbd a s • (M a x 1000 l a mbd a s a t once, so b a tch this) • Demo… vz.open_virtual_dataset( ) xarray.concat(…) 🧑💻 vz.open_virtual_dataset( ) vz.open_virtual_dataset( )
  12. EARTHMOVER.IO Demo • 10GB / s • From l a

    ptop • B a tched • Resum a ble • Gener a l • 3 lines (ish…)
  13. EARTHMOVER.IO polishing ✨ → Replic a te demo yourself with

    upcoming Virtu a liZ a rr rele a se • Sc a ling demo shows possibilities not polish - relies on some unmerged fe a tures • But Virtu a liZ a rr a nd Icechunk work together tod a y - come to our workshop to see! • This a ppro a ch is much e a sier th a n a ny other I know of • Also note TIFF + JPEG + GRIB support in Virtu a liZ a rr is coming • T a lk to Se a n H a rkins @ DevSeed a bout th a t
  14. EARTHMOVER.IO Summ a ry  Level 3/4 d a t

    a sets often in a rchiv a l form a ts with Z a rr-like ( a rr a y) structure  Virtu a l Icechunk stores point a t fi les without copying the d a t a  Build virtu a l d a t a cubes using X a rr a y synt a x vi a Virtu a liZ a rr  Icechunk a llows increment a l upd a tes a s new d a t a a rrives Format NetCDF4 “Native” Zarr Icechunk 🧊 # of URLs 500,000 1 1 Time to open ~1 year < 1 sec < 1 sec Storage increase 0% 100% <0.0004% Convert using Xarray? N/a Yes Yes Version- controlled? No No Yes Update-safe? No No Yes
  15. EARTHMOVER.IO Bonus: “Which file form a ts?” • A lot

    of them! • Currently does netCDF4, HDF5, netCDF3, “n a tive” Z a rr (v3), FITS • TIFF, COG, GRIB a nd more on the horizon • C a n write your own custom “re a der” for a more niche form a t • e.g. WIP re a der for HuggingF a ce’s S a feTensors form a t for ML model weights
  16. EARTHMOVER.IO Bonus: “Wh a t d a t a sets

    c a n you virtu a lize?” • Currently some a ddition a l requirements imposed by Z a rr d a t a model • Homogenous chunk sh a pes • Homogenous chunk codecs (e.g. compression) • Per- a rr a y met a d a t a
  17. EARTHMOVER.IO Bonus: “But you c a n’t ch a nge

    the chunks” • Correct. Th a t is the prim a ry downside of this a ppro a ch. • You c a n choose to write n a tive chunks a longside the virtu a l chunks though • Allows you to ensure sm a ll coordin a te v a ri a bles a re a ll one chunk • Allows you to increment a lly overwrite d a t a with more suit a ble chunking a fter virtu a l ingestion
  18. EARTHMOVER.IO Bonus: “Wh a t a bout Kerchunk?” Format NetCDF4

    “Native” Zarr Kerchunk Icechunk 🧊 # of URLs 500,000 1 1 1 Time to open ~1 year < 1 sec < 1 sec < 1 sec Storage increase 0% 100% 0.0004% 0.0004% Convert using Xarray? N/a Yes No Yes Version- controlled? No No No Yes Update-safe? No No No Yes  Kerchunk is two things: 1. Python p a ck a ge • Virtu a liZ a rr p a ck a ge repl a ces this 2. Form a t for storing references • Icechunk form a t is a n a ltern a tive • (Though Virtu a liZ a rr c a n write to both)
  19. EARTHMOVER.IO Bonus: “How does the p a r a lleliz

    a tion work?” • vz.open_virtu a l_mfd a t a set a ccepts a n Executor • Follows concurrent.Futures interf a ce (ide a from Cubed) • Comes with a LithopsExecutor a D a skDel a yedExecutor, a nd works with the python Thre a dPoolExecutor • Lithops • Open-source p a ck a ge a bstr a cting over serverless APIs of v a rious cloud providers • H a d to build a runtime using Docker first • But then my python function just runs on AWS L a mbd a s