a sm a Physicist • Pivoted to geoscience d a t a • Open source m a int a iner Tom Nichol a s, PhD FORWARD ENGINEER MISSION STATEMENT To empower people to use scientific d a t a to solve hum a nity’s gre a test ch a llenges
s a sked to cloud-optimize some d a t a - I’m fi n a lly done 😅 Such a p a in I wrote a new p a ck a ge - now it’s 3 lines of Python 💪 This a ppro a ch (Virtu a liZ a rr + Icechunk) should works for a ny Level 3/4 d a t a set 🌐 Avoids copying the d a t a - a cloud-n a tive bridge for a rchiv a l d a t a 📜☁ Coll a bor a tors: Virtu a liZ a rr: M a x Jones, Se a n H a rkins, Aimee B a rci a uk a s & Kyle B a rron (DevSeed), R a ph a el H a gen (C a rbonPl a n), Juli a Signell (Element84) OAE E ff iciency M a p: Sh a ne Loe ff ler & K a t a M a rtin (C a rbonPl a n), M a tt Long ([C]Worthy)
a s • Ensemble of clim a te model simul a tions • Arr a nged in a grid (“tiles”) • On S3 (in Source Cooper a tive’s bucket) 🙄 NetCDF 😬 50TB (200TB uncompressed) 😱 500,000 netCDF4 files! 😵💫 Logic a lly 6-dimension a l d a t a cube ☝ (3D + time + 2 ensemble dimensions)
to look a t met a d a t a spre a d throughout e a ch file • Do this for every file, then check they c a n be combined • If you c a lled xr.open_mfdataset th a t’s wh a t it would do 1 minute per file Time to open: = a n entire ye a r!!! 🤯 x 500,000 files
ns a ction a l, cloud- n a tive stor a ge engine for Z a rr Works together with Z a rr Python 3 a nd X a rr a y Supports virtu a l chunks https://github.com/ e a rth-mover/icechunk/ https://icechunk.io Core implemented in Rust; thin Python wr a pper Rust + Juli a + C interoper a bility 100% open source (Ap a che 2.0)
a l chunks import virtualizarr as vz vz.open_virtual_dataset ( ) • Combining chunk references from different files == a rr a y conc a ten a tion • So use xarray.concat()! (By wr a pping ManifestArrays) • C a n write references to Icechunk stores • ( a lso to Kerchunk form a t) “chunk m a nifests” “Virtu a l” xr.Dataset ( )
p-reduce problem, c a lled vi a vz.open_virtual_mfdataset(<filepaths>) • Serverlessly m a p vz.open_virtual_dataset over files a s AWS L a mbd a s • (M a x 1000 l a mbd a s a t once, so b a tch this) • Demo… vz.open_virtual_dataset( ) xarray.concat(…) 🧑💻 vz.open_virtual_dataset( ) vz.open_virtual_dataset( )
upcoming Virtu a liZ a rr rele a se • Sc a ling demo shows possibilities not polish - relies on some unmerged fe a tures • But Virtu a liZ a rr a nd Icechunk work together tod a y - come to our workshop to see! • This a ppro a ch is much e a sier th a n a ny other I know of • Also note TIFF + JPEG + GRIB support in Virtu a liZ a rr is coming • T a lk to Se a n H a rkins @ DevSeed a bout th a t
a sets often in a rchiv a l form a ts with Z a rr-like ( a rr a y) structure Virtu a l Icechunk stores point a t fi les without copying the d a t a Build virtu a l d a t a cubes using X a rr a y synt a x vi a Virtu a liZ a rr Icechunk a llows increment a l upd a tes a s new d a t a a rrives Format NetCDF4 “Native” Zarr Icechunk 🧊 # of URLs 500,000 1 1 Time to open ~1 year < 1 sec < 1 sec Storage increase 0% 100% <0.0004% Convert using Xarray? N/a Yes Yes Version- controlled? No No Yes Update-safe? No No Yes
of them! • Currently does netCDF4, HDF5, netCDF3, “n a tive” Z a rr (v3), FITS • TIFF, COG, GRIB a nd more on the horizon • C a n write your own custom “re a der” for a more niche form a t • e.g. WIP re a der for HuggingF a ce’s S a feTensors form a t for ML model weights
c a n you virtu a lize?” • Currently some a ddition a l requirements imposed by Z a rr d a t a model • Homogenous chunk sh a pes • Homogenous chunk codecs (e.g. compression) • Per- a rr a y met a d a t a
the chunks” • Correct. Th a t is the prim a ry downside of this a ppro a ch. • You c a n choose to write n a tive chunks a longside the virtu a l chunks though • Allows you to ensure sm a ll coordin a te v a ri a bles a re a ll one chunk • Allows you to increment a lly overwrite d a t a with more suit a ble chunking a fter virtu a l ingestion
“Native” Zarr Kerchunk Icechunk 🧊 # of URLs 500,000 1 1 1 Time to open ~1 year < 1 sec < 1 sec < 1 sec Storage increase 0% 100% 0.0004% 0.0004% Convert using Xarray? N/a Yes No Yes Version- controlled? No No No Yes Update-safe? No No No Yes Kerchunk is two things: 1. Python p a ck a ge • Virtu a liZ a rr p a ck a ge repl a ces this 2. Form a t for storing references • Icechunk form a t is a n a ltern a tive • (Though Virtu a liZ a rr c a n write to both)
a tion work?” • vz.open_virtu a l_mfd a t a set a ccepts a n Executor • Follows concurrent.Futures interf a ce (ide a from Cubed) • Comes with a LithopsExecutor a D a skDel a yedExecutor, a nd works with the python Thre a dPoolExecutor • Lithops • Open-source p a ck a ge a bstr a cting over serverless APIs of v a rious cloud providers • H a d to build a runtime using Docker first • But then my python function just runs on AWS L a mbd a s