Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Future of Data Driven Discovery in the Cloud

The Future of Data Driven Discovery in the Cloud

This keynote talk was given at JupyterCon in NYC.
https://www.oreilly.com/ideas/the-future-of-data-driven-discovery-in-the-cloud

What are the limiting factors to scientific progress? Of course, we always want more experiments, better measurements, and bigger simulations. But many fields, from microbiology to climate science and astronomy, are struggling to make sense of the data they already have, due to the size and complexity of modern scientific datasets. Academic science still mostly operates on a download model: datasets are downloaded and stored on-premises, where they are accessible to local computers. As datasets reach the petabyte scale, this model is breaking hard. Its inefficiency presents severe barriers to reproducibility and interdisciplinary research. Without a different approach to infrastructure, it will be difficult for basic science to reap the benefits of machine learning and artificial intelligence.

Ryan Abernathey makes the case for the large-scale migration of scientific data and research to the cloud. The cloud offers a way to make the largest datasets instantly accessible to the most sophisticated computational techniques. A global scientific data commons could usher in a golden age of data-driven discovery. Drawing on his experience with the Pangeo project, Ryan demonstrates that the technology to build it mostly already exists. Jupyter, which enables scientists to interact naturally with remote systems, is a key element of the proposed infrastructure. Other important elements are flexible frameworks for interactive distributed computing (such as Dask) and cloud-optimized storage formats. The biggest challenge is social—convincing the stakeholders and funders that the benefits of migrating to the cloud outweigh the considerable costs. A partnership between academia, government, and industry could catalyze a phase transition to a vastly more productive and exciting cloud-native scientific process.

Ryan Abernathey

August 24, 2018
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. T h e f u t u r e o

    f d ata - d r i v e n d i s c o v e r y i n t h e c l o u d R ya n A b e r n at h e y Associate Professor, Columbia University Lamont Doherty Earth Observatory http://rabernat.github.io physical oceanographer dabbler in scientific python development (xarray) founder of Pangeo
  2. !7 W h at d o d ata s e

    t s l i k e t h e s e a l l h av e i n c o m m o n ? • TBs - PBs in size • Produced through large, government-funded science projects • Cited in thousands of papers (used by thousands of scientists) • Ripe for new data-driven analysis methods (machine learning) • Trapped behind slow FTP servers, frustrating portals, and fragmented access APIs
  3. B i g s c i e n c e

    f r o m b i g d ata ! !8 Global extent of rivers and streams G. Allen & T. Pavelsky,
 Science 28 Jun 2018. DOI: 10.1126/science.aat0636 • Water covers 45% more surface area than previously thought! • Major implications for CO2 budget • Created by processing many TB of Landsat images
  4. !9 W h at w e D I D B

    e f o r e * data provider FTP Service $ wget ftp://all/the/files/* ! $ python download_script.py Weird API " Weird GUI data browser # Let’s work on something else… ? My Workstation * ca. 2014 result = [] for file in all_files: result.append(process(file))
  5. !10 D a r k R e p o s

    i t o r y * $ wget ftp://all/the/files/* * Balaji et al., 2018. Requirements for a global data infrastructure in support of CMIP6. Geoscientific Model Development Discussions. “Local copy of a dataset created to enable users to actually compute on the data.”
  6. !11 • Data has to be extracted from the remote

    server • Have to decide what data to download a priori • Analysis is slow • Try something -> go get lunch -> check results • Lots of checking Facebook / Twitter W h at w e D I D B e f o r e Consequences: ❌ Scientists (Ph.D. students / postdocs) are actually being data engineers ❌ Conservative approach to science, look for “expected” things ❌ Provenance of data is obscured (what if a correction is issued?) ❌ Nearly impossible to reproduce the full workflow (code + environment + data)
  7. !12 W h at w e D o T o

    d ay data provider FTP Service $ wget ftp://all/the/files/* ! my local server / cluster $ python download_script.py Weird API " Weird GUI data browser # file.0001.nc file.0002.nc file.0003.nc My Laptop import dask.dataframe as ddf df = ddf.read_csv('file.*.csv') df.foo.value_counts() import xarray as xr ds = xr.open_mfdataset('file.*.nc') ds.mean(dim=[‘time', ‘lon'])
  8. !13 W h at w e D o T o

    d ay • Still have a dark repository, but… • Analysis is fast! • We can think about datasets, not files • We can iterate quickly and explore new ideas Consequences: ✅ Scientists spend more time being scientists ❌ Still constrained by what we decided to download ❌ Provenance of data is obscured (what if a correction is issued?) ❌ Nearly impossible to reproduce the full workflow (code + environment + data)
  9. !14 W h e r e w e S h

    o u l d G o * commercial cloud / large HPC data_chunk.0000 My Laptop import dask.dataframe as ddf df = ddf.read_parquet(‘s3://bckt‘) df.foo.value_counts() import xarray as xr ds = xr.open_zar(‘gs://my/bucket‘) ds.mean(dim=[‘time', ‘lon']) object storage data provider’s buckets data_chunk.0001 data_chunk.0002 data_chunk.0003 compute scientist’s compute nodes Dask pod Dask pod Dask pod Jupyter pod * pangeo.io
  10. pa n g e o . p y d ata

    . o r g !15 zero2jupyterhub + + + parallel computing domain software cloud-optimized data building blocks of modular big data science gateways
  11. !16 ✅ Scientists write expressive code to interact lazily with

    full datasets. ✅ Calculations on big datasets run at interactive speed. ✅ No duplication of data, provenance chain is preserved. ✅ Puts the curiousity, discovery, and fun back into science! C l o u d - n at i v e s c i e n c e scalable storage compute fat pipe
  12. !17 Government HPC Commercial Cloud Access ✅ Available to all

    federally funded projects ❌ Available only to federally funded projects ✅ Available globally to anyone with a credit card ❌ Authentication is not integrated with existing research infrastructure Cost ✅ Cost is hidden from researchers and billed by funding agencies ❌ Allocations, quotas, limits ❌ Cost is borne by individual researchers and hidden from funding agencies ✅ Economics of scale, unlimited resources Compute ✅ Homogeneous, high performance nodes ❌ Queues, batch scheduling, ssh access ❌ Fixed-size compute ✅ Flexible hardware (big, small, GPU) ✅ Instant provisioning of unlimited resources ✅ Spot market: burstable, volatile Storage ✅ Fast parallel filesystems (e.g. GPFS) ✅ Fast object storage
  13. pa n g e o . i o J o

    i n t h e d i s c u s s i o n http://github.com/pangeo-data/pangeo/issues