MAST archive and operations in the MMA era

MAST archive and operations in the MMA era Arfon Smith,
Data Science Mission Office EXPANDING THE FRONTIERS OF SPACE ASTRONOMY

• Raise MAST archive (and data management) to ‘Mission’ status
• Responsible for DMS portfolio for all missions (HST, JWST, Kepler/K2, TESS, WFIRST, etc.) • Exploring new technologies, services and infrastructure for data management • Developing community expertise in combining data science and astronomy STScI: Data Science Mission Office https://archive.stsci.edu/reports/BigDataSDTReport_Final.pdf

Three things happening at STScI/MAST that might be interesting… 1.
2. 3. Science platforms as environments for archival data analysis and transient follow-up. Event-driven ‘serverless’ architectures for data processing. Community-contribution at all levels of the data management infrastructure.

Science platforms as environments for archival data analysis and transient
follow-up. 1.

MAST: Multi-mission archive

Archives & Services: Download data, compute locally CasJobs VO services
JSON API http://archive.stsci.edu Data Calibration pipelines Raw data Software http://mast.stsci.edu TOPCAT Services

MAST: Archival publication rates

Science Platforms Data Tools Compute APIs Web Portals Notebooks Internet
A Science Platform is an environment which combines data storage, computational capabilities, software tools and interfaces for users to interact with the underlying components.

Key part of LSST data management system

Science Platforms aka ‘server-side analytics’ Notebook-like interface integrated with astronomical
data services - Gregory Dubois Felsmann (LSST DM)

䢀 Composable machine images: FROM lsstsqre/pipeline BYOS: Shareable, compassable computational
environments

CADC DES ESAC IPAC JHU LSST NASA NCSA NDS NED
NOAO STScI https://github.com/spacetelescope/science-platforms-workshop

Cloud-hosted data analysis environment

Some reasons Science Platforms are exciting • Being developed by
many major projects & archives (MAST, IPAC, LSST, ESAC, CADC, NOAO…) - in a semi-coordinated fashion. • Provide access to large, high-value datasets and the ability to compute against them (server-side analytics). • Potentially provide access to substantial scalable compute resources including GPUs. • Leverages existing programmatic interfaces to astronomical archives (e.g. VO services and other APIs). • Convergence of technologies/conventions (notebook-driven analyses) for repeatable, reliable data exploration and analysis. • Potential environment for transient event analyses and broker development.

Notebook-driven analysis: Not just academia

https://speakerdeck.com/jakevdp/the-unexpected-eﬀectiveness-of-python-in-science Broad, rich ecosystem

Event-driven ‘serverless’ architectures for data processing. 2.

Serverless/Function as a Service (FaaS) computing • Write a function
(e.g. in Python, C++, Julia, Haskell, Fortran…) • Upload the function to a cloud computing platform (AWS Lambda, Google Cloud Functions, Azure Functions, Apache Whisk) • Define resources required for cloud function to execute (CPU/RAM) • Trigger function based on event rules (e.g. event posted to API or appearing in event stream) • SCIENCE!

Hubble public data in the (AWS) cloud ~140TB public HST
data from ACS, COS, STIS, WFC3, WFPC2

Robust, programmatic access to cloud-hosted data Hubble public data in
the (AWS) cloud

MAST Labs exploratory technical blog: mast-labs.stsci.io Hubble public data in
the (AWS) cloud

Next-generation serverless pipeline processing …In this post we’re going to
show you how to process 122,000 WFC3/IR images on AWS Lambda in about 2 minutes (and for about $2)

NISAR: 85TB/day

Some reasons we’re excited about serverless computing • Allows engineers
& astronomers to focus on ‘business logic’ of their analysis rather than thinking about infrastructure. • Can be triggered from multiple settings (e.g. automated background tasks or inline analysis steps) • Event-driven & responsive - can be very cost effective. • Potentially interesting for more data & compute intensive archive functionalities. • SCALE: Makes massively parallel computations easy*… * Easy to shoot yourself in the foot too

Community-contribution at all levels of the data management infrastructure. 3.

Archive, Services, Software CasJobs VO services JSON API http://archive.stsci.edu Data
Calibration pipelines Raw data Software http://mast.stsci.edu TOPCAT Services

Status quo (until relatively recently) CasJobs VO services JSON API
http://archive.stsci.edu http://mast.stsci.edu TOPCAT }Traditionally thought of as science centers activities

Community contributions at all levels Data Software Services Community Alert
brokers/agents Community-built Services ‘L3’ data products Community software (e.g. Astropy) Community software + L3/L4/L5 pipelines L3/L4/L5 data products (HLSPs) Reliance on community contributions at all levels

Community software

Change in the way technology is created

Community contributions / co-creation of technology • Open source is
now the ‘new normal’ in many sectors (especially data science) • What might the different roles be for projects/facilities, science teams, individuals? • Communities often form around shared challenges, shared data products • Easier to recognize innovations created by others when working with similar data

Community software initiative • Core Infrastructure • Contributing to core,
shared libraries (e.g. FITS, coordinate systems) • Community Outreach & Support • User Support • Documentation • Emerging efforts • LSST Photometry • JWST NIRSpec • MAST Astroquery

Thanks! [email protected] https://mast-labs.stsci.io

MAST archive and operations in the MMA era

MAST archive and operations in the MMA era

More Decks by Arfon Smith

Other Decks in Science

Featured

Transcript