Research Computing Intro Slides

RESEARCH COMPUTING IN EARTH SCIENCE EESC-GR6901 | Fall 2018 |
Department of Earth and Environmental Science Instructors Ryan Abernathey ([email protected])  TA: Xiaomeng Jin ([email protected]) Meeting Time Tuesday / Thursday 10:10am - 11:25am  506 Schermerhorn Prerequisites DEES grad student status or instructor permission. Access to a laptop. Website https://rabernat.github.io/research_computing_2018/

What drives progress in geoscience? New Ideas New Observations New
Simulations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f2 q dk, (3) where k 5 (k, l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k) 5 1 2p ð1‘ 2‘ jkj jkj P 2D (k, l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topography reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k). The effective 1D spectrum captures the effects of 2D topography on lee-wave radiation in the subcritical topography limit, that is, P2D (k, l) and P1D (k) result into identical radiation estimates for small steepness pa- rameters. However, the suppression of wave radiation in the critical topography limit is different in 1D and 2D; c. Bottom topography Simulations are configured with multiscale topography characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k, l) 5 2pH2(m 2 2) k 0 l 0 1 1 k2 k2 0 1 l2 l2 0 !2m/2 , (5) where k0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- rameter n used in Goff and Jordan (1988) as m 5 2(n 1 1), and H is the root-mean-square (rms) topographic height. Nikurashin and Ferrari (2010b) estimated with a least squares fit to multibeam data that representative values in the Drake Passage region are k0 5 2.3 3 1024 m21, l0 5 24 21 FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.

Anatomy of a data analysis pipeline in computational oceanography Goal:
identify coherent Lagrangian vortices from satellite data Input Data: sea-surface height satellite measurements (geostrophic velocity) Desired Outputs: statistics of vortex position, size, etc.

Analysis Pipeline Step 1: Process original geostrophic velocity data into
format needed by model download 1000s of netCDF4 files  (~100 GB) .nc interpolate from 1/4 to 1/10 degree grid perform Helmholtz decomposition and remove divergence .bin write custom binary file format  (~300 GB) Tools: Step 2: use ocean model to simulate trajectories of water parcels dX dt = u MITgcm customize, compile and configure FORTRAN model on HPC cluster use model solve the Lagrangian advection equation numerically for a mesh of 32 million particles sort and reformat output particle data, write netCDF  (~10 TB) .nc

Analysis Pipeline Step 3: Apply Lagrangian vortex identification algorithm apply
techniques from image- processing in customized way validate results and save individual vortex data in .json files Tools: Step 4: perform high-level visualization and statistical analysis injest .json data into an object database rotation tensor ⇥t t0 is not dynamically consistent because (16) exhibits the same memory ussed for (12). ynamic consistency of t t0 implies that the total angle swept by this tensor around its of rotation is dynamically consistent. This angle t t0 (x 0 ), called intrinsic rotation angle 1)), therefore satisfies t t0 (x 0 ) = t s (x 0 ) + s t0 (x 0 ), s, t 2 [t 0 , t 1 ]. on, as shown in Haller (2015b), t t0 (x 0 ) is objective both in two and three dimensions. In nsions, even the tensor t t0 itself turns out to be objective, not just its associated scalar 0 ). the results obtained in Haller (2015b), the intrinsic dynamic rotation t t0 (x 0 ) can be as t t0 (x 0 ) = 1 2 LAVDt t0 (x 0 ), (17) Lagrangian-Averaged Vorticity Deviation (LAVD) defined here as LAVDt t0 (x 0 ) := Z t t0 |!(x(s; x 0 ), s) ¯ !(s)| ds. (18) tivity of t t0 and LAVD can be confirmed directly from formula (6). Indeed, under a observer change x = Q(t)y + b(t), the transformed vorticity ˜ !(y, t) satisfies !(y(s), s) ˜ ¯ !(s)| = QT (s)!(x(s), s) + QT (t) ˙ q(t) QT (s)¯ !(s) + QT (t) ˙ q(t) = QT (s) [!(x(s), s) ¯ !(s)] = |!(x(s), s) ¯ !(s)| , (19) he rotation matrix QT (s) preserves the length of vectors. We summarize the results of n as follows: 1. For an infinitesimal fluid volume starting from x 0 , the LAVDt t0 (x 0 ) field is a dynami- stent and objective measure of bulk material rotation relative to the spatial mean-rotation d volume U(t). Specifically, LAVDt t0 (x 0 ) is twice the intrinsic dynamic rotation angle by the relative rotation tensor t t0 . The latter tensor is obtained from the dynamically decomposition Ft t0 = t t0 ⇥t t0 Mt t0 (20) 6 evaluate complicated mathematical formulas make pretty maps of vortex data aggregate and plot statistics of vortex properties

Computers Involved my laptop: the only screen I ever see,
used for “small data” analysis, code development, and connecting to other computers. Low storage (~100 GB) and memory (16 GB), compute (4 cores) group server: higher storage capacity (100 TB), memory (1 TB), and compute (32 cores) used for interactive data analysis, batch processing, shared! cluster: many nodes can operate in parallel to accomplish massive computing tasks (1000s of cores). Coordinating this requires special software. queue system

Code Written / Used • MITgcm: a large, community-developed FORTRAN
ocean model  http://mitgcm.org/ • Our customizations to MITgcm:  https://github.com/rabernat/mitgcm_2D_global • Floater: a python package for reading and writing Lagrangian float data and performing specialized calculations related to Lagrangian Coherent structures:  https://github.com/rabernat/floater • The project-specific python pipelines for processing and labeling the float data: https://github.com/rabernat/global_rclv • A javascript / python web app for interactively viewing eddy positions:  https://github.com/rabernat/eddy_map @rabernat @anirban89 @nathantieltarshsish @geosciz

Challenges in Research Computing • Complexity  The things we need
to do are very complex and hard! • Reproducibility  It’s not easy to reproduce others’ work (or even our own!) • Data Size  It’s hard to move around many TB of data.  

Problem: complexity Solution: write as little code as possible!

source: stackoverﬂow.com choose a high level programming language with a 
healthy, active community and a broad range of packages Python, R, Julia

aosp y SciP y Credit: Stephan Hoyer, Jake Vanderplas (SciPy
2015) Scientiﬁc Python “Ecosystem”

Anatomy of an open-source software package https://github.com/pydata/xarray

Problem: reproducibility Solution: learn and embrace reproducible practices

–Donoho, D. et al. (2009), Reproducible research in computational harmonic
analysis, Comp. Sci. Eng. 11(1):8–18, doi: 10.1109/MCSE.2009.15 “an article about computational science … is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete software development environment and the complete set of instructions which generated the ﬁgures.”

1. For every result, keep track of how it was
produced 2. Avoid manual data-manipulation steps 3. Archive the exact versions of all external programs used 4. Version-control all custom scripts 5. Record all intermediate results, when possible in standard formats 6. For analyses that include randomness, note underlying random seeds 7. Always store raw data behind plots 8. Generate hierarchical analysis output, allowing layers of increasing detail to be inspected 9. Connect textual statements to underlying results 10. Provide public access to scripts, runs, and results Sandve, G. K. et al. (2013), Ten simple rules for reproducible computational research, PLOS Comp. Bio. (editorial), Vol. 9(10):1–4, doi: 10.1371/journal.pcbi.1003285

Problem: big data Solution: learn how to use HPC and
cloud computing

Big science from big data! Global extent of rivers and
streams G. Allen & T. Pavelsky,  Science 28 Jun 2018. DOI: 10.1126/science.aat0636 • Water covers 45% more surface area than previously thought! • Major implications for CO2 budget • Created by processing many TB of Landsat images

What do datasets like these all have in common? •
TBs - PBs in size • Produced through large, government-funded science projects • Cited in thousands of papers (used by thousands of scientists) • Ripe for new data-driven analysis methods (machine learning) • Trapped behind slow FTP servers, frustrating portals, and fragmented access APIs

The old way data provider FTP Service $ wget ftp://all/the/files/*
$ python download_script.py Weird API Weird GUI data browser Let’s work on something else… ? My Workstation * ca. 2014 result = [] for file in all_files: result.append(process(file ))

Dark Repository* $ wget ftp://all/the/files/* * Balaji et al., 2018.
Requirements for a global data infrastructure in support of CMIP6. Geoscientiﬁc Model Development Discussions. “Local copy of a dataset created to enable users to actually compute on the data.”

• Data has to be extracted from the remote server
• Have to decide what data to download a priori • Analysis is slow • Try something -> go get lunch -> check results • Lots of checking Facebook / Twitter The old way Consequences: ❌ Scientists (Ph.D. students / postdocs) are actually being data engineers ❌ Conservative approach to science, look for “expected” things ❌ Provenance of data is obscured (what if a correction is issued?) ❌ Nearly impossible to reproduce the full workﬂow (code + environment + data)

What we do today data provider FTP Service $ wget
ftp://all/the/files/* my local server / cluster $ python download_script.py Weird API Weird GUI data browser file.0001.nc file.0002.nc file.0003.nc My Laptop import dask.dataframe as ddf df = ddf.read_csv('file.*.csv') df.foo.value_counts() import xarray as xr ds = xr.open_mfdataset('file.*.nc') ds.mean(dim=[‘time', ‘lon'])

What we do today • Still have a dark repository,
but… • Analysis is fast! • We can think about datasets, not ﬁles • We can iterate quickly and explore new ideas Consequences: ✅ Scientists spend more time being scientists ❌ Still constrained by what we decided to download ❌ Provenance of data is obscured (what if a correction is issued?) ❌ Nearly impossible to reproduce the full workﬂow (code + environment + data)

Where we should go commercial cloud / large HPC data_chunk.0000
My Laptop import dask.dataframe as ddf df = ddf.read_parquet(‘s3://bckt‘) df.foo.value_counts() import xarray as xr ds = xr.open_zarr(‘gs://my/bucket‘) ds.mean(dim=[‘time', ‘lon']) object storage data provider’s buckets data_chunk.0001 data_chunk.0002 data_chunk.0003 compute scientist’s compute nodes Dask pod Dask pod Dask pod Jupyter pod * pangeo.io

pangeo.io zero2jupyterhub + + + parallel computing domain software cloud-optimized
data building blocks of modular big data science gateways

✅ Scientists write expressive code to interact lazily with full
datasets. ✅ Calculations on big datasets run at interactive speed. ✅ No duplication of data, provenance chain is preserved. ✅ Puts the curiousity, discovery, and fun back into science! Cloud-native science scalable storage compute fat pipe

JupyterHub for our course: https://rces.pangeo.io Temporary url:  http://104.198.65.235/

Learning Goals • be fluent in the basics of Unix-based
workflows (command line / file system) • feel comfortable navigating JupyterHub Environment • be able to construct complete, well-structured programs in Python • read and write most common geoscience data formats • perform basic exploratory data analysis on Earth Science data • “Tabular data”: rows and columns • “Gridded data”: multidimensional numerical arrays • use visualization to enhance interpretation of Earth Science data, including maps and interactive visualizations • practice reproducible research through version control  

Grading Grading: Weekly assignment (70%), ﬁnal project (30%) Assignment Grading
Rubric: • Total: 100 • All questions complete: 50 • All questions correct: 30  (e.g., if there are 10 questions, each questions has 5 points for completeness and 3 points for correctness) • Clean, elegant, eﬃcient code: rate between 0 and 10 • Clear comments and explanations: rate between 0 and 10 • Late penalty: -20 per day (24 hrs)

Final Project • A project that uses one or more
of the programming languages covered in this course to carry out an extensive data analysis, modeling or visualization task. • You will be required to brieﬂy present and demo your ﬁnal project during the last week of classes. • More details to be given soon.

Plagiarism • “Plagiarism is the unacknowledged use of the work
of others. It comes from the Latin word plagiaries, meaning ‘kidnapper’.” - https://www.college.columbia.edu/ academics/academicdishonesty • Copying someone else's code and turning it in as your own work for an assignment or for the ﬁnal project is plagiarism • You will receive 0 credit for plagiarized work • However, open source software is all about reuse. The key is the License:  https://choosealicense.com/licenses/ • Since the purpose of this course is to teach you how to program, you are required to write your own original codes for the assignments and ﬁnal project.

date topic Assignment T 4 Sept Course introduction, Overview of
JupyterLab Th 6 Sept Core Python Language T 11 Sept in class partner activity Th 13 Sept Python Functions and Classes assignment 1 due T 18 Sept in class partner activity Th 20 Sept Numpy and Matplotlib I assignment 2 due T 25 Sept in class partner activity Th 27 Sept Exploring the Scipy Library assignment 3 due T 2 Oct in class partner activity Th 4 Oct Pandas for Tabular Data I: Basics assignment 4 due T 9 Oct in class partner activity Th 11 Oct Pandas for Tabular Data II: Advanced Pandas assignment 5 due T 16 Oct in class partner activity Th 18 Oct XArray for Multidimensional Data I: Basics assignment 6 due T 23 Oct in class partner activity Th 25 Oct XArray for Multidimensional Data II: Intermediate assignment 7 due T 30 Oct in class partner activity Th 1 Nov Using python on your computer assignment 8 due T 6 Nov NO CLASS Th 8 Nov Making maps with Basemap T 13 Nov in class partner activity Th 15 Nov Other mapping packages assignment 9 due T 20 Nov in class partner activity Th 22 Nov NO CLASS T 27 Nov Geoscience specifc packages assignment 10 due Th 29 Nov TBD T 4 Dec TBD Th 6 Dec TBD

Course Website https://rabernat.github.io/research_computing_2018 Assignments Lecture Notes Schedule

Class Structure Thursday Tuesday interactive lecture led by instructor (follow
along on laptop) work on group assignment work on solo assignment weekly hw due new topic begins new hw assigned

Bring your laptop to class The best way to learn
how to code is to start coding, so bring your laptop to class so you can be practicing the various coding commands while we present them. You will also need your laptop to work on the assignments during the in- class activity days.

baseline survey https://tinyurl.com/rces-survey

create your student GitHub account: https://education.github.com/pack Tell us your GitHub
username:  https://tinyurl.com/rces-github

JupyterHub for our course: https://rces.pangeo.io Temporary url:  http://104.198.65.235/

Research Computing Intro Slides

Research Computing Intro Slides

More Decks by Ryan Abernathey

Other Decks in Science

Featured

Transcript