Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Computing Intro Slides

Ryan Abernathey
September 04, 2018

Research Computing Intro Slides

My introductory slides to the 2018 iteration of Research Computing in Earth Science.

Ryan Abernathey

September 04, 2018
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. RESEARCH COMPUTING IN EARTH SCIENCE EESC-GR6901 | Fall 2018 |

    Department of Earth and Environmental Science Instructors Ryan Abernathey ([email protected])
 TA: Xiaomeng Jin ([email protected]) Meeting Time Tuesday / Thursday 10:10am - 11:25am
 506 Schermerhorn Prerequisites DEES grad student status or instructor permission. Access to a laptop. Website https://rabernat.github.io/research_computing_2018/
  2. What drives progress in geoscience? New Ideas New Observations New

    Simulations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f2 q dk, (3) where k 5 (k, l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k) 5 1 2p ð1‘ 2‘ jkj jkj P 2D (k, l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k). The effective 1D spectrum captures the effects of 2D topography on lee-wave radiation in the subcritical to- pography limit, that is, P2D (k, l) and P1D (k) result into identical radiation estimates for small steepness pa- rameters. However, the suppression of wave radiation in the critical topography limit is different in 1D and 2D; c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k, l) 5 2pH2(m 2 2) k 0 l 0 1 1 k2 k2 0 1 l2 l2 0 !2m/2 , (5) where k0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- rameter n used in Goff and Jordan (1988) as m 5 2(n 1 1), and H is the root-mean-square (rms) topographic height. Nikurashin and Ferrari (2010b) estimated with a least squares fit to multibeam data that representative values in the Drake Passage region are k0 5 2.3 3 1024 m21, l0 5 24 21 FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  3. Anatomy of a data analysis pipeline in computational oceanography Goal:

    identify coherent Lagrangian vortices from satellite data Input Data: sea-surface height satellite measurements (geostrophic velocity) Desired Outputs: statistics of vortex position, size, etc.
  4. Analysis Pipeline Step 1: Process original geostrophic velocity data into

    format needed by model download 1000s of netCDF4 files
 (~100 GB) .nc interpolate from 1/4 to 1/10 degree grid perform Helmholtz decomposition and remove divergence .bin write custom binary file format
 (~300 GB) Tools: Step 2: use ocean model to simulate trajectories of water parcels dX dt = u MITgcm customize, compile and configure FORTRAN model on HPC cluster use model solve the Lagrangian advection equation numerically for a mesh of 32 million particles sort and reformat output particle data, write netCDF
 (~10 TB) .nc
  5. Analysis Pipeline Step 3: Apply Lagrangian vortex identification algorithm apply

    techniques from image- processing in customized way validate results and save individual vortex data in .json files Tools: Step 4: perform high-level visualization and statistical analysis injest .json data into an object database rotation tensor ⇥t t0 is not dynamically consistent because (16) exhibits the same memory ussed for (12). ynamic consistency of t t0 implies that the total angle swept by this tensor around its of rotation is dynamically consistent. This angle t t0 (x 0 ), called intrinsic rotation angle 1)), therefore satisfies t t0 (x 0 ) = t s (x 0 ) + s t0 (x 0 ), s, t 2 [t 0 , t 1 ]. on, as shown in Haller (2015b), t t0 (x 0 ) is objective both in two and three dimensions. In nsions, even the tensor t t0 itself turns out to be objective, not just its associated scalar 0 ). the results obtained in Haller (2015b), the intrinsic dynamic rotation t t0 (x 0 ) can be as t t0 (x 0 ) = 1 2 LAVDt t0 (x 0 ), (17) Lagrangian-Averaged Vorticity Deviation (LAVD) defined here as LAVDt t0 (x 0 ) := Z t t0 |!(x(s; x 0 ), s) ¯ !(s)| ds. (18) tivity of t t0 and LAVD can be confirmed directly from formula (6). Indeed, under a observer change x = Q(t)y + b(t), the transformed vorticity ˜ !(y, t) satisfies !(y(s), s) ˜ ¯ !(s)| = QT (s)!(x(s), s) + QT (t) ˙ q(t) QT (s)¯ !(s) + QT (t) ˙ q(t) = QT (s) [!(x(s), s) ¯ !(s)] = |!(x(s), s) ¯ !(s)| , (19) he rotation matrix QT (s) preserves the length of vectors. We summarize the results of n as follows: 1. For an infinitesimal fluid volume starting from x 0 , the LAVDt t0 (x 0 ) field is a dynami- stent and objective measure of bulk material rotation relative to the spatial mean-rotation d volume U(t). Specifically, LAVDt t0 (x 0 ) is twice the intrinsic dynamic rotation angle by the relative rotation tensor t t0 . The latter tensor is obtained from the dynamically decomposition Ft t0 = t t0 ⇥t t0 Mt t0 (20) 6 evaluate complicated mathematical formulas make pretty maps of vortex data aggregate and plot statistics of vortex properties
  6. Computers Involved my laptop: the only screen I ever see,

    used for “small data” analysis, code development, and connecting to other computers. Low storage (~100 GB) and memory (16 GB), compute (4 cores) group server: higher storage capacity (100 TB), memory (1 TB), and compute (32 cores) used for interactive data analysis, batch processing, shared! cluster: many nodes can operate in parallel to accomplish massive computing tasks (1000s of cores). Coordinating this requires special software. queue system
  7. Code Written / Used • MITgcm: a large, community-developed FORTRAN

    ocean model
 http://mitgcm.org/ • Our customizations to MITgcm:
 https://github.com/rabernat/mitgcm_2D_global • Floater: a python package for reading and writing Lagrangian float data and performing specialized calculations related to Lagrangian Coherent structures:
 https://github.com/rabernat/floater • The project-specific python pipelines for processing and labeling the float data: https://github.com/rabernat/global_rclv • A javascript / python web app for interactively viewing eddy positions:
 https://github.com/rabernat/eddy_map @rabernat @anirban89 @nathantieltarshsish @geosciz
  8. Challenges in Research Computing • Complexity
 The things we need

    to do are very complex and hard! • Reproducibility
 It’s not easy to reproduce others’ work (or even our own!) • Data Size
 It’s hard to move around many TB of data. 

  9. source: stackoverflow.com choose a high level programming language with a


    healthy, active community and a broad range of packages Python, R, Julia
  10. aosp y SciP y Credit: Stephan Hoyer, Jake Vanderplas (SciPy

    2015) Scientific Python “Ecosystem”
  11. –Donoho, D. et al. (2009), Reproducible research in computational harmonic

    analysis, Comp. Sci. Eng. 11(1):8–18, doi: 10.1109/MCSE.2009.15 “an article about computational science … is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”
  12. 1. For every result, keep track of how it was

    produced 2. Avoid manual data-manipulation steps 3. Archive the exact versions of all external programs used 4. Version-control all custom scripts 5. Record all intermediate results, when possible in standard formats 6. For analyses that include randomness, note underlying random seeds 7. Always store raw data behind plots 8. Generate hierarchical analysis output, allowing layers of increasing detail to be inspected 9. Connect textual statements to underlying results 10. Provide public access to scripts, runs, and results Sandve, G. K. et al. (2013), Ten simple rules for reproducible computational research, PLOS Comp. Bio. (editorial), Vol. 9(10):1–4, doi: 10.1371/journal.pcbi.1003285
  13. Big science from big data! Global extent of rivers and

    streams G. Allen & T. Pavelsky,
 Science 28 Jun 2018. DOI: 10.1126/science.aat0636 • Water covers 45% more surface area than previously thought! • Major implications for CO2 budget • Created by processing many TB of Landsat images
  14. What do datasets like these all have in common? •

    TBs - PBs in size • Produced through large, government-funded science projects • Cited in thousands of papers (used by thousands of scientists) • Ripe for new data-driven analysis methods (machine learning) • Trapped behind slow FTP servers, frustrating portals, and fragmented access APIs
  15. The old way data provider FTP Service $ wget ftp://all/the/files/*

    $ python download_script.py Weird API Weird GUI data browser Let’s work on something else… ? My Workstation * ca. 2014 result = [] for file in all_files: result.append(process(file ))
  16. Dark Repository* $ wget ftp://all/the/files/* * Balaji et al., 2018.

    Requirements for a global data infrastructure in support of CMIP6. Geoscientific Model Development Discussions. “Local copy of a dataset created to enable users to actually compute on the data.”
  17. • Data has to be extracted from the remote server

    • Have to decide what data to download a priori • Analysis is slow • Try something -> go get lunch -> check results • Lots of checking Facebook / Twitter The old way Consequences: ❌ Scientists (Ph.D. students / postdocs) are actually being data engineers ❌ Conservative approach to science, look for “expected” things ❌ Provenance of data is obscured (what if a correction is issued?) ❌ Nearly impossible to reproduce the full workflow (code + environment + data)
  18. What we do today data provider FTP Service $ wget

    ftp://all/the/files/* my local server / cluster $ python download_script.py Weird API Weird GUI data browser file.0001.nc file.0002.nc file.0003.nc My Laptop import dask.dataframe as ddf df = ddf.read_csv('file.*.csv') df.foo.value_counts() import xarray as xr ds = xr.open_mfdataset('file.*.nc') ds.mean(dim=[‘time', ‘lon'])
  19. What we do today • Still have a dark repository,

    but… • Analysis is fast! • We can think about datasets, not files • We can iterate quickly and explore new ideas Consequences: ✅ Scientists spend more time being scientists ❌ Still constrained by what we decided to download ❌ Provenance of data is obscured (what if a correction is issued?) ❌ Nearly impossible to reproduce the full workflow (code + environment + data)
  20. Where we should go commercial cloud / large HPC data_chunk.0000

    My Laptop import dask.dataframe as ddf df = ddf.read_parquet(‘s3://bckt‘) df.foo.value_counts() import xarray as xr ds = xr.open_zarr(‘gs://my/bucket‘) ds.mean(dim=[‘time', ‘lon']) object storage data provider’s buckets data_chunk.0001 data_chunk.0002 data_chunk.0003 compute scientist’s compute nodes Dask pod Dask pod Dask pod Jupyter pod * pangeo.io
  21. pangeo.io zero2jupyterhub + + + parallel computing domain software cloud-optimized

    data building blocks of modular big data science gateways
  22. ✅ Scientists write expressive code to interact lazily with full

    datasets. ✅ Calculations on big datasets run at interactive speed. ✅ No duplication of data, provenance chain is preserved. ✅ Puts the curiousity, discovery, and fun back into science! Cloud-native science scalable storage compute fat pipe
  23. Learning Goals • be fluent in the basics of Unix-based

    workflows (command line / file system) • feel comfortable navigating JupyterHub Environment • be able to construct complete, well-structured programs in Python • read and write most common geoscience data formats • perform basic exploratory data analysis on Earth Science data • “Tabular data”: rows and columns • “Gridded data”: multidimensional numerical arrays • use visualization to enhance interpretation of Earth Science data, including maps and interactive visualizations • practice reproducible research through version control 

  24. Grading Grading: Weekly assignment (70%), final project (30%) Assignment Grading

    Rubric: • Total: 100 • All questions complete: 50 • All questions correct: 30
 (e.g., if there are 10 questions, each questions has 5 points for completeness and 3 points for correctness) • Clean, elegant, efficient code: rate between 0 and 10 • Clear comments and explanations: rate between 0 and 10 • Late penalty: -20 per day (24 hrs)
  25. Final Project • A project that uses one or more

    of the programming languages covered in this course to carry out an extensive data analysis, modeling or visualization task. • You will be required to briefly present and demo your final project during the last week of classes. • More details to be given soon.
  26. Plagiarism • “Plagiarism is the unacknowledged use of the work

    of others. It comes from the Latin word plagiaries, meaning ‘kidnapper’.” - https://www.college.columbia.edu/ academics/academicdishonesty • Copying someone else's code and turning it in as your own work for an assignment or for the final project is plagiarism • You will receive 0 credit for plagiarized work • However, open source software is all about reuse. The key is the License:
 https://choosealicense.com/licenses/ • Since the purpose of this course is to teach you how to program, you are required to write your own original codes for the assignments and final project.
  27. date topic Assignment T 4 Sept Course introduction, Overview of

    JupyterLab Th 6 Sept Core Python Language T 11 Sept in class partner activity Th 13 Sept Python Functions and Classes assignment 1 due T 18 Sept in class partner activity Th 20 Sept Numpy and Matplotlib I assignment 2 due T 25 Sept in class partner activity Th 27 Sept Exploring the Scipy Library assignment 3 due T 2 Oct in class partner activity Th 4 Oct Pandas for Tabular Data I: Basics assignment 4 due T 9 Oct in class partner activity Th 11 Oct Pandas for Tabular Data II: Advanced Pandas assignment 5 due T 16 Oct in class partner activity Th 18 Oct XArray for Multidimensional Data I: Basics assignment 6 due T 23 Oct in class partner activity Th 25 Oct XArray for Multidimensional Data II: Intermediate assignment 7 due T 30 Oct in class partner activity Th 1 Nov Using python on your computer assignment 8 due T 6 Nov NO CLASS Th 8 Nov Making maps with Basemap T 13 Nov in class partner activity Th 15 Nov Other mapping packages assignment 9 due T 20 Nov in class partner activity Th 22 Nov NO CLASS T 27 Nov Geoscience specifc packages assignment 10 due Th 29 Nov TBD T 4 Dec TBD Th 6 Dec TBD
  28. Class Structure Thursday Tuesday interactive lecture led by instructor (follow

    along on laptop) work on group assignment work on solo assignment weekly hw due new topic begins new hw assigned
  29. Bring your laptop to class The best way to learn

    how to code is to start coding, so bring your laptop to class so you can be practicing the various coding commands while we present them. You will also need your laptop to work on the assignments during the in- class activity days.