Just Do It: Reproducible Research in CFD

Just Do It: Reproducible Research in CFD

Keynote presentation at the International Parallel CFD Conference 2017, Glasgow, Scotland
http://www.strath.ac.uk/engineering/parcfd2017/

—Please cite as:
Barba, Lorena A. (2017): Just Do It: Reproducible Research in CFD. figshare.
https://doi.org/10.6084/m9.figshare.5011751.v1

—Abstract:
Reproducibility hit the mainstream in the last couple of years, after more than two decades of back-alley campaigns. For example, six months ago, the US National Science Foundation (NSF) issued a “Dear Colleague Letter: Encouraging Reproducibility in Computing and Communications Research.” The movement has often been associated with open data and open-source code, without which one could hardly reproduce a previous computational result. But one thing is sharing code and data for a statistical analysis or a bioinformatics workflow; and quite another to achieve reproducible research in parallel CFD. My research group has been practicing open science for years, and we found the hard way that open code is merely a first step. We need to exhaustively document our computational research, to encourage and accept publication of negative results, and to apply defensive tactics against bad code: version control, modular code, testing, and code review. In this talk, I will share our lessons learned from a replication campaign on our own previous study (arXiv:1605.04339, accepted), and make a call to action. The tools and methods require training, but running a lab for reproducibility is your decision. Just do it!

C10c1cc1bd01eb53c616f2d0a1786fe5?s=128

Lorena A. Barba

May 17, 2017
Tweet

Transcript

  1. 3.

    NSF 17-022 Dear Colleague Letter: Encouraging Reproducibility in Computing and

    Communications Research CISE, October 21, 2016 https://www.nsf.gov/pubs/2017/nsf17022/nsf17022.jsp
  2. 4.

    NSF SBE subcommittee on replicability in science: “reproducibility refers to

    the ability of a researcher to duplicate results of a prior study using the same materials as were used by the original investigator." “… new evidence is provided by new experimentation, defined in the NSF report as ‘replicability’ “ SBE, May 2015
  3. 5.

    https://www.nih.gov/research-training/rigor-reproducibility When a result can be reproduced by multiple scientists,

    it validates the original results and readiness to progress to the next phase of research.
  4. 9.
  5. 12.

    Reproducible Research Track (peer reviewed) Lorena A. Barba George Washington

    University labarba@gwu.edu George K. Thiruvathukal Loyola University Chicago gkt@cs.luc.edu https://www.computer.org/cise/
  6. 13.

    Def.— Reproducible research Authors provide all the necessary data and

    the computer codes to run the analysis again, re- creating the results. Schwab, M., Karrenbach, N., Claerbout, J. (2000) “Making scientific computations reproducible,” Computing in Science and Engineering Vol. 2(6):61–67
  7. 14.

    Jon F. Claerbout Professor Emeritus of Geophysics Stanford University …

    pioneered the use of computers in processing and filtering seismic exploration data [Wikipedia] … from 1991, he required theses to conform to a standard of reproducibility.
  8. 15.

    Invited paper at the October 1992 meeting of the Society

    of Exploration Geophysics http://library.seg.org/doi/abs/10.1190/1.1822162
  9. 16.

    “In 1990, we set this sequence of goals: 1.Learn how

    to merge a publication with its underlying computational analysis. 2.Teach researchers how to prepare a document in a form where they themselves can reproduce their own research results a year or more later by “pressing a single button”. 3.Learn how to leave finished work in a condition where coworkers can reproduce the calculation including the final illustration by pressing a button in its caption. 4.Prepare a complete copy of our local software environment so that graduating students can take their work away with them to other sites, press a button, and reproduce their Stanford work. 5.Merge electronic documents written by multiple authors (SEP reports). 6.Export electronic documents to numerous other sites (sponsors) so they can readily reproduce a substantial portion of our Stanford research.
  10. 19.

    ‣ I teach my graduate students about reproducibility ‣ All

    our research code (and writing) is under version control ‣ We always carry out verification & validation (and make them public) ‣ For main results, we share data, plotting script & figure under CC-BY ‣ We upload preprint to arXiv at the time of submission to a journal ‣ We release code at the time of submission of a paper to a journal ‣ We add a “Reproducibility” declaration at the end of each paper ‣ I develop a consistent open-science policy & keep an up-to-date web presence Reproducibility PI Manifesto (2012)
  11. 20.

    “private reproducibility” …we can rebuild our own past research results

    from the precise version of the code that was used to create them.
  12. 21.

    What is Science? ‣ American Physical Society: - Ethics and

    Values, 1999 "The success and credibility of science are anchored in the willingness of scientists to […] Expose their ideas and results to independent testing and replication by others. This requires the open exchange of data, procedures and materials." https://www.aps.org/policy/statements/99_6.cfm
  13. 22.

    Data and Code Sharing Recommendations ‣ assign a unique identifier

    to every version of the data and code ‣ describe in each publication the computing environment used ‣ use open licenses and non-proprietary formats ‣ publish under open-access conditions (and/or post pre-prints)
  14. 23.

    Open-source licenses: People can coordinate their work freely, within the

    confines of copyright law, while making access and wide distribution a priority.
  15. 25.

    “The key is prevention via the training of more people

    on techniques for data analysis and reproducible research.” Leek & Peng, PNAS 2015
  16. 26.

    A syllabus for research computing 1. command line utilities in

    Unix/Linux 2. an open-source scientific software ecosystem (our favorite is Python's) 3. software version control (we like the distributed kind: our favorite is git / GitHub) 4. good practices for scientific software development: code hygiene and testing 5. knowledge of licensing options for sharing software https://barbagroup.github.io/essential_skills_RRC/
  17. 27.
  18. 28.

    In parallel, even two runs with identical input data can

    differ! Different versions of your code, external libraries, even compilers may change results.
  19. 29.

    In HPC, peers may not be able to reproduce, but

    they will trust the results more if built over a consistent practice of reproducible research.
  20. 30.

    Def.— Replication Arriving at the same scientific findings as another

    study, collecting new data (possibly with different methods) and completing new analyses. Roger D. Peng (2011), “Reproducible Research in Computational Science” Science, Vol. 334, Issue 6060, pp. 1226-1227
  21. 32.
  22. 33.

    Experiments—snake profile lift & drag higher Re, producing a maximum

    lift coefficient of 1.9 while drag remained approximately the same. At higher angles of attack, the lift gradually decreased while the drag rapidly increased. The lift to drag ratio exhibited similar behavior, producing a maximum lift to drag ratio of 2.7 at 35o due to the peak in lift for the higher Reynolds numbers. The lift increased up to an angle of attack of 35o while exhibiting robust aerodynamic performance by maintaining high lift coefficient between 20 and 60o, and near maximum L/D values over a range of angles of attack between 15 to 40o. Credit: Holden, MSc Thesis, VA Tech (2011)
  23. 34.

    ‣ Immersed boundary method:
 reproduces lift signature, at same angle

    of attack, but different Reynolds # ‣ This is in 2D -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 10 20 30 40 50 CL Angle of Attack (degrees) Re=500 Re=1000 Re=1500 Re=2000 Re=2500 Re=3000 0o 35o Simulations—snake profile lift coefficient
  24. 36.
  25. 37.

    Four CFD solvers ‣ cuIBM— Used in the original study,

    written in C CUDA to exploit GPUs, serial on CPU. Uses the NVIDIA Cusp library for solving sparse linear systems. https://github.com/barbagroup/cuIBM ‣ OpenFOAM— Free and open-source CFD package with a suite of numerical solvers. Core discretization scheme: finite-volume method applied on mesh cells of arbitrary shape. http://www.openfoam.org ‣ IBAMR— A parallel code using the immersed boundary method on Cartesian meshes, with adaptive mesh refinement. https:// github.com/ibamr/ibamr ‣ PetIBM— Our own re-implementation of cuIBM, but for distributed- memory parallel systems. It uses the PETSc library for solving sparse linear systems in parallel. https://github.com/barbagroup/PetIBM
  26. 39.

    OpenFOAM ‣ Vorticity field at t = 52: angle-of-attack =

    35º, Re = 2000. 
 Mesh ~700k triangles created with the free software GMSH.
  27. 41.
  28. 44.

    Results ‣ Using IBAMR in a manner of other IBM

    codes gave the wrong answer ‣ The “trick” for this code took us months to find (no-slip markers needed inside the body) ‣ Finally the results match
  29. 45.
  30. 47.

    cuIBM vs. PetIBM ‣ both written by the same developer,

    implement the same method ‣ cuIBM: CUDA C, Cusp linear algebra library — used algebraic multigrid preconditioner and conjugate gradient (CG) solver ‣ PetIBM: C++, PETSc linear algebra library — CG solver crashed because of an indefinite preconditioner, so we were forced to switch it for bi-CG stabilized
  31. 49.

    what is going on? ‣ time signature of force coefficients:

    
 (top) PetIBM vs cuIBM
 (bottom) two runs with slightly shifted body
  32. 52.
  33. 54.

    What makes research reproducible? “… authors provide all the necessary

    data and the computer code to run the analysis again, re-creating the results.” ‣But what data are necessary? - open data & open-source code - actual meshes used, BCs, comprehensive parameter sets - exhaustive records of the process, automated workflows: launch via running scripts, store command-line arguments for every run, capture complete environment settings - post-processing and visualization should be scripted, avoiding GUIs for manipulation of images
  34. 56.

    Reproducible Research 10 Simple Rules 1. For every result, keep

    track of how it was produced 2. Avoid manual data-manipulation steps 3. Archive the exact versions of all external programs used 4. Version-control all custom scripts 5. Record all intermediate results, when possible in standard formats 6. For analyses that include randomness, note underlying random seeds 7. Always store raw data behind plots 8. Generate hierarchical analysis output, allowing layers of increasing detail to be inspected 9. Connect textual statements to underlying results 10.Provide public access to scripts, runs, and results
  35. 58.

    “The key is prevention via the training of more people

    on techniques for data analysis and reproducible research.” Leek & Peng, PNAS 2015
  36. 62.

    ReproPacks For main results in a paper, we share data,

    plotting script & figure under CC-BY. File bundle with input data, running scripts, plotting scripts, and figure. We cite our own figure in the caption!
  37. 67.