Working towards deterministic scientific pipelines

Working towards deterministic scientific pipelines

This presentation is about what I actually did in the last 3 months during his stay in South Africa. I mostly worked on the migration to github and installability of MeqTrees and related software. MeqTrees is a software package used for working with Measurement Equations, written by Oleg Smirnov and others. MeqTrees is now installable with 4 commands on Ubuntu 14.04 and 12.04.

Also I investigated other technologies like OS virtualisation and process isolation and how they could be applied to improve productivity in radio astronomy in general. The result is Papino, a github project that can be used to quickly set up a radio astronomy software development environment but can also be used as a foundation for scientific software development and collaboration.

978e79ad01185b39efcfca1482f0f819?s=128

Gijs Molenaar

April 15, 2014
Tweet

Transcript

  1. working towards deterministic scientific pipelines Gijs Molenaar gijs@pythonic.nl @gijzelaerr

  2. Who am I • Gijs Molenaar • Scientific Software Engineer

    • University of Amsterdam • 2+ years • Background in AI ( no astronomer )
  3. Previously • TRAnsient detection Pipeline (TRAP) • AARTFAAC ! •

    http://www.transientskp.org/ • http://www.aartfaac.org/
  4. problem • scientist A wants to share code / results

    with scientist B ! • how to install? • often quite complex specific per: ! • os • Distribution • Library version • Compile flags • System environment • Paths
  5. Solution • Formalise installation procedure ! • package it up!

  6. Why am i in ZA • Installability MeqTrees and related

    projects ! • Help where possible
  7. MeqTrees • Modernise installation • Migrate to github • create

    Debian packages • launchpad PPA
  8. Migrate to Github • http://meqtrees.net ! • All code, doc

    and issues now there ! • all the emails (sorry!)
  9. Create Debian packages • all the new debian repo’s (sorry!)

    ! • contain deb files that transform source tar ball into debian package
  10. • https://launchpad.net/~ska-sa/+archive/main/

  11. installation Ubuntu 12.04 and 14.04 ‣ sudo apt-get install python-software-properties

    software-properties-common ‣ sudo add-apt-repository ppa:ska-sa/main ‣ sudo apt-get update ‣ sudo apt-get install meqtrees
  12. In other news • Casacore & pyrap may be included

    in the standard Debian archive ! • Automated updates of measures data
  13. CASA • Get working on Ubuntu (12.04) ! • not

    a succes ! • Compiles & installs (after lot of patching) ! • BUT depends on too old libraries (very very old IPython)
  14. Papino • Radio astronomy springboard • Virtual machine • isolated

    environment • complete operating system ! • Vagrant • Docker • https://github.com/ska-sa/papino
  15. why? • Solves many problems • Just include complete OS

    with everything preinstalled • Easy to install • Easy to adjust (fork, change & build)
  16. advantages • Much easier to install • package once, install

    everywhere • formalised installation procedure • no interference with other packages • Old distro / library? No problem!
  17. back to CASA • solution! ! • almost. No shared

    process / memory space ! • Don’t have good solution (yet)
  18. vagrant • CLI interface around virtual box • Virtualisation !

    • install vagrant • $ git clone https://github.com/ska-sa/papino • $ cd papino • $ vagrant up
  19. docker • Linux container framework • NO virtualisation but isolation!

    ! • install docker • $ git clone https://github.com/ska-sa/papino • $ cd papino • $ sudo ./docker.sh
  20. papino github

  21. fork it

  22. demo

  23. nbdiff • http://nbdiff.org/ ! • http:// nbviewer.ipython. org/github/ska- sa/papino/blob/ master/

    notebooks/ papino.ipynb
  24. reproducible science • Moraila, Gina, et al. "Measuring Reproducibility in

    Computer Systems Research." (2013).
  25. I believe • Vagrant / docker / papino can help

    here! ! • Fork papino, commit your code • make code reproducible • make code open
  26. Future • Parallelisation • Data locality. Move calculate to data.

    • Disco? Hadoop?
  27. questions? keep updated and follow me on twitter @gijzelaerr