Software packaging and data pipelines

978e79ad01185b39efcfca1482f0f819?s=47 Gijs Molenaar
December 12, 2017

Software packaging and data pipelines




Gijs Molenaar

December 12, 2017


  1. Software packaging and data pipelines Gijs Molenaar PSR-FRB-SEARCHSOFT 2017

  2. Me • Not astronomer • Software engineer • Ex UvA

    • Working on data reduction pipelines
  3. • KERN - radio astronomy software suite • CWL -

    Workflow language
  4. KERN The problem: • Installing scientific software • Compile flags

    • Dependencies • Patches • Environment variables • Consistency & reproducibility
  5. What is KERN • Debian packages (Ubuntu LTS) • Released

    every 6 months • KERN-3, released November 27 2017 • 73 packages and growing
  6. Advantage • No compilation • Consistency between computers • Consistency

    between platforms • Zero knowledge setup
  7. $ sudo apt-add-repository -s ppa:kernsuite/kern-3 $ sudo apt-get update $

    sudo apt-get install presto To install presto on Ubuntu 16.04
  8. casacore-data casacore aoflagger python-casacore casasynthesis casarest lofar wsclean rmextract losoto

    pyvo lsmtool prefactor factor 21cmfast aips attrdict cassbeam chgcentre drive-casa pyxis rfimasker rpfits sagecal scatterbrane tempo2 sigproc sigpyproc simfast21 simms sourcery stimela tempo tigger tirific tkp dysco tmv karma katversion katpoint katdal kittens makems meqtrees-cattery meqtrees-timba cub montblanc msutils obit oskar owlcat psrcat spdlog sopt purify purr pymoresane transitions python-keepalive python-typing cwltool psrchive dspsr parseltongue presto multinest singularity-container casalite galsim cubical mt-imager
  9. Pulsar packages • Not tested enough • There might be

    bugs • Let me know
  10. Pulsar problems • Mostly non-versioned • Hacky culture, fork, modify

    code • No (unit) testing • Not confirm standard (installation, UNIX, Python) • Algorithms probably genius, but software fragile and poorly written
  11. I don’t have Ubuntu? • Use containerisation

  12. Containerisation • Docker • Singularity • udocker? (user space docker)

  13. What about GPU? • GOOD QUESTION! • Is a bit

    of a problem. • Not a problem if no containerisation • Nvidia kernel module / library version needs to match • For docker there is a workaround1 1
  14. I use docker already • Is your container 2GB? •

    You don’t know how to ‘combine’ containers? • Does it takes ages to build your container? • You are doing it wrong.
  15. Containerisation is not package management

  16. FROM kernsuite/base:3 RUN docker-apt-install presto Example Dockerfile

  17. Packaging software since 1993


  19. Common Workflow Language

  20. A standard for building pipelines

  21. pipeline /ˈpʌɪplʌɪn/ noun “a linear sequence of specialised tasks processing

    measurements, with the eventual goal to produce a paper”
  22. Example • Bash script • Easy to make (one on

    one command line)
  23. Problems • Platform specific • Data products hard to manage

    • What is input output? • What if pipeline fails halfway? • Parallelisation? Implementation specific • Easy to hack together but doesn’t ‘scale’
  24. What we want • A formal description of the tools

    • A formal description on the how to combine them • Split of responsibilities
  25. Task • Any piece of software that finishes in a

    ‘reasonable amount of time’ • Takes input • Has arguments • Produces output • We assume deterministic
  26. Workflow • Combine tasks into workflow • A DAG (Directed

    A-cyclic Graph) • Connect input to output • Manage parameters • Indicate what can run in parallel (implicit)
  27. Deterministic • ‘Functional’ • Output same if input and parameters

    don’t change • Cache results • If no inter-dependencies can run in parallel
  28. Making a Task • How to run the actual program

    (KERN, containers) • Formalise IO
  29. What does it look like cwlVersion: v1.0 class: CommandLineTool baseCommand: inputs: inf: type: File birds: type: File inputBinding: position: 1 outputs: zaplist: type: File outputBinding: glob: $(inputs.birds.nameroot).zaplist
  30. steps: realfft_subbands: run: steps/realfft.cwl in: {dat: prepsubband/dats} scatter: dat out:[fft]

    zapbirds: run: steps/zapbirds.cwl in: {zapfile: makezaplist/zaplist, fft: realfft/fft} scatter: [fft] out: [zapped] accelsearch_subbands: run: steps/accelsearch.cwl in: {dat: zapbirds/zapped, numharm: numharm} scatter: [dat, inf] scatterMethod: dotproduct out: [candidates_binary, candidates_text]
  31. None
  32. CWL runners • CWL is a ‘standard’. Multiple implementations. •

    Reference runner • Toil • Slum • Mesos • Arvados
  33. Prototype • 2 pipelines • Prefactor and presto • Get

    running on Cartesius • Dutch National Supercomputer • Probably using Toil, Slurm and Singularity • •
  34. Future Work • Improve packages • Add more packages •

    Add CWL definitions for all tasks • Enhance tooling (progress and result visualisation)
  35. Conclusion • CWLing is work in progress • Feel free

    to try out • Report packaging bugs if you encounter problems
  36. Questions?

  37. Come tomorrow! Happy hour at 5 135 Albert Rd, Woodstock,

    Cape Town