Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software packaging and data pipelines

Gijs Molenaar
December 12, 2017

Software packaging and data pipelines

PSR-FRB-SEARCHSOFT 2017

PULSAR AND FRB SEARCH SOFTWARE IN THE ERA OF REAL-TIME SURVEYS

Gijs Molenaar

December 12, 2017
Tweet

More Decks by Gijs Molenaar

Other Decks in Science

Transcript

  1. Me • Not astronomer • Software engineer • Ex UvA

    • Working on data reduction pipelines
  2. KERN The problem: • Installing scientific software • Compile flags

    • Dependencies • Patches • Environment variables • Consistency & reproducibility
  3. What is KERN • Debian packages (Ubuntu LTS) • Released

    every 6 months • KERN-3, released November 27 2017 • 73 packages and growing
  4. $ sudo apt-add-repository -s ppa:kernsuite/kern-3 $ sudo apt-get update $

    sudo apt-get install presto To install presto on Ubuntu 16.04
  5. casacore-data casacore aoflagger python-casacore casasynthesis casarest lofar wsclean rmextract losoto

    pyvo lsmtool prefactor factor 21cmfast aips attrdict cassbeam chgcentre drive-casa pyxis rfimasker rpfits sagecal scatterbrane tempo2 sigproc sigpyproc simfast21 simms sourcery stimela tempo tigger tirific tkp dysco tmv karma katversion katpoint katdal kittens makems meqtrees-cattery meqtrees-timba cub montblanc msutils obit oskar owlcat psrcat spdlog sopt purify purr pymoresane transitions python-keepalive python-typing cwltool psrchive dspsr parseltongue presto multinest singularity-container casalite galsim cubical mt-imager
  6. Pulsar problems • Mostly non-versioned • Hacky culture, fork, modify

    code • No (unit) testing • Not confirm standard (installation, UNIX, Python) • Algorithms probably genius, but software fragile and poorly written
  7. What about GPU? • GOOD QUESTION! • Is a bit

    of a problem. • Not a problem if no containerisation • Nvidia kernel module / library version needs to match • For docker there is a workaround1 1 https://github.com/NVIDIA/nvidia-docker
  8. I use docker already • Is your container 2GB? •

    You don’t know how to ‘combine’ containers? • Does it takes ages to build your container? • You are doing it wrong.
  9. pipeline /ˈpʌɪplʌɪn/ noun “a linear sequence of specialised tasks processing

    measurements, with the eventual goal to produce a paper”
  10. Problems • Platform specific • Data products hard to manage

    • What is input output? • What if pipeline fails halfway? • Parallelisation? Implementation specific • Easy to hack together but doesn’t ‘scale’
  11. What we want • A formal description of the tools

    • A formal description on the how to combine them • Split of responsibilities
  12. Task • Any piece of software that finishes in a

    ‘reasonable amount of time’ • Takes input • Has arguments • Produces output • We assume deterministic
  13. Workflow • Combine tasks into workflow • A DAG (Directed

    A-cyclic Graph) • Connect input to output • Manage parameters • Indicate what can run in parallel (implicit)
  14. Deterministic • ‘Functional’ • Output same if input and parameters

    don’t change • Cache results • If no inter-dependencies can run in parallel
  15. Making a Task • How to run the actual program

    (KERN, containers) • Formalise IO
  16. What does it look like cwlVersion: v1.0 class: CommandLineTool baseCommand:

    makezaplist.py inputs: inf: type: File birds: type: File inputBinding: position: 1 outputs: zaplist: type: File outputBinding: glob: $(inputs.birds.nameroot).zaplist
  17. steps: realfft_subbands: run: steps/realfft.cwl in: {dat: prepsubband/dats} scatter: dat out:[fft]

    zapbirds: run: steps/zapbirds.cwl in: {zapfile: makezaplist/zaplist, fft: realfft/fft} scatter: [fft] out: [zapped] accelsearch_subbands: run: steps/accelsearch.cwl in: {dat: zapbirds/zapped, numharm: numharm} scatter: [dat, inf] scatterMethod: dotproduct out: [candidates_binary, candidates_text]
  18. CWL runners • CWL is a ‘standard’. Multiple implementations. •

    Reference runner • Toil • Slum • Mesos • Arvados
  19. Prototype • 2 pipelines • Prefactor and presto • Get

    running on Cartesius • Dutch National Supercomputer • Probably using Toil, Slurm and Singularity • https://github.com/EOSC-LOFAR/prefactor-cwl • https://github.com/EOSC-LOFAR/presto-cwl
  20. Future Work • Improve packages • Add more packages •

    Add CWL definitions for all tasks • Enhance tooling (progress and result visualisation)
  21. Conclusion • CWLing is work in progress • Feel free

    to try out • Report packaging bugs if you encounter problems