EOSC LOFAR pilot final presentation

978e79ad01185b39efcfca1482f0f819?s=47 Gijs Molenaar
January 08, 2018

EOSC LOFAR pilot final presentation

This presentation discusses the results of the EOSC LOFAR pilot. The goal is to demonstrate the usability of CWL in the context of data reduction pipelines for radio astronomical data reduction. Target platforms include a MacBook, a single server, a cluster and supercomputer

978e79ad01185b39efcfca1482f0f819?s=128

Gijs Molenaar

January 08, 2018
Tweet

Transcript

  1. EOSC LOFAR pilot 8 January 2018 Gijs Molenaar

  2. Agenda • Presentation(s) 1 - 2 hours • Hackaton? Lets

    get hands dirty
  3. Deliverables • KERN packages • CWLifying 3 pipelines • Running

    pipelines on various platforms • Usability report • Demonstration Use case
  4. KERN The problem: • Installing scientific software • Compile flags

    • Dependencies • Patches • Environment variables • Consistency & reproducibility • Centralize and minimise agony
  5. What is KERN • Debian packages (Ubuntu LTS) • Released

    every 6 months • KERN-3, released November 27 2017 • 73 packages and growing
  6. KERN advantages • No compilation • Consistency between computers •

    Consistency between platforms • Zero knowledge setup
  7. Install prefactor on Ubuntu 16.04 $ sudo apt-add-repository -s ppa:kernsuite/kern-3

    $ sudo apt-get update $ sudo apt-get install prefactor
  8. Packages in KERN casacore-data casacore aoflagger python-casacore casasynthesis casarest lofar

    wsclean rmextract losoto pyvo lsmtool prefactor factor 21cmfast aips attrdict cassbeam chgcentre drive-casa spdlog sopt purify purr pymoresane pyxis rfimasker rpfits sagecal scatterbrane tempo2 sigproc sigpyproc simfast21 simms sourcery dysco tmv karma katversion katpoint katdal kittens makems meqtrees-cattery meqtrees-timba cub montblanc msutils obit oskar owlcat psrcat stimela tempo tigger tirific tkp transitions python-keepalive python-typing cwltool psrchive dspsr parseltongue presto multinest singularity-container casalite galsim cubical mt-imager
  9. Containerisation • Docker • Singularity • uDocker (user space docker)

  10. What about GPU • GOOD QUESTION! • Is a bit

    of a problem. • Not a problem if no containerisation • Nvidia kernel module / library version needs to match • For docker there is a workaround1 1 https://github.com/NVIDIA/nvidia-docker
  11. Example Dockerfile FROM kernsuite/base:3 RUN docker-apt-install prefactor

  12. I use docker already • Is your container 2GB? •

    You don’t know how to ‘combine’ containers? • Does it takes ages to build your container? • You are doing it wrong.
  13. Containerisation is not package management

  14. http://kernsuite.info

  15. Common Workflow Language

  16. A standard for building pipelines

  17. pipeline /ˈpʌɪplʌɪn/ noun “a linear sequence of specialised tasks processing

    measurements, with the eventual goal to produce a paper”
  18. Example • Bash script • Easy to make (one on

    one command line)
  19. Problems • Platform specific • Data products hard to manage

    • What is input output? • What if pipeline fails halfway? • Parallelisation? Implementation specific • Easy to hack together but doesn’t ‘scale’
  20. What we want • A formal description of the tools

    • A formal description on the how to combine them • Split of responsibilities
  21. Task • Any piece of software that finishes in a

    ‘reasonable amount of time’ • Takes input • Has arguments • Produces output • We assume deterministic
  22. Workflow • Combine tasks into workflow • A DAG (Directed

    A-cyclic Graph) • Connect input to output • Manage parameters • Indicate what can run in parallel (implicit)
  23. Deterministic • ‘Functional’ • Output same if input and parameters

    don’t change • Cache results • If no inter-dependencies can run in parallel
  24. Making a Task • How to run the actual program

    (KERN, containers) • Formalise IO
  25. What does it look like cwlVersion: v1.0 class: CommandLineTool baseCommand:

    makezaplist.py inputs: inf: type: File birds: type: File inputBinding: position: 1 outputs: zaplist: type: File outputBinding: glob: $(inputs.birds.nameroot).zaplist
  26. CWL != CWL runner • CWL is the open standard

    /definition • Multiple runner backends. We use: • CWL reference runner • Toil.
  27. CWL runners • CWL reference runner • Python with a

    bit of node-js • 100% CWL standard compatible • Should always work • No parallelisation • No integration with schedulers • Toil • Python with a bit of node-js • Should also be compatible • Parallelisation • Support for schedulers
  28. CWL bugs solved for radio astro • Fixed various problems

    with measurement sets • Toil now properly supports nested directory structures • Sorting of product arrays (became apparent with presto) • Inplace writing (no copy or copy-on-write, non-functional)
  29. Pipelines • Prefactor - imaging domain • Presto tutorial -

    time domain • Spiel (telescope simulation) - SKA SA project
  30. Systems • Laptop • Server • Cluster ( HPC cloud

    SURFsara) • Supercomputer (Cartesius) • Grid
  31. Datasets • Tiny - 1 subband, 4 timesteps, L591513_SB000_uv_delta_t_4.MS (25

    MB) • Small - 20 subband, 10 timesteps, L570745_SB*_uv_first10.MS (500 MB) • Big - ?
  32. Prefactor pipeline

  33. Prefactor • Nice! • Public (github) & Open Source •

    Documented • Python
  34. Prefactor issues • Not a Python project • Some parts

    a bit hacky (scripts) • Using parset plugins maybe not a good idea • Python3? • Managed to work around everything
  35. Parset plugins • Call Python code from CWL • Actually

    not that bad • Can do something similar with CASA
  36. Presto pipeline

  37. Problems • Some programs crash inside docker container • Interesting

    pipeline, many parallel products • Found and solved problems in CWL related to sorting
  38. Spiel pipeline

  39. Spiel - active development • Basic functionality works • Interesting

    proof of concept because of CASA tasks • Hopefully becomes more advanced using GalSim
  40. Putting everything together!

  41. Running on Laptop macOS High Sierra 10.13.2 MacBook Pro (Retina,

    13-inch, late 2013) 2.4 GHz Intel Core i5 6 GB 1600 MHz DDR3 250 GB SSD
  42. Running on OSX • Containerisation - Docker OS X •

    Docker inside VM • Reference runner works • Singularity and uDocker don’t work (not Linux) • Toil doesn’t due to hardcoded tmp path, can’t mount into VM • Can’t run big dataset, too big
  43. Running on server Ubuntu 16.04 Intel(R) Xeon(R) CPU E5-2660 v4

    @ 2.00GHz 56 virtual cores 2 processors, 14 cores, hyperthreading enabled 500 GB Memory
  44. Running on server • CWL reference runner works, toil works

    • Massive speedup with toil due to parallelisation • Native, docker, udocker, singularity works • All pipelines work • 100% done :)
  45. Running on SURFsara HPC cloud • A bit problematic, not

    easy to use, buggy • Managed to Automate some parts • https://github.com/EOSC-LOFAR/ansible/ • Python script to setup cluster, fetch IP addresses • Ansible rules to install CWL, KERN, mess “Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault- tolerant and elastic distributed systems to easily be built and run effectively”
  46. Ansible • Configure system in programmatic way using ansible •

    A bit out of scope • Setup own mesos cluster • We have something working
  47. Bug in Toil + CWL Traceback (most recent call last):

    File "/usr/local/bin/toil-cwl-runner", line 9, in <module> load_entry_point('toil==3.13.0a1', 'console_scripts', 'toil-cwl-runner')() File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 870, in main with Toil(options) as toil: File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 668, in __enter__ config.setOptions(self.options) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 212, in setOptions setBatchOptions(self, setOption) File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/options.py", line 83, in setOptions batchSystem = factory() File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/registry.py", line 32, in _mesosBatchSystemFactory from toil.batchSystems.mesos.batchSystem import MesosBatchSystem File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/mesos/batchSystem.py", line 43, in <module> import mesos.native File "/usr/lib/python2.7/site-packages/mesos/native/__init__.py", line 17, in <module> from mesos.executor import MesosExecutorDriver File "/usr/lib/python2.7/site-packages/mesos/executor/__init__.py", line 17, in <module> from ._executor import MesosExecutorDriverImpl as MesosExecutorDriver File "/usr/lib/python2.7/site-packages/mesos/interface/mesos_pb2.py", line 23, in <module> n\x03gid\x18\x07 \x01(\t\"r\n\x06\x44\x65vice\x12\x0c\n\x04path\x18\x01 \x01(\t\x12$\n\x06number\x18\x02 \x01(\x0b\x32\x14.mesos.Device.Number\x1a\x34\n\x06Number\x12\x14\n\x0cmajor_number\x18\x01 \x02(\x04\x12\x14\n\x0cminor_number\x18\x02 \x02(\x04\"\x8f\x01\n\x0c\x44\x65viceAccess\x12\x1d\n\x06\x64\x65vice\x18\x01 \x02(\x0b\x32\r.mesos.Device\x12*\n\x06\x61\x63\x63\x65ss\x18\x02 \x02(\x0b\x32\x1a.mesos.DeviceAccess.Access\x1a\x34\n\x06\x41\x63\x63\x65ss\x12\x0c\n\x04read\x18\x01 \x01(\x08\x12\r\n\x05write\x18\x02 \x01(\x08\x12\r\n\x05mknod\x18\x03 \x01(\x08\"?\n\x0f\x44\x65viceWhitelist\x12,\n\x0f\x61llowed_devices\x18\x01 \x03(\x0b\x32\x13.mesos.DeviceAccess*\\ \n\x06Status\x12\x16\n\x12\x44RIVER_NOT_STARTED\x10\x01\x12\x12\n\x0e\x44RIVER_RUNNING\x10\x02\x12\x12\n\x0e\x44RIVER_ABORTED\x10\x03\x12\x12\n\ x0e\x44RIVER_STOPPED\x10\x04*\x8c\x02\n\tTaskState\x12\x10\n\x0cTASK_STAGING\x10\x06\x12\x11\n\rTASK_STARTING\x10\x00\x12\x10\n\x0cTASK_RUNNING\ x10\x01\x12\x10\n\x0cTASK_KILLING\x10\x08\x12\x11\n\rTASK_FINISHED\x10\x02\x12\x0f\n\x0bTASK_FAILED\x10\x03\x12\x0f\n\x0bTASK_KILLED\x10\x04\x12 \x0e\n\nTASK_ERROR\x10\x07\x12\r\n\tTASK_LOST\x10\x05\x12\x10\n\x0cTASK_DROPPED\x10\t\x12\x14\n\x10TASK_UNREACHABLE\x10\n\x12\r\n\tTASK_GONE\x10 \x0b\x12\x19\n\x15TASK_GONE_BY_OPERATOR\x10\x0c\x12\x10\n\x0cTASK_UNKNOWN\x10\rB\x1a\n\x10org.apache.mesosB\x06Protos') TypeError: __init__() got an unexpected keyword argument 'syntax' https://github.com/BD2KGenomics/toil/issues/2005
  48. Running on cartesius • Reference runner works • Toil works

    • Singularity works • uDocker works • Docker not available • Recommended usage: Toil + singularity + Slurm
  49. Issues • Our jobs are quite small, use “short” •

    $ export TOIL_SLURM_ARGS="-t 0:5:00 -p short" • CWL job schedular should become smarter, group together tiny tasks • CWL scripts should be self-describing computational requirements
  50. Overall issues • Singularity support in CWL not optimal (yet)

    • Mesos support in toil broken • CWL task grouping & scheduleling • Many tiny issues
  51. Overall status overview • Everything almost done. +/- 2 days

    left. • Still need to process ‘big dataset’ • Benchmarks • Grid? • Report • Demonstration Use case
  52. Future Work • Make more CWL tasks • Make more

    packages • Improve CWL standard • Improve toil • Improve interoperability with containerisation (singularity) and schedulers • Result/progress visualisation