Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EOSC LOFAR pilot final presentation

Gijs Molenaar
January 08, 2018

EOSC LOFAR pilot final presentation

This presentation discusses the results of the EOSC LOFAR pilot. The goal is to demonstrate the usability of CWL in the context of data reduction pipelines for radio astronomical data reduction. Target platforms include a MacBook, a single server, a cluster and supercomputer

Gijs Molenaar

January 08, 2018
Tweet

More Decks by Gijs Molenaar

Other Decks in Science

Transcript

  1. Deliverables • KERN packages • CWLifying 3 pipelines • Running

    pipelines on various platforms • Usability report • Demonstration Use case
  2. KERN The problem: • Installing scientific software • Compile flags

    • Dependencies • Patches • Environment variables • Consistency & reproducibility • Centralize and minimise agony
  3. What is KERN • Debian packages (Ubuntu LTS) • Released

    every 6 months • KERN-3, released November 27 2017 • 73 packages and growing
  4. KERN advantages • No compilation • Consistency between computers •

    Consistency between platforms • Zero knowledge setup
  5. Install prefactor on Ubuntu 16.04 $ sudo apt-add-repository -s ppa:kernsuite/kern-3

    $ sudo apt-get update $ sudo apt-get install prefactor
  6. Packages in KERN casacore-data casacore aoflagger python-casacore casasynthesis casarest lofar

    wsclean rmextract losoto pyvo lsmtool prefactor factor 21cmfast aips attrdict cassbeam chgcentre drive-casa spdlog sopt purify purr pymoresane pyxis rfimasker rpfits sagecal scatterbrane tempo2 sigproc sigpyproc simfast21 simms sourcery dysco tmv karma katversion katpoint katdal kittens makems meqtrees-cattery meqtrees-timba cub montblanc msutils obit oskar owlcat psrcat stimela tempo tigger tirific tkp transitions python-keepalive python-typing cwltool psrchive dspsr parseltongue presto multinest singularity-container casalite galsim cubical mt-imager
  7. What about GPU • GOOD QUESTION! • Is a bit

    of a problem. • Not a problem if no containerisation • Nvidia kernel module / library version needs to match • For docker there is a workaround1 1 https://github.com/NVIDIA/nvidia-docker
  8. I use docker already • Is your container 2GB? •

    You don’t know how to ‘combine’ containers? • Does it takes ages to build your container? • You are doing it wrong.
  9. pipeline /ˈpʌɪplʌɪn/ noun “a linear sequence of specialised tasks processing

    measurements, with the eventual goal to produce a paper”
  10. Problems • Platform specific • Data products hard to manage

    • What is input output? • What if pipeline fails halfway? • Parallelisation? Implementation specific • Easy to hack together but doesn’t ‘scale’
  11. What we want • A formal description of the tools

    • A formal description on the how to combine them • Split of responsibilities
  12. Task • Any piece of software that finishes in a

    ‘reasonable amount of time’ • Takes input • Has arguments • Produces output • We assume deterministic
  13. Workflow • Combine tasks into workflow • A DAG (Directed

    A-cyclic Graph) • Connect input to output • Manage parameters • Indicate what can run in parallel (implicit)
  14. Deterministic • ‘Functional’ • Output same if input and parameters

    don’t change • Cache results • If no inter-dependencies can run in parallel
  15. Making a Task • How to run the actual program

    (KERN, containers) • Formalise IO
  16. What does it look like cwlVersion: v1.0 class: CommandLineTool baseCommand:

    makezaplist.py inputs: inf: type: File birds: type: File inputBinding: position: 1 outputs: zaplist: type: File outputBinding: glob: $(inputs.birds.nameroot).zaplist
  17. CWL != CWL runner • CWL is the open standard

    /definition • Multiple runner backends. We use: • CWL reference runner • Toil.
  18. CWL runners • CWL reference runner • Python with a

    bit of node-js • 100% CWL standard compatible • Should always work • No parallelisation • No integration with schedulers • Toil • Python with a bit of node-js • Should also be compatible • Parallelisation • Support for schedulers
  19. CWL bugs solved for radio astro • Fixed various problems

    with measurement sets • Toil now properly supports nested directory structures • Sorting of product arrays (became apparent with presto) • Inplace writing (no copy or copy-on-write, non-functional)
  20. Pipelines • Prefactor - imaging domain • Presto tutorial -

    time domain • Spiel (telescope simulation) - SKA SA project
  21. Systems • Laptop • Server • Cluster ( HPC cloud

    SURFsara) • Supercomputer (Cartesius) • Grid
  22. Datasets • Tiny - 1 subband, 4 timesteps, L591513_SB000_uv_delta_t_4.MS (25

    MB) • Small - 20 subband, 10 timesteps, L570745_SB*_uv_first10.MS (500 MB) • Big - ?
  23. Prefactor issues • Not a Python project • Some parts

    a bit hacky (scripts) • Using parset plugins maybe not a good idea • Python3? • Managed to work around everything
  24. Parset plugins • Call Python code from CWL • Actually

    not that bad • Can do something similar with CASA
  25. Problems • Some programs crash inside docker container • Interesting

    pipeline, many parallel products • Found and solved problems in CWL related to sorting
  26. Spiel - active development • Basic functionality works • Interesting

    proof of concept because of CASA tasks • Hopefully becomes more advanced using GalSim
  27. Running on Laptop macOS High Sierra 10.13.2 MacBook Pro (Retina,

    13-inch, late 2013) 2.4 GHz Intel Core i5 6 GB 1600 MHz DDR3 250 GB SSD
  28. Running on OSX • Containerisation - Docker OS X •

    Docker inside VM • Reference runner works • Singularity and uDocker don’t work (not Linux) • Toil doesn’t due to hardcoded tmp path, can’t mount into VM • Can’t run big dataset, too big
  29. Running on server Ubuntu 16.04 Intel(R) Xeon(R) CPU E5-2660 v4

    @ 2.00GHz 56 virtual cores 2 processors, 14 cores, hyperthreading enabled 500 GB Memory
  30. Running on server • CWL reference runner works, toil works

    • Massive speedup with toil due to parallelisation • Native, docker, udocker, singularity works • All pipelines work • 100% done :)
  31. Running on SURFsara HPC cloud • A bit problematic, not

    easy to use, buggy • Managed to Automate some parts • https://github.com/EOSC-LOFAR/ansible/ • Python script to setup cluster, fetch IP addresses • Ansible rules to install CWL, KERN, mess “Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault- tolerant and elastic distributed systems to easily be built and run effectively”
  32. Ansible • Configure system in programmatic way using ansible •

    A bit out of scope • Setup own mesos cluster • We have something working
  33. Bug in Toil + CWL Traceback (most recent call last):

    File "/usr/local/bin/toil-cwl-runner", line 9, in <module> load_entry_point('toil==3.13.0a1', 'console_scripts', 'toil-cwl-runner')() File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 870, in main with Toil(options) as toil: File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 668, in __enter__ config.setOptions(self.options) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 212, in setOptions setBatchOptions(self, setOption) File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/options.py", line 83, in setOptions batchSystem = factory() File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/registry.py", line 32, in _mesosBatchSystemFactory from toil.batchSystems.mesos.batchSystem import MesosBatchSystem File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/mesos/batchSystem.py", line 43, in <module> import mesos.native File "/usr/lib/python2.7/site-packages/mesos/native/__init__.py", line 17, in <module> from mesos.executor import MesosExecutorDriver File "/usr/lib/python2.7/site-packages/mesos/executor/__init__.py", line 17, in <module> from ._executor import MesosExecutorDriverImpl as MesosExecutorDriver File "/usr/lib/python2.7/site-packages/mesos/interface/mesos_pb2.py", line 23, in <module> n\x03gid\x18\x07 \x01(\t\"r\n\x06\x44\x65vice\x12\x0c\n\x04path\x18\x01 \x01(\t\x12$\n\x06number\x18\x02 \x01(\x0b\x32\x14.mesos.Device.Number\x1a\x34\n\x06Number\x12\x14\n\x0cmajor_number\x18\x01 \x02(\x04\x12\x14\n\x0cminor_number\x18\x02 \x02(\x04\"\x8f\x01\n\x0c\x44\x65viceAccess\x12\x1d\n\x06\x64\x65vice\x18\x01 \x02(\x0b\x32\r.mesos.Device\x12*\n\x06\x61\x63\x63\x65ss\x18\x02 \x02(\x0b\x32\x1a.mesos.DeviceAccess.Access\x1a\x34\n\x06\x41\x63\x63\x65ss\x12\x0c\n\x04read\x18\x01 \x01(\x08\x12\r\n\x05write\x18\x02 \x01(\x08\x12\r\n\x05mknod\x18\x03 \x01(\x08\"?\n\x0f\x44\x65viceWhitelist\x12,\n\x0f\x61llowed_devices\x18\x01 \x03(\x0b\x32\x13.mesos.DeviceAccess*\\ \n\x06Status\x12\x16\n\x12\x44RIVER_NOT_STARTED\x10\x01\x12\x12\n\x0e\x44RIVER_RUNNING\x10\x02\x12\x12\n\x0e\x44RIVER_ABORTED\x10\x03\x12\x12\n\ x0e\x44RIVER_STOPPED\x10\x04*\x8c\x02\n\tTaskState\x12\x10\n\x0cTASK_STAGING\x10\x06\x12\x11\n\rTASK_STARTING\x10\x00\x12\x10\n\x0cTASK_RUNNING\ x10\x01\x12\x10\n\x0cTASK_KILLING\x10\x08\x12\x11\n\rTASK_FINISHED\x10\x02\x12\x0f\n\x0bTASK_FAILED\x10\x03\x12\x0f\n\x0bTASK_KILLED\x10\x04\x12 \x0e\n\nTASK_ERROR\x10\x07\x12\r\n\tTASK_LOST\x10\x05\x12\x10\n\x0cTASK_DROPPED\x10\t\x12\x14\n\x10TASK_UNREACHABLE\x10\n\x12\r\n\tTASK_GONE\x10 \x0b\x12\x19\n\x15TASK_GONE_BY_OPERATOR\x10\x0c\x12\x10\n\x0cTASK_UNKNOWN\x10\rB\x1a\n\x10org.apache.mesosB\x06Protos') TypeError: __init__() got an unexpected keyword argument 'syntax' https://github.com/BD2KGenomics/toil/issues/2005
  34. Running on cartesius • Reference runner works • Toil works

    • Singularity works • uDocker works • Docker not available • Recommended usage: Toil + singularity + Slurm
  35. Issues • Our jobs are quite small, use “short” •

    $ export TOIL_SLURM_ARGS="-t 0:5:00 -p short" • CWL job schedular should become smarter, group together tiny tasks • CWL scripts should be self-describing computational requirements
  36. Overall issues • Singularity support in CWL not optimal (yet)

    • Mesos support in toil broken • CWL task grouping & scheduleling • Many tiny issues
  37. Overall status overview • Everything almost done. +/- 2 days

    left. • Still need to process ‘big dataset’ • Benchmarks • Grid? • Report • Demonstration Use case
  38. Future Work • Make more CWL tasks • Make more

    packages • Improve CWL standard • Improve toil • Improve interoperability with containerisation (singularity) and schedulers • Result/progress visualisation