Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multicore scores and resource optimisation within the Galaxy Project

Multicore World 2013
February 19, 2013
91

Multicore scores and resource optimisation within the Galaxy Project

Multicore World 2013

February 19, 2013
Tweet

Transcript

  1. Motivation • Significant levels of parallelism are here to stay

    • However many users of scientific computing: – have little interest in why parallelism is important to speed increases – do not have time or support to redevelop all their legacy code • Our desire: increase multicore use in Otago Biochemistry – promoting container platform changes to better support multicore – presenting simple metrics to non-technologist to focus development efforts where the most effective difference can be made • (... and to get an degree) 2
  2. The Galaxy Project • Galaxy provides web-based access to bioinformatics

    tools – http://galaxyproject.org/ – Users at Otago Biochemistry seem to find it highly useful • Galaxy provides a consistent, accessible interface that wraps stand-alone analysis software – scientists can focus on their actual work – no need for skilling up in computing methods • Experiments are built as ‘pipelines’ of ‘tools’ – The tool invocations provide parameter pages – Pipelines can be reused easily 3
  3. 4

  4. Downsides of Galaxy • User friendly abstraction hides problems, as

    well as detail – As experiment datasets grow, so too will the difficulties caused • Gross computing inefficiencies hidden from end users – Many tools make poor use of the underlying resources: cores, etc • Separates tool developers from end-users – Users may not understand whether to blame Galaxy or tools: – Tool developers may miss out on feedback – Users may not realise they should be expecting better performance 5
  5. Keeping an eye on the Galaxy • We began developing

    a monitoring framework for Galaxy – Galaxy does little monitoring of pipeline/tool optimisation itself • System monitoring – Overall performance information – Prevents system blindly exhausting all resources • Pipeline monitoring – Check configuration of tools in a pipeline before execution – Detect excessive projected resource use – Suggest optimisations (e.g. minimising intermediate data size through reordering) 6
  6. Resource monitoring specifics • Users are prompted with a warning

    if there is a sustained RAM usage that is above a given threshold. • A warning is presented to the user if a large number of processes are persistently blocking for I/O. • The total percentage complete is now displayed for some of the common Galaxy tools used. • When known to be effective, the percentage of work complete is extrapolated to estimated time to completion. • Tools with pre-classified RAM consumption patterns based on key input parameters, will provide estimated RAM use • Before executing tools, a history is consulted: – can suggest if invocations appear to be unreasonable 7
  7. Case study: Beagle phasing • Beagle is a Markov-Chain Monte-Carlo

    (MCMC) algorithm – (Phasing ‘determines’ which alleles—i.e. alternative forms of a particular gene—come from which chromosome in a parent) • beagle (implemented in Java) was using one core! – However could split data, reduce precision, and increase speed • RAM has two phase pattern – Used to provide warnings if all resources will be consumed 8 beagle processing data on an 8GiB system
  8. History-based reporting • Provide warnings & information about likely tool

    behaviour • User can ignore all information • Can catch cases that crashed the system – Also have the system monitor working in parallel 9
  9. Case study: Ensembl • Ensembl variant effect predictor is a

    tool developed by EMBL-EBI and the Wellcome Trust Sanger Institute – Implemented in Perl – Also only used one core by default! • Cannot simply partition input, due to windowing function – ... but can turn off the windowing function – then get an extra 80% in throughput per core (up to 16) ... – ... even thought computation was slowed down for each instance • The tool’s developers added a process forking feature – (usually match to core count) 10
  10. Multicore scores • Multicore score aims to focus efforts increasing

    efficiency of bioinformatics tools contained within Galaxy – Score is easy to compute: over tool or pipeline – Provides a direct measure of relative efficiency (usually) – Easy to explain to scientists: • they can focus and prioritise developers’ future efforts • The multicore score is the CPU utilisation of all cores over the course of a workflow or tool execution, normalised to the number of cores (C), and the total time taken (T) 11
  11. Conclusions and Future Work • Galaxy’s abstractions are greatly appreciated

    by scientists, but risk hiding performance problems – We developed a resource monitoring framework in response • Simple aggregate metrics can give a good estimate of whether everything is “going OK” in Galaxy – Many Galaxy tools are making poor use of multicore currently • Ideally a resource utilisation protocol between Galaxy and its tools would allow scheduling of the tools in workflows for the most efficient CPU use 12