Slide 1

Slide 1 text

Multicore scores and resource optimisation within the Galaxy Project Edward Hills ([email protected]) David Eyers ([email protected])

Slide 2

Slide 2 text

Motivation • Significant levels of parallelism are here to stay • However many users of scientific computing: – have little interest in why parallelism is important to speed increases – do not have time or support to redevelop all their legacy code • Our desire: increase multicore use in Otago Biochemistry – promoting container platform changes to better support multicore – presenting simple metrics to non-technologist to focus development efforts where the most effective difference can be made • (... and to get an degree) 2

Slide 3

Slide 3 text

The Galaxy Project • Galaxy provides web-based access to bioinformatics tools – http://galaxyproject.org/ – Users at Otago Biochemistry seem to find it highly useful • Galaxy provides a consistent, accessible interface that wraps stand-alone analysis software – scientists can focus on their actual work – no need for skilling up in computing methods • Experiments are built as ‘pipelines’ of ‘tools’ – The tool invocations provide parameter pages – Pipelines can be reused easily 3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

Downsides of Galaxy • User friendly abstraction hides problems, as well as detail – As experiment datasets grow, so too will the difficulties caused • Gross computing inefficiencies hidden from end users – Many tools make poor use of the underlying resources: cores, etc • Separates tool developers from end-users – Users may not understand whether to blame Galaxy or tools: – Tool developers may miss out on feedback – Users may not realise they should be expecting better performance 5

Slide 6

Slide 6 text

Keeping an eye on the Galaxy • We began developing a monitoring framework for Galaxy – Galaxy does little monitoring of pipeline/tool optimisation itself • System monitoring – Overall performance information – Prevents system blindly exhausting all resources • Pipeline monitoring – Check configuration of tools in a pipeline before execution – Detect excessive projected resource use – Suggest optimisations (e.g. minimising intermediate data size through reordering) 6

Slide 7

Slide 7 text

Resource monitoring specifics • Users are prompted with a warning if there is a sustained RAM usage that is above a given threshold. • A warning is presented to the user if a large number of processes are persistently blocking for I/O. • The total percentage complete is now displayed for some of the common Galaxy tools used. • When known to be effective, the percentage of work complete is extrapolated to estimated time to completion. • Tools with pre-classified RAM consumption patterns based on key input parameters, will provide estimated RAM use • Before executing tools, a history is consulted: – can suggest if invocations appear to be unreasonable 7

Slide 8

Slide 8 text

Case study: Beagle phasing • Beagle is a Markov-Chain Monte-Carlo (MCMC) algorithm – (Phasing ‘determines’ which alleles—i.e. alternative forms of a particular gene—come from which chromosome in a parent) • beagle (implemented in Java) was using one core! – However could split data, reduce precision, and increase speed • RAM has two phase pattern – Used to provide warnings if all resources will be consumed 8 beagle processing data on an 8GiB system

Slide 9

Slide 9 text

History-based reporting • Provide warnings & information about likely tool behaviour • User can ignore all information • Can catch cases that crashed the system – Also have the system monitor working in parallel 9

Slide 10

Slide 10 text

Case study: Ensembl • Ensembl variant effect predictor is a tool developed by EMBL-EBI and the Wellcome Trust Sanger Institute – Implemented in Perl – Also only used one core by default! • Cannot simply partition input, due to windowing function – ... but can turn off the windowing function – then get an extra 80% in throughput per core (up to 16) ... – ... even thought computation was slowed down for each instance • The tool’s developers added a process forking feature – (usually match to core count) 10

Slide 11

Slide 11 text

Multicore scores • Multicore score aims to focus efforts increasing efficiency of bioinformatics tools contained within Galaxy – Score is easy to compute: over tool or pipeline – Provides a direct measure of relative efficiency (usually) – Easy to explain to scientists: • they can focus and prioritise developers’ future efforts • The multicore score is the CPU utilisation of all cores over the course of a workflow or tool execution, normalised to the number of cores (C), and the total time taken (T) 11

Slide 12

Slide 12 text

Conclusions and Future Work • Galaxy’s abstractions are greatly appreciated by scientists, but risk hiding performance problems – We developed a resource monitoring framework in response • Simple aggregate metrics can give a good estimate of whether everything is “going OK” in Galaxy – Many Galaxy tools are making poor use of multicore currently • Ideally a resource utilisation protocol between Galaxy and its tools would allow scheduling of the tools in workflows for the most efficient CPU use 12