Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alan McCulloch: tardis - an interpreter for command-line parallel execution

Alan McCulloch: tardis - an interpreter for command-line parallel execution

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Alan McCulloch:
tardis - an interpreter for command-line parallel execution
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
@ Kiwi PyCon 2014 - Saturday, 13 Sep 2014 - Track 2

**Audience level**



This talk is about simplifying the command line interface to local or cluster based parallel computing.


Ideally the user of a command shell would be unaware whether their commands were executed as a single process on the local machine, or as many concurrent processes on either their local machine or a remote cluster, apart from the reduced time taken to complete the command if executed as concurrent processes. We have developed an approach which we call “command conditioning” in which the user marks up a command with hints to the interpreter which are used to transform the marked-up command into “(re)conditioned” native shell commands which the interpreter then launches concurrently and monitors, collects and collates output and termination status. We have implemented an initial python based command-conditioning interpreter called tardis. We describe tardis, give examples of the class of compute tasks for which it is suited, and briefly outline key compute-cluster design characteristics which support this approach. We also touch on future work such as potential integration with Galaxy, a popular python based workflow system.




New Zealand Python User Group

September 13, 2014


  1. tardish> A python based command line interpreter of decorated data

    processing commands (not all of which have been awarded medals) PyZMQ
  2. High Performance/Throughput Computing (“HPC”) Core Resources at AgResearch (why still

    a command- line focus ?) • Centos 6 based , using ZFS filesystem for mass storage • 5 medium to large RAM (0.5 - 1 TB) SMP (48 to 64 cores) compute servers • 63 small to medium RAM (32GB - 128GB) SMP (4 to 16 cores) blade compute servers • Condor for job scheduling and management • Same shared filesystem seen by all servers and blades. 4 NAS heads (one for replication) • Approx. 800TB available mass storage in total (ZFS, mounted over NFS4) • Python 2.6.6 (mostly) (no Python 3 yet)
  3. What are we doing with it – why still often

    on the command-line ? Time Data • Genetic improvement (sheep, cattle, ryegrass, clover) • Reduction in environmental impact (e.g. reduce livestock and pasture green-house gas emissions; reduce microbial and chemical freshwater contamination) • Economic sustainability and growth
  4. What we are aiming for here : ideally users would

    be able to interact with a very big fairly complicated compute infrastructure using commands essentially identical to those they would use at their own linux workstation, the only difference being their command will return more quickly sh>grep "AACTCAGA" SheepSeqs_R1.fastq tardish>grep "AACTCAGA" _condition_fastq_input_SheepSeqs_R1.fastq.gz Without decoration using a standard unix shell : With decoration using an HPC-and-decoration-aware shell:
  5. What is going on under the hood when a decorated

    command is processed tardish>grep "AACTCAGA" _condition_fastq_input_SheepSeqs_R1.fastq.gz sh>grep "AACTCAGA" SheepSeqs_R1.fastq.001 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.002 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.002 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.003 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.637
  6. For every command that you enter at the tardish> prompt,

    something like this is generated in a temp folder some where…. intrepid$ ls -slt /home/mccullocha/galaxy/hpc/dev/tardis_eH4nsA | head -40 total 137 14 -rw-rw-r-- 1 mccullocha mccullocha 53676 Sep 11 11:49 tardis.log 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 run14.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 run14.sh.stdout 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00015 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:49 run14.sh.log 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 sample.blastout.00013 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:49 run14.sh 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 sample.blastout.00011 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 sample.blastout.00012 7 -rw-rw-r-- 1 mccullocha mccullocha 2255 Sep 11 11:49 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00014 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run13.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run13.sh.stdout 7 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run13.sh.log 7 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run13.sh 7 -rw-rw-r-- 1 mccullocha mccullocha 2243 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00013 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run12.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run12.sh.stdout 7 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run12.sh.log 7 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run12.sh 2 -rw-rw-r-- 1 mccullocha mccullocha 1978 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00012 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run11.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run11.sh.stdout 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run11.sh.log 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run11.sh 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 sample.blastout.00010 2 -rw-rw-r-- 1 mccullocha mccullocha 2069 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00011 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 sample.blastout.00009 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run10.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run10.sh.stdout 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run10.sh.log 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run10.sh 2 -rw-rw-r-- 1 mccullocha mccullocha 1719 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00010 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 sample.blastout.00008 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run9.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run9.sh.stdout 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run9.sh.log 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run9.sh 2 -rw-rw-r-- 1 mccullocha mccullocha 1614 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00009 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run8.sh.stderr
  7. Command decorators do for shell commands what python function /method

    decorators do for python functions and methods. See e.g. the python-decorator based workflow system : “Ruffus” (http://www.ruffus.org.uk/ ) • Ruffus: you write a simple python function implementing your processing step, and then decorate it with a python decorator from the ruffus decorator library. The decorator “re-conditions” your function so that it can be executed as part of a workflow on a large system • tardish> : you write a simple command implementing your processing step, and then decorate it with a decorator from the tardish> decorator library. tardish> “re-conditions” your input data and command into a series of input files and commands which are executed on a large system and waits for the results, and merges them to give the output you expected (as though you had run the original command on your small system – but faster!)
  8. • Ruffus….“Handles even fiendishly complicated pipelines which would cause make

    or scons to go cross-eyed and recursive” • tardish> handles simple data processing pipelines that can be expressed as either a single unix shell command, or a one line composition of commands (such as cmd1 | cmd2 | cmd3 … or cmd1 ; cmd2 ; cmd 3 …) ( tardish> solves a much smaller problem than Ruffus)
  9. Also see also: • xargs - like tardish> , takes

    a command as one of its arguments. The command is bound to command- arguments also provided by xargs to yield a series of commands which are then executed • gnu parallel - like tardish> and xargs, takes a command (or a piped composition of commands) as one of its arguments, and binds it to command-arguments. A series of bound commands are then executed concurrently. • Unlike tardish> , these utilities don’t make use of command decoration
  10. …but wait, there’s more ! Further advantages of command decoration

    include seamless un- compression and format conversion (the bane of every bioinformatician’s life) sh>gunzip –c CloverSeqs.fastq.gz | fastq_to_fasta –o CloverSeqs.fasta sh>blastn –query CloverSeqs.fasta -num_threads 2 -db nt -evalue 1.0e-10 - max_target_seqs 1 -outfmt 7 -out contamination_check.txt tardish>blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz - num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 -out _condition_text_output_contamination_check.txt
  11. Command decoration enriches the semantic content of a command by

    allowing us to implicitly refer to bits of it. For example we can seamlessly take random samples of the input (potentially very handy but seldom done because it is very poorly supported in bioinformatics) tardish>set samplerate=.0001 tardish>blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz - num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 -out _condition_text_output_contamination_check.txt
  12. … many labs developing and implementing algorithms => many command-line

    utilities, and no standards ! => many decorators. Example: handling “by-products” of commands : “product decorators” sh>gunzip RGSeqs.fastq.gz sh>/usr/bin/DynamicTrim.pl -probcutoff 0.05 RGSeqs.fastq > trim.stdout tardish>/usr/bin/DynamicTrim.pl -probcutoff 0.05 _condition_fastq_input_RGSeqs.fastq.gz > _condition_text_output_trim.stdout _condition_fastq_product_.trimmed, trimmed.fastq _condition_uncompressedtext_product_.trimmed_segments,segments.txt _condition_uncompressedpdf_product_.trimmed_segments.hist.pdf,histogram.pdf Also generates the following output files which tardish> will need to find and join back up at the end: RGSeqs.trimmed RGSeqs.trimmed_segments RGSeqs.trimmed_segments.hist.pdf
  13. More decorators ! - handling simple one-line pipelines and paired-file

    dependencies : “throughput” and “paired input” decorators sh>gunzip BugSeqs_P1.fastq.gz sh>gunzip BugSeqs_P2.fastq.gz sh>bwa aln protein_mRNAs.fa BugSeqs_P1.fa > P1.sai sh>bwa aln protein_mRNAs.fa BugSeqs_P1.fa > P2.sai sh>bwa sampe protein_mRNAs.fa P1.sai P2.sai BugSeqs_P1.fa BugSeqs_P2.fa > bugs.sam tardish> bwa aln protein_mRNAs.fa _condition_paired_fastq_input_BugSeqs_P1.fastq.gz > _condition_throughput_P1.sai ; bwa aln protein_mRNAs.fa _condition_paired_fastq_input_BugSeqs_P2.fastq.gz > _condition_throughput_P2.sai ; bwa sampe protein_mRNAs.fa _condition_throughput_P1.sai _condition_throughput_P2.sai _condition_paired_fastq_input_BugSeqs_P1.fastq.gz _condition_paired_fastq_input_BugSeqs_P2.fastq.gz > _condition_sam_output_bugs.sam
  14. So many decorators ! - command completion in tardish> tardish>

    blastn _condition_ _condition_bam_output_ _condition_fasta_product_ _condition_headlesssam_output_ _condition_text_output_ _condition_uncompressedpdf_product_ _condition_bam_product_ _condition_fastq2fasta_input_ _condition_paired_fastq_input_ _condition_text_product_ _condition_uncompressedsam_output_ _condition_blastxml_output_ _condition_fastq2fasta_output_ _condition_pdf_output_ _condition_uncompressedfasta_output_ _condition_uncompressedsam_product_ _condition_blastxml_product_ _condition_fastq2fasta_product_ _condition_pdf_product_ _condition_uncompressedfasta_product_ _condition_uncompressedtext_output_ _condition_compressedtext_input_ _condition_fastq_input_ _condition_sam_output_ _condition_uncompressedfastq_output_ _condition_uncompressedtext_product_ _condition_fasta_input_ _condition_fastq_output_ _condition_sam_product_ _condition_uncompressedfastq_product_ _condition_fasta_output_ _condition_fastq_product_ _condition_text_input_ _condition_uncompressedpdf_output_
  15. Caveat: “Command decoration” is fruitful (only) because many of our

    HPC compute jobs have a very simple “linear” structure (aka “embarrassingly parallel”). These kinds of jobs can easily be factorised into computationally identical independent pieces… = _ ⨁ _ ⨁ _ … = _ ⨁ _ ⨁ _ … Example : pattern searching (e.g. grep, blast, bwa, …) ≡ ℎ ℎ ⨁ ≡
  16. ( However some jobs are not so easily factorised… )

    e.g. • Matrix inversion (e.g. fitting statistical models) • DNA sequence assembly – i.e. assembling complete genomes from large databases of small DNA fragments.
  17. Running decorated commands from a standard unix shell, for example

    as part of a rule in a makefile or existing script tardish>set samplerate=.0001 tardish>set chunksize=1500000 tardish>blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz - num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 > _condition_text_output_contamination_check.txt sh>tardis.py -c 1500000 –s .0001 -w blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz -num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 \> _condition_text_output_contamination_check.txt When running tardis engine from a standard shell need to remember to escape shell meta-characters such as ; > | if these are part of the command you want run - otherwise /bin/sh will mess with them (whereas when using /bin/tardish you don’t need to escape them)
  18. Processing batch files using the tardish> interpreter (via “shebang” convention)

    #!/bin/tardish !echo starting at `date` # this script contamination-checks a 1 in 10000 random sample of the data set samplerate=.0001 # set non-default chunksize (note the size is before sampling. # So chunks in this run will only contain around 150 seqs each) set chunksize=1500000 # don’t use condor for this run. Tardis engine will launch # concurrent processes up to MAX_PROCESSES (subprocess module) set hpctype=local # default is launch using condor # Run the command. Note that by default output is compressed – so this # command will yield a final output file of contamination_check.txt.gz blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz -num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 -out _condition_text_output_contamination_check.txt !echo finished at `date` parser = argparse.ArgumentParser() parser.add_argument("scriptfilename", default=None, nargs="?") . . . args = vars(parser.parse_args()) #if we have a script file then read commands from that shell_stdin = sys.stdin if args["scriptfilename"] is not None: shell_stdin = open(args["scriptfilename"],"r")
  19. How it works (generators and itertools) ['blastn', '-query', '_condition_fastq2fasta_input_seqs.fa', '-db',

    'nt', '-out', '_condition_text_output_sample.seqs.out‘ ]] blastn -query _condition_fastq2fasta_input_seqs.fa -db nt –out _condition_text_output_sample.seqs.out [repeat('blastn'), repeat('-query'), <tardis.fastq2fastaDataConditioner object at 0x185b9d0>, repeat('-db'), repeat('nt'), repeat('-out'), <tardis.textDataConditioner object at 0x185ba10>] <itertools.izip object at 0x185e3f8> blastn –query seqs.fa.00001 -db nt -out sample.seqs.out.0001 blastn –query seqs.fa.00002 -db nt -out sample.seqs.out.0002 blastn –query seqs.fa.00003 -db nt -out sample.seqs.out.0003 blastn –query seqs.fa.00004 -db nt -out sample.seqs.out.0004 blastn –query seqs.fa.00005 -db nt -out sample.seqs.out.0005 blastn –query seqs.fa.00006 -db nt -out sample.seqs.out.0006 blastn –query seqs.fa.00007 -db nt -out sample.seqs.out.0007 ...etc ... ... 2014-08-22 13:51:04,704 INFO getconditionedOutput : have received 80 of 80 job products 2014-08-22 13:51:04,712 INFO 1 output unconditioners is unconditioning textDataConditioner.unconditionOutput : unconditioning the following conditioned files ['sample.seqs.out.0001','sample.seqs.out.0002',...] to sample.seqs.out Executing ['cat', 'sample.seqs.out.0001', 'sample.seqs.out.0002',...] ...
  20. “inducting” design pattern – shared variables without using global or

    class variables or churning constructors “Inducting” newly created objects “into the club” (instead of using class variables. Class variables used originally in the engine to share state information but these broke everything once I started using the engine classes as a library – i.e. importing the classes ). This design pattern is a type of prototyping. def induct(self,other): super(localhpcJob,self).induct(other) other.workerList = self.workerList return prototype = localhpcJob() . . . worker = localhpcJob() prototype.induct(worker)
  21. Command decoration and distributed application support : connecting the tardish>

    interpreter to a tardish> on a remote host [tardish] rhost= rport=3391 PyZMQ (“sockets on steroids”) tardish> client tardish> server #!/bin/tardish # no-shared-filesystem-example – files will be copied to and from server and file monikers # will resolve to the actual file-path on the server file my_seqs is /dataset/clover_seqs/active/build1.2/translations.faa file my_output is /dataset/clover_seqs/active/build1.2/translations.faa.hits blastx –query _condition_fasta_input_$put(my_seqs) -num_threads 2 -db nt -evalue 1.0e-10 - max_target_seqs 1 -outfmt 7 -out _condition_text_output_$get(my_output) [tardish] lport=3391 [tardis_engine] valid_command_patterns=grep cat awk [t]*blast[nxp] bwa bowtie flexbar
  22. (Actually tardish> always runs in client-server mode) impulsive$ tardish (starting

    a dedicated tardish server to handle this client....) (server pid=1539) (tardish server listening on port 6107) Welcome to tardish (commands will be queued to tardish server at localhost:6107) Use tab for tab completion of conditioning directives (tab three times to list all) tardish>
  23. Python modules used • itertools     (iterator

    algebra for big data ! ) • cmd (command interpreter framework) • argparse (formerly optparse) • ConfigParser (.tardishrc) • subprocess (set hpctype=local) dict()
  24. 0 1 2 3 4 5 6 7 8 9

    10 tardish> tardis engine something pretty mature Maturity Level Maturity Level "t >2 : works reliably for me and my colleagues and does real work - but not yet packaged or deployed outside our environment. Fairly stable" "t< 0.1 : Bleeding edge. Works fairly reliably for me and has done a small amount of real work - but not yet used by anybody else. Not yet stable"
  25. Future work: Engine:  Abstract the interface to the cluster

    manager (probably using DRMAA API). Currently condor-dependent  Improve the performance of the data conditioner (e.g. the default Biopython sequence parse is probably a bit slow) (Test under pypy)  Can probably improve the code with a judicious decorator or two  Probably add support for more command decorators tardish> Interpreter:  Stabilise and beta-test – its pretty new  Refactor to use python multiprocessing module when running as a server Both:  Windows port  Possible Galaxy (python based bioinformatics workflow system) integration  Increase maturity level
  26. The tardishian principle of performance computing : the result of

    using a big very complicated and fast computer is indistinguishable from using a small very simple slow computer in combination with a tardis Time
  27. = + command-line interface tardish>

  28. tardish> PyZMQ tardish> THANKS to >AgResearch Linux and HPC Systems

    Engineers Simon Guest and Russel Smithies for a fantastic HPC setup >AgResearch IT team for a fantastic IT infrastructure >AgResearch Bioinformaticians for trying and using the tardis engine every now and then, and letting me use it to run some of their big jobs tardish> and THANKS to YOU for your attention tardish>exit https://bitbucket.org/agr-bifo/tardis