Slide 1

Slide 1 text

tardish> A python based command line interpreter of decorated data processing commands (not all of which have been awarded medals) PyZMQ

Slide 2

Slide 2 text

High Performance/Throughput Computing (“HPC”) Core Resources at AgResearch (why still a command- line focus ?) • Centos 6 based , using ZFS filesystem for mass storage • 5 medium to large RAM (0.5 - 1 TB) SMP (48 to 64 cores) compute servers • 63 small to medium RAM (32GB - 128GB) SMP (4 to 16 cores) blade compute servers • Condor for job scheduling and management • Same shared filesystem seen by all servers and blades. 4 NAS heads (one for replication) • Approx. 800TB available mass storage in total (ZFS, mounted over NFS4) • Python 2.6.6 (mostly) (no Python 3 yet)

Slide 3

Slide 3 text

What are we doing with it – why still often on the command-line ? Time Data • Genetic improvement (sheep, cattle, ryegrass, clover) • Reduction in environmental impact (e.g. reduce livestock and pasture green-house gas emissions; reduce microbial and chemical freshwater contamination) • Economic sustainability and growth

Slide 4

Slide 4 text

What we are aiming for here : ideally users would be able to interact with a very big fairly complicated compute infrastructure using commands essentially identical to those they would use at their own linux workstation, the only difference being their command will return more quickly sh>grep "AACTCAGA" SheepSeqs_R1.fastq tardish>grep "AACTCAGA" _condition_fastq_input_SheepSeqs_R1.fastq.gz Without decoration using a standard unix shell : With decoration using an HPC-and-decoration-aware shell:

Slide 5

Slide 5 text

What is going on under the hood when a decorated command is processed tardish>grep "AACTCAGA" _condition_fastq_input_SheepSeqs_R1.fastq.gz sh>grep "AACTCAGA" SheepSeqs_R1.fastq.001 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.002 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.002 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.003 sh>grep "AACTCAGA" SheepSeqs_R1.fastq.637

Slide 6

Slide 6 text

For every command that you enter at the tardish> prompt, something like this is generated in a temp folder some where…. intrepid$ ls -slt /home/mccullocha/galaxy/hpc/dev/tardis_eH4nsA | head -40 total 137 14 -rw-rw-r-- 1 mccullocha mccullocha 53676 Sep 11 11:49 tardis.log 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 run14.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 run14.sh.stdout 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00015 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:49 run14.sh.log 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 sample.blastout.00013 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:49 run14.sh 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 sample.blastout.00011 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:49 sample.blastout.00012 7 -rw-rw-r-- 1 mccullocha mccullocha 2255 Sep 11 11:49 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00014 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run13.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run13.sh.stdout 7 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run13.sh.log 7 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run13.sh 7 -rw-rw-r-- 1 mccullocha mccullocha 2243 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00013 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run12.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run12.sh.stdout 7 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run12.sh.log 7 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run12.sh 2 -rw-rw-r-- 1 mccullocha mccullocha 1978 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00012 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run11.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run11.sh.stdout 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run11.sh.log 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run11.sh 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 sample.blastout.00010 2 -rw-rw-r-- 1 mccullocha mccullocha 2069 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00011 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 sample.blastout.00009 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run10.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run10.sh.stdout 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run10.sh.log 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run10.sh 2 -rw-rw-r-- 1 mccullocha mccullocha 1719 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00010 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 sample.blastout.00008 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run9.sh.stderr 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run9.sh.stdout 2 -rw-rw-r-- 1 mccullocha mccullocha 23 Sep 11 11:48 run9.sh.log 2 -rwxr-xr-x 1 mccullocha mccullocha 569 Sep 11 11:48 run9.sh 2 -rw-rw-r-- 1 mccullocha mccullocha 1614 Sep 11 11:48 processed_S121Rumen-100MS_S73_L001_R2_001.fastq.trimmed.00009 1 -rw-rw-r-- 1 mccullocha mccullocha 0 Sep 11 11:48 run8.sh.stderr

Slide 7

Slide 7 text

Command decorators do for shell commands what python function /method decorators do for python functions and methods. See e.g. the python-decorator based workflow system : “Ruffus” (http://www.ruffus.org.uk/ ) • Ruffus: you write a simple python function implementing your processing step, and then decorate it with a python decorator from the ruffus decorator library. The decorator “re-conditions” your function so that it can be executed as part of a workflow on a large system • tardish> : you write a simple command implementing your processing step, and then decorate it with a decorator from the tardish> decorator library. tardish> “re-conditions” your input data and command into a series of input files and commands which are executed on a large system and waits for the results, and merges them to give the output you expected (as though you had run the original command on your small system – but faster!)

Slide 8

Slide 8 text

• Ruffus….“Handles even fiendishly complicated pipelines which would cause make or scons to go cross-eyed and recursive” • tardish> handles simple data processing pipelines that can be expressed as either a single unix shell command, or a one line composition of commands (such as cmd1 | cmd2 | cmd3 … or cmd1 ; cmd2 ; cmd 3 …) ( tardish> solves a much smaller problem than Ruffus)

Slide 9

Slide 9 text

Also see also: • xargs - like tardish> , takes a command as one of its arguments. The command is bound to command- arguments also provided by xargs to yield a series of commands which are then executed • gnu parallel - like tardish> and xargs, takes a command (or a piped composition of commands) as one of its arguments, and binds it to command-arguments. A series of bound commands are then executed concurrently. • Unlike tardish> , these utilities don’t make use of command decoration

Slide 10

Slide 10 text

…but wait, there’s more ! Further advantages of command decoration include seamless un- compression and format conversion (the bane of every bioinformatician’s life) sh>gunzip –c CloverSeqs.fastq.gz | fastq_to_fasta –o CloverSeqs.fasta sh>blastn –query CloverSeqs.fasta -num_threads 2 -db nt -evalue 1.0e-10 - max_target_seqs 1 -outfmt 7 -out contamination_check.txt tardish>blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz - num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 -out _condition_text_output_contamination_check.txt

Slide 11

Slide 11 text

Command decoration enriches the semantic content of a command by allowing us to implicitly refer to bits of it. For example we can seamlessly take random samples of the input (potentially very handy but seldom done because it is very poorly supported in bioinformatics) tardish>set samplerate=.0001 tardish>blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz - num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 -out _condition_text_output_contamination_check.txt

Slide 12

Slide 12 text

… many labs developing and implementing algorithms => many command-line utilities, and no standards ! => many decorators. Example: handling “by-products” of commands : “product decorators” sh>gunzip RGSeqs.fastq.gz sh>/usr/bin/DynamicTrim.pl -probcutoff 0.05 RGSeqs.fastq > trim.stdout tardish>/usr/bin/DynamicTrim.pl -probcutoff 0.05 _condition_fastq_input_RGSeqs.fastq.gz > _condition_text_output_trim.stdout _condition_fastq_product_.trimmed, trimmed.fastq _condition_uncompressedtext_product_.trimmed_segments,segments.txt _condition_uncompressedpdf_product_.trimmed_segments.hist.pdf,histogram.pdf Also generates the following output files which tardish> will need to find and join back up at the end: RGSeqs.trimmed RGSeqs.trimmed_segments RGSeqs.trimmed_segments.hist.pdf

Slide 13

Slide 13 text

More decorators ! - handling simple one-line pipelines and paired-file dependencies : “throughput” and “paired input” decorators sh>gunzip BugSeqs_P1.fastq.gz sh>gunzip BugSeqs_P2.fastq.gz sh>bwa aln protein_mRNAs.fa BugSeqs_P1.fa > P1.sai sh>bwa aln protein_mRNAs.fa BugSeqs_P1.fa > P2.sai sh>bwa sampe protein_mRNAs.fa P1.sai P2.sai BugSeqs_P1.fa BugSeqs_P2.fa > bugs.sam tardish> bwa aln protein_mRNAs.fa _condition_paired_fastq_input_BugSeqs_P1.fastq.gz > _condition_throughput_P1.sai ; bwa aln protein_mRNAs.fa _condition_paired_fastq_input_BugSeqs_P2.fastq.gz > _condition_throughput_P2.sai ; bwa sampe protein_mRNAs.fa _condition_throughput_P1.sai _condition_throughput_P2.sai _condition_paired_fastq_input_BugSeqs_P1.fastq.gz _condition_paired_fastq_input_BugSeqs_P2.fastq.gz > _condition_sam_output_bugs.sam

Slide 14

Slide 14 text

So many decorators ! - command completion in tardish> tardish> blastn _condition_ _condition_bam_output_ _condition_fasta_product_ _condition_headlesssam_output_ _condition_text_output_ _condition_uncompressedpdf_product_ _condition_bam_product_ _condition_fastq2fasta_input_ _condition_paired_fastq_input_ _condition_text_product_ _condition_uncompressedsam_output_ _condition_blastxml_output_ _condition_fastq2fasta_output_ _condition_pdf_output_ _condition_uncompressedfasta_output_ _condition_uncompressedsam_product_ _condition_blastxml_product_ _condition_fastq2fasta_product_ _condition_pdf_product_ _condition_uncompressedfasta_product_ _condition_uncompressedtext_output_ _condition_compressedtext_input_ _condition_fastq_input_ _condition_sam_output_ _condition_uncompressedfastq_output_ _condition_uncompressedtext_product_ _condition_fasta_input_ _condition_fastq_output_ _condition_sam_product_ _condition_uncompressedfastq_product_ _condition_fasta_output_ _condition_fastq_product_ _condition_text_input_ _condition_uncompressedpdf_output_

Slide 15

Slide 15 text

Caveat: “Command decoration” is fruitful (only) because many of our HPC compute jobs have a very simple “linear” structure (aka “embarrassingly parallel”). These kinds of jobs can easily be factorised into computationally identical independent pieces… = _ ⨁ _ ⨁ _ … = _ ⨁ _ ⨁ _ … Example : pattern searching (e.g. grep, blast, bwa, …) ≡ ℎ ℎ ⨁ ≡

Slide 16

Slide 16 text

( However some jobs are not so easily factorised… ) e.g. • Matrix inversion (e.g. fitting statistical models) • DNA sequence assembly – i.e. assembling complete genomes from large databases of small DNA fragments.

Slide 17

Slide 17 text

Running decorated commands from a standard unix shell, for example as part of a rule in a makefile or existing script tardish>set samplerate=.0001 tardish>set chunksize=1500000 tardish>blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz - num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 > _condition_text_output_contamination_check.txt sh>tardis.py -c 1500000 –s .0001 -w blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz -num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 \> _condition_text_output_contamination_check.txt When running tardis engine from a standard shell need to remember to escape shell meta-characters such as ; > | if these are part of the command you want run - otherwise /bin/sh will mess with them (whereas when using /bin/tardish you don’t need to escape them)

Slide 18

Slide 18 text

Processing batch files using the tardish> interpreter (via “shebang” convention) #!/bin/tardish !echo starting at `date` # this script contamination-checks a 1 in 10000 random sample of the data set samplerate=.0001 # set non-default chunksize (note the size is before sampling. # So chunks in this run will only contain around 150 seqs each) set chunksize=1500000 # don’t use condor for this run. Tardis engine will launch # concurrent processes up to MAX_PROCESSES (subprocess module) set hpctype=local # default is launch using condor # Run the command. Note that by default output is compressed – so this # command will yield a final output file of contamination_check.txt.gz blastn –query _condition_fastq2fasta_input_CloverSeqs.fastq.gz -num_threads 2 -db nt -evalue 1.0e-10 -max_target_seqs 1 -outfmt 7 -out _condition_text_output_contamination_check.txt !echo finished at `date` parser = argparse.ArgumentParser() parser.add_argument("scriptfilename", default=None, nargs="?") . . . args = vars(parser.parse_args()) #if we have a script file then read commands from that shell_stdin = sys.stdin if args["scriptfilename"] is not None: shell_stdin = open(args["scriptfilename"],"r")

Slide 19

Slide 19 text

How it works (generators and itertools) ['blastn', '-query', '_condition_fastq2fasta_input_seqs.fa', '-db', 'nt', '-out', '_condition_text_output_sample.seqs.out‘ ]] blastn -query _condition_fastq2fasta_input_seqs.fa -db nt –out _condition_text_output_sample.seqs.out [repeat('blastn'), repeat('-query'), , repeat('-db'), repeat('nt'), repeat('-out'), ] blastn –query seqs.fa.00001 -db nt -out sample.seqs.out.0001 blastn –query seqs.fa.00002 -db nt -out sample.seqs.out.0002 blastn –query seqs.fa.00003 -db nt -out sample.seqs.out.0003 blastn –query seqs.fa.00004 -db nt -out sample.seqs.out.0004 blastn –query seqs.fa.00005 -db nt -out sample.seqs.out.0005 blastn –query seqs.fa.00006 -db nt -out sample.seqs.out.0006 blastn –query seqs.fa.00007 -db nt -out sample.seqs.out.0007 ...etc ... ... 2014-08-22 13:51:04,704 INFO getconditionedOutput : have received 80 of 80 job products 2014-08-22 13:51:04,712 INFO 1 output unconditioners is unconditioning textDataConditioner.unconditionOutput : unconditioning the following conditioned files ['sample.seqs.out.0001','sample.seqs.out.0002',...] to sample.seqs.out Executing ['cat', 'sample.seqs.out.0001', 'sample.seqs.out.0002',...] ...

Slide 20

Slide 20 text

“inducting” design pattern – shared variables without using global or class variables or churning constructors “Inducting” newly created objects “into the club” (instead of using class variables. Class variables used originally in the engine to share state information but these broke everything once I started using the engine classes as a library – i.e. importing the classes ). This design pattern is a type of prototyping. def induct(self,other): super(localhpcJob,self).induct(other) other.workerList = self.workerList return prototype = localhpcJob() . . . worker = localhpcJob() prototype.induct(worker)

Slide 21

Slide 21 text

Command decoration and distributed application support : connecting the tardish> interpreter to a tardish> on a remote host [tardish] rhost=10.10.2.114 rport=3391 PyZMQ (“sockets on steroids”) tardish> client tardish> server #!/bin/tardish # no-shared-filesystem-example – files will be copied to and from server and file monikers # will resolve to the actual file-path on the server file my_seqs is /dataset/clover_seqs/active/build1.2/translations.faa file my_output is /dataset/clover_seqs/active/build1.2/translations.faa.hits blastx –query _condition_fasta_input_$put(my_seqs) -num_threads 2 -db nt -evalue 1.0e-10 - max_target_seqs 1 -outfmt 7 -out _condition_text_output_$get(my_output) [tardish] lport=3391 [tardis_engine] valid_command_patterns=grep cat awk [t]*blast[nxp] bwa bowtie flexbar

Slide 22

Slide 22 text

(Actually tardish> always runs in client-server mode) impulsive$ tardish (starting a dedicated tardish server to handle this client....) (server pid=1539) (tardish server listening on port 6107) Welcome to tardish (commands will be queued to tardish server at localhost:6107) Use tab for tab completion of conditioning directives (tab three times to list all) tardish>

Slide 23

Slide 23 text

Python modules used • itertools     (iterator algebra for big data ! ) • cmd (command interpreter framework) • argparse (formerly optparse) • ConfigParser (.tardishrc) • subprocess (set hpctype=local) dict()

Slide 24

Slide 24 text

0 1 2 3 4 5 6 7 8 9 10 tardish> tardis engine something pretty mature Maturity Level Maturity Level "t >2 : works reliably for me and my colleagues and does real work - but not yet packaged or deployed outside our environment. Fairly stable" "t< 0.1 : Bleeding edge. Works fairly reliably for me and has done a small amount of real work - but not yet used by anybody else. Not yet stable"

Slide 25

Slide 25 text

Future work: Engine:  Abstract the interface to the cluster manager (probably using DRMAA API). Currently condor-dependent  Improve the performance of the data conditioner (e.g. the default Biopython sequence parse is probably a bit slow) (Test under pypy)  Can probably improve the code with a judicious decorator or two  Probably add support for more command decorators tardish> Interpreter:  Stabilise and beta-test – its pretty new  Refactor to use python multiprocessing module when running as a server Both:  Windows port  Possible Galaxy (python based bioinformatics workflow system) integration  Increase maturity level

Slide 26

Slide 26 text

The tardishian principle of performance computing : the result of using a big very complicated and fast computer is indistinguishable from using a small very simple slow computer in combination with a tardis Time

Slide 27

Slide 27 text

= + command-line interface tardish>

Slide 28

Slide 28 text

tardish> PyZMQ tardish> THANKS to >AgResearch Linux and HPC Systems Engineers Simon Guest and Russel Smithies for a fantastic HPC setup >AgResearch IT team for a fantastic IT infrastructure >AgResearch Bioinformaticians for trying and using the tardis engine every now and then, and letting me use it to run some of their big jobs tardish> and THANKS to YOU for your attention tardish>exit https://bitbucket.org/agr-bifo/tardis