Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Snake makes Vital-it a nicer place

Snake makes Vital-it a nicer place

I tried to explain the logic behind build/workflow languages. Latter on I am pinpointing differences between snakemake and GNU make and in the end I am introducing a template I wrote for computation on Vital-it, a Swiss lsf cluster for bioinformatics.

Kamil S Jaroň

March 29, 2018
Tweet

More Decks by Kamil S Jaroň

Other Decks in Education

Transcript

  1. bash make execute line 1 1. read script execute line

    2 2. define relationships between files execute line 3 3. execute what is needed to create/update desired output
  2. Makefile target : dependencies recepie shell : make Make target

    : if the target does not exist if any of the dependiencies have a newer timestamp only if & only if all dependencies exist
  3. Variables hamlet_wc.txt : hamlet.txt wc -w hamlet.txt > hamlet_wc.txt $<

    the first dependency $ˆ all dependencies $@ target
  4. Variables hamlet_wc.txt : hamlet.txt wc -w $< > $@ $<

    the first dependency $ˆ all dependencies $@ target
  5. Pattern rules Makefile %_wc.txt : %.txt wc -w $< >

    $@ executed from bash make hamlet_wc.txt # requires hamlet.txt make romeo_wc.txt # requires romeo.txt
  6. PHONY targets Makefile PHONY : both both : hamlet_wc.txt romeo_wc.txt

    %_wc.txt : %.txt wc -w $< > $@ executed from bash make both
  7. ./configure make make install # run automake to generate Makefile

    # compile source code into programs # move programs somewhere (/bin/bash)
  8. GNU make snakemake exhaustive documentation stable limited programming local files

    running in memory no isolation of env tutorial instead of documentation sometimes misbehaving python flavoured ! cluster support lock files virtal env
  9. snakemake rules have names rule bwa_map: input: "data/genome.fa", "data/samples/A.fastq" output:

    "mapped_reads/A.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"
  10. Snakefiles have verbose wildcards rule download_all : input : "data/monkey/genome.fa.gz",

    "data/lion/genome.fa.gz" rule download_genome : output : "data/{sp}/genome.fa.gz" shell : "download_genome.sh {wildcards.sp} {output}"
  11. Snakefiles have blocks of ”procedural” code species_with_genomes = [] with

    open(’tables/genome_table.tsv’) as tab : tab.readline() for textline in tab : line = textline.split() if line[2] != ’NA’ : species_with_genomes.append(line[0]) rule download_all : input : expand("data/{sp}/genome.fa.gz", sp=species_with_genomes) rule download_genome : output : "data/{sp}/genome.fa.gz" shell : "download_genome.sh {wildcards.sp} genome_table.tsv {output}"
  12. Running snakemake on Vital-it shell # snakemake v3.13.0 is installed

    # Warning, following lines are stupid snakemake download_all --jobs 10 --cluster "bsub \ -J snakejobs \ -q normal \ -n 1 \ -M 5000000 \ -R \"span[hosts=1] rusage[tmp=50000] span[ptile=1]\" \ -o \"logs/log.out\" \ -e \"logs/log.err\""
  13. Snakefile can specify resources in rules Snakefile rule download_genome :

    threads : 1 resources : mem=2000000, tmp=3000 output : "data/{sp}/genome.fa.gz" shell : "download_genome.sh {wildcards.sp} genome_table.tsv {output}"
  14. When executed resources are pulled for every submited job shell

    snakemake download_all --jobs 10 --cluster "bsub \ -J {rule} \ -q normal \ -n {threads} \ -M {resources.mem} \ -R \"span[hosts=1] rusage[tmp={resources.tmp}]\" \ -o \"logs/{rule}.{wildcards}.out\" \ -e \"logs/{rule}.{wildcards}.err\""
  15. Classical job scripts #BSUB -L /bin/bash #BSUB -q normal #BSUB

    -n 16 #BSUB -M 25165824 #BSUB -R \"rusage[tmp=70000] span[ptile=16]\" INPUTDIR=/scratch/data/raw_reads/ INPUT=raw_reads.fq.gz LOCALDIR=/scratch/local/daily/$USER/$JOBID TARGETPATH=/scratch/data/$1/trimmed_reads mkdir -p $LOCALDIR $TARGETPATH cp $INPUTDIR/$INPUT . trimmomatic PE -threads 16 ... $INPUT mv reads[12].fq.gz \$TARGETPATH rm $INPUT rmdir \$LOCALDIR
  16. My alternative scripts/use_local.sh <script> <arguments> <output> the first argument is

    the script that get executed the last argument is output (can be specified using wildcards or a directory) 1. copy to local disk all arguments that are valid files 2. execute the script (fist argument) 3. move the output back to where snakemake was executed (last argument) 4. remove all files that were copied as an input
  17. My alternative scripts/use_local.sh <script> <arguments> <output> the first argument is

    the script that get executed the last argument is output (can be specified using wildcards or a directory) 1. copy to local disk all arguments that are valid files 2. execute the script (fist argument) 3. move the output back to where snakemake was executed (last argument) 4. remove all files that were copied as an input
  18. Snakemake project template https://github.com/KamilSJaron/snakemake-vital-it-template ssh prd.vital-it.ch cd /scratch/beegfs/monthly/$USER git clone

    \ [email protected]:KamilSJaron/snakemake-vital-it-template.git mv snakemake-vital-it-template the_coolest_study_ever cd the_coolest_study_ever git remote set ... ...