Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nextflow tutorial - ACGT'14

Nextflow tutorial - ACGT'14

Nextflow is a DSL for data-driven pipelines, this tutorial introduces on main concepts on show some praticale examples

Paolo Di Tommaso

May 29, 2014
Tweet

More Decks by Paolo Di Tommaso

Other Decks in Programming

Transcript

  1. A DSL FOR DATA-DRIVEN PIPELINES Paolo Di Tommaso ACGT retreat

    - 29 May '14
  2. WHAT IS NEXTFLOW A fluent DSL modelled around the UNIX

    pipe concept that simplifies writing parallel pipelines in a portable manner
  3. None
  4. None
  5. WAT?! YET ANOTHER .. !?

  6. PLENTY OF PIPELINES FRAMEWORKS!

  7. ANY BIOINFORMATICIANS IS A LINUX HACKER

  8. cat sequence | blast -in - | head 10 |

    t_coffee > result
  9. THE RATIONALE • Fast prototyping • Smoothly integration with Linux

    world • High-level parallelisation model
  10. NEXTFLOW • Portable across different execution platforms (clusters and cloud)

    i.e. enable reproducibility • Error handling and crash recovery • Simplify debugging making possible to reproduce errors
  11. VFS Groovy Runtime Executors Tasks dispatcher Dataflow parallelisation & synchronisation

    Script interpreter Java VM 7+
  12. DATAFLOW • Declarative computational model for concurrent processes execution •

    Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variable i.e. async FIFO queues called channels • The synchronization is managed automatically
  13. p1

  14. p2 p1

  15. – Henri E. Bal , Jennifer G. Steiner , Andrew

    S. Tanenbaum. 
 Programming languages for distributed computing systems (1989) “Dataflow variables are spectacularly expressive in concurrent programming when compared to explicit synchronisation”
  16. MAIN PRIMITIVES • Processes: run any piece of script •

    Channels: unidirectional async queues that allows the processes to comunicate • Operators: transform channels content
  17. input dataset splitting collectFile map map map filter filter task

    task task task task task
  18. GET STARTED Prerequisites: Java 7 or 8 Install by using

    the following command wget  -­‐qO-­‐  get.nextflow.io  |  bash nextflow
  19. THE BASIC Variables and assignments x  =  1   y

     =  10.5   str  =  'hello  world!'   p  =  x;  q  =  y   int  x  =1   double  y  =  10.5   String  str  =  'hello  world!'  
  20. THE BASIC Printing values x  =  1   y  =

     10.5   str  =  'hello  world!'   print  x   print  str   print  str  +  '\n' println  str  
  21. THE BASIC Printing values x  =  1   y  =

     10.5   str  =  'hello  world!'   print(x)   print(str)   print(str  +  '\n') println(str)  
  22. MORE ON STRINGS str  =  'bioinformatics'     print  str[0]

      ! print  "$str  is  cool!"   print  "Current  path:  $PWD" str  =  '''              multi              line                string       ''' ! str  =  """              User:  $USER              Home:  $HOME              """
  23. LISTS simpleList  =  [1,2,5]   strList  =  ['a','z']   emptyList

     =  []   ! simpleList.add(anyValue)   simpleList  <<  anyValue   ! print  simpleList[0]   print  simpleList[1]   print  simpleList[0..3]   ! print  simpleList.size()
  24. MAPS map  =  [:]   ! map  =  [  k1:

     10,  k2:  20,  k3:  'str'  ]   ! print  map.k1     print  map['k1']   print  map.get('k1')   ! map.k1  =  'Hello'   map['k1']  =  'Hello'   map.put('k1',  'Hello')   ! print  map.size()
  25. CONTROL STATEMENTS if(  x  )  {      print  x

        }       if(  x  ==  1  )  {      print  x   }   ! if(  x  >  2  )  {      //  do  this     }   else  {      //  to  that   }
  26. CONTROL STATEMENTS for(  int  i=0,n=10;  i<n;  i++  )  {  

       ..     }   ! list  =  [1,2,3]   for(  x  :  list  )  {        print  x   }   ! list.each  {      print  it   }   ! map.each  {  k,  v  -­‐>        println  "$k  contains  $v"   }
  27. FUNCTIONS def  foo()  {      print  'Hello'   }

      ! foo()   ! def  bar(  x,  y  )  {      x+y   }   ! print  bar(1,2)
  28. CLOSURES Allows you to reference functions as variables sayHello  =

     {          print  'Hello'   } sayHello()   sayHello.call() printSum  =  {  a,  b  -­‐>  print  a+b  }   printSum(  5,  7  )              
  29. CLOSURES Pass closure as argument def  foo(  f  )  {

           x  =  Random.nextInt()        f.call(x)   } foo(  {  println  it  +1  }  ) foo    {  println  it*it  }  
  30. CLOSURES for(  x  :  list  )  {      

     print  x   }   ! list.each  {      print  it   }   ! map.each  {  k,  v  -­‐>        println  "$k  equals  $v"   }
  31. FILES my_file  =  file('any/path/to/file.txt')   ! print  my_file.text   //

     save  file  content     my_file.text  =  'some  content  ..' //  read  line  by  file     my_file.eachLine  {      println  it   }
  32. READING FASTA FILES my_file  =  file('any/path/to/file.txt')   ! //  split

     sequence  by  sequence   my_file.splitFasta()  {        print  it       }   ! //  chunk  of  10  sequences     my_file.splitFasta(by:  10)  {        print  it       }   ! //  parse  into  map  objects       my_file.splitFasta(record:  [id:  true,  sequence:  true])  {        print  it.id        print  it.sequence       }  
  33. EXAMPLE 1 • Print the content of a file •

    Read FASTA file and print sequences • Read a FASTA file and print only the IDs
  34. HOW READ IT process  sayHello  {   !    

     input:        val  str   !      output:        stdout  into  result   !      """        echo  $str  world!        """   }   ! str  =  Channel.from('hello',  'hola',  'bonjour',  'ciao') result.subscribe  {  print  it  }
  35. PROCESS INPUTS input:      <input  type>  <name>  [from  <source

     channel>]  [attributes] process  procName  {   ! ! ! ! ! ! ! ! !      """        <your  script>        """     ! }
  36. PROCESS INPUTS input:      val    x  from  ch_1

         file  y  from  ch_2      file  'data.fa'  from  ch_3      stdin  from  from  ch_4      set  (x,  'file.txt')  from  ch_5 process  procName  {   ! ! ! ! ! ! ! ! !      """        <your  script>        """     ! }
  37. PROCESS INPUTS num  =  Channel.from(  1,  2,  3  )  

    ! ! process  basicExample  {            input:      val  x  from  num   !    """      echo  process  job  $x      """   ! }
  38. PROCESS INPUTS proteins  =  Channel.fromPath(  '/some/path/data.fa'  )   ! !

    ! process  blastThemAll  {   !    input:      file  'query.fa'  from  proteins   !    "blastp  -­‐query  query.fa  -­‐db  nr"   ! }   !
  39. PROCESS OUTPUTS process  randomNum  {   !      output:

           file  'result.txt'  into  numbers   ! !      '''        echo  $RANDOM  >  result.txt        '''   ! }   ! ! numbers.subscribe  {  println  "Received:  "  +  it.text  }
  40. EXAMPLE 2 • Execute a process running a BLAST job

    given an input file • Execute a BLAST job emitting the produced output
  41. PIPELINES PARAMETERS params.p1  =  'alpha'   params.p2  =  'beta'  

    : Simply declares some variables prefixed by params When launching your script you can override the default values $  nextflow  <script.nf>  -­‐-­‐p1  'delta'  -­‐-­‐p2  'gamma'
  42. SPLITTING CONTENT You can split text object or files using

    the splitting methods: • splitText - line by line • splitCsv - comma separated values format • splitFasta - by FASTA sequences • splitFastq - by FASTQ sequences
  43. SPLITTING CONTENT params.query  =  "$HOME/sample.fa"   params.chunkSize  =  5  

    ! fasta  =  file(params.query)   seq  =  Channel.from(fasta).splitFasta(by:  params.chunkSize)   ! process  blast  {          input:          file  'seq.fa'  from  seq   !        output:          file  'out'  into  blast_result   !        """          blastp  -­‐db  $DB  -­‐query  seq.fa  -­‐outfmt  6  >  out          """   }
  44. COLLECT FILE The operator collectFile allows to gather items produced

    by upstream processes my_items.collectFile(name:'result.txt')   Collect all items to a single item
  45. COLLECT FILE The operator collectFile allows to gather items produced

    by upstream processes my_items.collectFile(storeDir:'path/name')  {   !    def  key  =  getKeyByItem(it)      def  content  =  getContentByItem(it)      [  key,  content  ]   ! } Collect the items and group them into files having a names defined by a grouping criteria
  46. EXAMPLE 3 Split a FASTA file, execute a BLAST query

    for each chunk and gather the results
  47. MULTIPLE INPUT FILES Simply use Channel.fromPath method instead of Channel.from

    Channel.fromPath('any/path/file.txt') Channel.fromPath('any/path/*.txt') Channel.fromPath('any/path/**.txt') Channel.fromPath('any/path/**/*.txt') Channel.fromPath('any/path/**/*.txt',  maxDepth:  3)
  48. EXAMPLE 4 Split many FASTA files and execute BLAST query

    for each of them
  49. CONFIG FILE Allows you to save into a file commons

    options and environment settings. By default it uses nextflow.config in current path params.p1  =  'alpha'   params.p2  =  'beta'   ! env.VAR_1  =  'some_value'   env.CACHE_4_TCOFFEE  =  '/some/path/cache'   env.LOCKDIR_4_TCOFFEE  =  '/some/path/lock'   ! process.executor  =  'sge'
  50. CONFIG FILE params  {      p1  =  'alpha'  

       p2  =  'beta'   }   ! env  {      VAR_1  =  'some_value'      CACHE_4_TCOFFEE  =  '/some/path/cache'      LOCKDIR_4_TCOFFEE  =  '/some/path/lock'   }     ! process  {        executor  =  'sge'   } Alternate syntax (almost) equivalent
  51. USING THE CLUSTER //  default  properties  for  any  process  

    process.executor  =  'sge'   process.queue  =  'short'   process.clusterOptions  =  '-­‐pe  smp  2'   process.scratch  =  true   ! //  specific  process  settings   process.$procName.queue  =  'long'   process.$procName.clusterOptions  =  '-­‐l  h_rt=12:00:0'   ! //  set  the  max  number  SGE  jobs     executor.$sge.queueSize  =  100 Simply define the SGE executor in nextflow.config
  52. MORE ON OPERATORS Operators are commonly used to transforms channels

    content Channel          .from(  1,  2,  3,  4,  5  )          .map  {  it  *  it  }           1 4 9 16 25 //  it  prints
  53. MORE ON OPERATORS target1  =  Channel.create()   target2  =  Channel.create()

      Operators can be used also to filter, fork and combine channels Moreover they can be chained in order to implement a custom behaviour Channel            .fromPath('misc/sample.fa')            .splitFasta(  record:  [id:  true,  seqString:  true  ])            .filter  {  record  -­‐>                      record.id  =~  /^ENST0.*/              }            .into(target1,  target2)            
  54. EXAMPLE 5 A toy RNAseq pipeline that: • Index a

    reference genome • Maps a collection of read-pairs • Assemble a transcript for each read pair
 This example will run using a Docker container
  55. DOCKER • Enable to run processes in a isolated environment

    • You can package and distribute a self-contained executable environment • Up today it runs only on Linux (partially OSX), Docker plans to support Windows as well.
  56. RESOURCES • nextflow.io • nextflow.readthedocs.org • groups.google.com/forum/#!forum/nextflow • github.com/nextflow-io/ACGT14-tutorial

  57. THANKS!