$30 off During Our Annual Pro Sale. View Details »

Nextflow tutorial - ACGT'14

Nextflow tutorial - ACGT'14

Nextflow is a DSL for data-driven pipelines, this tutorial introduces on main concepts on show some praticale examples

Paolo Di Tommaso

May 29, 2014
Tweet

More Decks by Paolo Di Tommaso

Other Decks in Programming

Transcript

  1. A DSL FOR DATA-DRIVEN
    PIPELINES
    Paolo Di Tommaso

    ACGT retreat - 29 May '14

    View Slide

  2. WHAT IS NEXTFLOW
    A fluent DSL modelled around

    the UNIX pipe concept

    that simplifies writing parallel

    pipelines in a portable manner

    View Slide

  3. View Slide

  4. View Slide

  5. WAT?!

    YET ANOTHER .. !?

    View Slide

  6. PLENTY OF PIPELINES
    FRAMEWORKS!

    View Slide

  7. ANY BIOINFORMATICIANS

    IS A LINUX HACKER

    View Slide

  8. cat sequence | blast -in - | head 10 | t_coffee > result

    View Slide

  9. THE RATIONALE
    • Fast prototyping

    • Smoothly integration with Linux world

    • High-level parallelisation model

    View Slide

  10. NEXTFLOW
    • Portable across different execution platforms
    (clusters and cloud) i.e. enable reproducibility

    • Error handling and crash recovery

    • Simplify debugging making possible to reproduce
    errors

    View Slide

  11. VFS Groovy Runtime
    Executors
    Tasks dispatcher
    Dataflow parallelisation &
    synchronisation
    Script interpreter
    Java VM 7+

    View Slide

  12. DATAFLOW
    • Declarative computational model for concurrent processes
    execution

    • Processes wait for data, when an input set is ready the
    process is executed

    • They communicate by using dataflow variable i.e. async
    FIFO queues called channels

    • The synchronization is managed automatically

    View Slide

  13. p1

    View Slide

  14. p2
    p1

    View Slide

  15. – Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum. 

    Programming languages for distributed computing systems (1989)
    “Dataflow variables are spectacularly expressive in
    concurrent programming when compared to explicit
    synchronisation”

    View Slide

  16. MAIN PRIMITIVES
    • Processes: run any piece of script

    • Channels: unidirectional async queues that
    allows the processes to comunicate

    • Operators: transform channels content

    View Slide

  17. input dataset
    splitting
    collectFile
    map map map
    filter filter
    task task task
    task task task

    View Slide

  18. GET STARTED
    Prerequisites:

    Java 7 or 8
    Install by using the following command
    wget  -­‐qO-­‐  get.nextflow.io  |  bash
    nextflow

    View Slide

  19. THE BASIC
    Variables and assignments
    x  =  1  
    y  =  10.5  
    str  =  'hello  world!'  
    p  =  x;  q  =  y  
    int  x  =1  
    double  y  =  10.5  
    String  str  =  'hello  world!'  

    View Slide

  20. THE BASIC
    Printing values
    x  =  1  
    y  =  10.5  
    str  =  'hello  world!'  
    print  x  
    print  str  
    print  str  +  '\n'
    println  str  

    View Slide

  21. THE BASIC
    Printing values
    x  =  1  
    y  =  10.5  
    str  =  'hello  world!'  
    print(x)  
    print(str)  
    print(str  +  '\n')
    println(str)  

    View Slide

  22. MORE ON STRINGS
    str  =  'bioinformatics'    
    print  str[0]  
    !
    print  "$str  is  cool!"  
    print  "Current  path:  $PWD"
    str  =  '''  
               multi  
               line    
               string  
        '''
    !
    str  =  """  
               User:  $USER  
               Home:  $HOME  
               """

    View Slide

  23. LISTS
    simpleList  =  [1,2,5]  
    strList  =  ['a','z']  
    emptyList  =  []  
    !
    simpleList.add(anyValue)  
    simpleList  <<  anyValue  
    !
    print  simpleList[0]  
    print  simpleList[1]  
    print  simpleList[0..3]  
    !
    print  simpleList.size()

    View Slide

  24. MAPS
    map  =  [:]  
    !
    map  =  [  k1:  10,  k2:  20,  k3:  'str'  ]  
    !
    print  map.k1    
    print  map['k1']  
    print  map.get('k1')  
    !
    map.k1  =  'Hello'  
    map['k1']  =  'Hello'  
    map.put('k1',  'Hello')  
    !
    print  map.size()

    View Slide

  25. CONTROL STATEMENTS
    if(  x  )  {  
       print  x    
    }  
       
    if(  x  ==  1  )  {  
       print  x  
    }  
    !
    if(  x  >  2  )  {  
       //  do  this    
    }  
    else  {  
       //  to  that  
    }

    View Slide

  26. CONTROL STATEMENTS
    for(  int  i=0,n=10;  i    ..    
    }  
    !
    list  =  [1,2,3]  
    for(  x  :  list  )  {  
         print  x  
    }  
    !
    list.each  {  
       print  it  
    }  
    !
    map.each  {  k,  v  -­‐>    
       println  "$k  contains  $v"  
    }

    View Slide

  27. FUNCTIONS
    def  foo()  {  
       print  'Hello'  
    }  
    !
    foo()  
    !
    def  bar(  x,  y  )  {  
       x+y  
    }  
    !
    print  bar(1,2)

    View Slide

  28. CLOSURES
    Allows you to reference functions as variables
    sayHello  =  {    
         print  'Hello'  
    }
    sayHello()  
    sayHello.call()
    printSum  =  {  a,  b  -­‐>  print  a+b  }  
    printSum(  5,  7  )              

    View Slide

  29. CLOSURES
    Pass closure as argument
    def  foo(  f  )  {  
         x  =  Random.nextInt()  
         f.call(x)  
    }
    foo(  {  println  it  +1  }  )
    foo    {  println  it*it  }  

    View Slide

  30. CLOSURES
    for(  x  :  list  )  {  
         print  x  
    }  
    !
    list.each  {  
       print  it  
    }  
    !
    map.each  {  k,  v  -­‐>    
       println  "$k  equals  $v"  
    }

    View Slide

  31. FILES
    my_file  =  file('any/path/to/file.txt')  
    !
    print  my_file.text  
    //  save  file  content    
    my_file.text  =  'some  content  ..'
    //  read  line  by  file    
    my_file.eachLine  {  
       println  it  
    }

    View Slide

  32. READING FASTA FILES
    my_file  =  file('any/path/to/file.txt')  
    !
    //  split  sequence  by  sequence  
    my_file.splitFasta()  {  
         print  it      
    }  
    !
    //  chunk  of  10  sequences    
    my_file.splitFasta(by:  10)  {  
         print  it      
    }  
    !
    //  parse  into  map  objects      
    my_file.splitFasta(record:  [id:  true,  sequence:  true])  {  
         print  it.id  
         print  it.sequence      
    }  

    View Slide

  33. EXAMPLE 1
    • Print the content of a file

    • Read FASTA file and print sequences

    • Read a FASTA file and print only the IDs

    View Slide

  34. HOW READ IT
    process  sayHello  {  
    !
         input:  
         val  str  
    !
         output:  
         stdout  into  result  
    !
         """  
         echo  $str  world!  
         """  
    }  
    !
    str  =  Channel.from('hello',  'hola',  'bonjour',  'ciao')
    result.subscribe  {  print  it  }

    View Slide

  35. PROCESS INPUTS
    input:  
           [from  ]  [attributes]
    process  procName  {  
    !
    !
    !
    !
    !
    !
    !
    !
    !
         """  
           
         """    
    !
    }

    View Slide

  36. PROCESS INPUTS
    input:  
       val    x  from  ch_1  
       file  y  from  ch_2  
       file  'data.fa'  from  ch_3  
       stdin  from  from  ch_4  
       set  (x,  'file.txt')  from  ch_5
    process  procName  {  
    !
    !
    !
    !
    !
    !
    !
    !
    !
         """  
           
         """    
    !
    }

    View Slide

  37. PROCESS INPUTS
    num  =  Channel.from(  1,  2,  3  )  
    !
    !
    process  basicExample  {  
         
       input:  
       val  x  from  num  
    !
       """  
       echo  process  job  $x  
       """  
    !
    }

    View Slide

  38. PROCESS INPUTS
    proteins  =  Channel.fromPath(  '/some/path/data.fa'  )  
    !
    !
    !
    process  blastThemAll  {  
    !
       input:  
       file  'query.fa'  from  proteins  
    !
       "blastp  -­‐query  query.fa  -­‐db  nr"  
    !
    }  
    !

    View Slide

  39. PROCESS OUTPUTS
    process  randomNum  {  
    !
         output:  
         file  'result.txt'  into  numbers  
    !
    !
         '''  
         echo  $RANDOM  >  result.txt  
         '''  
    !
    }  
    !
    !
    numbers.subscribe  {  println  "Received:  "  +  it.text  }

    View Slide

  40. EXAMPLE 2
    • Execute a process running a BLAST job given an
    input file

    • Execute a BLAST job emitting the produced
    output

    View Slide

  41. PIPELINES PARAMETERS
    params.p1  =  'alpha'  
    params.p2  =  'beta'  
    :
    Simply declares some variables prefixed by params
    When launching your script you can override

    the default values
    $  nextflow    -­‐-­‐p1  'delta'  -­‐-­‐p2  'gamma'

    View Slide

  42. SPLITTING CONTENT
    You can split text object or files using the splitting methods:

    • splitText - line by line

    • splitCsv - comma separated values format

    • splitFasta - by FASTA sequences

    • splitFastq - by FASTQ sequences

    View Slide

  43. SPLITTING CONTENT
    params.query  =  "$HOME/sample.fa"  
    params.chunkSize  =  5  
    !
    fasta  =  file(params.query)  
    seq  =  Channel.from(fasta).splitFasta(by:  params.chunkSize)  
    !
    process  blast  {  
           input:  
           file  'seq.fa'  from  seq  
    !
           output:  
           file  'out'  into  blast_result  
    !
           """  
           blastp  -­‐db  $DB  -­‐query  seq.fa  -­‐outfmt  6  >  out  
           """  
    }

    View Slide

  44. COLLECT FILE
    The operator collectFile allows to gather

    items produced by upstream processes
    my_items.collectFile(name:'result.txt')  
    Collect all items to a single item

    View Slide

  45. COLLECT FILE
    The operator collectFile allows to gather

    items produced by upstream processes
    my_items.collectFile(storeDir:'path/name')  {  
    !
       def  key  =  getKeyByItem(it)  
       def  content  =  getContentByItem(it)  
       [  key,  content  ]  
    !
    }
    Collect the items and group them into files

    having a names defined by a grouping criteria

    View Slide

  46. EXAMPLE 3
    Split a FASTA file, execute a BLAST query for each chunk

    and gather the results

    View Slide

  47. MULTIPLE INPUT FILES
    Simply use Channel.fromPath method instead of Channel.from
    Channel.fromPath('any/path/file.txt')
    Channel.fromPath('any/path/*.txt')
    Channel.fromPath('any/path/**.txt')
    Channel.fromPath('any/path/**/*.txt')
    Channel.fromPath('any/path/**/*.txt',  maxDepth:  3)

    View Slide

  48. EXAMPLE 4
    Split many FASTA files and execute BLAST query

    for each of them

    View Slide

  49. CONFIG FILE
    Allows you to save into a file

    commons options and environment settings.

    By default it uses nextflow.config in current path
    params.p1  =  'alpha'  
    params.p2  =  'beta'  
    !
    env.VAR_1  =  'some_value'  
    env.CACHE_4_TCOFFEE  =  '/some/path/cache'  
    env.LOCKDIR_4_TCOFFEE  =  '/some/path/lock'  
    !
    process.executor  =  'sge'

    View Slide

  50. CONFIG FILE
    params  {  
       p1  =  'alpha'  
       p2  =  'beta'  
    }  
    !
    env  {  
       VAR_1  =  'some_value'  
       CACHE_4_TCOFFEE  =  '/some/path/cache'  
       LOCKDIR_4_TCOFFEE  =  '/some/path/lock'  
    }    
    !
    process  {    
       executor  =  'sge'  
    }
    Alternate syntax (almost) equivalent

    View Slide

  51. USING THE CLUSTER
    //  default  properties  for  any  process  
    process.executor  =  'sge'  
    process.queue  =  'short'  
    process.clusterOptions  =  '-­‐pe  smp  2'  
    process.scratch  =  true  
    !
    //  specific  process  settings  
    process.$procName.queue  =  'long'  
    process.$procName.clusterOptions  =  '-­‐l  h_rt=12:00:0'  
    !
    //  set  the  max  number  SGE  jobs    
    executor.$sge.queueSize  =  100
    Simply define the SGE executor in nextflow.config

    View Slide

  52. MORE ON OPERATORS
    Operators are commonly used to transforms channels content
    Channel  
           .from(  1,  2,  3,  4,  5  )  
           .map  {  it  *  it  }  
           
    1
    4
    9
    16
    25
    //  it  prints

    View Slide

  53. MORE ON OPERATORS
    target1  =  Channel.create()  
    target2  =  Channel.create()  
    Operators can be used also to filter, fork and combine channels

    Moreover they can be chained in order

    to implement a custom behaviour
    Channel  
             .fromPath('misc/sample.fa')  
             .splitFasta(  record:  [id:  true,  seqString:  true  ])  
             .filter  {  record  -­‐>    
                     record.id  =~  /^ENST0.*/    
             }  
             .into(target1,  target2)  
             

    View Slide

  54. EXAMPLE 5
    A toy RNAseq pipeline that:

    • Index a reference genome

    • Maps a collection of read-pairs

    • Assemble a transcript for each read pair

    This example will run using a Docker container

    View Slide

  55. DOCKER
    • Enable to run processes in a isolated environment

    • You can package and distribute a self-contained
    executable environment

    • Up today it runs only on Linux (partially OSX),
    Docker plans to support Windows as well.

    View Slide

  56. RESOURCES
    • nextflow.io

    • nextflow.readthedocs.org

    • groups.google.com/forum/#!forum/nextflow

    • github.com/nextflow-io/ACGT14-tutorial

    View Slide

  57. THANKS!

    View Slide