$30 off During Our Annual Pro Sale. View Details »

Nextflow CRG tutorial

Paolo Di Tommaso
February 26, 2015
160

Nextflow CRG tutorial

Paolo Di Tommaso

February 26, 2015
Tweet

Transcript

  1. Paolo Di Tommaso

    Comparative bioinformatics

    Notredame Lab - CRG

    !
    26 Feb 2015

    View Slide

  2. WHAT NEXTFLOW IS
    • A computing runtime which executes Nextflow
    pipeline scripts

    • A programming DSL that simplify writing of highly
    parallel computational pipelines reusing your
    existing scripts and tools

    View Slide

  3. NEXTFLOW DSL
    • It is NOT a new programming language

    • It extends the Groovy scripting language

    • It provides a multi-paradigm programming
    environment

    View Slide

  4. MULTI-PARADIGM
    Imperative

    Object-oriented programming

    +

    Declarative concurrency

    Dataflow programming model

    View Slide

  5. VFS Groovy Runtime
    Executors
    Tasks dispatcher
    Dataflow parallelisation &
    synchronisation
    Script interpreter
    Java VM 7+

    View Slide

  6. HOW TO INSTALL
    Use the following command:
    wget  -­‐qO-­‐  get.nextflow.io  |  bash
    nextflow

    View Slide

  7. GET STARTED
    $  cd  ~/crg-­‐course  
    $  vagrant  up

    $  vagrant  ssh  
    Login in your course laptop
    Once in the virtual machine
    $  cd  ~/nextflow-­‐tutorial  
    $  git  pull  
    $  nextflow  info  

    View Slide

  8. THE BASIC
    Variables and assignments
    x  =  1  
    y  =  10.5  
    str  =  'hello  world!'  
    p  =  x;  q  =  y  

    View Slide

  9. THE BASIC
    Printing values
    x  =  1  
    y  =  10.5  
    str  =  'hello  world!'  
    print  x  
    print  str  
    print  str  +  '\n'
    println  str  

    View Slide

  10. THE BASIC
    Printing values
    x  =  1  
    y  =  10.5  
    str  =  'hello  world!'  
    print(x)  
    print(str)  
    print(str  +  '\n')
    println(str)  

    View Slide

  11. MORE ON STRINGS
    str  =  'bioinformatics'    
    print  str[0]  
    !
    print  "$str  is  cool!"  
    print  "Current  path:  $PWD"
    str  =  '''  
               multi  
               line    
               string  
        '''
    !
    str  =  """  
               User:  $USER  
               Home:  $HOME  
               """

    View Slide

  12. COMMON STRUCTURES &
    PROGRAMMING IDIOMS
    • Data structures: Lists & Maps

    • Control statements: if, for, while, etc.

    • Functions and classes

    • File I/O operations

    View Slide

  13. 6 PAGES PRIMER
    http://refcardz.dzone.com/refcardz/groovy

    View Slide

  14. MAIN ABSTRACTIONS
    • Processes: run any piece of script

    • Channels: unidirectional async queues that
    allows the processes to comunicate

    • Operators: transform channels content

    View Slide

  15. CHANNELS
    • It connects two processes/operators

    • Write operations is NOT blocking

    • Read operation is blocking

    • Once an item is read is removed from the queue

    View Slide

  16. CHANNELS
    some_items  =  Channel.from(10,  20,  30,  ..)
    my_channel  =  Channel.create()
    single_file  =  Channel.fromPath('some/file/name')
    more_files  =  Channel.fromPath('some/data/path/*')
    file x
    file y
    file z

    View Slide

  17. OPERATORS
    • Functions applied to channels

    • Transform channels content

    • Can be used also to filter, fork and combine
    channels

    • Operators can be chained to implement custom
    behaviours

    View Slide

  18. OPERATORS
    nums  =  Channel.from(1,2,3,4)  
    square  =  nums.map  {  it  -­‐>  it  *  it  }
    4            3              2            1
    16          9              4            1
    nums
    square
    map

    View Slide

  19. OPERATORS CHAINING
    Channel.from(1,2,3,4)  
        .map  {  it  -­‐>  [it,  it*it]  }  
        .subscribe  {  num,  sqr  -­‐>  println  "Square  of:  $num  is  $sqr"  }
    //  it  prints    
    Square  of:  1  is  1    
    Square  of:  2  is  4    
    Square  of:  3  is  9    
    Square  of:  4  is  16  

    View Slide

  20. SPLIT FASTA FILE(S)
    Channel.fromPath('/some/path/fasta.fa')  
        .splitFasta()  
        .view()
    Channel.fromPath('/some/path/fasta.fa')  
        .splitFasta(by:  3)  
        .view()
    Channel.fromPath('/some/path/*.fa')  
        .splitFasta(by:  3)  
        .view()

    View Slide

  21. SPLITTING OPERATORS
    You can split text object or files using the splitting methods:

    • splitText - line by line

    • splitCsv - comma separated values format

    • splitFasta - by FASTA sequences

    • splitFastq - by FASTQ sequences

    View Slide

  22. EXAMPLE 1
    • Split a FASTA file in sequence

    • Parse a FASTA file and count number of
    sequences matching specified ID

    View Slide

  23. EXAMPLE 1
    $  nextflow  run  channel_split.nf  
    !
    !
    $  nextflow  run  channel_filter.nf  

    View Slide

  24. PROCESS
    process  sayHello  {  
    !
         input:  
         val  str  
    !
         output:  
         stdout  into  result  
    !
         script:  
         """  
         echo  $str  world!  
         """  
    }  
    !
    str  =  Channel.from('hello',  'hola',  'bonjour',  'ciao')
    result.subscribe  {  print  it  }

    View Slide

  25. PROCESS INPUTS
    input:  
           [from  ]  [attributes]
    process  procName  {  
    !
    !
    !
    !
    !
    !
    !
    !
    !
         """  
           
         """    
    !
    }

    View Slide

  26. PROCESS INPUTS
    input:  
       val    x  from  ch_1  
       file  y  from  ch_2  
       file  'data.fa'  from  ch_3  
       stdin  from  from  ch_4  
       set  (x,  'file.txt')  from  ch_5
    process  procName  {  
    !
    !
    !
    !
    !
    !
    !
    !
    !
         """  
           
         """    
    !
    }

    View Slide

  27. PROCESS INPUTS
    proteins  =  Channel.fromPath(  '/some/path/data.fa'  )  
    !
    !
    !
    process  blastThemAll  {  
    !
       input:  
       file  'query.fa'  from  proteins  
    !
       "blastp  -­‐query  query.fa  -­‐db  nr"  
    !
    }  
    !

    View Slide

  28. PROCESS OUTPUTS
    process  randomNum  {  
    !
         output:  
         file  'result.txt'  into  numbers  
    !
    !
         '''  
         echo  $RANDOM  >  result.txt  
         '''  
    !
    }  
    !
    !
    numbers.subscribe  {  println  "Received:  "  +  it.text  }

    View Slide

  29. USE YOUR FAVOURITE

    PROGRAMMING LANG
    process  pyStuff  {  
    !
           script:  
           """  
           #!/usr/bin/env  python  
    !
           x  =  'Hello'  
           y  =  'world!'  
           print  "%s  -­‐  %s"  %  (x,y)  
           """  
    }

    View Slide

  30. EXAMPLE 2
    • Execute a process running a BLAST job given an
    input file

    • Execute a BLAST job emitting the produced
    output

    View Slide

  31. EXAMPLE 2
    $  nextflow  run  process_input.nf  
    !
    !
    $  nextflow  run  process_output.nf  

    View Slide

  32. PIPELINES PARAMETERS
    params.p1  =  'alpha'  
    params.p2  =  'beta'  
    :
    Simply declares some variables prefixed by params
    When launching your script you can override

    the default values
    $  nextflow  run    -­‐-­‐p1  'delta'  -­‐-­‐p2  'gamma'<br/>

    View Slide

  33. COLLECT FILE
    The operator collectFile allows to gather

    items produced by upstream processes
    my_results.collectFile(name:'result.txt')  
    Collect all items to a single file

    View Slide

  34. COLLECT FILE
    The operator collectFile allows to gather

    items produced by upstream processes
    my_items.collectFile(storeDir:'path/name')  {  
    !
          def  key  =  get_a_key_from_the_item(it)  
          def  content  =  get_the_item_value(it)  
          [  key,  content  ]  
    !
    }
    Collect the items and group them into files

    having a names defined by a grouping criteria

    View Slide

  35. EXAMPLE 3
    • Split a FASTA file, execute a BLAST query for
    each chunk and gather the results

    • Split multiple FASTA file and execute a BLAST
    query for each chunk

    View Slide

  36. EXAMPLE 3
    $  nextflow  run  split_fasta.nf  
    !
    !
    $  nextflow  run  split_fasta.nf  -­‐-­‐chunkSize  2  
    !
    !
    $  nextflow  run  split_fasta.nf  -­‐-­‐chunkSize  2  -­‐-­‐query  data/p\*.fa  
    !
    !
    $  nextflow  run  split_and_collect.nf  

    View Slide

  37. UNDERSTANDING

    MULTIPLE INPUTS
    task 1
    process
    a
    out x
    d a
    c
    β
    ..
    /END/ task 2
    out y
    β
    d

    View Slide

  38. UNDERSTANDING

    MULTIPLE INPUTS
    process
    a
    out x
    d a
    c
    ..
    β
    β
    d
    out y
    β
    c
    out z
    β
    task 1
    task 2
    task 3
    :

    task n

    View Slide

  39. CONFIG FILE
    • Pipeline configuration can be externalised to a file
    named nextflow.config

    • parameters

    • environment variables

    • required resources (mem, cpus, queue, etc)

    • modules/containers

    View Slide

  40. CONFIG FILE
    params.p1  =  'alpha'  
    params.p2  =  'beta'  
    !
    env.VAR_1  =  'some_value'  
    env.CACHE_4_TCOFFEE  =  '/some/path/cache'  
    env.LOCKDIR_4_TCOFFEE  =  '/some/path/lock'  
    !
    process.executor  =  'sge'

    View Slide

  41. CONFIG FILE
    params  {  
       p1  =  'alpha'  
       p2  =  'beta'  
    }  
    !
    env  {  
       VAR_1  =  'some_value'  
       CACHE_4_TCOFFEE  =  '/some/path/cache'  
       LOCKDIR_4_TCOFFEE  =  '/some/path/lock'  
    }    
    !
    process  {    
       executor  =  'sge'  
    }
    Alternate syntax (almost) equivalent

    View Slide

  42. HOW USE DOCKER
    Specify in the config file the Docker image to use
    !
    process  {    
        container  =    
    }
    Add the with-docker flag when launching it
    !
    $  nextflow  run    -­‐with-­‐docker  <br/>

    View Slide

  43. EXAMPLE 4
    Launch a pipeline using a Docker container

    View Slide

  44. EXAMPLE 4
    !
    $  nextflow  run  blast_extract.nf  -­‐with-­‐docker  

    View Slide

  45. HOW USE THE CLUSTER
    //  default  properties  for  any  process  
    process  {  
      executor  =  'crg'  
      queue  =  'short'  
      cpus  =  2    
      memory  =  '4GB'  
      scratch  =  true  
    }  
    !
    !
    Define the CRG executor in nextflow.config

    View Slide

  46. PROCESS RESOURCES
    //  default  properties  for  any  process  
    process  {  
      executor  =  'crg'  
      queue  =  'short'  
      scratch  =  true  
    }  
    !
    //  cpus  for  process  'foo'  
    process.$foo.cpus  =  2  
    !
    //  resources  for  'bar'    
    process.$bar.queue  =  'long'  
    process.$bar.cpus  =  4    
    process.$bar.memory  =  '4GB'  
    !

    View Slide

  47. ENVIRONMENT MODULE
    !
    process.$foo.module  =  'Bowtie2/2.2.3'  
    !
    process.$bar.module  =  'TopHat/2.0.12:Boost/1.55.0'  
    Specify in the config file the modules required

    View Slide

  48. EXAMPLE 5
    Executes a pipeline in the cluster

    View Slide

  49. EXAMPLE 5
    $  ssh  username@ant-­‐login.linux.crg.es
    $  module  avail    
    $  module  purge    
    $  module  load  nextflow/0.12.3-­‐goolf-­‐1.4.10-­‐no-­‐OFED-­‐Java-­‐1.7.0_21
    $  curl  -­‐fsSL  get.nextflow.io  |  bash
    Login in ANT-LOGIN
    If you have module configured:
    Otherwise install it downloading from internet

    View Slide

  50. EXAMPLE 5
    Create the following nextflow.config file:
    process  {  
       executor  =  'crg'  
       queue  =  'course'  
       scratch  =  true  
    }
    $  nextflow  run  rnatoy  -­‐with-­‐docker  -­‐with-­‐trace
    Launch the pipeline execution:

    View Slide

  51. RESOURCES
    project home

    http://nextflow.io

    tutorials

    https://github.com/nextflow-io/examples

    community

    http://groups.google.com/forum/#!forum/nextflow

    View Slide