Paolo Di Tommaso
Comparative bioinformatics
Notredame Lab - CRG
!
26 Feb 2015
Slide 2
Slide 2 text
WHAT NEXTFLOW IS
• A computing runtime which executes Nextflow
pipeline scripts
• A programming DSL that simplify writing of highly
parallel computational pipelines reusing your
existing scripts and tools
Slide 3
Slide 3 text
NEXTFLOW DSL
• It is NOT a new programming language
• It extends the Groovy scripting language
• It provides a multi-paradigm programming
environment
Slide 4
Slide 4 text
MULTI-PARADIGM
Imperative
Object-oriented programming
+
Declarative concurrency
Dataflow programming model
HOW TO INSTALL
Use the following command:
wget
-‐qO-‐
get.nextflow.io
|
bash
nextflow
Slide 7
Slide 7 text
GET STARTED
$
cd
~/crg-‐course
$
vagrant
up
$
vagrant
ssh
Login in your course laptop
Once in the virtual machine
$
cd
~/nextflow-‐tutorial
$
git
pull
$
nextflow
info
Slide 8
Slide 8 text
THE BASIC
Variables and assignments
x
=
1
y
=
10.5
str
=
'hello
world!'
p
=
x;
q
=
y
Slide 9
Slide 9 text
THE BASIC
Printing values
x
=
1
y
=
10.5
str
=
'hello
world!'
print
x
print
str
print
str
+
'\n'
println
str
Slide 10
Slide 10 text
THE BASIC
Printing values
x
=
1
y
=
10.5
str
=
'hello
world!'
print(x)
print(str)
print(str
+
'\n')
println(str)
Slide 11
Slide 11 text
MORE ON STRINGS
str
=
'bioinformatics'
print
str[0]
!
print
"$str
is
cool!"
print
"Current
path:
$PWD"
str
=
'''
multi
line
string
'''
!
str
=
"""
User:
$USER
Home:
$HOME
"""
Slide 12
Slide 12 text
COMMON STRUCTURES &
PROGRAMMING IDIOMS
• Data structures: Lists & Maps
• Control statements: if, for, while, etc.
• Functions and classes
• File I/O operations
Slide 13
Slide 13 text
6 PAGES PRIMER
http://refcardz.dzone.com/refcardz/groovy
Slide 14
Slide 14 text
MAIN ABSTRACTIONS
• Processes: run any piece of script
• Channels: unidirectional async queues that
allows the processes to comunicate
• Operators: transform channels content
Slide 15
Slide 15 text
CHANNELS
• It connects two processes/operators
• Write operations is NOT blocking
• Read operation is blocking
• Once an item is read is removed from the queue
Slide 16
Slide 16 text
CHANNELS
some_items
=
Channel.from(10,
20,
30,
..)
my_channel
=
Channel.create()
single_file
=
Channel.fromPath('some/file/name')
more_files
=
Channel.fromPath('some/data/path/*')
file x
file y
file z
Slide 17
Slide 17 text
OPERATORS
• Functions applied to channels
• Transform channels content
• Can be used also to filter, fork and combine
channels
• Operators can be chained to implement custom
behaviours
Slide 18
Slide 18 text
OPERATORS
nums
=
Channel.from(1,2,3,4)
square
=
nums.map
{
it
-‐>
it
*
it
}
4
3
2
1
16
9
4
1
nums
square
map
Slide 19
Slide 19 text
OPERATORS CHAINING
Channel.from(1,2,3,4)
.map
{
it
-‐>
[it,
it*it]
}
.subscribe
{
num,
sqr
-‐>
println
"Square
of:
$num
is
$sqr"
}
//
it
prints
Square
of:
1
is
1
Square
of:
2
is
4
Square
of:
3
is
9
Square
of:
4
is
16
SPLITTING OPERATORS
You can split text object or files using the splitting methods:
• splitText - line by line
• splitCsv - comma separated values format
• splitFasta - by FASTA sequences
• splitFastq - by FASTQ sequences
Slide 22
Slide 22 text
EXAMPLE 1
• Split a FASTA file in sequence
• Parse a FASTA file and count number of
sequences matching specified ID
Slide 23
Slide 23 text
EXAMPLE 1
$
nextflow
run
channel_split.nf
!
!
$
nextflow
run
channel_filter.nf
Slide 24
Slide 24 text
PROCESS
process
sayHello
{
!
input:
val
str
!
output:
stdout
into
result
!
script:
"""
echo
$str
world!
"""
}
!
str
=
Channel.from('hello',
'hola',
'bonjour',
'ciao')
result.subscribe
{
print
it
}
PROCESS INPUTS
input:
val
x
from
ch_1
file
y
from
ch_2
file
'data.fa'
from
ch_3
stdin
from
from
ch_4
set
(x,
'file.txt')
from
ch_5
process
procName
{
!
!
!
!
!
!
!
!
!
"""
"""
!
}
Slide 27
Slide 27 text
PROCESS INPUTS
proteins
=
Channel.fromPath(
'/some/path/data.fa'
)
!
!
!
process
blastThemAll
{
!
input:
file
'query.fa'
from
proteins
!
"blastp
-‐query
query.fa
-‐db
nr"
!
}
!
USE YOUR FAVOURITE
PROGRAMMING LANG
process
pyStuff
{
!
script:
"""
#!/usr/bin/env
python
!
x
=
'Hello'
y
=
'world!'
print
"%s
-‐
%s"
%
(x,y)
"""
}
Slide 30
Slide 30 text
EXAMPLE 2
• Execute a process running a BLAST job given an
input file
• Execute a BLAST job emitting the produced
output
Slide 31
Slide 31 text
EXAMPLE 2
$
nextflow
run
process_input.nf
!
!
$
nextflow
run
process_output.nf
Slide 32
Slide 32 text
PIPELINES PARAMETERS
params.p1
=
'alpha'
params.p2
=
'beta'
:
Simply declares some variables prefixed by params
When launching your script you can override
the default values
$
nextflow
run
-‐-‐p1
'delta'
-‐-‐p2
'gamma'
Slide 33
Slide 33 text
COLLECT FILE
The operator collectFile allows to gather
items produced by upstream processes
my_results.collectFile(name:'result.txt')
Collect all items to a single file
Slide 34
Slide 34 text
COLLECT FILE
The operator collectFile allows to gather
items produced by upstream processes
my_items.collectFile(storeDir:'path/name')
{
!
def
key
=
get_a_key_from_the_item(it)
def
content
=
get_the_item_value(it)
[
key,
content
]
!
}
Collect the items and group them into files
having a names defined by a grouping criteria
Slide 35
Slide 35 text
EXAMPLE 3
• Split a FASTA file, execute a BLAST query for
each chunk and gather the results
• Split multiple FASTA file and execute a BLAST
query for each chunk
Slide 36
Slide 36 text
EXAMPLE 3
$
nextflow
run
split_fasta.nf
!
!
$
nextflow
run
split_fasta.nf
-‐-‐chunkSize
2
!
!
$
nextflow
run
split_fasta.nf
-‐-‐chunkSize
2
-‐-‐query
data/p\*.fa
!
!
$
nextflow
run
split_and_collect.nf
Slide 37
Slide 37 text
UNDERSTANDING
MULTIPLE INPUTS
task 1
process
a
out x
d a
c
β
..
/END/ task 2
out y
β
d
Slide 38
Slide 38 text
UNDERSTANDING
MULTIPLE INPUTS
process
a
out x
d a
c
..
β
β
d
out y
β
c
out z
β
task 1
task 2
task 3
:
task n
Slide 39
Slide 39 text
CONFIG FILE
• Pipeline configuration can be externalised to a file
named nextflow.config
• parameters
• environment variables
• required resources (mem, cpus, queue, etc)
• modules/containers
HOW USE DOCKER
Specify in the config file the Docker image to use
!
process
{
container
=
}
Add the with-docker flag when launching it
!
$
nextflow
run
-‐with-‐docker
Slide 43
Slide 43 text
EXAMPLE 4
Launch a pipeline using a Docker container
Slide 44
Slide 44 text
EXAMPLE 4
!
$
nextflow
run
blast_extract.nf
-‐with-‐docker
Slide 45
Slide 45 text
HOW USE THE CLUSTER
//
default
properties
for
any
process
process
{
executor
=
'crg'
queue
=
'short'
cpus
=
2
memory
=
'4GB'
scratch
=
true
}
!
!
Define the CRG executor in nextflow.config
Slide 46
Slide 46 text
PROCESS RESOURCES
//
default
properties
for
any
process
process
{
executor
=
'crg'
queue
=
'short'
scratch
=
true
}
!
//
cpus
for
process
'foo'
process.$foo.cpus
=
2
!
//
resources
for
'bar'
process.$bar.queue
=
'long'
process.$bar.cpus
=
4
process.$bar.memory
=
'4GB'
!
Slide 47
Slide 47 text
ENVIRONMENT MODULE
!
process.$foo.module
=
'Bowtie2/2.2.3'
!
process.$bar.module
=
'TopHat/2.0.12:Boost/1.55.0'
Specify in the config file the modules required
Slide 48
Slide 48 text
EXAMPLE 5
Executes a pipeline in the cluster
Slide 49
Slide 49 text
EXAMPLE 5
$
ssh
username@ant-‐login.linux.crg.es
$
module
avail
$
module
purge
$
module
load
nextflow/0.12.3-‐goolf-‐1.4.10-‐no-‐OFED-‐Java-‐1.7.0_21
$
curl
-‐fsSL
get.nextflow.io
|
bash
Login in ANT-LOGIN
If you have module configured:
Otherwise install it downloading from internet
Slide 50
Slide 50 text
EXAMPLE 5
Create the following nextflow.config file:
process
{
executor
=
'crg'
queue
=
'course'
scratch
=
true
}
$
nextflow
run
rnatoy
-‐with-‐docker
-‐with-‐trace
Launch the pipeline execution:
Slide 51
Slide 51 text
RESOURCES
project home
http://nextflow.io
tutorials
https://github.com/nextflow-io/examples
community
http://groups.google.com/forum/#!forum/nextflow