Slide 1

Slide 1 text

THE STATE OF NEXTFLOW Paolo Di Tommaso, CRG Nextflow workshop 2018, 22 Nov 2018

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

WHERE WE ARRIVED

Slide 4

Slide 4 text

Execution reports Oct 2017 Phil Ewels Nov 2017 AWS Batch Francesco Strozzi

Slide 5

Slide 5 text

Syntax highlighting Jan 2018 Feb/Mar 2018 Kubernetes

Slide 6

Slide 6 text

Apr 2018 Alexander 
 Peltzer Phil Ewels Andreas Wilm Jun 2018 Google 
 Summer of Code Edgar 
 Garriga

Slide 7

Slide 7 text

GA4GH TES Jul 2018 Aug 2018 Performance Improvements

Slide 8

Slide 8 text

Dockstore Sep 2018 https://dockstore.org/ Oct 2018 New release

Slide 9

Slide 9 text

Seqera Labs Oct 2018 Nov 2018 New contributors Pull request # 926

Slide 10

Slide 10 text

HEALTH METRICS • 76 citations • 90k LoC (+35%) • 650 stars and
 136 forks on GitHub (+100%)

Slide 11

Slide 11 text

HEALTH METRICS Downloads 14K / month (+150%) GitHub tickets +80%

Slide 12

Slide 12 text

NEW CONTRIBUTORS Sven Fillinger Quantitative Biology Center, 
 Germany Lorenz Gerber
 Genome Institute 
 Singapore Ólafur Haukur 
 Flygenring
 WuXi NextCode, Island Jonathan Sheffi
 Google Genomics, USA Rad Suchecki
 CSIRO, Australia Luke Goodsell
 Achilles Therapeutics, UK

Slide 13

Slide 13 text

WHAT'S NEXT?

Slide 14

Slide 14 text

SYNTAX POLISHING Channel.from( .. ) Channel.fromPath( .. ) Channel.fromFilePairs( .. ) channel.of( .. ) channel.ofFiles( .. ) channel.ofFilePairs( .. ) → → → proces foo { input: file x from file('/some/path') """ your_command $x
 """ } proces foo { input: path x from '/some/path' """ your_command $x
 """ } →

Slide 15

Slide 15 text

CHANNELS DEDUP Channel.fromPath('/something/*') .into{ this_ch; that_ch } proces foo { input: file x from this_ch """ your_command $x
 """ } proces bar { input: file x from that_ch """ your_command $x
 """ } Channel.fromPath('/something/*') .fork(2) .set{ this_ch } proces foo { input: file x from this_ch """ your_command $x
 """ } proces bar { input: file x from this_ch """ your_command $x
 """ } →

Slide 16

Slide 16 text

EVENTS & METADATA proces foo { input: set pair_id, file(reads) from something_ch onSuccess: // do something onError: // do something else metadata_ch << [ name: task.name, failed: true, sample: pair_id, cmd_ver: task.version ] onTerminate: // an action when the last task has complete script: """ your_command $x
 """ }

Slide 17

Slide 17 text

EVENTS & METADATA (2) workflow.onComplete { // action } process.onComplete { // action } process.onSuccess { // action } process.onError { task -> metadata_ch << [ name: task.name, failed: true, sample: task.context.pair_id, // ??? cmd_ver: task.version ] }

Slide 18

Slide 18 text

RUNTIME CLEANUP • Workflow execution may produce a lot of temporary data • Deleting it affects the resume ability • Work in progress to cleaned it up at runtime w/o breaking the resume (issues # 452) p1 q1 t3 t2 t1 ω1 fromPath('/data/*') p1 t2 t1 q1

Slide 19

Slide 19 text

RESOURCES HANDLING process foo { time memory errorStategy 'retry' input: file 'some-in.txt' from in_ch output: file 'some-out.bam' into out_ch script: """ your_command --mem $task.memory """ } { 1.h * task.attempt } { 10.GB * task.attempt }

Slide 20

Slide 20 text

RESOURCES PREDICTION • Train a prediction model with real resources usage and inputs metadata for first n tasks • Predict following task execution of previous trained model process foo { time auto memory auto input: file 'some-in.txt' from in_ch output: file 'some-out.bam' into out_ch script: """ your_command --mem $task.memory """ }

Slide 21

Slide 21 text

WORKFLOW COMPOSITION? nextflow run prj/foo nextflow run prj/bar input: file from ch_1 output: file into ch_2 input: file from ch_2 output: file into ch_3 prj/foo + prj/bar + ... proces foo { script: workflow('prj/foo') } process bar { script: workflow('prj/bar') }

Slide 22

Slide 22 text

DATA INTEGRATION • Sequence Read Archive (NCBI SRA) • GA4GH HTSGET • SQL/NoSQL big datasources (e.g. AWS Athena or Google BigTable, etc.) • Domain specific databases e.g. iRODS

Slide 23

Slide 23 text

HYBRID DEPLOYMENTS S3 storage NFS storage GS storage DOS storage

Slide 24

Slide 24 text

NOTEBOOKS

Slide 25

Slide 25 text

MY TAKE • Genomics data and health records are exploding • Data will be fragmented and siloed in many different public clouds and private orgs • Heterogeneous computing environment eg. clouds, HPC clusters, interactive notebooks, etc. • Hardly there could be one-size fit-all solution • Increasingly need of portable workflows and hybrid computing enabling transparent multi-cloud and platform deployments

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

CREDITS Brendan Bouffler Cedric Notredame Anna Sole Damjana Kastelic