Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The state of Nextflow

The state of Nextflow

A quick update on the project status, recent changes and open challenges presented at the Nextflow Workshop 2018.

Paolo Di Tommaso

November 22, 2018
Tweet

More Decks by Paolo Di Tommaso

Other Decks in Technology

Transcript

  1. THE STATE OF NEXTFLOW Paolo Di Tommaso, CRG Nextflow workshop

    2018, 22 Nov 2018
  2. None
  3. WHERE WE ARRIVED

  4. Execution reports Oct 2017 Phil Ewels Nov 2017 AWS Batch

    Francesco Strozzi
  5. Syntax highlighting Jan 2018 Feb/Mar 2018 Kubernetes

  6. Apr 2018 Alexander 
 Peltzer Phil Ewels Andreas Wilm Jun

    2018 Google 
 Summer of Code Edgar 
 Garriga
  7. GA4GH TES Jul 2018 Aug 2018 Performance Improvements

  8. Dockstore Sep 2018 https://dockstore.org/ Oct 2018 New release

  9. Seqera Labs Oct 2018 Nov 2018 New contributors Pull request

    # 926
  10. HEALTH METRICS • 76 citations • 90k LoC (+35%) •

    650 stars and
 136 forks on GitHub (+100%)
  11. HEALTH METRICS Downloads 14K / month (+150%) GitHub tickets +80%

  12. NEW CONTRIBUTORS Sven Fillinger Quantitative Biology Center, 
 Germany Lorenz

    Gerber
 Genome Institute 
 Singapore Ólafur Haukur 
 Flygenring
 WuXi NextCode, Island Jonathan Sheffi
 Google Genomics, USA Rad Suchecki
 CSIRO, Australia Luke Goodsell
 Achilles Therapeutics, UK
  13. WHAT'S NEXT?

  14. SYNTAX POLISHING Channel.from( .. ) Channel.fromPath( .. ) Channel.fromFilePairs( ..

    ) channel.of( .. ) channel.ofFiles( .. ) channel.ofFilePairs( .. ) → → → proces foo { input: file x from file('/some/path') """ your_command $x
 """ } proces foo { input: path x from '/some/path' """ your_command $x
 """ } →
  15. CHANNELS DEDUP Channel.fromPath('/something/*') .into{ this_ch; that_ch } proces foo {

    input: file x from this_ch """ your_command $x
 """ } proces bar { input: file x from that_ch """ your_command $x
 """ } Channel.fromPath('/something/*') .fork(2) .set{ this_ch } proces foo { input: file x from this_ch """ your_command $x
 """ } proces bar { input: file x from this_ch """ your_command $x
 """ } →
  16. EVENTS & METADATA proces foo { input: set pair_id, file(reads)

    from something_ch onSuccess: // do something onError: // do something else metadata_ch << [ name: task.name, failed: true, sample: pair_id, cmd_ver: task.version ] onTerminate: // an action when the last task has complete script: """ your_command $x
 """ }
  17. EVENTS & METADATA (2) workflow.onComplete { // action } process.onComplete

    { // action } process.onSuccess { // action } process.onError { task -> metadata_ch << [ name: task.name, failed: true, sample: task.context.pair_id, // ??? cmd_ver: task.version ] }
  18. RUNTIME CLEANUP • Workflow execution may produce a lot of

    temporary data • Deleting it affects the resume ability • Work in progress to cleaned it up at runtime w/o breaking the resume (issues # 452) p1 q1 t3 t2 t1 ω1 fromPath('/data/*') p1 t2 t1 q1
  19. RESOURCES HANDLING process foo { time memory errorStategy 'retry' input:

    file 'some-in.txt' from in_ch output: file 'some-out.bam' into out_ch script: """ your_command --mem $task.memory """ } { 1.h * task.attempt } { 10.GB * task.attempt }
  20. RESOURCES PREDICTION • Train a prediction model with real resources

    usage and inputs metadata for first n tasks • Predict following task execution of previous trained model process foo { time auto memory auto input: file 'some-in.txt' from in_ch output: file 'some-out.bam' into out_ch script: """ your_command --mem $task.memory """ }
  21. WORKFLOW COMPOSITION? nextflow run prj/foo nextflow run prj/bar input: file

    <?> from ch_1 output: file <?> into ch_2 input: file <?> from ch_2 output: file <?> into ch_3 prj/foo + prj/bar + ... proces foo { script: workflow('prj/foo') } process bar { script: workflow('prj/bar') }
  22. DATA INTEGRATION • Sequence Read Archive (NCBI SRA) • GA4GH

    HTSGET • SQL/NoSQL big datasources (e.g. AWS Athena or Google BigTable, etc.) • Domain specific databases e.g. iRODS
  23. HYBRID DEPLOYMENTS S3 storage NFS storage GS storage DOS storage

  24. NOTEBOOKS

  25. MY TAKE • Genomics data and health records are exploding

    • Data will be fragmented and siloed in many different public clouds and private orgs • Heterogeneous computing environment eg. clouds, HPC clusters, interactive notebooks, etc. • Hardly there could be one-size fit-all solution • Increasingly need of portable workflows and hybrid computing enabling transparent multi-cloud and platform deployments
  26. None
  27. CREDITS Brendan Bouffler Cedric Notredame Anna Sole Damjana Kastelic