$30 off During Our Annual Pro Sale. View Details »

The state of Nextflow

The state of Nextflow

A quick update on the project status, recent changes and open challenges presented at the Nextflow Workshop 2018.

Paolo Di Tommaso

November 22, 2018
Tweet

More Decks by Paolo Di Tommaso

Other Decks in Technology

Transcript

  1. THE STATE OF NEXTFLOW
    Paolo Di Tommaso, CRG
    Nextflow workshop 2018, 22 Nov 2018

    View Slide

  2. View Slide

  3. WHERE WE ARRIVED

    View Slide

  4. Execution reports
    Oct 2017
    Phil Ewels
    Nov 2017
    AWS Batch
    Francesco
    Strozzi

    View Slide

  5. Syntax highlighting
    Jan 2018
    Feb/Mar 2018
    Kubernetes

    View Slide

  6. Apr 2018
    Alexander 

    Peltzer
    Phil
    Ewels
    Andreas
    Wilm
    Jun 2018
    Google 

    Summer of Code
    Edgar 

    Garriga

    View Slide

  7. GA4GH TES
    Jul 2018
    Aug 2018
    Performance
    Improvements

    View Slide

  8. Dockstore
    Sep 2018
    https://dockstore.org/
    Oct 2018
    New release

    View Slide

  9. Seqera Labs
    Oct 2018
    Nov 2018
    New contributors
    Pull request # 926

    View Slide

  10. HEALTH METRICS
    • 76 citations
    • 90k LoC (+35%)
    • 650 stars and

    136 forks on GitHub (+100%)

    View Slide

  11. HEALTH METRICS
    Downloads 14K / month (+150%)
    GitHub tickets +80%

    View Slide

  12. NEW CONTRIBUTORS
    Sven Fillinger
    Quantitative Biology Center, 

    Germany
    Lorenz Gerber

    Genome Institute 

    Singapore
    Ólafur Haukur 

    Flygenring

    WuXi NextCode, Island
    Jonathan Sheffi

    Google Genomics,
    USA
    Rad Suchecki

    CSIRO, Australia
    Luke Goodsell

    Achilles Therapeutics, UK

    View Slide

  13. WHAT'S NEXT?

    View Slide

  14. SYNTAX POLISHING
    Channel.from( .. )
    Channel.fromPath( .. )
    Channel.fromFilePairs( .. )
    channel.of( .. )
    channel.ofFiles( .. )
    channel.ofFilePairs( .. )



    proces foo {
    input:
    file x from file('/some/path')
    """
    your_command $x

    """
    }
    proces foo {
    input:
    path x from '/some/path'
    """
    your_command $x

    """
    }

    View Slide

  15. CHANNELS DEDUP
    Channel.fromPath('/something/*')
    .into{ this_ch; that_ch }
    proces foo {
    input:
    file x from this_ch
    """
    your_command $x

    """
    }
    proces bar {
    input:
    file x from that_ch
    """
    your_command $x

    """
    }
    Channel.fromPath('/something/*')
    .fork(2)
    .set{ this_ch }
    proces foo {
    input:
    file x from this_ch
    """
    your_command $x

    """
    }
    proces bar {
    input:
    file x from this_ch
    """
    your_command $x

    """
    }

    View Slide

  16. EVENTS & METADATA
    proces foo {
    input:
    set pair_id, file(reads) from something_ch
    onSuccess:
    // do something
    onError:
    // do something else
    metadata_ch << [ name: task.name,
    failed: true,
    sample: pair_id,
    cmd_ver: task.version ]
    onTerminate:
    // an action when the last task has complete
    script:
    """
    your_command $x

    """
    }

    View Slide

  17. EVENTS & METADATA (2)
    workflow.onComplete {
    // action
    }
    process.onComplete {
    // action
    }
    process.onSuccess {
    // action
    }
    process.onError { task ->
    metadata_ch << [ name: task.name,
    failed: true,
    sample: task.context.pair_id, // ???
    cmd_ver: task.version ]
    }

    View Slide

  18. RUNTIME CLEANUP
    • Workflow execution may
    produce a lot of temporary data
    • Deleting it affects the resume
    ability
    • Work in progress to cleaned it
    up at runtime w/o breaking the
    resume
    (issues # 452)
    p1
    q1
    t3
    t2
    t1
    ω1
    fromPath('/data/*')
    p1
    t2
    t1
    q1

    View Slide

  19. RESOURCES HANDLING
    process foo {
    time
    memory
    errorStategy 'retry'
    input:
    file 'some-in.txt' from in_ch
    output:
    file 'some-out.bam' into out_ch
    script:
    """
    your_command --mem $task.memory
    """
    }
    { 1.h * task.attempt }
    { 10.GB * task.attempt }

    View Slide

  20. RESOURCES PREDICTION
    • Train a prediction model with
    real resources usage and
    inputs metadata for first n
    tasks
    • Predict following task
    execution of previous trained
    model
    process foo {
    time auto
    memory auto
    input:
    file 'some-in.txt' from in_ch
    output:
    file 'some-out.bam' into out_ch
    script:
    """
    your_command --mem $task.memory
    """
    }

    View Slide

  21. WORKFLOW COMPOSITION?
    nextflow run prj/foo
    nextflow run prj/bar
    input: file > from ch_1
    output: file > into ch_2
    input: file > from ch_2
    output: file > into ch_3
    prj/foo + prj/bar + ...
    proces foo {
    script:
    workflow('prj/foo')
    }
    process bar {
    script:
    workflow('prj/bar')
    }

    View Slide

  22. DATA INTEGRATION
    • Sequence Read Archive (NCBI SRA)
    • GA4GH HTSGET
    • SQL/NoSQL big datasources (e.g. AWS Athena or
    Google BigTable, etc.)
    • Domain specific databases e.g. iRODS

    View Slide

  23. HYBRID DEPLOYMENTS
    S3 storage
    NFS storage GS storage DOS storage

    View Slide

  24. NOTEBOOKS

    View Slide

  25. MY TAKE
    • Genomics data and health records are exploding
    • Data will be fragmented and siloed in many different public clouds and
    private orgs
    • Heterogeneous computing environment eg. clouds, HPC clusters,
    interactive notebooks, etc.
    • Hardly there could be one-size fit-all solution
    • Increasingly need of portable workflows and hybrid computing enabling
    transparent multi-cloud and platform deployments

    View Slide

  26. View Slide

  27. CREDITS
    Brendan Bouffler
    Cedric Notredame
    Anna Sole
    Damjana Kastelic

    View Slide