$30 off During Our Annual Pro Sale. View Details »

HiCOMB 2018: Practical lessons from scaling read aligners to hundreds of threads

HiCOMB 2018: Practical lessons from scaling read aligners to hundreds of threads

Presented Monday, May 21, 2018 at HiCOMB (in conjunction with IPDPS) in Vancouver, Canada

Ben Langmead

May 21, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Practical lessons from scaling read
    aligners to hundreds of threads
    Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    HiCOMB, May 21, 2018

    View Slide

  2. Upstream genomics software
    Much genomics software loops over reads & is
    embarrassingly parallel
    Read aligners, metagenomics binners, error
    correctors, transcriptome quantifiers, ...
    Assumption broken e.g. by spliced aligners (mildly),
    assemblers (severely)
    while(next_read()) {
    ...
    }
    Variant caller
    Transcript assembler
    Peak caller
    Differential expression

    View Slide

  3. How many threads?
    It's typical to compare tools based on speed of a
    single thread or at low thread count
    Bowtie 2 comparison is single-threaded:
    Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2.
    Nat Methods. 2012 Mar 4;9(4):357-9.

    View Slide

  4. How many threads?
    Kraken speed comparison: single thread
    Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence
    classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46.

    View Slide

  5. How many threads?
    Jain, C., Koren, S., Dilthey, A., Phillippy, A. M., & Aluru, S. (2018). A Fast Adaptive
    Algorithm for Computing Whole-Genome Homology Maps. bioRxiv, 259986.
    MashMap2: 8 threads

    View Slide

  6. How many threads?
    CLARK-S metagenomics binning tool
    demonstrates scaling results up to 8 threads
    Ounit R, Lonardi S. Higher classification sensitivity of short metagenomic
    reads with CLARK-S. Bioinformatics. 2016 Dec 15;32(24):3823-3825.

    View Slide

  7. How many threads?
    Lenis, J., & Senar, M. A. (2017). A performance comparison of data and
    memory allocation strategies for sequence aligners on NUMA
    architectures. Cluster Computing, 20(3), 1909-1924.
    Investigations at higher thread counts tend to be
    done by computer scientists; up to 64 here:

    View Slide

  8. Intel Xeon Phi “Knights Landing”
    http://www.hardwareluxx.com/index.php/news/hardware/cpu/37217-intel-shows-xeon-phi-knights-landing-wafer-die.html
    1 2 3 4 5 6 7 8
    9 10 11 12 13 14 15 16 17 18 19 20
    21 22 23 24 25 26 27 28 29 30 31 32
    31 32 33 34 35 36 37 38
    39 40 41 42 43 44 45 46 47 48 49 50
    51 52 53 54 55 56 57 58 59 60 61 62
    63 64 65 66 67 68 69 70 71 72 73 74
    Henceforth "KNL"
    4 threads 8 threads
    64 threads

    View Slide

  9. MP & MT
    Multiprocessing (MP) Multithreading (MT)
    Keeping the cores busy by running
    many processes in parallel
    Keeping the cores busy by running
    many threads in a single process
    bowtie -p 1 index in1of8 out1 &
    bowtie -p 1 index in2of8 out2 &
    bowtie -p 1 index in3of8 out3 &
    bowtie -p 1 index in4of8 out4 &
    bowtie -p 1 index in5of8 out5 &
    bowtie -p 1 index in6of8 out6 &
    bowtie -p 1 index in7of8 out7 &
    bowtie -p 1 index in8of8 out8 &
    bowtie -p 8 index in out

    View Slide

  10. MP & MT
    bowtie -p 4 index in1of2 out &
    bowtie -p 4 index in2of2 out &
    MP MT
    bowtie -p 2 index in1of4 out &
    bowtie -p 2 index in2of4 out &
    bowtie -p 2 index in3of4 out &
    bowtie -p 2 index in4of4 out &
    MT often combined with MP to keep per-process thread
    count from creating a thread-scaling bottleneck
    MP + MT spectrum

    View Slide

  11. When pure MT scales well, we can choose points anywhere on
    the spectrum, achieving high throughput in more scenarios
    • When reads arrive in fast, hard-to-buffer stream
    • When MP static load balancing is not balanced
    • When MP dynamic load balancing doesn't scale well or
    is too complex to implement
    • When low single-job latency is desired
    MP MT
    MP & MT
    MP + MT spectrum

    View Slide

  12. Thread model
    All threads can do any
    kind of work: input,
    output, alignment
    Input & output are
    synchronized via locks
    allowing just one
    thread at a time
    While a thread holds a
    lock, it is in the critical
    section
    Running
    Waiting
    In critical
    section

    View Slide

  13. Questions explored
    Can certain lock types reduce
    overhead of entering and
    exiting critical sections?
    Can different input parsing
    and output writing strategies
    reduce frequency and duration
    of critical sections?
    Overall: Can pure MT
    compete with MP+MT?
    How do our file formats affect
    the complexity of critical sections?

    View Slide

  14. Tools & versions
    Bowtie v1.1.2, Bowtie 2 v2.2.9, HISAT v0.1.6-beta
    • All used here for (unspliced) DNA alignment only;
    HISAT's spliced alignment disabled
    • All code changes are in publicly-available
    branches (see preprint Supplementary Note 1)
    • Intel Thread Building Blocks (TBB) 2017 Update 5
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  15. XSEDE: Stampede 2
    • 4,200 Intel Knights Landing nodes, each with 68 cores,
    96GB of DDR RAM, and 16GB of high speed MCDRAM
    • 1,736 Intel Xeon Skylake nodes, each with 48 cores and
    192GB of RAM
    https://www.tacc.utexas.edu/systems/stampede2
    At Texas Advanced Computing Center (TACC)

    View Slide

  16. Reads
    Reads are a mix from 3 projects that did HiSeq 2000
    100 x 100 paired-end sequencing:
    • Platinum Genomes (ERR194147)
    • 1000 Genomes Project (SRR069520)
    • Rustagi et al, low-coverage whole genome DNA
    sequencing of South Asian (SRR3947551)
    Aligned to hg38
    Number of input reads per experiment tuned so 1
    thread takes ~1 minute

    View Slide

  17. Experimental setup
    FASTQ is read from a local disk
    SAM output is written to same local disk
    • Broadwell: 7200 RPM SATA HD
    • Skylake & KNL: local solid-state drive (/tmp)

    View Slide

  18. Locking comparison
    Compare locking mechanisms
    • TinyThread++ & TBB spin are simple atomic-
    operation spin loops
    • TBB standard: tries lock; sleeps if unavailable (like
    standard pthreads lock / FUTEX mechanism)
    • TBB queueing: spins but without an atomic op
    and each thread spins on a separate cache line
    (AKA MCS lock)
    • MP Baseline: multiple processes, 16 threads each
    (so it's really MP+MT)
    100 200
    # threads
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    0
    250
    500
    750
    1000
    0 100
    # threa
    TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline

    View Slide

  19. 60 90
    0
    200
    0 30 60 90
    0
    250
    0 30 60
    0 200
    # threads
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100
    # thread
    KNL HISAT
    TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline
    Locking comparison
    30 60 90
    0
    200
    0 30 60 90
    0
    250
    500
    0 30 60
    100 200
    # threads
    Bowtie
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100
    # threa
    KNL HISAT
    TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline
    Thread time
    • Results are for weak scaling; as thread count increases,
    amount of per-thread work held constant
    • Horizontal: # threads
    • Vertical: wall-clock running time of longest-running
    thread, excluding index load time

    View Slide

  20. Locking comparison
    0
    250
    500
    750
    0 30 60 90
    Thread time
    0
    200
    400
    600
    0 30 60 90
    0
    250
    500
    750
    1000
    0 30 60 90
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    Thread time
    KNL Bowtie
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL HISAT
    TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline
    • Bowtie 2, slowest of the tools, has less "spread"
    • Of the MT methods TBB queueing lock seems to
    perform best
    • MP baseline is best by far -- more work to do if 100%
    MT is to compete with MP/MT mix
    30 60 90
    0
    0 30 60 90
    0
    0 30 60
    100 200
    # threads
    wtie
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100
    # threads
    KNL HISAT
    TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  21. Locking comparison
    Spinning using an atomic operation is worse than
    spinning with a normal memory read
    • Atomic op treated like a write by cache coherence
    infrastructure
    • Many threads spinning at once leads to flood of
    cache coherence messages, clogging bus
    Details in preprint
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  22. Parsing comparison
    To address remaining
    difference between MT
    and MP+MT, we try to
    reduce the frequency
    and duration of the
    input-related critical
    sections

    View Slide

  23. Parsing strategies
    (Original) O-parsing (Deferred) D-parsing
    obtain lock
    read = read_and_parse()
    release lock
    buf = new empty buffer
    obtain lock
    nl = 0
    while nl < 4:
    c = read_char()
    if c is newline:
    nl = nl + 1
    append c to buf
    release lock
    read = parse_fastq(buf)
    (Batch deferred) B-parsing (Block deferred) L-parsing
    bufs = array of N empty buffers
    reads = array of N empty reads
    obtain lock
    for i = 1 to N:
    nl = 0
    while nl < 4:
    c = read_char()
    if c is newline:
    nl = nl + 1
    append c to bufs[i]
    release lock
    for i = 1 to N:
    reads[i] = parse_fastq(bufs[i])
    reads = array of N empty reads
    obtain lock
    buf = read B bytes
    release lock
    for i = 1 to N:
    reads[i] = parse_fastq(buf)
    advance buf to next read

    View Slide

  24. Parsing comparison (unpaired)
    0
    200
    400
    600
    0 30 60 90
    Thread time
    0
    200
    400
    600
    0 30 60 90
    0
    250
    500
    750
    0 30 60 90
    0
    250
    500
    750
    0 100 200
    # threads
    Thread time
    KNL Bowtie
    0
    250
    500
    750
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL HISAT
    Original (O) Deferred (D) Batch deferred (B) MP baseline
    200
    threads
    0
    250
    500
    750
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100
    # th
    KNL HISAT
    Original (O) Deferred (D) Batch deferred (B) MP baseline
    • B-parsing (batch deferred) outperforms D-
    parsing (deferred), which outperforms O-parsing
    • MP baseline is still best, though MT with B-
    parsing already scales comparably for Bowtie 2
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  25. Parsing strategies
    (Original) O-parsing (Deferred) D-parsing
    obtain lock
    read = read_and_parse()
    release lock
    buf = new empty buffer
    obtain lock
    nl = 0
    while nl < 4:
    c = read_char()
    if c is newline:
    nl = nl + 1
    append c to buf
    release lock
    read = parse_fastq(buf)
    (Batch deferred) B-parsing (Block deferred) L-parsing
    bufs = array of N empty buffers
    reads = array of N empty reads
    obtain lock
    for i = 1 to N:
    nl = 0
    while nl < 4:
    c = read_char()
    if c is newline:
    nl = nl + 1
    append c to bufs[i]
    release lock
    for i = 1 to N:
    reads[i] = parse_fastq(bufs[i])
    reads = array of N empty reads
    obtain lock
    buf = read B bytes
    release lock
    for i = 1 to N:
    reads[i] = parse_fastq(buf)
    advance buf to next read

    View Slide

  26. Blocked FASTQ
    0 @read1
    7 GGTATATATG
    18 +
    20 BBCCCCDDDD
    31 @read2
    38 CCATAGCCAT
    49 +
    51 A!A999CDDD
    62 @read3
    69 CCATAGCCAT
    80 +
    82 A!A999CDDD
    (a)
    End 1 FASTQ
    0 @read1
    7 CACCCGTTA
    17 +
    19 BBACCDDDH
    29 @read2
    36 CGGTTGACC
    46 +
    48 !!ABBCCCD
    58 @read3
    65 CCATAGCCA
    75 +
    77 A!A999CDD
    End 2 FASTQ
    0 @read1
    7 GGTATATATG
    18 +
    20 BBCCCCDDDD
    31 @read2
    40 CCATAGCCAT
    51 +
    53 A!A999CDDD
    64 @read3
    71 CCATAGCCAT
    82 +
    84 A!A999CDDD
    (b)
    0 @read1
    7 CACCCGTTA
    17 +
    19 BBACCDDDH
    29 @read2
    42 CGGTTGACC
    52 +
    54 !!ABBCCCD
    64 @read3
    71 CCATAGCCA
    81 +
    83 A!A999CDD
    Standard
    End 1 FASTQ End 2 FASTQ
    Block (B = 64, N = 2)
    • Use padding (filled rectangles above) to prevent FASTQ
    records from spanning certain fixed boundaries

    View Slide

  27. Parsing strategies
    (Original) O-parsing (Deferred) D-parsing
    obtain lock
    read = read_and_parse()
    release lock
    buf = new empty buffer
    obtain lock
    nl = 0
    while nl < 4:
    c = read_char()
    if c is newline:
    nl = nl + 1
    append c to buf
    release lock
    read = parse_fastq(buf)
    (Batch deferred) B-parsing (Block deferred) L-parsing
    bufs = array of N empty buffers
    reads = array of N empty reads
    obtain lock
    for i = 1 to N:
    nl = 0
    while nl < 4:
    c = read_char()
    if c is newline:
    nl = nl + 1
    append c to bufs[i]
    release lock
    for i = 1 to N:
    reads[i] = parse_fastq(bufs[i])
    reads = array of N empty reads
    obtain lock
    buf = read B bytes
    release lock
    for i = 1 to N:
    reads[i] = parse_fastq(buf)
    advance buf to next read

    View Slide

  28. Final comparison
    Compare:
    • B-parsing
    • Block deferred (L) parsing with 1 output file
    • Block deferred (L) parsing with 16 output files (to
    address output bottleneck)
    • MP Baseline: multiple processes, 16 threads each
    • For Bowtie 2: BWA-MEM v0.7.16a
    100 200
    # threads
    0
    100
    200
    300
    400
    0 100 200
    # threads
    0
    250
    500
    750
    1000
    0 100
    # thre
    Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM

    View Slide

  29. Final comparison
    0
    100
    200
    300
    0 30 60 90
    Thread time
    0
    50
    100
    150
    200
    0 30 60 90
    0
    100
    200
    300
    0 30 60 90
    0
    250
    500
    750
    0 100 200
    # threads
    Thread time
    KNL Bowtie
    0
    100
    200
    300
    400
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100 200
    # threads
    KNL HISAT
    Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM
    0 60 90
    0
    0 30 60 90
    0
    0 30 60
    100 200
    # threads
    ie
    0
    100
    200
    300
    400
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100
    # threads
    KNL HISAT
    Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM
    • Block parsing competes very favorably with MP
    baseline -- finally, pure MT is competing with MP
    • For Bowtie 2, all tested modes scaled better than
    BWA-MEM
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  30. Final comparison
    0
    20
    40
    60
    80
    0 25 50 75 100
    Thread time
    Skylake Bowtie
    0
    50
    100
    150
    0 25 50 75 100
    Skylake Bowtie 2
    0
    50
    100
    0 25 50 75 100
    Skylake HISAT
    0
    30
    60
    90
    0 30 60 90
    Thread time
    Broadwell Bowtie
    0
    50
    100
    150
    200
    0 30 60 90
    Broadwell Bowtie 2
    0
    100
    200
    300
    0 30 60 90
    Broadwell HISAT
    50
    100
    150
    Thread time
    KNL Bowtie
    50
    100
    150
    200
    250
    KNL Bowtie 2
    250
    500
    750
    1000
    KNL HISAT
    • Similar results for Broadwell & Skylake
    • Bowtie 2 memory scaling also superior to BWA-MEM
    (results in preprint)
    5 50 75 100
    0
    50
    0 25 50 75 100
    0
    50
    0 25 50
    0 60 90
    Bowtie
    0
    50
    100
    150
    200
    0 30 60 90
    Broadwell Bowtie 2
    0
    100
    200
    300
    0 30 60
    Broadwell HISAT
    100 200
    # threads
    ie
    0
    100
    200
    300
    400
    0 100 200
    # threads
    KNL Bowtie 2
    0
    250
    500
    750
    1000
    0 100
    # threads
    KNL HISAT
    Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  31. Changes to tools
    All evaluated modes are available from public repos
    • See supplementary note 1 in preprint
    B-parsing with the TBB queueing lock is now the
    default in Bowtie & Bowtie 2
    • Block (L) parsing can't be the default because typical
    FASTQ isn't padded
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  32. Conclusions
    • Pure MT can often scale at least as well as
    MP+MT, unlocking full spectrum of MP/MT
    trade offs
    • Thread scaling and file formats are linked; our
    (least) favorite file formats could be better
    designed with an eye to thread scaling
    • Gains described here can generalize to other
    embarrassingly parallel software
    • MT scaling should be a first-class concern when
    describing and evaluating upstream genomics
    tools in literature

    View Slide

  33. Preprint
    Scaling read aligners to hundreds of threads on
    general-purpose processors
    Ben Langmead1,2,*, Christopher Wilks1,2, Valentin Antonescu1, and Rone Charles1
    1Department of Computer Science, Johns Hopkins University
    2Center for Computational Biology, Johns Hopkins University
    *Correspondence to: [email protected]
    February 4, 2018
    Abstract
    General-purpose processors can now contain many dozens of processor cores and support
    hundreds of simultaneous threads of execution. To make best use of these threads, genomics
    software must contend with new and subtle computer architecture issues. We discuss some
    of these and propose methods for improving thread scaling in tools that analyze each read in-
    dependently, such as read aligners. We implement these methods in new versions of Bowtie,
    Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the re-
    cent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-
    record-length file formats like FASTQ and suggest changes that enable superior scaling.
    1 Introduction
    General-purpose processors are now capable of running hundreds of threads of execution simul-
    taneously in parallel. Intel’s Xeon Phi “Knight’s Landing” architecture supports 256–288 simul-
    taneous threads across 64–72 physical processor cores [1, 2]. With severe physical limits on clock
    speed [3], future architectures will likely support more simultaneous threads rather than faster in-
    Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners
    to hundreds of threads on general-purpose processors. bioRxiv, 205328.

    View Slide

  34. Thanks
    Valentin
    Antonescu
    Daehwan
    Kim
    Chris
    Wilks
    Rone
    Charles
    Funding
    • Intel Parallel Computing Center 2014-2016
    • NIH R01GM118568
    • XSEDE allocation TG-CIE170020

    View Slide