HiCOMB 2018: Practical lessons from scaling read aligners to hundreds of threads

Practical lessons from scaling read aligners to hundreds of threads
Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead HiCOMB, May 21, 2018

Upstream genomics software Much genomics software loops over reads &
is embarrassingly parallel Read aligners, metagenomics binners, error correctors, transcriptome quantifiers, ... Assumption broken e.g. by spliced aligners (mildly), assemblers (severely) while(next_read()) { ... } Variant caller Transcript assembler Peak caller Differential expression

How many threads? It's typical to compare tools based on
speed of a single thread or at low thread count Bowtie 2 comparison is single-threaded: Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012 Mar 4;9(4):357-9.

How many threads? Kraken speed comparison: single thread Wood DE,
Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46.

How many threads? Jain, C., Koren, S., Dilthey, A., Phillippy,
A. M., & Aluru, S. (2018). A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps. bioRxiv, 259986. MashMap2: 8 threads

How many threads? CLARK-S metagenomics binning tool demonstrates scaling results
up to 8 threads Ounit R, Lonardi S. Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics. 2016 Dec 15;32(24):3823-3825.

How many threads? Lenis, J., & Senar, M. A. (2017).
A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures. Cluster Computing, 20(3), 1909-1924. Investigations at higher thread counts tend to be done by computer scientists; up to 64 here:

Intel Xeon Phi “Knights Landing” http://www.hardwareluxx.com/index.php/news/hardware/cpu/37217-intel-shows-xeon-phi-knights-landing-wafer-die.html 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 Henceforth "KNL" 4 threads 8 threads 64 threads

MP & MT Multiprocessing (MP) Multithreading (MT) Keeping the cores
busy by running many processes in parallel Keeping the cores busy by running many threads in a single process bowtie -p 1 index in1of8 out1 & bowtie -p 1 index in2of8 out2 & bowtie -p 1 index in3of8 out3 & bowtie -p 1 index in4of8 out4 & bowtie -p 1 index in5of8 out5 & bowtie -p 1 index in6of8 out6 & bowtie -p 1 index in7of8 out7 & bowtie -p 1 index in8of8 out8 & bowtie -p 8 index in out

MP & MT bowtie -p 4 index in1of2 out &
bowtie -p 4 index in2of2 out & MP MT bowtie -p 2 index in1of4 out & bowtie -p 2 index in2of4 out & bowtie -p 2 index in3of4 out & bowtie -p 2 index in4of4 out & MT often combined with MP to keep per-process thread count from creating a thread-scaling bottleneck MP + MT spectrum

When pure MT scales well, we can choose points anywhere
on the spectrum, achieving high throughput in more scenarios • When reads arrive in fast, hard-to-buffer stream • When MP static load balancing is not balanced • When MP dynamic load balancing doesn't scale well or is too complex to implement • When low single-job latency is desired MP MT MP & MT MP + MT spectrum

Thread model All threads can do any kind of work:
input, output, alignment Input & output are synchronized via locks allowing just one thread at a time While a thread holds a lock, it is in the critical section Running Waiting In critical section

Questions explored Can certain lock types reduce overhead of entering
and exiting critical sections? Can different input parsing and output writing strategies reduce frequency and duration of critical sections? Overall: Can pure MT compete with MP+MT? How do our file formats affect the complexity of critical sections?

Tools & versions Bowtie v1.1.2, Bowtie 2 v2.2.9, HISAT v0.1.6-beta
• All used here for (unspliced) DNA alignment only; HISAT's spliced alignment disabled • All code changes are in publicly-available branches (see preprint Supplementary Note 1) • Intel Thread Building Blocks (TBB) 2017 Update 5 Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

XSEDE: Stampede 2 • 4,200 Intel Knights Landing nodes, each
with 68 cores, 96GB of DDR RAM, and 16GB of high speed MCDRAM • 1,736 Intel Xeon Skylake nodes, each with 48 cores and 192GB of RAM https://www.tacc.utexas.edu/systems/stampede2 At Texas Advanced Computing Center (TACC)

Reads Reads are a mix from 3 projects that did
HiSeq 2000 100 x 100 paired-end sequencing: • Platinum Genomes (ERR194147) • 1000 Genomes Project (SRR069520) • Rustagi et al, low-coverage whole genome DNA sequencing of South Asian (SRR3947551) Aligned to hg38 Number of input reads per experiment tuned so 1 thread takes ~1 minute

Experimental setup FASTQ is read from a local disk SAM
output is written to same local disk • Broadwell: 7200 RPM SATA HD • Skylake & KNL: local solid-state drive (/tmp)

Locking comparison Compare locking mechanisms • TinyThread++ & TBB spin
are simple atomic- operation spin loops • TBB standard: tries lock; sleeps if unavailable (like standard pthreads lock / FUTEX mechanism) • TBB queueing: spins but without an atomic op and each thread spins on a separate cache line (AKA MCS lock) • MP Baseline: multiple processes, 16 threads each (so it's really MP+MT) 100 200 # threads 0 250 500 750 1000 0 100 200 # threads 0 250 500 750 1000 0 100 # threa TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline

60 90 0 200 0 30 60 90 0 250
0 30 60 0 200 # threads 0 250 500 750 1000 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 # thread KNL HISAT TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline Locking comparison 30 60 90 0 200 0 30 60 90 0 250 500 0 30 60 100 200 # threads Bowtie 0 250 500 750 1000 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 # threa KNL HISAT TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline Thread time • Results are for weak scaling; as thread count increases, amount of per-thread work held constant • Horizontal: # threads • Vertical: wall-clock running time of longest-running thread, excluding index load time

Locking comparison 0 250 500 750 0 30 60 90
Thread time 0 200 400 600 0 30 60 90 0 250 500 750 1000 0 30 60 90 0 250 500 750 1000 0 100 200 # threads Thread time KNL Bowtie 0 250 500 750 1000 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 200 # threads KNL HISAT TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline • Bowtie 2, slowest of the tools, has less "spread" • Of the MT methods TBB queueing lock seems to perform best • MP baseline is best by far -- more work to do if 100% MT is to compete with MP/MT mix 30 60 90 0 0 30 60 90 0 0 30 60 100 200 # threads wtie 0 250 500 750 1000 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 # threads KNL HISAT TinyThread++ spin TBB spin TBB standard TBB queueing MP baseline Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Locking comparison Spinning using an atomic operation is worse than
spinning with a normal memory read • Atomic op treated like a write by cache coherence infrastructure • Many threads spinning at once leads to flood of cache coherence messages, clogging bus Details in preprint Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Parsing comparison To address remaining difference between MT and MP+MT,
we try to reduce the frequency and duration of the input-related critical sections

Parsing strategies (Original) O-parsing (Deferred) D-parsing obtain lock read =
read_and_parse() release lock buf = new empty buffer obtain lock nl = 0 while nl < 4: c = read_char() if c is newline: nl = nl + 1 append c to buf release lock read = parse_fastq(buf) (Batch deferred) B-parsing (Block deferred) L-parsing bufs = array of N empty buffers reads = array of N empty reads obtain lock for i = 1 to N: nl = 0 while nl < 4: c = read_char() if c is newline: nl = nl + 1 append c to bufs[i] release lock for i = 1 to N: reads[i] = parse_fastq(bufs[i]) reads = array of N empty reads obtain lock buf = read B bytes release lock for i = 1 to N: reads[i] = parse_fastq(buf) advance buf to next read

Parsing comparison (unpaired) 0 200 400 600 0 30 60
90 Thread time 0 200 400 600 0 30 60 90 0 250 500 750 0 30 60 90 0 250 500 750 0 100 200 # threads Thread time KNL Bowtie 0 250 500 750 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 200 # threads KNL HISAT Original (O) Deferred (D) Batch deferred (B) MP baseline 200 threads 0 250 500 750 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 # th KNL HISAT Original (O) Deferred (D) Batch deferred (B) MP baseline • B-parsing (batch deferred) outperforms D- parsing (deferred), which outperforms O-parsing • MP baseline is still best, though MT with B- parsing already scales comparably for Bowtie 2 Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Blocked FASTQ 0 @read1 7 GGTATATATG 18 + 20 BBCCCCDDDD
31 @read2 38 CCATAGCCAT 49 + 51 A!A999CDDD 62 @read3 69 CCATAGCCAT 80 + 82 A!A999CDDD (a) End 1 FASTQ 0 @read1 7 CACCCGTTA 17 + 19 BBACCDDDH 29 @read2 36 CGGTTGACC 46 + 48 !!ABBCCCD 58 @read3 65 CCATAGCCA 75 + 77 A!A999CDD End 2 FASTQ 0 @read1 7 GGTATATATG 18 + 20 BBCCCCDDDD 31 @read2 40 CCATAGCCAT 51 + 53 A!A999CDDD 64 @read3 71 CCATAGCCAT 82 + 84 A!A999CDDD (b) 0 @read1 7 CACCCGTTA 17 + 19 BBACCDDDH 29 @read2 42 CGGTTGACC 52 + 54 !!ABBCCCD 64 @read3 71 CCATAGCCA 81 + 83 A!A999CDD Standard End 1 FASTQ End 2 FASTQ Block (B = 64, N = 2) • Use padding (filled rectangles above) to prevent FASTQ records from spanning certain fixed boundaries

Final comparison Compare: • B-parsing • Block deferred (L) parsing
with 1 output file • Block deferred (L) parsing with 16 output files (to address output bottleneck) • MP Baseline: multiple processes, 16 threads each • For Bowtie 2: BWA-MEM v0.7.16a 100 200 # threads 0 100 200 300 400 0 100 200 # threads 0 250 500 750 1000 0 100 # thre Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM

Final comparison 0 100 200 300 0 30 60 90
Thread time 0 50 100 150 200 0 30 60 90 0 100 200 300 0 30 60 90 0 250 500 750 0 100 200 # threads Thread time KNL Bowtie 0 100 200 300 400 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 200 # threads KNL HISAT Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM 0 60 90 0 0 30 60 90 0 0 30 60 100 200 # threads ie 0 100 200 300 400 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 # threads KNL HISAT Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM • Block parsing competes very favorably with MP baseline -- finally, pure MT is competing with MP • For Bowtie 2, all tested modes scaled better than BWA-MEM Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Final comparison 0 20 40 60 80 0 25 50
75 100 Thread time Skylake Bowtie 0 50 100 150 0 25 50 75 100 Skylake Bowtie 2 0 50 100 0 25 50 75 100 Skylake HISAT 0 30 60 90 0 30 60 90 Thread time Broadwell Bowtie 0 50 100 150 200 0 30 60 90 Broadwell Bowtie 2 0 100 200 300 0 30 60 90 Broadwell HISAT 50 100 150 Thread time KNL Bowtie 50 100 150 200 250 KNL Bowtie 2 250 500 750 1000 KNL HISAT • Similar results for Broadwell & Skylake • Bowtie 2 memory scaling also superior to BWA-MEM (results in preprint) 5 50 75 100 0 50 0 25 50 75 100 0 50 0 25 50 0 60 90 Bowtie 0 50 100 150 200 0 30 60 90 Broadwell Bowtie 2 0 100 200 300 0 30 60 Broadwell HISAT 100 200 # threads ie 0 100 200 300 400 0 100 200 # threads KNL Bowtie 2 0 250 500 750 1000 0 100 # threads KNL HISAT Batch (B) Block (L), 1 output Block (L), 16 outputs MP baseline BWA−MEM Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Changes to tools All evaluated modes are available from public
repos • See supplementary note 1 in preprint B-parsing with the TBB queueing lock is now the default in Bowtie & Bowtie 2 • Block (L) parsing can't be the default because typical FASTQ isn't padded Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Conclusions • Pure MT can often scale at least as
well as MP+MT, unlocking full spectrum of MP/MT trade offs • Thread scaling and file formats are linked; our (least) favorite file formats could be better designed with an eye to thread scaling • Gains described here can generalize to other embarrassingly parallel software • MT scaling should be a first-class concern when describing and evaluating upstream genomics tools in literature

Preprint Scaling read aligners to hundreds of threads on general-purpose
processors Ben Langmead1,2,*, Christopher Wilks1,2, Valentin Antonescu1, and Rone Charles1 1Department of Computer Science, Johns Hopkins University 2Center for Computational Biology, Johns Hopkins University *Correspondence to: [email protected] February 4, 2018 Abstract General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read in- dependently, such as read aligners. We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the re- cent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable- record-length ﬁle formats like FASTQ and suggest changes that enable superior scaling. 1 Introduction General-purpose processors are now capable of running hundreds of threads of execution simul- taneously in parallel. Intel’s Xeon Phi “Knight’s Landing” architecture supports 256–288 simultaneous threads across 64–72 physical processor cores [1, 2]. With severe physical limits on clock speed [3], future architectures will likely support more simultaneous threads rather than faster in- Langmead, B., Wilks, C., Antonescu, V., & Charles, R. (2017). Scaling read aligners to hundreds of threads on general-purpose processors. bioRxiv, 205328.

Thanks Valentin Antonescu Daehwan Kim Chris Wilks Rone Charles Funding
• Intel Parallel Computing Center 2014-2016 • NIH R01GM118568 • XSEDE allocation TG-CIE170020

HiCOMB 2018: Practical lessons from scaling rea...

HiCOMB 2018: Practical lessons from scaling read aligners to hundreds of threads

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript