Scalable analysis of many sequencing datasets at once

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
University of Utah, November 11, 2016 Scalable analysis of many sequencing datasets at once

Langmead lab Eﬃciency Scalability + Bowtie, Bowtie 2, Lighter, Arioc,
HISAT Rail-RNA, Boiler, Rail-dbGaP Resources + Intropolis, recount, Snaptron

Sequence Read Archive (SRA) growth Terabases Open access Total 1
Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement

Sequence Read Archive (SRA) growth Terabases Open access Total 1
Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 6.15 -> 8.72 Pbp in October Open access Total

MapReduce Elastic MapReduce Spot Marketplace

Abhinav Nellore JHU & OHSU Jeﬀ Leek, JHU Website: http://rail.bio,
Paper: http://bit.ly/rail-aa

From reads to alignments, coverage vectors & junctions Aggregate across
samples to borrow strength and eliminate redundant work Annotation agnostic: let data, not annotation, prune the junction space Spliced RNA-seq aligner for analyzing many samples at once Website: http://rail.bio, Paper: http://bit.ly/rail-aa

Pass 1: align to genome, make junction calls Pass 2:
re-align to genome with putative junctions Reads: Ref: Readlets:

Sample 1 S 2 S 3 S 4 S 5
Candidate Junction 1 CJ 2 CJ 3 Aggregating across samples adds a dimension to junction call conﬁdence

Rail-RNA: design Preprocess Aggregate duplicate reads Split into readlets Aggregate
duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon diﬀerentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie

Sample 1 Sample 2 Sample 3 Log coverage

Marginal cost of analyzing 1 additional sample decreases as we
add more samples Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website:
http://docs.rail.bio/dbgap/ Paper: http://bit.ly/rail_dbgap Abhinav Nellore JHU & OHSU Jeﬀ Leek, JHU

What is the annotation tradeoﬀ? Some new tools start with
annotation; no attempt to discover junctions / isoforms Major projects (GTEx, GEUVADIS, TCGA) quantitate directly from annotated transcripts What is the nature of this tradeoﬀ? How complete are the annotations?

What is the annotation tradeoﬀ? • Analyzed ~50,000 human RNA-seq
samples with Rail-RNA; about 150 Tbp • Repeatable: http://github.com/nellore/runs • ~ $1.40 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS)

a 0 2000 4000 6000 8000 10000 12000 14000 0
100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega

0 2000 4000 6000 8000 10000 0 20 40 60
80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19

Djebali, Sarah, et al. "Landscape of transcription in human cells."
Nature 489.7414 (2012): 101-108.

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count
overlaps w/ annotated features Diﬀerential gene / exon expression (often with annotation) (quasi-, pseudo-)

RNA-seq: a third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially
expressed region finder derfinder: region-based, annotation-agnostic See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29.

Collado-Torres L et al. Flexible expressed region analysis for RNA-seq
with derﬁnder. Nucleic Acids Res. 2016 Sep 29. RNA-seq: a third way See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa

See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa Spliced alignment
Rail-RNA: accurate, annotation-agnostic Differentially expressed region finder derfinder: region-based, annotation-agnostic Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29. bigWigs RNA-seq: a third way

Resources • Intropolis (in press, Genome biology) • Nellore A,
et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. • Snaptron (in preparation) • recount (in revision) • Collado-Torres L, et al. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv (2016): 068478.

Intropolis Site: http://intropolis.rail.bio, preprint: http://j.mp/rail-sra-pre Nellore A, et al. Human
splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. Abhinav Nellore JHU & OHSU

Intropolis • Discovery of novel splicing events has leveled oﬀ
Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224.

Intropolis Nellore A, et al. Human splicing diversity across the
Sequence Read Archive. bioRxiv (2016): 038224. ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02 -0.01 0.00 0.01 0.02 PC1 PC2

Snaptron Users pose ﬂexible queries about splicing Query planner delegates
to appropriate systems (sqlite, tabix, lucene) and indexes (R-tree, B-tree, inverted full text) Lucene/Document Inverted Index SQLite/B-tree Index Tabix/R-tree Index 8 Sample Filter Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Sample Filter Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Chris Wilks

Snaptron Web service and UI currently available, preprint soon B
C 1 2 1 2 3 1 2 3 4 Example: two unannotated junctions on either side of an exonized repetitive element discovered by colleague Sarven Sabunciyan Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

recount Junctions Genes Coverage Exons • Provides expression summaries at
levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

recount recount : expression data for ~70,000 human samples samples
expression estimates gene exon junctions ERs samples phenotypes ? GTEx N=9,962 TCGA N=10,327 SRA N=49,848 1.  Have'’novel’'isoforms'ever'been' seen'previously?'' 2.  What'regions'of'the'human' genome'are'transcribed'in' humans?' 1.  Have'’novel’'isoforms' ever'been'seen' previously?'In#what# tissue?#At#what#levels?# 2.  What'regions'of'the' human'genome'are' transcribed'in'humans' and#in#what#tissues?' 3.  Do'the'same'genes' escape'X'Inactivation' across'all'tissues?' 4.  What'expression' changes'occur'as'we' age?' 5.  …….' Biological Phenotypes: -  Sex -  Age -  Tissue recount: A large-scale resource of analysis-ready RNA-seq expression data Leonardo Collado-Torres, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, Jeffrey Leek Slide courtesy of Shannon Ellis Abhinav Nellore Leo Collado Torres

recount studies Collado-Torres L, Nellore A, Kammers K, Ellis SE,
Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478. • Tissue meta-analysis: compare colon & blood tissues from SRA, do same from GTEx, compare differential expression rankings • Compare our gene expression measurements with those from GTEx project; high concordance • Compare differential expression results when analysis is performed at the level of gene, exon, junction or DER

recount studies https://rpubs.com/crazyhottommy/heatmap_demystiﬁed

recount • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • SciServer Compute lets users
to work with locally-hosted data in Jupyter notebook http://compute.sciserver.org/dashboard/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/ • Preprint: http://bit.ly/recount_pre Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee
Leo Collado Torres Chris Wilks Andrew Jaﬀe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love

Scalable analysis of many sequencing datasets a...

Scalable analysis of many sequencing datasets at once

More Decks by Ben Langmead

Other Decks in Research

Featured

Transcript