Summarizing tens of thousands of RNA-seq samples: themes and lessons

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
Banﬀ, March 30, 2017 Summarizing tens of thousands of RNA-seq samples: themes and lessons

Themes • Public data is available & valuable but hard
to use

to use • Scalable software beneﬁts from big resources & data

to use • Scalable software beneﬁts from big resources & data • Strategically ignoring gene annotations can yield clearer results

Sequence Read Archive (SRA) growth Open access Total 1 Pbp
4.5 -> 9 Pbp in ~10 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 10 Pbp 100 Tbp 10 Tbp

Abhinav Nellore OHSU Jeﬀ Leek, JHU Website: http://rail.bio Nellore A,
Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Spliced RNA-seq aligner for analyzing many samples at once •
Aggregate across samples to borrow strength and eliminate redundant alignment work Website: http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation Website: http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for Website: http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud, other clusters Website: http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Sample 1 S 2 S 3 S 4 S 5
Candidate Junction 1 CJ 2 CJ 3 Aggregating across samples adds a dimension to junction call conﬁdence

Aggregating across samples adds a dimension to junction call conﬁdence
Sample 1 S 2 S 3 S 4 S 5 Candidate Junction 1 CJ 2 CJ 3

Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate
duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon diﬀerentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie

Sample 1 Sample 2 Sample 3 Log coverage

Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases
as we add more samples Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Intropolis Website: http://intropolis.rail.bio Paper: http://bit.ly/intropolis Abhi Nellore OHSU Jeﬀ Leek
JHU

RNA-seq & annotation Spliced alignment Count overlaps w/ annotated features
Diﬀerential gene / exon expression Reads

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantiﬁcation Count
overlaps w/ annotated features Diﬀerential gene / exon expression Reads

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantiﬁcation Count
overlaps w/ annotated features Diﬀerential gene / exon expression (quasi-, pseudo-) Reads

Good proxy? Transcripts in sample Isoform quantiﬁcation Reads Transcripts in
annotation

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;
about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples

a 0 2000 4000 6000 8000 10000 12000 14000 0
100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega

0 2000 4000 6000 8000 10000 0 20 40 60
80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed
region finder derfinder: region-based, annotation-agnostic Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed
region finder derfinder: region-based, annotation-agnostic bigWigs Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

Boiler: RNA-seq alignment compression • As big as bigWigs &
1-2 orders of magnitude smaller than sorted BAMs • Usable with Cuﬄinks, StringTie Pritt J, Langmead B. Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res. 2016 Sep 19;44(16):e133. F R1 R2 Coverage Length tallies Co-occurrence patterns Jacob Pritt

Snaptron & recount2 • Where to start for our summaries
of public human RNA-seq; both include: • ~50K SRA samples • ~10K GTEx samples • ~10K TCGA samples • Snaptron: pose sophisticated queries re: splicing, quick responses, no downloading

Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene)
and indexes (R-tree, B-tree, inverted full text) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

Snaptron • Junctions in ALK gene • # times junction
occurs in each of 50,000 SRA samples • Tissue speciﬁcity of junction in GTEx data • Samples ranked according to how overrepresented one splicing pattern is relative to another Example queries: http://snaptron.cs.jhu.edu Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee
Leo Collado Torres Chris Wilks Andrew Jaﬀe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love

Summarizing tens of thousands of RNA-seq sample...

Summarizing tens of thousands of RNA-seq samples: themes and lessons

Ben Langmead

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

Themes • Public data is available & valuable but hard

Themes • Public data is available & valuable but hard

Themes • Public data is available & valuable but hard

Sequence Read Archive (SRA) growth Open access Total 1 Pbp

Abhinav Nellore OHSU Jeﬀ Leek, JHU Website: http://rail.bio Nellore A,

Spliced RNA-seq aligner for analyzing many samples at once •

Spliced RNA-seq aligner for analyzing many samples at once •

Spliced RNA-seq aligner for analyzing many samples at once •

Spliced RNA-seq aligner for analyzing many samples at once •

Spliced RNA-seq aligner for analyzing many samples at once •

Sample 1 S 2 S 3 S 4 S 5

Aggregating across samples adds a dimension to junction call conﬁdence

Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate

Sample 1 Sample 2 Sample 3 Log coverage

Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases

Intropolis Website: http://intropolis.rail.bio Paper: http://bit.ly/intropolis Abhi Nellore OHSU Jeﬀ Leek

RNA-seq & annotation Spliced alignment Count overlaps w/ annotated features

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantiﬁcation Count

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantiﬁcation Count

Good proxy? Transcripts in sample Isoform quantiﬁcation Reads Transcripts in

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

a 0 2000 4000 6000 8000 10000 12000 14000 0

0 2000 4000 6000 8000 10000 0 20 40 60

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Diﬀerentially expressed

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Diﬀerentially expressed

Boiler: RNA-seq alignment compression • As big as bigWigs &

Snaptron & recount2 • Where to start for our summaries

Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene)

Snaptron • Junctions in ALK gene • # times junction

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee