PSU 2017: Unlocking sequence data archives with scalable software and resources

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
Penn State, March 1, 2017 Unlocking sequence data archives with scalable software and resources

Themes • Public data is available & valuable but hard
to use • Scalable software beneﬁts from big resources & data • Strategically ignoring gene annotations can yield clearer results

Sequence Read Archive (SRA) growth Open access Total 1 Pbp
4.5 -> 9 Pbp in ~10 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 10 Pbp 100 Tbp 10 Tbp

Abhinav Nellore OHSU Jeﬀ Leek, JHU Website: http://rail.bio Nellore A,
Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Spliced RNA-seq aligner for analyzing many samples at once •
Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; alignments only when asked • Runs easily on commercial AWS cloud, other clusters Website: http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Pass 1: align to genome, make junction calls Pass 2:
re-align to genome with putative junctions Reads: Ref: Readlets:

Aggregating across samples adds a dimension to junction call conﬁdence
Sample 1 S 2 S 3 S 4 S 5 Candidate Junction 1 CJ 2 CJ 3

Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate
duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon diﬀerentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie

Sample 1 Sample 2 Sample 3 Log coverage

Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases
as we add more samples Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website:
http://docs.rail.bio/dbgap/ Paper: http://bit.ly/rail_dbgap Abhinav Nellore OHSU Jeﬀ Leek, JHU

Intropolis Website: http://intropolis.rail.bio Paper: http://bit.ly/intropolis Abhi Nellore OHSU Jeﬀ Leek
JHU

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantiﬁcation Count
overlaps w/ annotated features Diﬀerential gene / exon expression (quasi-, pseudo-) Reads

Good proxy? Transcripts in sample Isoform quantiﬁcation Reads Transcripts in
annotation

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;
about 62 Tbp • Repeatable: http://github.com/nellore/runs (Exact commands we used to run on AWS)

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;
about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples

a 0 2000 4000 6000 8000 10000 12000 14000 0
100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega

0 2000 4000 6000 8000 10000 0 20 40 60
80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count
overlaps w/ annotated features Diﬀerential gene / exon expression (often with annotation) (quasi-, pseudo-)

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed
region finder derfinder: region-based, annotation-agnostic Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI,
Langmead B, Irizarry RA, Leek JT, Jaﬀe AE. Flexible expressed region analysis for RNA- seq with derﬁnder. Nucleic Acids Res. 2017 Jan 25;45(2):e9. A third way

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed
region finder derfinder: region-based, annotation-agnostic bigWigs Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

Boiler: RNA-seq alignment compression • As big as bigWigs &
1-2 orders of magnitude smaller than sorted BAMs • Usable with Cuﬄinks, StringTie Pritt J, Langmead B. Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res. 2016 Sep 19;44(16):e133. F R1 R2 Coverage Length tallies Co-occurrence patterns Jacob Pritt

Snaptron & recount2 • Best places to start if you’re
interested in our summaries of public human RNA-seq; both include: • ~50K SRA samples • ~10K GTEx samples • ~10K TCGA samples • Snaptron: pose sophisticated queries re: splicing, with quick responses, no need to download data • recount2: reanalyze in starting from our gene-, exon-, junction- or coverage-level summaries

Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene)
and indexes (R-tree, B-tree, inverted full text) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

Snaptron • Junctions in ALK gene • # times junction
occurs in each of 50,000 SRA samples • Tissue speciﬁcity of junction in GTEx data • Samples ranked according to how overrepresented one splicing pattern is relative to another Example queries: http://snaptron.cs.jhu.edu Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

recount2 Junctions Genes Coverage Exons • Summaries over the 70K
samples at levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

recount2 • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/
• Preprint: http://bit.ly/recount_pre Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478. Abhinav Nellore Leo Collado Torres

Intropolis • Discovery of novel splicing events has leveled oﬀ
• Good time to put eﬀort into a more complete annotation Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

Intropolis ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02
-0.01 0.00 0.01 0.02 PC1 PC2 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee
Leo Collado Torres Chris Wilks Andrew Jaﬀe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love

PSU 2017: Unlocking sequence data archives with...

PSU 2017: Unlocking sequence data archives with scalable software and resources

Ben Langmead

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

Themes • Public data is available & valuable but hard

Sequence Read Archive (SRA) growth Open access Total 1 Pbp

Abhinav Nellore OHSU Jeﬀ Leek, JHU Website: http://rail.bio Nellore A,

Spliced RNA-seq aligner for analyzing many samples at once •

Pass 1: align to genome, make junction calls Pass 2:

Aggregating across samples adds a dimension to junction call conﬁdence

Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate

Sample 1 Sample 2 Sample 3 Log coverage

Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases

Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website:

Intropolis Website: http://intropolis.rail.bio Paper: http://bit.ly/intropolis Abhi Nellore OHSU Jeﬀ Leek

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantiﬁcation Count

Good proxy? Transcripts in sample Isoform quantiﬁcation Reads Transcripts in

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

a 0 2000 4000 6000 8000 10000 12000 14000 0

0 2000 4000 6000 8000 10000 0 20 40 60

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Diﬀerentially expressed

Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI,

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Diﬀerentially expressed

Boiler: RNA-seq alignment compression • As big as bigWigs &

Snaptron & recount2 • Best places to start if you’re

Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene)

Snaptron • Junctions in ALK gene • # times junction

recount2 Junctions Genes Coverage Exons • Summaries over the 70K

recount2 • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/

Intropolis • Discovery of novel splicing events has leveled oﬀ

Intropolis ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee