Slide 1

Slide 1 text

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead Penn State, March 1, 2017 Unlocking sequence data archives with scalable software and resources

Slide 2

Slide 2 text

Themes • Public data is available & valuable but hard to use • Scalable software benefits from big resources & data • Strategically ignoring gene annotations can yield clearer results

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Sequence Read Archive (SRA) growth Open access Total 1 Pbp 4.5 -> 9 Pbp in ~10 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 10 Pbp 100 Tbp 10 Tbp

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Abhinav Nellore OHSU Jeff Leek, JHU Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Slide 9

Slide 9 text

Spliced RNA-seq aligner for analyzing many samples at once • Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; alignments only when asked • Runs easily on commercial AWS cloud, other clusters Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Slide 10

Slide 10 text

Pass 1: align to genome, make junction calls Pass 2: re-align to genome with putative junctions Reads: Ref: Readlets:

Slide 11

Slide 11 text

Aggregating across samples adds a dimension to junction call confidence Sample 1 S 2 S 3 S 4 S 5 Candidate Junction 1 CJ 2 CJ 3

Slide 12

Slide 12 text

Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon differentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie

Slide 13

Slide 13 text

Sample 1 Sample 2 Sample 3 Log coverage

Slide 14

Slide 14 text

Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases as we add more samples Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Slide 15

Slide 15 text

Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website: http://docs.rail.bio/dbgap/ Paper: http://bit.ly/rail_dbgap Abhinav Nellore OHSU Jeff Leek, JHU

Slide 16

Slide 16 text

Intropolis Website: http://intropolis.rail.bio Paper: http://bit.ly/intropolis Abhi Nellore OHSU Jeff Leek JHU

Slide 17

Slide 17 text

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantification Count overlaps w/ annotated features Differential gene / exon expression (quasi-, pseudo-) Reads

Slide 18

Slide 18 text

Good proxy? Transcripts in sample Isoform quantification Reads Transcripts in annotation

Slide 19

Slide 19 text

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA; about 62 Tbp • Repeatable: http://github.com/nellore/runs (Exact commands we used to run on AWS)

Slide 20

Slide 20 text

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA; about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples

Slide 21

Slide 21 text

Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA; about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples

Slide 22

Slide 22 text

a 0 2000 4000 6000 8000 10000 12000 14000 0 100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega

Slide 23

Slide 23 text

0 2000 4000 6000 8000 10000 0 20 40 60 80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19

Slide 24

Slide 24 text

RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count overlaps w/ annotated features Differential gene / exon expression (often with annotation) (quasi-, pseudo-)

Slide 25

Slide 25 text

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed region finder derfinder: region-based, annotation-agnostic Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

Slide 26

Slide 26 text

Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA- seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9. A third way

Slide 27

Slide 27 text

A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed region finder derfinder: region-based, annotation-agnostic bigWigs Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

Slide 28

Slide 28 text

Boiler: RNA-seq alignment compression • As big as bigWigs & 1-2 orders of magnitude smaller than sorted BAMs • Usable with Cufflinks, StringTie Pritt J, Langmead B. Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res. 2016 Sep 19;44(16):e133. F R1 R2 Coverage Length tallies Co-occurrence patterns Jacob Pritt

Slide 29

Slide 29 text

Snaptron & recount2 • Best places to start if you’re interested in our summaries of public human RNA-seq; both include: • ~50K SRA samples • ~10K GTEx samples • ~10K TCGA samples • Snaptron: pose sophisticated queries re: splicing, with quick responses, no need to download data • recount2: reanalyze in starting from our gene-, exon-, junction- or coverage-level summaries

Slide 30

Slide 30 text

Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene) and indexes (R-tree, B-tree, inverted full text) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

Slide 31

Slide 31 text

Snaptron • Junctions in ALK gene • # times junction occurs in each of 50,000 SRA samples • Tissue specificity of junction in GTEx data • Samples ranked according to how overrepresented one splicing pattern is relative to another Example queries: http://snaptron.cs.jhu.edu Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

Slide 32

Slide 32 text

recount2 Junctions Genes Coverage Exons • Summaries over the 70K samples at levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

Slide 33

Slide 33 text

recount2 • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/ • Preprint: http://bit.ly/recount_pre Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478. Abhinav Nellore Leo Collado Torres

Slide 34

Slide 34 text

Intropolis • Discovery of novel splicing events has leveled off • Good time to put effort into a more complete annotation Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

Slide 35

Slide 35 text

Intropolis ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02 -0.01 0.00 0.01 0.02 PC1 PC2 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

Slide 36

Slide 36 text

Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love