RNA-seq read alignment to individualized genomes

RNA-‐seq alignment to individualized genomes. Steve Munger
@stevemunger Post-‐doctoral Fellow Gary Churchill Group The Jackson Laboratory Bar Harbor, Maine USA Duke Molecular Physiology InsNtute NGS Interest Group Slides are posted on www.speakerdeck.com/stevemunger

Alignment 101
ACATGCTGCGGA ACATGCTGCGGA 100bp Read Chr 1 Chr 2 Chr 3

The perfect read: 1 read = 1 unique alignment.
ACATGCTGCGGA ACATGCTGCGGA 100bp Read ✓ Chr 1 Chr 2 Chr 3

Some reads will align equally well to mulNple locaNons.
“MulNreads” ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA 100bp Read ✓ ✗ ✗ 1 read 3 valid alignments Only 1 alignment is correct

The worst case scenario.
ACATGCTGCGGA ACATGCTGCGGC ACATGCTGCGGA ACATGCTGCGGA 100bp Read ✗ ✗ 1 read 2 valid alignments Neither is correct

Individual geneNc variaNon may aﬀect read alignment.
ACATGCTGCGGA ACATGCTGCGGC ACATGCTGCGGA ATATGCTGCGGA 100bp Read ✗ ✗

Aligning Billions of Short Sequence Reads Gene A
Gene B Aligners: BowNe, GSNAP, BWA, MAQ, BLAT + 100 more Designed to align the short reads fast, but not accurate

Gene A Gene B •  Gene family
(orthologous/paralogous) •  Low-‐complexity sequence •  AlternaNvely spliced isoforms •  Pseudogenes Aligning Sequence Reads •  Polymorphisms •  Indels •  Structural Variants •  Reference sequence Errors

Align to Genome or Transcriptome? Genome Transcriptome

Exon 2 Exon 3 Exon 1 Aligning
to Reference Genome: Exon First Alignment

Exon 2 Exon 3 Exon 1 Exon
First Alignment: Pseudogene problem Exon 2 Exon 3 Exon 1 Processed Pseudo-‐gene

Advantages: Can align novel isoforms. Disadvantages: Diﬃcult, Spurious alignments, spliced alignment, gene families, pseudogenes

Exon 2 Exon 3 Exon 1 Isoform
1 Isoform 2 Isoform 3 Reference Transcriptome Gene 1 Exon 2 Exon 3 Exon 1 Isoform 1 Isoform 2 Isoform 3 Gene 2 Exon 4

Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misafributed to a known isoform. Advantages: Can align novel isoforms. Disadvantages: Diﬃcult, Spurious alignments, spliced alignment, gene families, pseudo genes

Befer Approach: Aligning to Transcriptome and Genome Align to
Transcriptome First Advantages: relaNvely simpler, overcomes the pseudo-‐gene and novel isoform problems Align the remaining reads to Genome next RUM, TopHat2, STAR

Hybrid Aligners – Best of both worlds?

Conclusions •  There is no perfect aligner. Pick one
well-‐suited to your applicaNon. –  E.g. Want to idenNfy novel exons? Don’t align only to the known set of isoforms. •  Visually inspect the resulNng alignments. Seing a parameter a lifle too liberal or conservaNve can have a huge effect on alignment. •  Consider running the same fastq files through mulNple alignment pipelines specific to each applicaNon. –  Gene expression -‐> BowNe to transcriptome –  Exon discovery -‐> RUM or other hybrid mapper –  Variant detecNon -‐> GSNAP, GATK, Samtools mpileup •  If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold.

18M 18M 4M 4M 4M 4M 7M More geneNc diversity
= More phenotypic diversity = More analyNcal complexity C57BL/6J Brynn Voy

The CollaboraNve Cross: A large panel of recombinant inbred
lines derived from eight inbred founder strains. CC001– 98% Homozygous

The Diversity Outbred (DO) heterogeneous stock. Collaborative Cross Funnel
Diversity Outbred … G2:F4-F12 mice from 144 different funnels Random Outbreeding

7mo F 65g Chow Karen Svenson & Lisa Somes

The complex genomes of DO mice. A/J A
BL6 B 129 C NOD D NZO E CAST F PWK G WSB H 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 A A A A A A A A B B B B B B B C C C C C C D D D D D E E E E F F F G G H A B C D E F G H B C D E F G H C D E F G H D E F G H E F G H F G H G H H 36 Diplotype States 8 Founder Haplotypes

High-density Array Genotyping

Gai et al. 2014 HMM to infer 36-‐state diplotypes.

Reconstructed DO Genome A/J C57BL/6J 129S1/SvImJ NOD/ShiLtJ NZO/LtJ CAST/EiJ PWK/PhJ
WSB/EiJ

ProporNon of Founder Alleles in DO Svenson et.al. Genetics,
2011

RNA-‐seq in the DO: Enter at your own risk.

Current RNA-‐seq Analysis: One Common Reference Genome and Two
Separate Analysis Pipelines 1)  One Pipeline for quanNfying gene expression •  For example: TopHat/Cuﬄinks 2)  Second pipeline for allele-‐speciﬁc gene expression •  Using single reference genome and known SNPs

How does geneNc variaNon aﬀect alignment of RNA-‐seq reads?
CAST/EiJ C57BL/6J ≈ ≠

Based on known gene annotaNons, we expect that >50%
of 100bp CAST reads will have at least one SNP that diﬀers from the reference. Sanger Mouse Genomes Project

100bp SE Reads from CAST liver Compare alignment results
to ground truth Align to CAST Pseudotranscriptome 5’-‐ATCGGCGTCTTACATTAGCTCAAGGGTGCC-‐3’ 5’-‐ATCGGCGTCTTGCTCAAGGGTGCC-‐3’ Align to B6 Transcriptome 5’-‐ATCGGCGTCTTACATTAGCTCAAGGGTGCC-‐3’ To what degree do these diﬀerences aﬀect alignment of RNA-‐Seq reads and gene abundance esNmates? Simulated reads Real data

Simulated CAST reads map more accurately and uniquely to
the CAST transcriptome. 458,297 out of ~10M reads improve by alignment to CAST 10,533 reads improve by alignment to the reference.

Gene-‐level abundance esNmates are improved by alignment to CAST
transcriptome. RSEM, Li and Dewey 2010

What about real CAST data? One CAST RNA-‐seq sample

For 2,984 genes, abundance esNmates diﬀer by >
10% by alignment approach alone.

For these genes in the simulated data, 2,242 –
CAST alignment gave befer esNmate 439 – REF alignment gave befer esNmate 71 – CAST esNmate = REF esNmate 232 – No results in simulaNon One real CAST sample 2,984 genes diﬀer by > 10% by alignment alone.

DO genomes are diploid with 28 possible heterozygous and
8 homozygous genotype states at every locus. A/J A BL6 B 129 C NOD D NZO E CAST F PWK G WSB H 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 A A A A A A A A B B B B B B B C C C C C C D D D D D E E E E F F F G G H A B C D E F G H B C D E F G H C D E F G H D E F G H E F G H F G H G H H Founder strains – 8 possible genotypes Diversity Outcross – 36 Possible Genotype states

Building an individualized diploid transcriptome. F F
D D D D D C C A A A A A A A G G E E E E E E H H H H G G G G B B B B B A A H H H H F F F F F F E E C C C D D Psuedo-‐phased Chromosomes Two Haploid Chromosomes Genotype DNA @ 8-‐80k informaNve SNPs FG FG DG DG DB DB DB CB CB AA AA AH AH AH AH AF GF GF EF EF EF EE EE EC HC HC HD HD Infer 36-‐state genotypes Munger et al., 2014 Gai et al., 2014

Munger et al. 2014 Gai et al. 2014
ConstrucNng individualized diploid transcriptomes for RNA-‐seq alignment with Seqnature.

Coming soon – EMASE (N. Raghupathy) POPULASE
(KB Choi) Seqnature Munger et al. 2014 Gai et al. 2014

Align each read once to both alleles. CCTCAGGGTCAACTCTAAGGAAGT GATCAGGGTCAACTCTAAGGAAGT
ENSMUST0000000001 E L ENSMUST0000000001 H R Transcript ID Founder Strain A-‐H (NZO) Chr side AGGGTCAACTCTAAGG Read 1 AGGGTCAACTCTAAGG AGGGTCAACTCTAAGG ENSMUST0000000001 E L Count weight = 0.5 ENSMUST0000000001 H R Count weight = 0.5

Align each read once to both alleles. CCTCAGGGTCAACTCTAAGGAAGT GATCAGGGTCAACTCTAAGGAAGT
ENSMUST0000000001 E L ENSMUST0000000001 H R GATCAGGGTCAACTCT Read 1 GATCAGGGTCAACTCT ENSMUST0000000001 E L Count weight = 0 ENSMUST0000000001 H R Count weight = 1 Direct output of allelic abundance esNmates

100bp SE Reads from DO liver Compare Results
Align to diploid DO transcriptome 5’-‐ATCGGCGTCTTGCTCAAGGGTGCC-‐3’ 5’-‐ATCGGCGTCTTGATCTTGGGTGCC-‐3’ Align to REF Transcriptome 5’-‐ATCGGCGTCTTACATTAGCTCAAGGGTGCC-‐3’ SimulaNons! Simulated reads Real data

Reads map more accurately and uniquely by alignment to
DO transcriptome. Munger et al. 2014

Gene-‐level abundance esNmates are improved by alignment to individualized
DO transcriptome.

Alignment to individualized diploid transcriptomes yields direct, accurate esNmates
of allelic expression.

What about real DO data?

For 714 genes, abundance esNmates diﬀer by >
10% by alignment approach alone.

One real DO sample 714 genes diﬀer by >
10% by alignment alone. For these genes in the simulated data, 432 – DO alignment gave befer esNmate 124 – REF alignment gave befer esNmate 16 – DO esNmate = REF esNmate 142 – No results in simulaNon

Every DO sample will have a unique gene set that
is sensiNve to alignment errors from reference alignment…

Does this mafer to humans? Gene expression from
Individualized Genome Gene expression from Reference Genome Narayanan Raghupathy

Individualized RNA-‐seq analysis in 50 African Individuals Gene
expression from Individualized Genome Gene expression from Reference Genome

High Fat Chow Diversity Outbred Mice
GeneraNons G4-‐G8 ~200-‐300 recombinaNons wean Phenotyping Livers Livers Livers Livers N= 73 N= 70 N= 68 N= 66 Sac @ 26 weeks

Analysis Pipeline ~ 30 million SE 100bp reads
Yfg 1. Align reads to transcriptome. Yfg Yfg Yfg Mouse 1 Mouse 2 Mouse 3 x 272 mice RSEM (Li and Dewey 2010) 2. EsNmate gene and isoform expression. 3. Map expression QTL

How does individualized alignment aﬀect eQTL? Yfg Yfg
Yfg Mouse 1 Mouse 2 Mouse 3 x 453mice EsNmate gene and isoform expression. 3. Map expression QTL

Alignment to individualized transcriptomes results in fewer spurious liver
eQTL. Rps12-ps2 Aligned to NCBIm37 Aligned to DO IRGs

Hebp1 Aligned to NCBIM37 Aligned to DO IRGs Alignment to
individualized transcriptomes reveals signiﬁcant local eQTLs for 2,000+ genes.

Munger et al. 2014 Are these unmasked local eQTLs
real? CC/DO Founder Strain samples

The founder origin of each allele is tagged and provides
direct esNmates of allele speciﬁc expression. The local eQTL for Gm12976 is cis-‐acNng. DO samples N=277 N=554 allele-‐speciﬁc esNmates.

Allele speciﬁc expression is the rule rather than the excepNon
in geneNcally diverse individuals.

Goal: Make data accessible and useful. Example: Dynamic eQTL
Viewer. hfp://cgd.jax.org/apps/eqtlviewer-‐beta/ Karl Broman hfp://kbroman.github.io/qtlcharts/

MulNread alignment circle plot.

Summary: Tools to embrace geneNc diversity in RNA-‐ seq analysis
•  Seqnature: Building personal diploid genomes using known geneNc variaNon. Can be applied to human data. •  EMASE: Accurate and simultaneous quanNtaNon of gene expression and allele-‐speciﬁc expression •  Diversity Outbred mice are ideal for high-‐resoluNon mapping of expression QTL. r/DOQTL: comprehensive DO analysis tools. •  We have developed methodology for dealing with exploiGng the high geneNc diversity in this populaNon.

•  ApplicaNons •  Human data -‐ HapMap african
populaNon genomics. •  SEG-‐Pas mouse spleen infected with Y. pes*s – IdenNfy reads of bacterial origin. •  PaNent Derived Xenogra}s – Parse gra} v host-‐derived RNA. •  ChIP-‐seq in F1 or HS mice – Characterize allele-‐speciﬁc binding. •  Bulk Segregant Analysis of pooled RNA-‐seq – Map mutaNons/ QTL. Summary: Tools to Embrace GeneNc Diversity in RNA-‐seq Analysis

Acknowledgements

Thank you! •  DOQTL -‐ hfp://cgd.jax.org/apps/doqtl/DOQTL.shtml •  Seqnature
-‐ hfp://cgd.jax.org/tools/Seqnature.shtml •  EMASE – Coming Soon! •  eQTL Viewer -‐ hfp://cgd.jax.org/apps/eqtlviewer-‐beta/ •  AieWeb (DO founder expression viewer) –  hfp://cgd.jax.org/aie/rnaseq/ •  Sanger Mouse Genomes Project (Keane and others) –  hfp://www.sanger.ac.uk/sanger/Mouse_SnpViewer/rel-‐1303 •  r/qtlcharts (K Broman) -‐ hfp://kbroman.github.io/qtlcharts/

Resources Individualized Genomes -‐ Seqnature
hfps://github.com/jaxcs/Seqnature Aligner –  BowGe hfp://bowNe-‐bio.sourceforge.net/ –  BowNe 2 hfp://bowNe-‐bio.sourceforge.net/bowNe2/index.shtml –  GSNAP hfp://research-‐pub.gene.com/gmap/ –  Tophat hfp://tophat.cbcb.umd.edu/ –  STAR hfps://code.google.com/p/rna-‐star/ Transcript Abundance / DifferenNal Expression –  Cufflinks hfp://cufflinks.cbcb.umd.edu/ –  IsoEM hfp://dna.engr.uconn.edu/?page_id=105 –  RSEM hfp://deweylab.biostat.wisc.edu/rsem/ –  EMASE (Coming Soon!) –  Miso hfp://genes.mit.edu/burgelab/miso/ –  Cuffdiff hfp://cufflinks.cbcb.umd.edu/ –  DESeq hfp://www-‐huber.embl.de/users/anders/DESeq/ –  edgeR hfp://bioconductor.org/packages/release/bioc/html/edgeR.html Denovo Assembly –  Trinity hfp://trinityrnaseq.sourceforge.net/

DO allele esNmates in genes with local eQTL strong
correlated to gene esNmates from the Founder strains. Genes with local eQTL Genes without local eQTL

The diﬀerence between Allele Speciﬁc Expression

And Imprin*ng in the DO.

RNA-seq read alignment to individualized genomes

RNA-seq read alignment to individualized genomes

More Decks by Steve Munger

Featured

Transcript