Seminar at Sanger Institute 10-5-2017

Harnessing genetic diversity to discover protein regulatory networks Steven Munger
The Jackson Laboratory, Bar Harbor, ME USA @stevemunger The problem – and the power – of genetic diversity in genomics studies

Until recently, our understanding of gene regulation stopped at the
transcript. DNA RNA Protein transcription translation replication

DNA pre-mRNA Protein mRNA mRNA mRNA miRNA polyuridylation ubiquitination siRNA
RNA interference protein splicing, phosphorylation, acetylation, N-linked glycosylation, amidation sulfation… and more! hydroxylation methylation O-linked glycosylation epigenetic modification A-to-I editing replication stoichiometric buffering Goal: Expanding our understanding of gene regulation to the proteome. How does genetic variation affect transcript and protein abundance?

“Next Generation” genetic models: The mouse Collaborative Cross and Diversity
Outbred stock CAST 129S1 WSB NZO A/J B6 PWK NOD

Diversity Outbred (DO) mice: A reservoir of natural genetic perturbations.
-  40M+ SNPS -  2M+ indels -  Balanced popula7on structure -  Each individual unique -  400+ recombina7ons in each animal - High heterozygosity

Diversity Outbred (DO) Heterogeneous Stock A natural reservoir of genetic
perturbations

50 40 30 20 10 Body weight (gm) 7/11/2014 7/31/2014
8/20/2014 date 50 40 30 20 10 Body weight (gm) 7/11/2014 7/31/2014 8/20/2014 date female DO mice male DO mice DO mice are genetically and phenotypically diverse Alan Attie & Mark Keller Female DO mice Male DO mice

Diversity Outbred mice exhibit phenotypes far exceeding the range observed
in the founder strains.

Some combinations of genetic variants produce very long-lived mice.

192 DO Livers Transcripts Short Reads RNA-Seq eQTL pQTL eQTL
Mapping pQTL Mapping Proteins Peptides MS/MS Compare ? Munger et al. 2014 Chick*, Munger* et al. Nature, 2016 How does genetic variation inﬂuence transcript and protein abundance?

Challenge: Every DO mouse is a unique diploid combination of
10M+ SNPs and 500K+ indels.

Alignment 101 ACATGCTGCGGA ACATGCTGCGGA 100bp Read Chr 1 Chr 2
Chr 3

The perfect read: 1 read = 1 unique alignment. ACATGCTGCGGA
ACATGCTGCGGA 100bp Read ✓ Chr 1 Chr 2 Chr 3

Some reads will align equally well to multiple locations. “Multireads”
ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA 100bp Read ✓ ✗ ✗ 1 read 3 valid alignments Only 1 alignment is correct Read “Mappability” – www.gene7cs.org/content/198/1/59

How does genetic variation affect alignment of RNA-seq reads? Start
with a simple comparison of two inbred strains. CAST/EiJ C57BL/6J ≈ ≠

Based on known gene annotations, we expect that >50% of
100bp CAST reads will have at least one SNP that differs from the reference. Sanger Mouse Genomes Project – Thank you thank you thank you Thomas and colleagues

100bp SE Reads from CAST liver Compare alignment results to
ground truth Align to CAST Pseudotranscriptome 5’-ATCGGCGTCTTACATTAGCTCAAGGGTGCC-3’ 5’-ATCGGCGTCTTGCTCAAGGGTGCC-3’ Align to B6 Transcriptome 5’-ATCGGCGTCTTACATTAGCTCAAGGGTGCC-3’ To what degree do these diﬀerences aﬀect alignment of RNA-Seq reads and gene abundance es7mates? Simulated reads Real data

Simulated CAST reads map more accurately and uniquely to the
CAST transcriptome. 458,297 out of ~10M reads improve by alignment to CAST 10,533 reads improve by alignment to the reference.

Gene-level abundance es7mates are improved by alignment to CAST transcriptome.
RSEM, Li and Dewey 2010

What about real CAST data? One CAST RNA-seq sample

For 2,984 genes, abundance es7mates diﬀer by > 10% by
alignment approach alone.

For these genes in the simulated data, 2,242 – CAST
alignment gave beier es7mate 439 – REF alignment gave beier es7mate 71 – CAST es7mate = REF es7mate 232 – No results in simula7on One real CAST sample 2,984 genes diﬀer by > 10% by alignment alone.

Every DO sample will have a unique gene set that
is sensitive to alignment errors from reference alignment…

Munger et al. 2014 Gak et al. 2014 Solution: Construct
individualized diploid transcriptomes for RNA-seq alignment with Seqnature.

Seqnature Munger et al. 2014 Gak et al. 2014 Choi
Raghupathy et al., Submiied

Analysis Pipeline ~ 30 million SE 100bp reads Yfg 1.
Align reads to transcriptome. Yfg Yfg Yfg Mouse 1 Mouse 2 Mouse 3 x 272 mice RSEM (Li and Dewey 2010) 2. Es7mate gene and isoform expression. 3. Map expression QTL

Alignment to individualized transcriptomes results in fewer spurious liver eQTL.
Rps12-ps2 Aligned to NCBIm37 Aligned to DO IRGs Lesson 1: One false read alignment can cause two false positive genetic associations.

Hebp1 Aligned to NCBIM37 Aligned to DO IRGs Alignment to
individualized transcriptomes unmasks signiﬁcant local eQTLs for 2,000+ genes. Lesson 2: Alignment of all samples to a single reference genome obscures a huge amount of real regulatory variation.

Munger et al. 2014 Are these unmasked local eQTLs real?
Yes. CC/DO Founder Strain samples

The founder origin of each allele provides direct estimates of
allele speciﬁc expression. Only alleles derived from 129S1 express Gm12976 in the DO popula7on.

Lesson 3: Allele speciﬁc expression is the rule rather than
the exception in genetically diverse individuals.

Lesson 4: Most expressed genes have eQTL (>75%). The DO
is a reservoir of genetic perturbations. Gene Location eQTL Location 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

192 DO Livers Transcripts Short Reads RNA-Seq eQTL pQTL eQTL
Mapping pQTL Mapping Proteins Peptides MS/MS Compare ? Munger et al. 2014 Chick*, Munger* et al. Nature 2016 How does genetic variation affect protein abundance?

Global, multiplexed, and quantitative: A new era in proteomics

An unprecedented view of protein regulation. 2,866 pQTL detected for
2,552 proteins. Total eQTL pQTL 2306 1152 1400 N=6707 p < 2.2e-16 FDR < 0.1

80% of proteins with local pQTL have concordant local eQTL
Local eQTL pQTL 1819 344 1392 QTL RNA Protein cis cis

20% of proteins with local pQTL lack concordant local eQTL
Local eQTL pQTL 1819 344 1392 QTL RNA Protein cis

25% of expressed proteins appear buffered from local transcriptional variation
Local eQTL pQTL 1819 344 1392 QTL RNA Protein cis

1,130 distant pQTL indicate extensive trans regulation of protein abundance.

Only 9 out of 1130 distant pQTL have concordant distant
eQTL. Distant eQTL pQTL 915 1039 9 cis RNA Protein QTL trans FDR < 0.1

What post-transcriptional mechanism is acting in trans to control these
proteins? Distant eQTL pQTL 915 1039 9 RNA Protein QTL trans

Searching for protein and transcript mediators of distant pQTL –>
Mediation Analysis RNA Protein QTL trans cis RNA Protein Target Causal Intermediates RNA Protein trans QTL cis Target Target Protein ~ pQTLdistant Target Protein ~ pQTLdistant + MediatorProtein x 8000 proteins Target Protein ~ pQTLdistant + MediatorRNA x 21000 Transcripts X

Mediation analysis reveals causal intermediates. pQTLD Tmem68 TMEM68 trans 13
Target 3 cis

Tmem68 TMEM68 trans 13 cis Target 3 cis cis Nnt
NNT Mediation analysis reveals causal intermediates.

43,102 SNPs in region 3 Candidate SNPs 1 short deletion
1 long deletion of Exons 7-11 Nnt eQTL B6 alleles do not express Nnt Low abundance of NNT in C57BL/6J drives low abundance of TMEM68.

Protein complex members are tightly coregulated, with one member adopting
the “regulatory” role. Chaperonin containing TCP1 complex

CCT2 Mediation Analysis

Cct6a Low expression of Cct6a in NOD/ShiLtJ drives low expression
of CCT complex

CCT2 CCT3 CCT5 CCT6A CCT4 CCT7 CCT8 TCP1 TCP1 TCP1
TCP1 TCP1 TCP1 TCP1 TCP1 CCT2 CCT2 CCT2 CCT2 CCT2 CCT2 CCT2 CCT2 CCT2 CCT3 CCT3 CCT3 CCT3 CCT3 CCT3 CCT4 CCT4 CCT4 CCT4 CCT4 CCT4 CCT4 CCT4 CCT4 CCT4 CCT5 CCT5 CCT5 CCT5 CCT5 CCT5 CCT6A CCT6A CCT7 CCT7 CCT7 CCT7 CCT7 CCT7 CCT7 CCT7 CCT8 CCT8 CCT8 CCT8 CCT2 CCT3 CCT5 CCT6A CCT4 CCT7 CCT8 TCP1 Stable CCT2 CCT3 CCT5 CCT4 CCT7 CCT8 TCP1 Stoichiometric buﬀering of protein abundance

Mediation identiﬁes known and novel protein interactions

Wash Kiaa1033 Trans Cis Kiaa0196 Fam21 Zw10 Vcp Ccdc43 llph
Spg20 Fam45a Ccdc22 Gaa Atg16l1 Rufy1 Wash Complex Ccc Complex Ccdc93 Commd10 Commd9 9030624J02Rik Commd5 Commd7 Commd3 Commd2 Commd4 Dscr3 Cis Trans H2-Q10 Cis Trans Commd1 Pum1 1110004F10Rik Exocyst Complex Arp2/3 Complex Exoc6 Exoc2 Exoc7 Exoc8 Exoc5 Exoc4 Exoc1 Ttc39b Arpc3 Gckr Arpc5 Actr3 Arpc4 Arpc2 Actr2 Rala Coro1b Cis Cis Exoc3 Cis Cis eQTL pQTL Co-regulated Mediation reveals higher order protein networks Endosome

Natural genetic perturbations + Mediation analysis = Predictive protein network

In Progress: Using genetic diversity to identify kinases for speciﬁc
phosphorylation sites. Liver Phospho-Proteome Kinase <–> phospho site iden7ﬁca7on by media7on

Collaborative Cross strains can be used to validate predictions from
the DO and build new models. CC001– 98% Homozygous

Accurate prediction of protein abundance in Founder and Collaborative Cross
Strains. Chick Munger et al. 2016

Looking ahead: Pathway-centered predictive genomics Example: Drug metabolism pathways are
enriched for genes with signiﬁcant liver pQTL. Tamoxifen

Predict and test CC strain crosses that will produce progeny
with compromised drug metabolism. CC Strain Cyp3a13 Cyp3a16 Cyp2d10 Cyp2d22 Fmo1 Fmo5 PredicAon CC001 ++ + - +++ - + Highest CC002 - + + - - + Medium CC003 - - - + + + Medium CC004 - + --- + -- - Lowest CC005 + -- ++ - + - Medium CC006 + - - - - + Low CC007 + - + - + + High Pathway-centered prediction Toy Example Test!

Conclusions •  Most genetic variation that affects transcript abundance does
not affect protein abundance. –  For local genetic variation that does affect protein abundance, 80% act proximally on transcription (standard model). •  99+% of distant pQTL act on the target protein’s abundance independent of the target’s transcript abundance. •  Mediation analysis identifies 700 RNA/protein causal intermediates of distant pQTL and infers >5000 protein interactions. •  Stoichiometric buffering is a common post-translational mechanism governing protein abundance of binding partners and complex members. •  We can apply our new understanding of the genome- proteome map in DO mice to tune output of liver pathways.

Acknowledgments

Slides can be downloaded: https://speakerdeck.com/stevemunger/ seminar-at-sanger-institute-10-5-2017

Seminar at Sanger Institute 10-5-2017

Seminar at Sanger Institute 10-5-2017

More Decks by Steve Munger

Other Decks in Science

Featured

Transcript