Gene expression profiling workshop

Gene expression profiling Nick Haining DFCI | Broad Institute Imm306QC
April 26, 2013

Learning Objectives 1.  Mechanics of gene expression profiling 2.  Experimental
design considerations 3.  Basic analysis of gene expression profiling data

Overview 1.  Rationale 2.  Experimental design 3.  Analysis approaches 4. 
Case examples 5.  Gene expression analysis workshop

Why do gene expression profiling?

mRNA abundance correlates with protein abundance (sort of) Marguerat et
al, Cell, 2012 “…Copy numbers of mRNAs and corresponding proteins were highly correlated…However, the ratios between protein and corresponding mRNA copy numbers spanned over three orders of magnitude, ranging from 14 to 61,060….”

Uses of gene expression profiling 1.  Parts list 2.  Patterns
3.  Wiring diagrams

Parts list Wherry, Immunity 2007

Parts list Kaech, Nat Immunol, 2003 IL7R in effector CD8
T cell differentiation PD-1 in T cell exhaustion Barber, Nature, 2005 Quigley, Nat Med, 2010 BATF in T cell exhaustion

Finding Patterns Golub, Science, 1999

Finding Patterns Cohen, Nat Immunol, 2012

Finding patterns Nakaya, Nat Immunol, 2011

Wiring diagrams Basso, Nat genetics, 2005

Wiring diagrams Amit, Science, 2009

Experimental Design •  What platform for gene measurement should I
use? •  How many samples/replicates should I measure? •  How many cells will I need?

Platforms

Number of transcripts assayed Handful All annotated genes qRT-PCR Fluidigm
Nanostring Affymetrix Illumina RNA-Seq ~10 ~800 96 47,000 All RNA species ?

Transcripts on Affy/Illumina arrays 61% 23% 4% 5% 7% Coding
transcript, well- established annotation Coding transcript, provisional annotation Non-coding transcript, well-estblished annotation Non-coding transcript, provisional annotation mRNA sequences that align to EST clusters

Fluidigm •  Fast •  Reliable technology •  Sensitive •  High-throughput
•  Max of 96 genes •  PCR bias? •  Fewer cores available •  Relatively expensive

Nanostring •  No PCR amplification •  Fast •  Straightforward analysis
•  Long lead-time for probe design •  Relatively expensive for larger panels •  Fewer cores available

Affymetrix array •  Industry standard •  Loads of analysis tools
•  Lots of reference data •  Available in most cores •  Small input protocols well developed •  Can’t measure what you don’t know •  Won’t be industry standard for much longer •  Smaller dynamic range than PCR/ Nanostring

Illumina BeadArray •  Industry standard •  Cheaper than Affy • 
Loads of analysis tools •  Lots of reference data •  Available in many cores •  Can’t measure what you don’t know •  Smaller dynamic range than PCR/ Nanostring •  Slightly noisier data than Affy •  Longevity?

RNA-seq (digital gene expression) •  Can identify all transcripts • 
Likely to be industry standard in near future •  May be cheaper •  Data pipelines aren’t turn-key •  More variability in rare transcript quantification •  Small input protocols are in development

Cost Genes $ per sample $ per gene
Min. cost Fluidigm 96 22 0.22 2000 (96) Nanostring ~100 ~100 1 100 (1) Aﬀymetrix 20,000 500 0.025 500 (1) Illumina 20,000 250 0.0125 250 (1) RNA-‐seq 20,000 200 0.01 2000 (~10)

Replicates

Number of replicates

Cell number

~10pg ~1pg

Input RNA amount •  Fluidigm – 1pg •  Affy –
1µg •  Illumina – 100ng •  RNA-seq (DGE) – 100ng

Analysis

Before you begin •  Normalization •  Log transformation •  Collapse
probesets

Data normalization •  RMA (robust multichip averaging) •  makes each
array comparable to the next •  won’t completely get rid of batch effect

Log transformation

Collapse probesets (Affy, Illumina) •  Genes are represented by more
than one probes •  Maximum value from each set of probes is selected 211607_x_at 210984_x_at 201983_s_at 211550_at 1565484_x_at 211551_at 201984_s_at 1565483_at EGFR Maximum value Affymetrix U133A 2.0

General types of analysis Supervised analysis

General types of analysis Unsupervised analysis (clustering)

Supervised analysis Differential expression •  Given phenotypically distinct classes, find
“markers” that distinguish these classes from one another B Cells Monocytes mDC pDC

Problem
Gene Markers Error Example I. Tissue or Cell Type ~1000-‐2000 ~0% T cells vs. Monocytes II. Morphological ~200-‐500 ~0-‐5% Naive vs. memory T cell Type III. Morphological Subtype ~50-‐100 ~0-‐15% Effector Mem. vs. Effector memory (RA) MulOclass ClassificaOon IV. Treatment Outcome ~1-‐20 ~5-‐50% Vaccine response Drug SensiOvity Degree of Difficulty adapted from P. Tamayo Hierarchy of difficulty

Marker Selection Process Dataset Phenotype/ class labels
Measure of signiﬁcance Compute score: t-test, SNR, etc. Measure significance: permutation test Score Ranked gene list

Ranking differential expression 0 2000 4000 6000 8000 10,000 12,000
14,000 16,000 18,000 20,000 Samples Expression χ σ - Signal to noise ratio: ( + ) 0 2000 4000 6000 8000 10,000 12,000 14,000 16,000 18,000 20,000 Samples Expression 0 2000 4000 6000 8000 10,000 12,000 14,000 16,000 18,000 20,000 Samples Expression

7 4 1 9 9 4 6 7 1 9
4 5 6 10 3 8 4 1 2 1 7 3 5 1 4 3 9 4 5 5 7 6 9 8 8 3 10 6 7 3 8 10 9 7 8 5 10 10 2 4 2 8 10 2 4 1 10 9 6 6 5 10 10 10 3 8 10 8 4 9 7 9 8 10 4 5 6 5 2 7 7 2 4 9 6 2 4 1 2 9 10 9 1 3 7 1 1 1 5 5 7 5 4 7 1 2 6 5 8 1 10 9 4 8 7 2 9 1 10 3 8 4 2 6 6 9 2 10 5 2 5 3 7 10 7 6 2 9 3 10 5 9 9 7 10 2 5 2 4 8 4 2 9 2 5 8 2 10 7 5 5 3 2 5 8 9 3 4 5 6 1 1 9 2 6 2 5 1 6 5 6 1 5 2 7 9 9 3 4 2 2 9 1 4 8 3 8 6 6 6 3 1 7 2 8 2 4 2 4 1 2 9 10 8 3 7 3 9 8 6 8 10 7 4 3 10 3 1 5 6 1 8 3 1 9 3 4 1 2 6 9 2 8 8 4 7 9 8 9 10 8 9 6 5 5 7 3 6 5 2 4 2 10 8 9 3 8 3 9 10 5 2 9 6 5 2 10 5 3 9 1 9 7 1 8 10 10 2 7 10 2 9 1 4 3 2 8 8 9 2 1 6 6 1 8 8 6 4 9 8 8 5 5 5 8 7 4 10 4 9 5 1 1 5 5 2 1 7 2 4 9 10 1 4 10 9 7 7 7 5 Permutation test and P-value Class A Class B “True” classes Permutation 1 Permutation 2 Permutation n Aim: Determine the significance of gene’s statistical score Known class A samples Known class B samples Score Generates a “null distribution” of scores for this gene Compare with “real” score for this gene

Multiple Testing Procedures •  False Discovery Rate (FDR) –  Percent
of false positives among all genes called differentially expressed •  Multiple testing can only correct for false positives (type 1 error); need more samples to correctly identify false negatives (type 2 error)

Effect of Sample Size Ø Generate a 10,000x100 matrix from a
Gaussian (mean=0, SD=0.5) Ø Pick n columns (6,14,30,100) Ø Assign sample labels yellow and green Ø Select top 25 markers for yellow, top 25 markers for green With small sample size it is easy to ﬁnd genes correlated with phenotype Yellow Green 6 samples Yellow Green 14 samples Yellow Green 30 samples Yellow Green 100 samples

Expression in YFV Effectors Expression in Naive CD8 T cells
Gene set enrichment analysis Measuring signatures rather than genes

Gene set enrichment analysis Enriched in Cell Type A Enriched
in Cell Type B No Enrichment Subramanian et al. PNAS, 2005 Haining & Wherry. Immunity 2010

Enriched Gene Set Un-enriched Gene Set Enrichment Score S Max.
Enrichment Score ES Gene List Order Index Enrichment Score S Max. Enrichment Score ES Gene List Order Index Every hit go up by 1/NH Every miss go down by 1/NM The maximum height provides the enrichment score Enrichment: KS-score

TBX21 EOMES Expression in YFV Effectors Expression in Naive CD8
T cells Rama Akondy

E2F3 E2F2 E2F8 Expression in YFV Effectors Expression in Naive
CD8 T cells

E2F Target Genes Expression in YFV Effectors Expression in Naive
CD8 T cells

FDR=0.01

Signatures are portable Oncogenic KRAS CD8 CD4 B Cell S100A4
CD58 C1ORF24 ANXA1 SMAD3 TOX CLIC1 ANXA2P2 GLIPR1 KLF6 FAS AIM2 WEE1 ATP2B4 GARNL4 ITGB1 PHACTR2 KLF10 LGALS3 CRIP1 CMRF-35H AHNAK IL2RB EPHA4 TNFRSF1B OPTN CASP1 CYB561 CD63 ADAM19 SLAMF1 C8ORF70 C11ORF17 NRIP1 PECAM1 CYORF14 PTK2 AIF1 SELL STMN1 SCML2 SERPINE2 KBTBD11 C5ORF13 SATB1 GAS2 ZNF516 TBXA2R BACH2 NBEA GAL3ST4 SCML1 PTPRK POP5 LOC282997 CCR7 S100A4 CD58 C1ORF24 ANXA1 SMAD3 TOX CLIC1 ANXA2P2 GLIPR1 KLF6 FAS AIM2 WEE1 ATP2B4 GARNL4 ITGB1 PHACTR2 KLF10 LGALS3 CRIP1 CMRF-35H AHNAK IL2RB EPHA4 TNFRSF1B OPTN CASP1 CYB561 CD63 ADAM19 SLAMF1 C8ORF70 C11ORF17 NRIP1 PECAM1 CYORF14 PTK2 AIF1 SELL STMN1 SCML2 SERPINE2 KBTBD11 C5ORF13 SATB1 GAS2 ZNF516 TBXA2R BACH2 NBEA GAL3ST4 SCML1 PTPRK POP5 LOC282997 CCR7 Ras Signature #1 Ras Signature #2 Lung Tumors

MSig DB •  collection of ~8,000 signatures http://www.broadinstitute.org/gsea/msigdb/index.jsp

Clustering •  Hierarchical clustering •  Principal components analysis •  K-means
clustering

Hierarchical Clustering 3 1 4 2
5 5 2 4 1 3 Distance between joined clusters Dendrogram

HC example Yeoh, Cancer Cell, 2002

Principal Components Analysis •  Reduces high dimensional data (like microarrays)
to artificial dimensions of greatest variation •  Useful since phenotypic differences often are captured along a PC •  Allows objects (samples) to be clustered together in small number of dimensions

Simplified Example Haining, Immunity 2012

Immunity 2012

k-means clustering

k-means example Best, Nat Immunol, 2013

Case examples

Case example #1 You're doing a rotation in a lab
and are staining a population of T cells from a well characterized mouse model for flow cytometry. You accidentally grab the wrong vial of antibody for your stains. When you flow the cells, you discover that a subset of your population of interest stains with this novel marker. Subsequent experiments confirm the finding and show that this novel subset has unique functional properties. You want to use gene expression profiling to characterize this novel subset of cells.

Case example #2 You're studying the immune response to a
new vaccine in samples from a clinical trial. A well-characterized cohort of human subjects is vaccinated with the same vaccine, but unexpectedly the antibody response to the vaccine varies enormously across the cohort. Your project is to identify novel correlates of the antibody response using gene expression profiling of PBMC samples.

Case example #3 Your lab studies the differentiation of cell-type
A into cell-type B. In a small-molecule screen, you have identified a compound that appears to induce the differentiation of cell-type A. The readout of the screen was upregulation of a cell-surface molecule characteristic of cell- type B. You now want to use gene expression profiling to determine whether the compound induces broader transcriptional changes associated with cell-type B.

Gene expression profiling workshop

Gene expression profiling workshop

Featured

Transcript