Aug
2009:
“I
keep
saying
that
the
sexy
job
in
the
next
10
years
will
be
sta(s(cians.
And
I’m
not
kidding.”
-‐
Hal
Varian,
chief
economist
at
Google
2
min
Youtube
video
h,p://www.amstat.org/newsroom/pressreleases/2015-‐StatsFastestGrowingSTEMDegree.pdf
Data
from
the
Na(onal
Center
for
Educa(on
Sta(s(cs;
Analysis
by
the
ASA
Preparing
students:
undergraduate
and
graduate
h,p://magazine.amstat.org/blog/2013/05/01/stats-‐degrees/
Data
source:
NCES
Digest
of
Educa(on
Sta(s(cs
Graduated
from
high
school
Completed
two
REUs
in
Biosta:s:cs
Graduated
from
LSU
(BS,
Mathema:cs);
Started
a
PhD
in
Sta:s:cs
Discuss
these
three
ques(ons
and
illustrate
how
sta(s(cs
can
help
Focus
on
DNA
methyla:on
data
(But
these
challenges
are
common
to
other
areas
of
genomics)
Morgan
et
al.
(1999).
Nature
Gene+cs
23:
314-‐8
h,p://epigenome.eu/en/2,48,873
Bradbury
(2003).
PLoS
Biology
1:
e82
Measuring
DNA
Methyla(on
Boch
(2012).
Nature
Reviews
Gene+cs
13,
705-‐719
Problem:
Which
CpGs
are
differen(ally
methylated
between
two
groups?
Some
proposed
sta:s:cal
solu:ons:
At
each
CpG,
test
if
there
is
a
difference
using
e.g.
t-‐test,
F-‐test
or
linear
regression
t-‐test
for
Differen(al
Methyla(on
p-‐value
=
0.034
0.00 0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #1 CpG
is
differen(ally
methylated
<
0.05
t-‐test
for
Differen(al
Methyla(on
p-‐value
=
0.343
0.00 0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #2 CpG
is
not
differen(ally
methylated
>
0.05
What
about
neighboring
CpGs?
Problem:
If
one
CpG
is
methylated,
would
a
CpG
nearby
be
also
methylated?
Some
proposed
solu:ons:
(1)
Can
we
find
two
or
more
runs
of
differen(ally
methylated
CpGs?
• If
p-‐value
<
0.05
for
CpG
#1,
#2,
#3,
etc…
• Cau(on:
mul(ple
tes(ng
(2)
Can
we
smooth
across
CpGs
and
find
genomic
regions
that
are
differen(ally
methylated?
Technical
vs
Biological
Varia(on
• Raw
genomics
data
contains
biases
and
unwanted
technical
varia:on
– e.g.
sequencing
technology,
batch
effects
– Can
cause
perceived
differences
between
samples,
irrespec(ve
of
the
biological
varia:on
• Changes
in
experimental
condi(ons
can
be
confused
with
biological
variability
– Can
lead
to
false
discoveries
(e.g.
finding
DMRs)
Quan(le
Normaliza(on
• Mostly
widely
used
mul:-‐sample
normaliza:on
• Originally
developed
for
gene
expression
microarrays
• Now
applied
to
– Genotyping
arrays,
RNA-‐Sequencing,
DNA
methyla(on,
ChIP-‐Sequencing
&
Brain
imaging
Can
be
very
helpful
in
elimina(ng
unwanted
varia(on
e.g.
``batch
effects''
(good),
but
has
poten(al
to
wash
out
true
biological
varia(on
(bad)
How
does
it
work?
Quan(le
normaliza(on
is
a
non-‐linear
transforma(on
that
replaces
each
intensity
score
with
the
mean
of
the
features
with
the
same
rank
from
each
array
Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average
Back
to
mo(va(ng
example
(quan(le
normalized)
Should
we
use
quan(le
normaliza(on?
Will
we
remove
important
biological
varia(on?
0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values
quantro:
Test
for
global
changes
between
groups
quantro
• R/Bioconductor
package
to
test
for
the
assump(ons
of
quan(le
normaliza(on
6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main
idea:
• Compare
variability
within
groups
to
variability
between
groups
• If
variability
between
groups
>
variability
within
groups,
then
there
may
be
global
changes
across
groups
Targeted
vs
Global
changes
−5 0 5 10 15 0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference
Final
thoughts
• Sta:s:cs
maSers
in
the
analysis
of
any
data!
• Sta(s(cs
can
help
iden(fy
relevant
biological
varia:on
in
genomics
data
– Differences
in
CpGs
– Smoothing
across
genomic
regions
• Sta:s:cs
can
help
eliminate
unwanted
technical
varia:on
in
genomics
data
– “Batch
effects”