Dissertation Defense: "Encoding Alignments for Classification Problems"

Learning signals in genomic sequence alignments for identification of functional
elements

Signals for identifying functional elements ▪ Sequence constraint: ▪ Specific
sequence patterns (“motifs”) ▪ Base composition ▪ Evolutionary constraint: ▪ Mutation events occur randomly, but selection determines if events are tolerated, resulting in a diﬀerent pattern of change in functional regions

Sequence Alignment CTCCCAGCTGCCC

Sequence Alignment CTCCCAGCTGCCC Substitutions CTCCCGGCAGCCC

Sequence Alignment CTCCCAGCTGCCC Substitutions CTCCCGGCAGCCC Insertion CTCCCAGAGAGCTGCCC

Sequence Alignment CTCCCAGCTGCCC Substitutions CTCCCGGCAGCCC Insertion CTCCCAGAGAGCTGCCC Deletion CTAGAGAGCTGCCC

Sequence Alignment CTCCCAGCTGCCC Substitutions CTCCCGGCAGCCC Insertion CTCCCAGAGAGCTGCCC Deletion CTAGAGAGCTGCCC CTCCCGG----CAGCCC
CTCCCAGAGAGCTGCCC CT---AGAGAGCTGCCC ▪ Gap symbols (“-”) indicate insertions and deletions ▪ Columns contain orthologous bases

Encoding alignments Find a mapping from alignment columns into a
smaller alphabet that maintains the “right” information for some classification problem CTCCCAGCTGCCCAGTGCCGCCTCTTTTT CTCCTAGCTG-CCAGCATCTCCCGTTTTT CTCCCAGCTGCCCTGCGCCTCCTCTTTTT ↓ 13111021321110232112113133333

ESPERR (Evolutionary and Sequence Pattern Extraction through Reduced Representation)

What’s new about ESPERR? ▪ Replaced “seeded clustering” with a
new agglomerative approach (allows us to scale to many more species) ▪ Improved handling of missing data (now used for all applications) ▪ It finally has a name!

ESPERR Overview ▪ Represent alignment columns as probability distributions ▪
Create initial grouping of columns using an agglomeration procedures that combines evolutionary similarity with frequency distribution ▪ Search for final grouping of columns using iterative procedure guided by classification performance

Ancestral probability distributions Ancestral probability distribution inferred with Felsenstein’s algorithm:
A G G A A A C G T - A G - A A A C G T - A G * A A A. Stage 1 ﬁrst step: represent alignment columns as ancestral probability distributions A C G T - y x1 t1 t2 x2 L(y | a) = b L(x1 | b) p(a ⇥ b; t1 ) c L(x2 | c) p(a ⇥ c; t2 ) Q = ⇧ ⇧ ⇤ - a b c d - e f g h - i ⇥ ⌃ ⌃ ⌅ b) p(a ⇥ b; t1 ) c L(x2 | c) p(a ⇥ c; t2 ) - a b c ⇥

Ancestral probability distributions A G G A A A C
G T - A G - A A A C G T - A G * A A A. Stage 1 ﬁrst step: represent alignment columns as ancestral probability distributions A C G T - Continuous time Markov model of substitution, HKY matrix augmented to handle gaps: nch of the phylogenetic tree. We assume a continuous time Markov process rate matrix Q speci⇠es the instantaneous rate of each substitution event, s the rates in Q through a smaller number of parameters. In particular, parameterization provided by the HKY model of Hasegawa et al. (⌧ ) of equilibrium probabilities for each base (⌫ parameters; ⌅ A , ⌅ C , ⌅ G , ⌅ T ), io between the rates of transitions and transversions (⇧)%. We extend to accommodate gaps as if they were a '(h nucleotide, introducing an equilibrium probability (⌅ Gap ) and rate ratio (gaps to transversions ⌃), e rate matrix: Q = ⇤ ⌥ ⌥ ⌥ ⌥ ⌥ ⌥ ⌥ ⌥ ⌥ ⌥ ⇧ ⌅ C ⇧⌅ G ⌅ T ⌃⌅ Gap ⌅ A ⌅ G ⇧⌅ T ⌃⌅ Gap ⇧⌅ A ⌅ C ⌅ T ⌃⌅ Gap ⌅ A ⇧⌅ C ⌅ G ⌃⌅ Gap ⌃⌅ A ⌃⌅ C ⌃⌅ G ⌃⌅ T ⌅ ⌃ ameters of Q are estimated using the Expectation Maximization algorithm ed in the PHAST so⌧ware package (Siepel and Haussler, ), generally d tree topology and a sample of genome-wide alignments.

Ancestral probability distributions A G G A A A C
G T - A G - A A A C G T - A G * A A A. Stage 1 ﬁrst step: represent alignment columns as ancestral probability distributions A C G T - ▪ Alignment and synteny annotation used to separate real gaps from missing data ▪ Leaves with missing data are eliminated from the inference ▪ Amount of missing data allowed is limited

Clustering spatially and distributionally Consider the observed column frequencies as
a discrete distribution over the probability simplex, and find a distribution on a smaller number of points that preserves: ▪ spatial structure: merge only neighboring points ▪ distributional structure: select mergers that maximize mutual information A C G T - A C G T - A C G T - • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • −0.5 0.0 0.5 MDS2 A C G B. Stage 1 second step: create initial grouping (encoding) based on evolutionary similarity and frequency distribution (colored circles represent groups of columns from clustering)

An agglomerative algorithm ▪ Initialize clusters: each contains one point
▪ Centroid of a cluster is the average of all points in cluster weighted by their probabilities ▪ Distance / linkage is euclidean between centroids ▪ Iteration: ▪ Consider each point and its nearest neighbor ▪ Compute entropy if that pair were merged, keep merger that maximizes entropy

Searching for encodings ▪ Initialize with encoding determined by agglomeration
(1) (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classiﬁcation rate:

Searching for encodings ▪ Produce candidate encodings using a random
set of mergers and expansions from the current encoding (1) (2) (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classiﬁcation rate:

Searching for encodings ▪ Evaluate candidates using cross validation on
the training data 62% 58% 65% (1) (2) (3) (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classiﬁcation rate:

Searching for encodings ▪ Accept the encoding with the best
performance 62% 58% 65% (1) (2) (3) (4) (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classiﬁcation rate:

Searching for encodings ▪ Continue iterating until convergence (1,000 iterations
with no performance improvement) 62% 58% 65% (1) (2) (3) (4) (5) (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classiﬁcation rate:

Heuristics to improve the search ▪ Due to a preference
for smaller models the search can get stuck in local minima — we force several consecutive “expansion” steps if there is no improvement in performance for 20 iterations ▪ Given the large space of possible alphabets the search process can get lost — we restart from the best alphabet found so far if there is no improvement in performance for 50 iterations

Search convergence behavior Merit by iteration for a single run
Maximum merit by iteration for 10 runs Stop search if no improvement in max after 1000 iterations, usually converges within ~10,000 iterations.

Variable order Markov models ▪ Allows parsimonious models that can
still capture some long words ▪ Model size is constrained while alphabet size changes: 10 20 30 40 50 0 20000 40000 60000 80000 100000 120000 Alphabet Size Parameters Fixed Variable (RP data)

Applications

ESPERR Regulatory Potential Scores

Example: Seven-species RP scores ▪ Regulatory Potential (RP) Scores discriminate
“known regulatory” from “neutral” regions ▪ Seven species alignments: human, chimpanzee, macaque, mouse, rat, dog, cow ▪ Training data sets of ~31,000 bases each (no more than three missing species allowed in a column) ▪ 17 symbol final alphabet ▪ Cross validation success rate (leave-one-out) of ~94%, a substantial improvement over ~82% for three-species RP

- 1 0 1 2 0.0 0.2 0.4 0.6 0.8
1.0 Score Cumulative Distribution Reg. training set Exons Bulk AR training set 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity

Example: Seven-species RP ▪ Additional evaluation using HBBC ▪ 23
experimentally confirmed regulatory elements in this region ▪ ROC plot considers the sensitivity/specificity of several scores for discriminating these elements

chr11: 5255000 5260000 5265000 5270000 Compilation of Landmarks from Locus
Experts HBE1_PRA HBE1_NRA HS1 HS2_pos HS2_neg HS3 HS3.1 HS3.2 HS4 HS5 phastCons chimp rhesus mouse rat dog cow Vertebrate Multiz Alignment & Conservation RepeatMasker Repeating Elements by RepeatMasker Human/Mouse/Rat RP Scores, Kolbe et al model 0.05 _ 0 _ ESPERR Regulatory Potential (7 species) 0.05 _ 0 _

Types of signals captured by RP

Decomposing RP signals ▪ PCA applied to word frequencies in
the training data, indicates that a much of the variability is explained by a few dominant components, but a substantial amount is also spread over many weaker components 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 Principal components Variance explained 0.00 0.10 0.20 p. .6 RP

Decomposing RP signals ▪ Two likely signals (GC content and
conservation) account for ~68% of the variability (F is the residual) ▪ The strongest component that has high correlation with RP also has high correlation with conservation and GC content ▪ RP also shows substantial correlation with many of the weaker components, which are less exclusively dominated by the strong conservation and GC content signals 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 Principal components Variance explained 0.00 0.10 0.20 Correlation with principal comp. −0.2 0.0 0.2 0.4 0.6 RP GC Conservation F

Decomposing RP signals ▪ Correlation of individual words with RP
component signals ▪ GC and conservation: a few strong words ▪ F: fewer outliers, more variety of words Figure %.%: Share of variance explained by each of the #rst $& principal the RP training data word frequencies (top) and correlation of RP sco conservation, and the residuals F with each principal component (b Conservation GC F -0.2 0.0 0.2 0.4 0.6 Correlation of word frequency with signal Most highly correlated words 05/29/2 http://www.bx.psu.edu/~james/rp_2006/logos_7way/high_cons.html http://www.bx.psu.edu/~james/rp_2006/logos_7way/high_cons.html http://www.bx.psu.edu/~james/rp_2006/logos_7way/high_gc.html http://www.bx.psu.edu/~james/rp_2006/logos_7way/high_gc.html http://www.bx.psu.edu/~james/rp_2006/logos_7way/high_F.ht http://www.bx.psu.edu/~james/rp_2006/logos_7way/high_F.ht Figure ⇡.⇢: Distributions of the correlations between word frequen

RP weak signals and distal elements ▪ Distal elements using
ENCODE data ▪ Defined by a sequence specific ChIP-chip hit ▪ Supported by secondary evidence ▪ More than 2.5kb away from a TSS ▪ Elements with high GC / conservation associated with predicted promoters and other promoter-like characteristics ▪ Elements with high F show much less associated with these characteristics: more likely to be distal

ENCODE DNaseI hypersensitive sites

Discriminating DNaseI hypersensitive sites ▪ DNaseI hypersensitivity is a reliable
marker for regulatory function ▪ ENCODE comprehensively tested 1% of the genome for this feature ▪ We derive from their data a stringent set of positive and negative examples

Discriminating DNaseI hypersensitive sites ▪ Previous work (Noble et al.
2005) found that a linear SVM trained on word frequencies could discriminate such sites, however on this dataset applying their approach achieves only 60% success (leave-one-out cross validation) ▪ ESPERR identifies an 18 symbol encoding which achieves a success rate of ~80%

Highly conserved developmental enhancers

Example: Developmental Enhancers ▪ VISTA Enhancer Browser: 253 highly conserved
regions tested for consistent developmental enhancer activity, for example:

Example: Developmental Enhancers ▪ Training set containing 108 validated elements
(~143kb) and 138 non-validated (~165kb); all tested regions are highly conserved ▪ Alignments of human, mouse, opossum, chicken, frog, zebrafish and puﬀerfish ▪ ESPERR identifies an encoding to 15 symbols that achieves ~83% cross validation success rate

Promoter prediction with ESPERR

Promoter activity for 1% of the human genome ▪ Cooper
et al. (2006) tested promoter activity at the 5’ ends of aligned cDNAs in the ENCODE regions ▪ Most regions tested in 16 cell lines, from these we derive three training sets: ▪ “Ubiquitous” — positive in all 16 cell lines (106) ▪ “Specific” — positive in 1 to 5 cell lines (130) ▪ “Negative” — negative in all 16 cell lines (123)

Various signals discriminate promoters "% • Negative Specific Ubiquitous 0.3
0.4 0.5 0.6 0.7 0.8 GC Content • • • • • • • • • • • • • Negative Specific Ubiquitous 0.00 0.05 0.10 0.15 CpG density • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Negative Specific Ubiquitous 0.0 0.2 0.4 0.6 0.8 non−coding MCS overlap • • • • • • • • • • • • • • • • • • • • Negative Specific Ubiquitous 0.0 0.2 0.4 0.6 0.8 non−coding average phastCons

ESPERR compared to various signals ▪ We evaluated the ability
to discriminate each pair of training sets using: ▪ ESPERR with five species alignments (human, chimpanzee, mouse, rat, dog) ▪ Naive-bayes classification using four other quantities: GC content, CpG content, MCS overlap, and phastCons. ▪ Leave-one-out cross validation used for all cases

alignments of chimpanzee (panTro-), mouse (mm0), rat (rn.), and dog
(canFam-) to the human genome. Positions overlapping coding sequences were eliminated from the training data. We allowed any column in the training data with at most two missing species to be used. Handling missing data in this way allowed us to cover most potential promoter regions, however a small number of training regions (+ Datasets phastCons MCS overlap GC CpG ESPERR Ubiquitous vs. Negative 54.15% 61.14% 80.79% 90.83% 96.31% Ubiquitous vs. Speci c 46.19% 53.81% 64.41% 90.68% 96.21% Speci c vs. Negative 52.96% 60.08% 63.24% 58.50% 83.81% Table .⌥. Pair-wise classi)cation success rates using quantities computed from genomic sequence (GC content and CpG density), alignments (phastCons and MCS overlap), and ESPERR.

Extending ESPERR to multi-way classification ▪ ESPERR procedure is classifier
agnostic, all that is needed is to change the underlying classifier ▪ Rather than a log-odds score, we train a VOMM for each class and assign elements to the model under which they have the highest probability ▪ Figure of merit remains the fraction of elements correctly classified under cross validation

Other multiway classification approaches ▪ We compare ESPERR with several
other multi-way classifiers using various combinations of signals: ▪ LDA (linear discriminant analysis) ▪ Classification trees (RPART) ▪ SVM (various kernels)

⇡⌫⇢ of our predictions are within ⇠ bp of a
RefSeq annotated start site. Predicted peci#c promoters coincide less frequently – &$. – consistent with lower quality Method (predictors) Performance LDA (MCS) 39.83% LDA (phastCons) 34.09% LDA (GC) 48.60% LDA (GC, CpG) 65.46% LDA (MCS, GC, CpG) 66.85% LDA (phastCons, GC, CpG) 65.06% Tree (GC, CpG) 57.94% Tree (phastCons, GC, CpG) 63.07% Tree (MCS, GC, CpG) 63.23% SVM (MCS, gc, cpg) 63.83% ESPERR 80.98% Table .⌥. Multi-way classi#cation success rates using several machine learning methods nd predictors: Linear discriminant analysis (LDA), class+cation trees (Tree), support vector machines (SVM), and ESPERR.

Promoter prediction genome wide ▪ Following the same approach as
Cooper et al. (based on 5’ ends of cDNA alignments) we identified 79,616 potential promoter regions genome wide ▪ Using the encoding determined by ESPERR we predicted whether each potential promoter would be ubiquitously expressed (~19k), specifically expressed (~23k), or not active (~29k). ▪ We find ~6,000 gene models (clusters of overlapping cDNA alignments) with both a specific and ubiquitous promoter

ZFPM1 (encodes FOG-1) chr16: ubiq_proms spec_proms neg_proms na_proms brain cancer
germ gland immune muscle nerve other CpG Islands 87050000 87060000 87070000 87080000 87090000 87100000 87110000 87120000 Predicted widely expressed promoters Predicted tissue specific promoters Predicted non-promoter 5-prime ends Possible promoters not tested due to lack of alignments UCSC Known Genes (June, 05) Based on UniProt, RefSeq, and GenBank mRNA Human mRNAs from GenBank GNF Expression Atlas 2 CpG Islands (Islands < 300 Bases are Light Green) 5-Way Regulatory Potential - Human, Chimp, Dog, Mouse, Rat ZFPM1 AF488691 AK130845 0.2 _ 0 _ Figure ⌧.⇢: UCSC genome browser snapshot of promoter predictions in the neighborhood of ZFPM⇧. Specific promoter in CpG island, correctly classified

POU2F1 (encodes OCT1) Ubiquitous and specific promoters chr1: ubiq_proms spec_proms
brain cancer germ gland immune muscle nerve other CpG Islands 163950000 164000000 164050000 164100000 Predicted widely expressed promoters Predicted tissue specific promoters UCSC Known Genes (June, 05) Based on UniProt, RefSeq, and GenBank mRNA Human mRNAs from GenBank GNF Expression Atlas 2 CpG Islands (Islands < 300 Bases are Light Green) 5-Way Regulatory Potential - Human, Chimp, Dog, Mouse, Rat POU2F1 POU2F1 POU2F1 AK091438 S66901 BC052274 X13403 AK026259 BC041822 AY113189 BC001664 BC003571 BC007388 AK026701 S66902 0.05 _ 0 _ Figure .⇢: UCSC genome browser snapshot of promoter predictions in the neighborhood of POU⌃F⇧.

Conclusions and future directions

ESPERR ▪ ESPERR eﬀectively identifies encodings with good performance on
a variety of problems ▪ Can capture combinations of many signals ▪ Can be used with diﬀerent underlying classifiers for pairwise and multi-way classification

Future Directions ▪ Integrating other sources of data — particularly
high throughput experimental assays for protein binding (like ChIP-chip) ▪ Interpreting encodings and understanding the specific signals captured for a given problem ▪ Better modeling of indels ▪ Moving RP beyond mammals ▪ Other classifiers... gene prediction

Acknowledgments ▪ My co-authors: ▪ “ESPERR: Learning strong and weak
signals in genomic sequence alignments to identify functional elements”, written with Svitlana Tyekucheva, David King, Ross Hardison, Webb Miller and Francesca Chiaromonte ▪ “Leveraging ENCODE data to predict widely expressed and tissue-specific transcriptional promoters in the human genome”, written with Nathan Trinklein, Ross Hardison, Webb Miller and Francesca Chiaromonte ▪ ENCODE consortium, the CCGB

Dissertation Defense: "Encoding Alignments for...

Dissertation Defense: "Encoding Alignments for Classification Problems"

More Decks by James Taylor

Other Decks in Science

Featured

Transcript