Discriminative Embeddings of Latent Variable Models for Structured Data

Slide 1

Slide 1 text

Discriminative Embeddings of Latent Variable Models for Structured Data by Hanjun Dai, Bo Dai, Le Song presentation by Breandan Considine McGill University [email protected] March 12, 2020 Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 1 / 20

Slide 2

Slide 2 text

What is a kernel? A feature map transforms the input space to a feature space: ϕ : Input space Rn → Feature space Rm (1) A kernel function k is a real-valued function with two inputs: k : Ω × Ω → R (2) Kernel functions generalize the notion of inner products to feature maps: k(x, y) = ϕ(x) ϕ(y) (3) Gives us ϕ(x) ϕ(y) without directly computing ϕ(x) or ϕ(y). Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 2 / 20

Slide 3

Slide 3 text

What is a kernel? Consider the univariate polynomial regression algorithm: ˆ f (x; β) = βϕ(x) = β0 + β1x + β2x2 + · · · + βmxm = m j=0 βj xj (4) Where ϕ(x) = [1, x, x2, x3, . . . , xm]. We seek β minimizing the error: β∗ = argmin β ||Y − ˆ f(X; β)||2 (5) Can solve for β∗ using the normal equation or gradient descent: β∗ = (X X)−1X Y (6) β ← β − α∇β||Y − ˆ f(X; β)||2 (7) What happens if we want to approximate a multivariate polynomial? z(x, y) = 1 + βx x + βy y + βxy xy + βx2 x2 + βy2 y2 + βxy2 xy2 + . . . (8) Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 3 / 20

Slide 4

Slide 4 text

What is a kernel? Consider the polynomial kernel k(x, y) = (1 + xT y)2 with x, y ∈ R2. k(x, y) = (1 + xT y)2 = (1 + x1 y1 + x2 y2)2 (9) = 1 + x2 1 y2 1 + x2 2 y2 2 + 2x1y1 + 2x2y2 + 2x1x2y1y2 (10) This gives us the same result as computing the 6 dimensional feature map: k(x, y) = ϕ(x) ϕ(y) (11) = [1, x2 1 , x2 2 , √ 2x1, √ 2x2, √ 2x1x2]         1 y2 1 y2 2 √ 2y1 √ 2y2 √ 2y1y2         (12) But does not require computing ϕ(x) or ϕ(y). Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 4 / 20

Slide 5

Slide 5 text

Examples of common kernels Popular kernels Polynomial k(x, y) := (xT y + r)n x, y ∈ Rd , n ∈ N, r ≥ 0 Laplacian k(x, y) := exp − x−y σ x, y ∈ Rd , σ > 0 Gaussian RBF k(x, y) := exp − x−y 2 2σ2 x, y ∈ Rd , σ > 0 Popular Graph Kernels RW k×(G, H) := |V×| i,j=1 [ ∞ n=1 λnAn × ]ij = e (I − λA×)−1e O(n6) SP kSP(G, H) := s1∈SD(G) s2∈SD(H) k(s1, s2) O(n4) WL l(i)(G) := degv , ∀v ∈ G i = 1 HASH({{l(i−1)(u), ∀u ∈ N(v)}}) i > 1 kWL(G, H) := ψWL(G), ψWL(H) O(hm) https://people.mpi-inf.mpg.de/~mehlhorn/ftp/genWLpaper.pdf Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 5 / 20

Slide 6

Slide 6 text

Positive definite kernels Positive Definite Matrix A symmetric matrix K ∈ RN2 is positive definite if x Kx > 0, ∀x ∈ RN \0. Positive Definite Kernel A symmetric kernel k is called positive definite on Ω if its associated kernel matrix K = [k(xi , xj )]N i,j=0 is positive definite ∀N ∈ N, ∀{x1, . . . , xN} ⊂ Ω. http://www.math.iit.edu/~fass/PDKernels.pdf Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 6 / 20

Slide 7

Slide 7 text

What is an inner product space? Linear function Let X be a vector space over R. A function f : X → R is linear iﬀ f (αx) = αf (x) and f (x + z) = f (x) + f (z) for all α ∈ R, x, z ∈ X. Inner product space X is an inner product space if there exists a symmetric bilinear map ·, · : X × X → R if ∀x ∈ X, x, x > 0 (i.e. is positive deﬁnite). Cauchy-Schwartz Inequality If X is an inner product space, then ∀u, v ∈ X, | u, v |2 ≤ u, u · v, v . Scalar Product Vector Dot Product Random Variable x, y := xy    x1 . . . xn    ,    y1 . . . yn    := xTy X, Y := E(XY ) Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 7 / 20

Slide 8

Slide 8 text

What is a Hilbert space? Let d : X × X → R≥0 be a metric on the space X. Cauchy sequence A sequence {xn} is called a Cauchy sequence if ∀ε > 0, ∃N ∈ N, such that ∀n, m ≥ N, d(xn, xm) ≤ ε. Completeness X is called complete if every Cauchy sequence converges to a point in X. Separability X is called separable if there exists a sequence {xn}∞ n=1 ∈ X s.t. every nonempty open subset of X contains at least one element of the sequence. Hilbert space A Hilbert space H is an inner product space that is complete and separable. Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 8 / 20

Slide 9

Slide 9 text

Properties of Hilbert Spaces Hilbert space inner products are kernels The inner product ·, · H : H × H → R is a positive deﬁnite kernel: n i,j=1 ci cj (xi , xj )H = n i=1 ci xi , n j=1 cj xj H = n i=1 ci xi 2 H ≥ 0 Reproducing Kernel Hilbert Space (RKHS) Any continuous, symmetric, positive deﬁnite kernel k : X × X → R has a corresponding Hilbert space, which induces a feature map ϕ : X → H satisfying k(x, y) = ϕ(x), ϕ(y) H. http://jmlr.csail.mit.edu/papers/volume11/vishwanathan10a/vishwanathan10a.pdf https://marcocuturi.net/Papers/pdk_in_ml.pdf Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 9 / 20

Slide 10

Slide 10 text

Hilbert Space Embedding of Distributions Maps distributions into potentially inﬁnite dimensional feature spaces: µX := EX [φ(X)] = X φ(x)p(x)dx : P → F (13) By choosing the right kernel, we can make this mapping injective. f (p(x)) = ˜ f (µx ), f : P → R (14) T ◦ p(x) = ˜ T ◦ µx , ˜ T : F → Rd (15) Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 10 / 20

Slide 11

Slide 11 text

Hilbert Space Embedding of Distributions Maps distributions into potentially inﬁnite dimensional feature spaces: µX := EX [φ(X)] = X φ(x)p(x)dx : P → F (16) By choosing the right kernel, we can make this mapping injective. f (p(x)) = ˜ f (µx ), f : P → R (17) T ◦ p(x) = ˜ T ◦ µx , ˜ T : F → Rd (18) Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 11 / 20

Slide 12

Slide 12 text

Belief Networks Belief network is a distribution of the form: P(x1, . . . , xD) = D i=1 P(xi |pa(xi )) (19) z x y z x y P(X, Y |Z) ∝ P(Z|X, Y )P(X)P(Y ) P(X, Y |Z) = P(X|Z)P(Y |Z) Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 12 / 20

Slide 13

Slide 13 text

Latent Variable Models Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 13 / 20

Slide 14

Slide 14 text

Embedded mean ﬁeld Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 14 / 20

Slide 15

Slide 15 text

Embedded loopy belief propagation Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 15 / 20

Slide 16

Slide 16 text

Discriminative Embedding Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 16 / 20

Slide 17

Slide 17 text

Graph Dataset Results Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 17 / 20

Slide 18

Slide 18 text

Harvard Clean Energy Project (CEP) Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 18 / 20

Slide 19

Slide 19 text

CEP Results Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 19 / 20

Slide 20

Slide 20 text

Resources Dai et al., Discriminative Embeddings of Latent Variable Models Cristianini and Shawe-Taylor, Kernel Methods for Pattern Analysis Kriege et al., Survey on Graph Kernels Panangaden, Notes on Metric Spaces Fasshauer, Positive Deﬁnite Kernels: Past, Present and Future Cuturi, Positive Deﬁnite Kernels in Machine Learning Gormley and Eisner, Structured Belief Propagation for NLP Forsyth, Mean Field Inference Tseng, Probabilistic Graphical Models Görtler, et al. A Visual Exploration of Gaussian Processes Breandan Considine (McGill) Discriminative Embeddings March 12, 2020 20 / 20