Avishek Joey Bose 1 2 Ariella Smofsky 1 2 Renjie Liao 3 4 Prakash Panangaden 1 2 William L. Hamilton 1 2 Abstract The choice of approximate posterior distributions plays a central role in stochastic variational infer- ence (SVI). One effective solution is the use of normalizing ﬂows to construct ﬂexible posterior distributions. However, one key limitation of ex- isting normalizing ﬂows is that they are restricted to the Euclidean space and are ill-equipped to model data with an underlying hierarchical struc- ture. To address this fundamental limitation, we present the ﬁrst extension of normalizing ﬂows to hyperbolic spaces. We ﬁrst elevate normal- izing ﬂows to hyperbolic spaces using coupling transforms deﬁned on the tangent bundle, termed Tangent Coupling (T C). We further introduce Wrapped Hyperboloid Coupling (WHC), a fully invertible and learnable transformation that explic- itly utilizes the geometric structure of hyperbolic spaces, allowing for expressive posteriors while being efﬁcient to sample from. We demonstrate the efﬁcacy of our novel normalizing ﬂow over hy- 2 K 2 K 2 2 Figure 1. The shortest path between a given pair of node embed- dings in R 2 and hyperbolic space as modelled by the Lorentz model H 2 K and Poincar´ e disk P 2 K . Unlike Euclidean space, dis- tances between points grow exponentially as you move away from the origin in hyperbolic space, and thus the shortest paths between points in hyperbolic space go through a common parent node, —i.e. the origin, giving rise to hierarchical and tree-like structure. 36v3 [cs.LG] 16 Jun 2020 *$.-
rovide a general way of constructing ﬂexible probability distributions dom variables. Let x be a D-dimensional real vector, and suppose we a joint distribution over x. The main idea of ﬂow-based modeling is to ormation T of a real vector u sampled from pu(u): x = T(u) where u ⇠ pu(u). (1) 2 جఈ transformation to pu(u) as the base distribution of the ﬂow-based model.1 The transformation T ase distribution pu(u) can have parameters of their own (denote them as and ely); this induces a family of distributions over x parameterized by { , }. ing property of ﬂow-based models is that the transformation T must be invertible T and T 1 must be di↵erentiable. Such transformations are known as di↵eo- ms and require that u be D-dimensional as well (Milnor and Weaver, 1997). Under ditions, the density of x is well-deﬁned and can be obtained by a change of variables 006; Bogachev, 2007): px(x) = pu(u) |det JT (u)| 1 where u = T 1(x). (2) ntly, we can also write px(x) in terms of the Jacobian of T 1: px(x) = pu T 1(x) |det JT 1 (x)| . (3) bian JT (u) is the D ⇥ D matrix of all partial derivatives of T given by: J (u) = 2 6 @T1 @u1 · · · @T1 @uD . . .. . . 3 7 . (4) the base distribution of the ﬂow-based model.1 The transformation T tion pu(u) can have parameters of their own (denote them as and duces a family of distributions over x parameterized by { , }. y of ﬂow-based models is that the transformation T must be invertible 1 must be di↵erentiable. Such transformations are known as di↵eo- re that u be D-dimensional as well (Milnor and Weaver, 1997). Under density of x is well-deﬁned and can be obtained by a change of variables hev, 2007): x(x) = pu(u) |det JT (u)| 1 where u = T 1(x). (2) also write px(x) in terms of the Jacobian of T 1: px(x) = pu T 1(x) |det JT 1 (x)| . (3) is the D ⇥ D matrix of all partial derivatives of T given by: JT (u) = 2 6 4 @T1 @u1 · · · @T1 @uD . . . ... . . . @TD @u1 · · · @TD @uD 3 7 5 . (4) ally construct a ﬂow-based model by implementing T (or T 1) with d taking pu(u) to be a simple density such as a multivariate normal.
to an inﬁnitesimally small neighbourhood dx around x = T(u), |det JT (u)| is e volume of dx divided by the volume of du. Since the probability mass in dx mus probability mass in du, the density at x is smaller than the density at u if du is and is larger if du is contracted. An important property of invertible and di↵erentiable transformations is tha composable. Given two such transformations T1 and T2, their composition T2 invertible and di↵erentiable. Its inverse and Jacobian determinant are given by: (T2 T1) 1 = T 1 1 T 1 2 det JT2 T1 (u) = det JT2 (T1(u)) · det JT1 (u). In consequence, we can build complex transformations by composing multipl of simpler transformations, without compromising the requirements of invert di↵erentiability, and hence without losing the ability to calculate the density px 1. Some papers refer to pu(u) as the ‘prior’ and to u as the ‘latent variable’. We believe that thi is not as well-suited for normalizing ﬂows as it is for latent-variable models. Upon obse corresponding u = T 1 (x) is uniquely determined and thus no longer ‘latent’. 3 to T. Roughly speaking, if an inﬁnitesimally small neighbourhood du around u to an inﬁnitesimally small neighbourhood dx around x = T(u), |det JT (u)| is e volume of dx divided by the volume of du. Since the probability mass in dx mus probability mass in du, the density at x is smaller than the density at u if du is and is larger if du is contracted. An important property of invertible and di↵erentiable transformations is tha composable. Given two such transformations T1 and T2, their composition T2 invertible and di↵erentiable. Its inverse and Jacobian determinant are given by: (T2 T1) 1 = T 1 1 T 1 2 det JT2 T1 (u) = det JT2 (T1(u)) · det JT1 (u). In consequence, we can build complex transformations by composing multiple of simpler transformations, without compromising the requirements of invert di↵erentiability, and hence without losing the ability to calculate the density px 1. Some papers refer to pu(u) as the ‘prior’ and to u as the ‘latent variable’. We believe that this is not as well-suited for normalizing ﬂows as it is for latent-variable models. Upon obse corresponding u = T 1 (x) is uniquely determined and thus no longer ‘latent’. 3 ighbourhood around u due od du around u is mapped |det JT (u)| is equal to the mass in dx must equal the ity at u if du is expanded, rmations is that they are composition T2 T1 is also nt are given by: (5) u). (6) mposing multiple instances ments of invertibility and e the density px(x). We believe that this terminology models. Upon observing x, the atent’. tiﬁes the relative change of volume of a small neighbourhood around u due peaking, if an inﬁnitesimally small neighbourhood du around u is mapped ally small neighbourhood dx around x = T(u), |det JT (u)| is equal to the vided by the volume of du. Since the probability mass in dx must equal the in du, the density at x is smaller than the density at u if du is expanded, du is contracted. roperty of invertible and di↵erentiable transformations is that they are en two such transformations T1 and T2, their composition T2 T1 is also ↵erentiable. Its inverse and Jacobian determinant are given by: (T2 T1) 1 = T 1 1 T 1 2 (5) det JT2 T1 (u) = det JT2 (T1(u)) · det JT1 (u). (6) we can build complex transformations by composing multiple instances formations, without compromising the requirements of invertibility and and hence without losing the ability to calculate the density px(x). er to pu(u) as the ‘prior’ and to u as the ‘latent variable’. We believe that this terminology uited for normalizing ﬂows as it is for latent-variable models. Upon observing x, the u = T 1 (x) is uniquely determined and thus no longer ‘latent’. 3
4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 Figure 1: Example of a 4-step ﬂow transforming samples from a standard-normal base density to a cross-shaped target density. In practice, it is common to chain together multiple transformations T1, . . . , TK to obtain T = TK · · · T1, where each Tk transforms zk 1 into zk , assuming z0 = u and zK = x. Hence, the term ‘ﬂow’ refers to the trajectory that a collection of samples from pu(u) follow as they are gradually transformed by the sequence of transformations T1, . . . , TK. The term ‘normalizing’ refers to the fact that the inverse ﬂow through T 1 K , . . . , T 1 1 takes a collection of samples from px(x) and transforms them (in a sense, ‘normalizes’ them) into a collection of samples from a prescribed density pu(u) (which is often taken to be a multivariate normal). Figure 1 illustrates a ﬂow (K = 4) transforming a standard-normal base distribution to a cross-shaped target density. In terms of functionality, a ﬂow-based model provides two operations: sampling from the model using eq. 1, and evaluating the model’s density using eq. 3. These two operations have di↵erent computational requirements. Sampling from the model requires the ability to w l'MPXzˠͷมͷ࿈ଓʹΑͬͯαϯϓϧ͕มԽ͍ͯ͘͠ي w l/PSNBMJ[JOHz ˠෳͳ͕JOWFSTFqPXʹΑͬͯجఈʹม͞ΕΔ͜ͱɻ ɹ جఈͱͯ͠Ψε͕͘ར༻͞ΕΔ
Using ﬂows for modeling and inference Similarly to ﬁtting any probabilistic model, ﬁtting a ﬂow-based model distribution p⇤ x (x) can be done by minimizing some divergence or discrep This minimization is performed with respect to the model’s parameters are the parameters of T and are the parameters of pu(u). In the we discuss a number of divergences for ﬁtting ﬂow-based models, with a the Kullback–Leibler (KL) divergence as it is one of the most popular 5 ng ﬂows for modeling and inference to ﬁtting any probabilistic model, ﬁtting a ﬂow-based model px(x; ✓) to a target on p⇤ x (x) can be done by minimizing some divergence or discrepancy between them. imization is performed with respect to the model’s parameters ✓ = { , }, where e parameters of T and are the parameters of pu(u). In the following sections, s a number of divergences for ﬁtting ﬂow-based models, with a particular focus on ack–Leibler (KL) divergence as it is one of the most popular choices. 5 3.1 Forward KL divergence and maximum likelihood estimation e forward KL divergence2 between the target distribution p⇤ x (x) and the ﬂow-based model (x; ✓) can be written as follows: L(✓) = DKL [ p⇤ x (x) k px(x; ✓) ] = Ep⇤ x (x) [ log px(x; ✓) ] + const. = Ep⇤ x (x) ⇥ log pu T 1(x; ); + log |det JT 1 (x; )| ⇤ + const. (12) e forward KL divergence is well-suited for situations in which we have samples from the get distribution (or the ability to generate them), but we cannot necessarily evaluate e target density p⇤ x (x). Assuming we have a set of samples {xn}N n=1 from p⇤ x (x), we can imate the expectation over p⇤ x (x) by Monte Carlo as follows: L(✓) ⇡ 1 N N X n=1 log pu(T 1(xn; ); ) + log |det JT 1 (xn; )| + const. (13) nimizing the above Monte Carlo approximation of the KL divergence is equivalent to ﬁt- g the ﬂow-based model to the samples {xn}N n=1 by maximum likelihood estimation. e forward KL divergence2 between the target distribution p⇤ x (x) and the ﬂow-based model x; ✓) can be written as follows: L(✓) = DKL [ p⇤ x (x) k px(x; ✓) ] = Ep⇤ x (x) [ log px(x; ✓) ] + const. = Ep⇤ x (x) ⇥ log pu T 1(x; ); + log |det JT 1 (x; )| ⇤ + const. (12) e forward KL divergence is well-suited for situations in which we have samples from the get distribution (or the ability to generate them), but we cannot necessarily evaluate e target density p⇤ x (x). Assuming we have a set of samples {xn}N n=1 from p⇤ x (x), we can imate the expectation over p⇤ x (x) by Monte Carlo as follows: L(✓) ⇡ 1 N N X n=1 log pu(T 1(xn; ); ) + log |det JT 1 (xn; )| + const. (13) nimizing the above Monte Carlo approximation of the KL divergence is equivalent to ﬁt- g the ﬂow-based model to the samples {xn}N n=1 by maximum likelihood estimation. practice, we typically optimize the parameters ✓ iteratively with stochastic gradient- sed methods. We can obtain an unbiased estimate of the gradient of the KL divergence th respect to the parameters as follows: r L(✓) ⇡ 1 N N X r log pu T 1(xn; ); + r log |det JT 1 (xn; )| (14) = Ep⇤ x (x) [ log px(x; ✓) ] + const. = Ep⇤ x (x) ⇥ log pu T 1(x; ); + log |det JT 1 (x; )| ⇤ + const. (12) The forward KL divergence is well-suited for situations in which we have samples from the arget distribution (or the ability to generate them), but we cannot necessarily evaluate he target density p⇤ x (x). Assuming we have a set of samples {xn}N n=1 from p⇤ x (x), we can stimate the expectation over p⇤ x (x) by Monte Carlo as follows: L(✓) ⇡ 1 N N X n=1 log pu(T 1(xn; ); ) + log |det JT 1 (xn; )| + const. (13) Minimizing the above Monte Carlo approximation of the KL divergence is equivalent to ﬁt- ing the ﬂow-based model to the samples {xn}N n=1 by maximum likelihood estimation. n practice, we typically optimize the parameters ✓ iteratively with stochastic gradient- ased methods. We can obtain an unbiased estimate of the gradient of the KL divergence with respect to the parameters as follows: r L(✓) ⇡ 1 N N X n=1 r log pu T 1(xn; ); + r log |det JT 1 (xn; )| (14) r L(✓) ⇡ 1 N N X n=1 r log pu T 1(xn; ); . (15) The update with respect to may also be done in closed form if pu(u; ) admits closed-form
sive ﬂows saw that, under reasonable conditions, we can transform any distribution orm distribution in (0, 1)D using maps with a triangular Jacobian. The ssive ﬂows are a direct implementation of this construction, specifying f ing form (as described by e.g. Huang et al., 2018; Jaini et al., 2019): z0 i = ⌧(zi; hi) where hi = ci(z<i), (28) d the transformer and ci the i-th conditioner. The transformer is a strictly on of zi (and therefore invertible), is parameterized by hi, and speciﬁes on zi in order to output z0 i . The conditioner determines the parameters r, and in turn, can modify the transformer’s behavior. The conditioner be a bijection. Its one constraint is that the i-th conditioner can take as ariables with dimension indices less that i. The parameters of f are meters of the conditioner (not shown above for notational simplicity), but ansformer has its own parameters too (in addition to hi). that the above construction is invertible for any choice of ⌧ and ci as long r is invertible. Given z0, we can compute z iteratively as follows: zi = ⌧ 1(z0 i ; hi) where hi = ci(z<i). (29) mputation, each hi and therefore each z0 i can be computed independently n parallel. In the inverse computation however, all z<i need to have been zi, so that z<i is available to the conditioner for computing hi. Transformer Conditioner with respect to zj is zero whenever j > i. Hence, the Jacobian of f can be following form: Jf (z) = 2 6 4 @⌧ @z1 (z1; h1) 0 ... L(z) @⌧ @zD (zD; hD) 3 7 5 . (30) s a lower-triangular matrix whose diagonal elements are the derivatives of the each of the D elements of z. Since the determinant of any triangular matrix product of its diagonal elements, the log-absolute-determinant of Jf (z) can n O(D) time as follows: D Y @⌧ D X @⌧ derivative of z0 i with respect to zj is zero whenever j > i. Hence, the Jacobian of f written in the following form: Jf (z) = 2 6 4 @⌧ @z1 (z1; h1) 0 ... L(z) @⌧ @zD (zD; hD) 3 7 5 . The Jacobian is a lower-triangular matrix whose diagonal elements are the derivativ transformer for each of the D elements of z. Since the determinant of any triangula is equal to the product of its diagonal elements, the log-absolute-determinant of Jf be calculated in O(D) time as follows: log det Jf (z) = log D Y i=1 @⌧ @zi (zi; hi) = D X i=1 log @⌧ @zi (zi; hi) . The lower-triangular part of the Jacobian— denoted here by L(z)—is irrelevant. The tives of the transformer can be computed either analytically or via automatic di↵eren
$POEJUJPOFSదͳχϡʔϥϧωοτͰྑ͍͕ɺDPOEJUJPOFS͝ͱʹ ҟͳΔωοτϫʔΫʹ͢Δͱύϥϝʔλ͕ଟ͘େมͳͷͰɺ 3//Λར༻͢Δํ๏ͳͲ͕ఏҊ͞Ε͍ͯΔɻ USBOTGPSNFSͱDPOEJUJPOFS g paragraphs, we will describe various choices for the implementation of the well as discuss their expressivity and computational e ciency. egressive ﬂows A simple choice of transformer is the class of a ne func- ⌧(zi; hi) = ↵izi + i where hi = {↵i, i} . (32) be thought of as a location-scale transformation, where ↵i controls the scale s the location. Invertibility is guaranteed if ↵i 6= 0, and this can be easily . taking ↵i = exp ˜ ↵i, where ˜ ↵i is an unconstrained parameter (in which case 12 verse autoregressive ﬂow (IAF) (Kingma et al., 2016), masked autoregressive ﬂow (MAF) (Papamakarios et al., 2017), and Glow (Kingma and Dhariwal, 2018). Non-a ne neural transformers Non-a ne transformers can be constructed from sim- ple components based on the observation that conic combinations as well as compositions of monotonic functions are also monotonic. Given monotonic functions ⌧1, . . . , ⌧K of a real variable z, the following functions are also monotonic: • Conic combination: ⌧(z) = PK k=1 wk⌧k (z), where wk > 0 for all k. • Composition: ⌧(z) = ⌧K · · · ⌧1(z). For example, a non-a ne neural transformer can be constructed using a conic combination of monotonically increasing activation functions (·) (such as logistic sigmoid, tanh, leaky ReLU, and others): ⌧(zi; hi) = wi0 + K X k=1 wik (↵ik zi + ik ) where hi = {wi0, . . . , wiK, ↵ik, ik} , (34) provided ↵ik > 0 and wik > 0 for all k 1. Clearly, the above construction corresponds to a monotonic single-layer perceptron. By repeatedly combining and composing monotonic activation functions, we can construct a multi-layer perceptron that is monotonic, provided that all its weights are strictly positive. Non-a ne neural transformers such as the above can represent any monotonic function ar- bitrarily well, which follows directly from the universal-approximation capabilities of multi- layer perceptrons (see e.g. Huang et al., 2018, for details). The derivatives of neural trans- formers needed for the computation of the Jacobian determinant are in principle analytically obtainable, but more commonly they are computed via backpropagation. A drawback of neural transformers is that in general they cannot be inverted analytically, and can be in- verted only iteratively e.g. using bijection search. Variants of non-a ne neural transformers have been used in models such as neural autoregressive ﬂow (NAF) (Huang et al., 2018), segment—which can be done in O(log K) time using binary search—and then evaluating or inverting that segment, which is assumed to be analytically tractable. By increasing the number of segments K, a spline-based transformer can be made arbitrarily ﬂexible. 3.1.2 Implementing the conditioner Moving on to the conditioner, ci(z<i) can be any function of z<i, meaning that each con- ditioner can, in principle, be implemented as an arbitrary neural network with input z<i and output hi. However, a na¨ ıve implementation in which each ci(z<i) is a separate neural network would scale poorly with the dimensionality D, requiring D forward propagations of a vector of average size D/2. This is in addition to the cost of storing and learning the parameters of D independent networks. In fact, early work on ﬂow precursors (Chen and Gopinath, 2000) dismissed the autoregressive approach as prohibitively expensive. Nonetheless, this problem can be e↵ectively addressed in practice by sharing parameters across the conditioners ci(z<i), or even by combining the conditioners into a single net- work. In the following paragraphs, we will discuss some practical implementations of the conditioner that allow it to scale to high dimensions. Recurrent autoregressive ﬂows One way to share parameters across conditioners is by using a recurrent neural network (RNN). The i-th conditioner is implemented as: hi = c(si) where s1 = initial state si = RNN(zi 1, si 1) for i > 1. (38)
ͦͷଞͷqPX determinant from the outset. a class of invertible transformations of the general form: z0 = z + g (z), (54) k that outputs a D-dimensional translation vector, and are This structure bears a strong similarity to residual networks we use the term residual ﬂow to refer to a normalizing ﬂow ations. Residual transformations are not always invertible, but is constrained appropriately. In what follows, we discuss two ning invertible residual transformations: the ﬁrst is based on cond is based on the matrix determinant lemma. ual flows guaranteed to be invertible if g can be made contractive with ction (Behrmann et al., 2019; Chen et al., 2019). In general, a 22 If the determinant and inverse of A are tractable and M is less than D, the matrix deter- minant lemma can provide a computationally e cient way to compute the determinant of A+VW>. For example, if A is diagonal, computing the left-hand side costs O D3 + D2M , whereas computing the right-hand side costs O M3 + DM2 , which is preferable if M < D. In the context of ﬂows, the matrix determinant lemma can be used to e ciently compute the Jacobian determinant. In this section, we will discuss examples of residual ﬂows that are speciﬁcally designed such that application of the matrix determinant lemma leads to e cient Jacobian-determinant computation. Planar ﬂow One early example is the planar ﬂow (Rezende and Mohamed, 2015), where the function g is a one-layer neural network with a single hidden unit: z0 = z + v (w>z + b). (62) The parameters of the planar ﬂow are v 2 R D, w 2 R D and b 2 R, and is a di↵erentiable activation function such as the hyperbolic tangent. This ﬂow can be interpreted as expand- ing/contracting the space in the direction perpendicular to the hyperplane w>z + b = 0. The Jacobian of the transformation is given by: Jf (z) = I + 0(w>z + b) vw>, (63) where 0 is the derivative of the activation function. The Jacobian has the form of a diagonal matrix plus a rank-1 update. Using the matrix determinant lemma, the Jacobian determinant can be computed in time O(D) as follows: det Jf (z) = 1 + 0(w>z + b) w>v. (64) In general, the planar ﬂow is not invertible for all values of v and w. However, assuming that 0 is positive everywhere and bounded from above (which is the case if is the hyperbolic tangent, for example), a su cient condition for invertibility is w>v > 1 , since it van den Berg et al. (2018) proposed the parameterization V = QU and W = QL, where Q is a D ⇥M matrix whose columns are an orthonormal set of vectors (this requires M D), U is M ⇥ M upper triangular, and L is M ⇥ M lower triangular. Since Q>Q = I and the product of upper-triangular matrices is also upper triangular, the Jacobian determinant becomes: det Jf (z) = det ⇣ I + S(z)L>U ⌘ = D Y i=1 (1 + Sii(z)LiiUii). (68) Similar to planar ﬂows, Sylvester ﬂows are not invertible for all values of their parameters. Assuming 0 is positive everywhere and bounded from above, a su cient condition for invertibility is LiiUii > 1 supx 0(x) for all i 2 {1, . . . , D}, since it ensures det Jf (z) is non-zero everywhere. Radial ﬂow Radial ﬂows (Tabak and Turner, 2013; Rezende and Mohamed, 2015) take the following form: z0 = z + ↵ + r(z) (z z0) where r(z) = kz z0k . (69) The parameters of the ﬂow are ↵ 2 (0, +1), 2 R and z0 2 R D, and k·k is the Euclidean norm. The above transformation can be thought of as a contraction/expansion radially with center z0. The Jacobian can be written as follows: Jf (z) = ✓ 1 + ↵ + r(z) ◆ I r(z)(↵ + r(z))2 (z z0)(z z0)>, (70) which is a diagonal matrix plus a rank-1 update. Applying the matrix determinant lemma and rearranging, we get the following expression for the Jacobian determinant, which can be computed in O(D): det Jf (z) = ✓ 1 + ↵ (↵ + r(z))2 ◆✓ 1 + ↵ + r(z) ◆D 1 . (71) The radial ﬂow is not invertible for all values of . A su cient condition for invertibility is > ↵, since it ensures that det Jf (z) is non-zero everywhere. In summary, planar, Sylvester and radial ﬂows have O(D) Jacobian determinants, and can we considered constructing ﬂows by parameterizing a one-step 1 ), several of which are then composed to create a ﬂow of K ive strategy is to construct ﬂows in continuous time by param- simal dynamics, and then integrating to ﬁnd the corresponding ords, we construct the ﬂow by deﬁning an ordinary di↵erential bes the ﬂow’s evolution in time. We call these ‘continuous-time’ ing to a real-valued scalar variable analogous to the number of his scalar ‘time’ as it determines how long the dynamics are run. ribe this class of continuous-time ﬂows and summarize numerical plementation. te at time t (or ‘step’ t, thinking in the discrete setting). Time t usly from t0 to t1, such that zt0 = u and zt1 = x. A continuous- parameterizing the time derivative of zt with a neural network ding the following ordinary di↵erential equation (ODE): dzt dt = g (t, zt). (74) es as inputs both the time t and the ﬂow’s state zt, and outputs time t. The only requirements for g are that it be uniformly meaning that there is a single Lipschitz constant that works for Chen et al., 2018). From Picard’s existence theorem, it follows Let zt denote the ﬂow’s state at time t (or ‘step’ t, thinking in the discrete setting). Time t s assumed to run continuously from t0 to t1, such that zt0 = u and zt1 = x. A continuous- ime ﬂow is constructed by parameterizing the time derivative of zt with a neural network g with parameters , yielding the following ordinary di↵erential equation (ODE): dzt dt = g (t, zt). (74) The neural network g takes as inputs both the time t and the ﬂow’s state zt, and outputs he time derivative of zt at time t. The only requirements for g are that it be uniformly Lipschitz continuous in zt (meaning that there is a single Lipschitz constant that works for all t) and continuous in t (Chen et al., 2018). From Picard’s existence theorem, it follows hat satisfying these requirements ensures that the above ODE has a unique solution (Cod- dington and Levinson, 1955). Many neural-network layers meet these requirements (Gouk et al., 2018), and unlike the neural architectures described in section 3 that require careful tructural assumptions to ensure invertibility and tractability of their Jacobian determinant, g has no such requirements. To compute the transformation x = T(u), we need to run the dynamics forward in time by ntegrating: x = zt1 = u + Z t1 t=t0 g (t, zt) dt. (75) 28
O Graph Distance Ratio Hyperbolic Distance Ratio Euclidean Distance Ratio Figure 1: Geodesics and distances in the Poincar´ e disk. As x and y move towards the outside of the disk (i.e., kxk, kyk ! 1), the distance d H (x, y) approaches d H (x, O) + d H (O, y). are not preserved, but are given by d H (x, y) = acosh ✓ 1 + 2 kx yk2 (1 kxk2)(1 kyk2) ◆ . There are some potentially unexpected consequences of this formula, and a simple example gives intuition abou technical property that allows hyperbolic space to embed trees. Consider three points: the origin 0, and points y with kxk = kyk = t for some t > 0. As shown on the right of Figure 1, as t ! 1 (i.e., the points move to the outside of the disk), in ﬂat Euclidean space, the ratio dE (x,y) dE (x,0)+dE (0,y) is constant with respect to t (blue curv contrast, the ratio dH (x,y) dH (x,0)+dH (0,y) approaches 1, or, equivalently, the distance d H (x, y) approaches d H (x, 0)+d H (red and pink curves). That is, the shortest path between x and y is almost the same as the path through the origin is analogous to the property of trees in which the shortest path between two sibling nodes is the path through parent. This tree-like nature of hyperbolic space is the key property exploited by embeddings. Moreover, this pr holds for arbitrarily small angles between x and y. Lines and geodesics There are two types of geodesics (shortest paths) in the Poincar´ e disk model of hyperbolic segments of circles that are orthogonal to the disk surface, and disk diameters [3]. Our algorithms and proofs ma Y Z 0 ຒΊࠐΈ d(x, y) d(x, O) + d(y, O) $ISJTUPQIFS%F4BFUBM*$.-
Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 hhcho@mit.edu Benjamin DeMeo Department of Biomedical Informatics Harvard University Cambridge, MA 02138 bdemeo@g.harvard.edu Jian Peng Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 jianpeng@illinois.edu Bonnie Berger Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 bab@mit.edu Abstract Representing data in hyperbolic space can effectively capture latent hierarchical relationships. With the goal of enabling accurate classiﬁcation of points in hyper- bolic space while respecting their hyperbolic geometry, we introduce hyperbolic SVM1, a hyperbolic formulation of support vector machine classiﬁers, and elu- cidate through new theoretical work its connection to the Euclidean counterpart. We demonstrate the performance improvement of hyperbolic SVM for multi-class prediction tasks on real-world complex networks as well as simulated datasets. arXiv:1806.00437v1 [cs.LG] 1 Jun 2018 "*45"54 w ී௨ͷιϑτϚʔδϯઢܕ47. w ۂۭؒͰͷιϑτϚʔδϯઢܕ47. le; for correct classiﬁcations, we increase our conﬁdence, and for incorrect classiﬁcations, mize the error. m margin learning of the optimal decision rule h?, which provides the foundation for support achines, can now be formalized as h? = arg max h2H min j2[m] (h, (x(j), y(j))), (8) is the set of candidate decision rules that we consider. the data space X be Rn and d be the Euclidean distance function and consider only linear s, i.e., H = {h(x; w) : w 2 Rn } where h(x; w) = ⇢ 1 wT x > 0, 1 otherwise, (9) n be shown that the max-margin problem given in Eq. 8 becomes equivalent to solving the g convex optimization problem: minimizew2 Rn 1 2kwk 2 (10) subject to y(j)(wT x(j)) 1, 8j 2 [m] (11) ting algorithm that solves this problem (via its dual) is known as support vector machines ntroducing a relaxation for the separability constraints gives a more commonly used soft- ariant of SVM minimizew2 Rn 1 2kwk 2 + C m X j=1 max(0, 1 y(j)(wT x(j))) (12) 3 minimizew2 Rn+1 1 2 w ⇤ w, (16) subject to y(j)(w ⇤ x(j)) 1, 8j 2 [m], (17) w ⇤ w < 0. (18) The proof of Theorem 2 is exactly analogous to the Euclidean version, and is provided in the Supplementary Information. Our result suggests that despite the apparent complexity of hyperbolic distance calculation, the optimal (linear) maximum margin classiﬁers in the hyperbolic space can be identiﬁed via a relatively simple optimization problem that closely resembles the Euclidean version of SVM, where Euclidean inner products are replaced with Minkowski inner products. Note that if we restrict H to decision functions where w0 = 0, then our formulation coincides with with Euclidean SVM. Thus, Euclidean SVM can be viewed as a special case of our formulation where the ﬁrst coordinate (corresponds to the time axis in Minkowski spacetime) is neglected. Unlike Euclidean SVM, however, our optimization problem has a non-convex objective as well as a non-convex constraint. Yet, if we restrict our attention to non-trivial, ﬁnite-sized problems where it is necessary and sufﬁcient to consider only the set of w for which at least one data point lies on either side of the decision boundary, then the negative norm constraint can be replaced with a convex alternative that intuitively maps out the convex hull of given data points in the ambient Euclidean space of Ln. Finally, the soft-margin formulation of hyperbolic SVM can be derived by relaxing the separability constraints as in the Euclidean case. Instead of imposing a linear penalty on misclassiﬁcation errors, which has an intuitive interpretation as being proportional to the minimum Euclidean distance to the correct classiﬁcation in the Euclidean case, we impose a penalty proportional to the hyperbolic distance to the correct classiﬁcation. Analogous to the Euclidean case, we ﬁx the scale of penalty so that the margin of the closest point to the decision boundary (that is correctly classiﬁed) is set to sinh 1(1). This leads to the optimization problem minimizew2 Rn+1 1 2 w ⇤ w + C m X j=1 max(0, sinh 1(1) sinh 1(y(j)(w ⇤ x(j)))), (19) subject to w ⇤ w < 0. (20) In all our experiments in the following section, we consider the simplest approach of solving the above formulation of hyperbolic SVM via projected gradient descent. The initial w is determined จஶऀʹ#POOJF#FSHFS͕ೖ͍ͬͯΔʂ
0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Euclidean SVM Hyperbolic SVM Hyperbolic SVM Euclidean SVM Macro-AUPR Figure 2: Multi-class classiﬁcation of Gaussian mixtures in hyperbolic space. (a) Two-fold cross validation results for 100 simulated Gaussian mixture datasets with 4 randomly positioned components and 100 points sampled from each component. Each dot represents the average performance over 5 trials. Vertical and horizontal lines represent standard deviations. Example decision hyperplanes for hyperbolic and Euclidean SVMs are shown in (b) and (c), respectively, using the Poincaré disk model. Color of each decision boundary denotes which component is being discriminated from the rest.
Learning (a) (b) (c) Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). The distance between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. TµHn set can be literally thought of as the tangent space of the forward hyperboloid sheet at µ. Note that Tµ0 Hn con- sists of v 2 Rn+1 with v0 = 0, and kvkL := p hv, viL = kvk2 . Parallel transport and inverse parallel transport Next, for an arbitrary pair of point µ, ⌫ 2 Hn, the par- allel transport from ⌫ to µ is deﬁned as a map PT⌫!µ from T⌫Hn to TµHn that carries a vector in T⌫Hn along the geodesic from ⌫ to µ in a parallel manner without changing its metric tensor. In other words, if norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can conﬁrm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyper- w ۭؒۂ্ۭؒͷЖʹରͯ͠ɺͷࣜΛຬͨ͢ͷू߹ w ฏߦҠಈɹ ɹɹ্ͷϕΫτϧaΛɺν͔Βµ·Ͱͷଌઢ(ۂۭؒͰͷઢ)ʹԊͬͯɺ ͱҠ͢ૢ࡞ɻͨͩͦ͠ͷܭྔอଘ͞ΕΔ͜ͱͱ͢Δɻͭ·Γ ce. To formalize this sequence of operations, we eﬁne the tangent space on hyperbolic space as well ay to transport the tangent space and the way to vector in the tangent space to the surface. The ation of the tangent vector requires parallel trans- the projection of the tangent vector to the surface the deﬁnition of exponential map. space of hyperbolic space se TµHn to denote the tangent space of Hn at µ 2(a)). Representing TµHn as a set of vectors in ambient space Rn+1 into which Hn is embedded, an be characterized as the set of points satisfying ogonality relation with respect to the Lorentzian TµH n := {u: hu, µiL = 0}. (2) (a) (b) (c) Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). The distance between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. TµHn set can be literally thought of as the tangent space of the forward hyperboloid sheet at µ. Note that Tµ0 Hn con- sists of v 2 Rn+1 with v0 = 0, and kvkL := p hv, viL = kvk2 . Parallel transport and inverse parallel transport Next, for an arbitrary pair of point µ, ⌫ 2 Hn, the par- allel transport from ⌫ to µ is deﬁned as a map PT⌫!µ from T⌫Hn to TµHn that carries a vector in T⌫Hn along the geodesic from ⌫ to µ in a parallel manner without changing its metric tensor. In other words, if PT is the parallel transport on hyperbolic space, then hPT⌫!µ(v), PT⌫!µ(v0)iL = hv, v0iL. The explicit formula for the parallel transport on the Lorentz norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can conﬁrm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyper- bolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially deﬁned. We, therefore, need to be able to compute the inverse of the PTν→μ (a) (a) Figure 2: (a) One-dimensional v 2 Tµ0 (green) to u 2 Tµ (blu The distance between µ and exp TµHn set can be literally though the forward hyperboloid sheet a sists of v 2 Rn+1 with v0 = 0, kvk2 . Parallel transport and inverse Next, for an arbitrary pair of p allel transport from ⌫ to µ is from T⌫Hn to TµHn that c along the geodesic from ⌫ to without changing its metric te PT is the parallel transport o hPT⌫!µ(v), PT⌫!µ(v0)iL = The explicit formula for the para Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel tra v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to The distance between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. TµHn set can be literally thought of as the tangent space of the forward hyperboloid sheet at µ. Note that Tµ0 Hn con- sists of v 2 Rn+1 with v0 = 0, and kvkL := p hv, viL = kvk2 . Parallel transport and inverse parallel transport Next, for an arbitrary pair of point µ, ⌫ 2 Hn, the par- allel transport from ⌫ to µ is deﬁned as a map PT⌫!µ from T⌫Hn to TµHn that carries a vector in T⌫Hn along the geodesic from ⌫ to µ in a parallel manner without changing its metric tensor. In other words, if PT is the parallel transport on hyperbolic space, then hPT⌫!µ(v), PT⌫!µ(v0)iL = hv, v0iL. The explicit formula for the parallel transport on the Lorentz norm of v. For hyperbolic space, this map (F given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk As we can conﬁrm with straightforward com exponential map is norm preserving in th d`(µ, expµ (u)) = arccosh hµ, expµ (u) Now, in order to evaluate the density of a po bolic space, we need to be able to map the poi tangent space, on which the distribution is ini We, therefore, need to be able to compute the exponential map, which is also called logar /BHBOPFUBM*$.-
ରࣸ૾ࢦࣸ૾ͷٯɺMPHͱॻ͘͜ͱଟ͍ w ͜ͷลͷܭΛղੳతʹͰ͖Δͱݴ͏͜ͱ͕ॏཁͳϙΠϯτ PT is the parallel transport on hyperbolic space, then hPT⌫!µ(v), PT⌫!µ(v0)iL = hv, v0iL. The explicit formula for the parallel transport on the Lorentz model (Figure 2(b)) is given by: PT⌫!µ(v) = v + hµ ↵⌫, viL ↵ + 1 (⌫ + µ), (3) where ↵ = h⌫, µiL. The inverse parallel transport PT 1 ⌫!µ simply carries the vector in TµHn back to T⌫Hn along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) Exponential map and inverse exponential map Finally, we will describe a function that maps a vector in a tangent space to its surface. According to the basic theory of differential geometry, ev- ery u 2 TµHn determines a unique maximal geodesic µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- ponential map expµ : TµHn ! Hn is a map deﬁned by expµ (u) = µ(1), and we can use this map to project a vector v in TµHn onto Hn in a way that the distance from µ to destination of the map coincides with kvkL, the metric bolic space, we need to be able to map the point back to t tangent space, on which the distribution is initially deﬁne We, therefore, need to be able to compute the inverse of t exponential map, which is also called logarithm map, well. Solving eq. (13) for u, we can obtain the inverse exponenti map as u = exp 1 µ (z) = arccosh(↵) p ↵2 1 (z ↵µ), ( where ↵ = hµ, ziL. See Appendix A.1 for further detai 3. Pseudo-Hyperbolic Gaussian 3.1. Construction Finally, we are ready to provide the construction of o wrapped gaussian distribution G(µ, ⌃) on Hyperbolic spa with µ 2 Hn and positive deﬁnite ⌃. In the language of the differential geometry, our strateg can be re-described as follows: 1. Sample a vector ˜ v from the Gaussian distributio N(0, ⌃) deﬁned over Rn. 2. Interpret ˜ v as an element of Tµ0 Hn ⇢ Rn+1 by rewr ing ˜ v as v = [0, ˜ v]. model (Figure 2(b)) is given by: PT⌫!µ(v) = v + hµ ↵⌫, viL ↵ + 1 (⌫ + where ↵ = h⌫, µiL. The inverse paralle PT 1 ⌫!µ simply carries the vector in TµHn bac along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). Exponential map and inverse exponential ma Finally, we will describe a function that maps a tangent space to its surface. According to the basic theory of differential ge ery u 2 TµHn determines a unique maxima µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) ponential map expµ : TµHn ! Hn is a map expµ (u) = µ(1), and we can use this map t vector v in TµHn onto Hn in a way that the dis µ to destination of the map coincides with kvkL H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries ng k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). measured on the surface of Hn coincides with kukL. nt space of µ0 Hn con- p hv, viL = port n, the par- ap PT⌫!µ in T⌫Hn el manner words, if pace, then he Lorentz µ), (3) norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can conﬁrm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyper- bolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially deﬁned. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as well. Solving eq. (13) for u, we can obtain the inverse exponential map as u = exp 1 µ (z) = arccosh(↵) p (z ↵µ), (6) ng the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) ponential map and inverse exponential map nally, we will describe a function that maps a vector in a gent space to its surface. cording to the basic theory of differential geometry, ev- y u 2 TµHn determines a unique maximal geodesic : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- nential map expµ : TµHn ! Hn is a map deﬁned by pµ (u) = µ(1), and we can use this map to project a ctor v in TµHn onto Hn in a way that the distance from o destination of the map coincides with kvkL, the metric 3. Pseudo-Hyperbolic G 3.1. Construction Finally, we are ready to pro wrapped gaussian distribution with µ 2 Hn and positive de In the language of the differ can be re-described as follow 1. Sample a vector ˜ v fro N(0, ⌃) deﬁned over R 2. Interpret ˜ v as an elemen ing ˜ v as v = [0, ˜ v]. v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) ntial map and inverse exponential map we will describe a function that maps a vector in a space to its surface. ng to the basic theory of differential geometry, ev- 2 TµHn determines a unique maximal geodesic 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- al map expµ : TµHn ! Hn is a map deﬁned by ) = µ(1), and we can use this map to project a v in TµHn onto Hn in a way that the distance from tination of the map coincides with kvkL, the metric 3.1. Construction Finally, we are ready to provide t wrapped gaussian distribution G(µ, with µ 2 Hn and positive deﬁnite ⌃ In the language of the differential can be re-described as follows: 1. Sample a vector ˜ v from the N(0, ⌃) deﬁned over Rn. 2. Interpret ˜ v as an element of Tµ ing ˜ v as v = [0, ˜ v]. where ↵ = h⌫, µiL. The inverse parallel transport PT 1 ⌫!µ simply carries the vector in TµHn back to T⌫Hn along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) Exponential map and inverse exponential map Finally, we will describe a function that maps a vector in a tangent space to its surface. According to the basic theory of differential geometry, ev- ery u 2 TµHn determines a unique maximal geodesic µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- ponential map expµ : TµHn ! Hn is a map deﬁned by expµ (u) = µ(1), and we can use this map to project a vector v in TµHn onto Hn in a way that the distance from µ to destination of the map coincides with kvkL, the metric u = exp 1 µ (z) = p where ↵ = hµ, ziL. See App 3. Pseudo-Hyperbolic G 3.1. Construction Finally, we are ready to prov wrapped gaussian distribution G with µ 2 Hn and positive deﬁ In the language of the differe can be re-described as follows 1. Sample a vector ˜ v from N(0, ⌃) deﬁned over Rn 2. Interpret ˜ v as an element ing ˜ v as v = [0, ˜ v]. where ↵ = h⌫, µiL. The inverse parallel transport PT 1 ⌫!µ simply carries the vector in TµHn back to T⌫Hn along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) Exponential map and inverse exponential map Finally, we will describe a function that maps a vector in a tangent space to its surface. According to the basic theory of differential geometry, ev- ery u 2 TµHn determines a unique maximal geodesic µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- ponential map expµ : TµHn ! Hn is a map deﬁned by expµ (u) = µ(1), and we can use this map to project a vector v in TµHn onto Hn in a way that the distance from µ to destination of the map coincides with kvkL, the metric ↵ where ↵ = hµ, ziL. See Appendix 3. Pseudo-Hyperbolic Gauss 3.1. Construction Finally, we are ready to provide th wrapped gaussian distribution G(µ, ⌃ with µ 2 Hn and positive deﬁnite ⌃ In the language of the differential g can be re-described as follows: 1. Sample a vector ˜ v from the N(0, ⌃) deﬁned over Rn. 2. Interpret ˜ v as an element of Tµ0 ing ˜ v as v = [0, ˜ v]. d its tangent space TµH1 (blue). (b) Parallel transport carries c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). the surface of Hn coincides with kukL. norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can conﬁrm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyper- bolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially deﬁned. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as tangent space TµH1 (blue). (b) Parallel transport carries ponential map projects the u 2 Tµ (blue) to z 2 Hn (red). surface of Hn coincides with kukL. orm of v. For hyperbolic space, this map (Figure 2(c)) is iven by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) s we can conﬁrm with straightforward computation, this xponential map is norm preserving in the sense that `(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. ow, in order to evaluate the density of a point on hyper- olic space, we need to be able to map the point back to the ngent space, on which the distribution is initially deﬁned. We, therefore, need to be able to compute the inverse of the xponential map, which is also called logarithm map, as ell. olving eq. (13) for u, we can obtain the inverse exponential map as u = exp 1 µ (z) = arccosh(↵) p ↵2 1 (z ↵µ), (6) here ↵ = hµ, ziL. See Appendix A.1 for further details. . Pseudo-Hyperbolic Gaussian a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries reen) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). e between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. an be literally thought of as the tangent space of hyperboloid sheet at µ. Note that Tµ0 Hn con- Rn+1 with v0 = 0, and kvkL := p hv, viL = ansport and inverse parallel transport n arbitrary pair of point µ, ⌫ 2 Hn, the par- ort from ⌫ to µ is deﬁned as a map PT⌫!µ n to TµHn that carries a vector in T⌫Hn geodesic from ⌫ to µ in a parallel manner anging its metric tensor. In other words, if parallel transport on hyperbolic space, then ), PT⌫!µ(v0)iL = hv, v0iL. formula for the parallel transport on the Lorentz ure 2(b)) is given by: ⌫!µ(v) = v + hµ ↵⌫, viL ↵ + 1 (⌫ + µ), (3) = h⌫, µiL. The inverse parallel transport mply carries the vector in TµHn back to T⌫Hn eodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can conﬁrm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyper- bolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially deﬁned. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as well. Solving eq. (13) for u, we can obtain the inverse exponential map as u = exp 1 µ (z) = arccosh(↵) p ↵2 1 (z ↵µ), (6) where ↵ = hµ, ziL. See Appendix A.1 for further details. 3. Pseudo-Hyperbolic Gaussian 3.1. Construction
! 0 and to an (improper for non-compact) uniform distribution when ture tends to 0, one should recover the vanilla normal distribution. Hereby, alisations of the normal distribution, which have different theoretical and al The property that Pennec (2006) takes for granted is the maximization n and a covariance matrix, yielding in the isotropic setting µ, 2) z) = NR M (z|µ, 2) = 1 ZR exp ✓ dM(µ, z)2 2 2 ◆ , (5) nnian distance on the manifold induced by the tensor metric. Such a eferred as Riemannian Normal distribution – is used by Said et al. (2014) , or by Hauberg (2018) in the hypersphere Sd. Sampling from such ng the normalising constant – especially in the anisotropic setting – is 13 A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning (a) (b) (c) Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). The distance between µ and exp (u) which is measured on the surface of Hn coincides with kuk . A Wrapped Normal Distribution on Hyperb Algorithm 1 Sampling on hyperbolic space Input: parameter µ 2 Hn, ⌃ Output: z 2 Hn Require: µ0 = (1, 0, · · · , 0)> 2 Hn Sample ˜ v ⇠ N(0, ⌃) 2 Rn v = [0, ˜ v] 2 Tµ0 Hn Move v to u = PTµ0 !µ(v) 2 TµHn by eq. (15) Map u to z = expµ (u) 2 Hn by eq. (13) 3. Parallel transport the vector v to u 2 TµHn ⇢ Rn+1 along the geodesic from µ0 to µ. 4. Map u to Hn by expµ . /BHBOPFUBM*$.-
ͷฏ໘ʹରࣸ૾͢Δ ˠͦΕΛYͷฏ໘ͱฏߦҠಈ͢Δɻ ˠࢦࣸ૾ʹΑΓۂۭؒ͢ͱݴ͏ૢ࡞ w ݮಉ༷ʹఆٛͰ͖Δɻ ͳ͓͜ΕΒͷԋ݁߹ଇΛຬͨ͞ͳ͍ͨΊɺ܈ʹͳΒͳ͍ɻ ߏͱͯ͠ɺδϟΠϩ܈ͱݺΕΔߏʹͳΔɻ ϝϏεՃ x ⊕ y = (1 + 2c⟨x, y⟩ + c∥y∥2)x + (1 − c∥x∥2)y 1 + 2c⟨x, y⟩ + c2∥x∥2∥y∥2 y x μ 0 x ⊕ y ͜ΕҰମͲ͏͍͏ԋͳͷ͔ʁ Dͱ͢ΔͱY ZʹͳΔ x ⊕ 0 = 0 ⊕ x = x − x ⊕ x = 0 ҎԼͷੑ࣭ೖ͢Ε͔Δɻ ϝϏεՃͱϦʔϚϯزԿͷؔΛड़ͨจݙݟ͚ͭΒΕͳ͔ͬͨɻ ϝϏεՃͷҙຯ x ⊕ y = (1 + 2c⟨x, y⟩ + c∥y∥2)x + (1 − c∥x∥2)y 1 + 2c⟨x, y⟩ + c2∥x∥2∥y∥2 = exp x ∘ PT μ0 →x ∘ log μ0 y ͍ۙͱ͜Ζ·Ͱݴٴ͞Ε͍ͯΔ͕ɾɾ y x μ 0 log μ0 y PT μ0 →x ∘ log μ0 y x ⊕ y </FVS*14>)ZQFSCPMJD/FVSBM/FUXPSLT μ0
w DۂΛҙຯ͢Δɻ ͜Ε·Ͱۂ໘ΛɹɹɹɹɹɹɹͱͳΔͷͱ͍͕ͯͨ͠ɺ ɹɹɹɹɹɹɹɹͱͯ͠ಉ༷ͷ͕ٞՄͰɺͦΕΛߟྀ͍ͯ͠Δɻ ﬁned by ht+1 = '(Wht + Uxt + b) where ' is a pointwise ReLU, etc. This formula can be naturally generalized to the eters W 2 Mm,n (R), U 2 Mm,d (R), b 2 Dm c , we deﬁne: t c U ⌦c xt c b), ht 2 Dn c , xt 2 Dd c . (29) e can write ˜ xt := expc 0 (xt ) and use the above formula, since ht c expc 0 (Uxt ) = W ⌦c ht c U ⌦c ˜ xt . pt the GRU architecture: + br), zt = (Wzht 1 + Uzxt + bz), Uxt + b), ht = (1 zt ) ht 1 + zt ˜ ht, (30) rst, how should we adapt the pointwise multiplication by a the Möbius version (see Eq. (26)) can be naturally extended : (h, h 0) 2 Dn ⇥ Dp 7! expc(f(logc(h), logc(h 0))). In perbolic RNN NN. A simple RNN can be deﬁned by ht+1 = '(Wht + Uxt + b) where ' is a pointwise arity, typically tanh, sigmoid, ReLU, etc. This formula can be naturally generalized to the ic space as follows. For parameters W 2 Mm,n (R), U 2 Mm,d (R), b 2 Dm c , we deﬁne: ht+1 = ' ⌦c (W ⌦c ht c U ⌦c xt c b), ht 2 Dn c , xt 2 Dd c . (29) if inputs xt ’s are Euclidean, one can write ˜ xt := expc 0 (xt ) and use the above formula, since ht (Pc 0!W ⌦cht (Uxt )) = W ⌦c ht c expc 0 (Uxt ) = W ⌦c ht c U ⌦c ˜ xt . chitecture. One can also adapt the GRU architecture: rt = (Wrht 1 + Urxt + br), zt = (Wzht 1 + Uzxt + bz), ˜ ht = '(W(rt ht 1) + Uxt + b), ht = (1 zt ) ht 1 + zt ˜ ht, (30) denotes pointwise product. First, how should we adapt the pointwise multiplication by a < x, x >L = − 1 < x, x >L = − 1/c 0DUBWJBO&VHFOFUBM/*14
MIXED-CURVATURE REPRESENTATIONS IN PRODUCTS OF MODEL SPACES Albert Gu, Frederic Sala, Beliz Gunel & Christopher R´ e Computer Science Department Stanford University Stanford, CA 94305 {albertgu,fredsala,bgunel}@stanford.edu, chrismre@cs.stanford.edu ABSTRACT The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Eu- clidean space has been the workhorse for embeddings; recently hyperbolic and spherical spaces have gained popularity due to their ability to better embed new types of structured data—such as hierarchical data—but most data is not struc- tured so uniformly. We address this problem by proposing learning embeddings in a product manifold combining multiple copies of these model spaces (spherical, hyperbolic, Euclidean), providing a space of heterogeneous curvature suitable for a wide variety of structures. We introduce a heuristic to estimate the sectional cur- vature of graph data and directly determine an appropriate signature—the number of component spaces and their dimensions—of the product manifold. Empiri- cally, we jointly learn the curvature and the embedding in the product space via Riemannian optimization. We discuss how to deﬁne and compute intrinsic quan- tities such as means—a challenging notion for product manifolds—and provably learnable optimization functions. On a range of datasets and reconstruction tasks, our product space embeddings outperform single Euclidean or hyperbolic spaces used in previous works, reducing distortion by 32.55% on a Facebook social net- work dataset. We learn word embeddings and ﬁnd that a product of hyperbolic spaces in 50 dimensions consistently improves on baseline Euclidean and hyper- bolic embeddings, by 2.6 points in Spearman rank correlation on similarity tasks and 3.4 points on analogy accuracy. 1 INTRODUCTION *$-3 w ۂۭؒ ۂ͕ෛ ͷຒΊࠐΈ͕͍ͭͰ࠷దͳΘ͚Ͱͳ͍ w ٿ໘ɺϢʔΫϦουۭؒɺ͋Δ͍ͦͷࠞ߹ۭؒͷຒΊࠐΈ͕࠷ద͔ʁ w ࠞ߹ۭؒͷຒΊࠐΈΞϧΰϦζϜΛ࡞ͨ͠
1: Three component spaces: sphere S2, Euclidean plane E2, and hyperboloid H2. Thick lines are geodesics; these get closer in positively curved (K = +1) space S2, remain equidistant in ﬂat (K = 0) space E2, and get farther apart in negatively curved (K = 1) space H2. We propose embedding into product spaces in which each component has constant curvature. As we show, this allows us to capture a wider range of curvatures than traditional embeddings, while retaining the ability to globally optimize and operate on the resulting embeddings. Speciﬁcally, we form a Riemannian product manifold combining hyperbolic, spherical, and Euclidean components and equip it with a decomposable Riemannian metric. While each component space in the product has constant curvature (positive for spherical, negative for hyperbolic, and zero for Euclidean), the Published as a conference paper at ICLR 2019 a b c m a b c m a b c m Figure 3: Geodesic triangles in differently curved spaces: compared to Euclidean geometry in which it satisﬁes the parallelogram law (Center), the median am is longer in cycle-like positively curved space (Left), and shorter in tree-like negatively curved space (Right). The relative length of am can be used as a heuristic to estimate discrete curvature. 3.2 ESTIMATING THE SIGNATURE w άϥϑ͕ঢ়ˠٿ໘ɺάϦουঢ়ˠϢʔΫϦουฏ໘ɺঢ়ˠۂ໘ ʹͦΕͧΕ͍͍ͯͦ͏
index components in the either hyperbolic nor spherical space is suitable for G, but stortion. Note the decomposition into tree and cycle. depends on hyperbolic distance d H (for which the gradient which is continuously differentiable (Sala et al., 2018). ion can be optimized through standard Riemannian opti- el, 2013) and RSVRG (Zhang et al., 2016). We write down spaces in Algorithm 1. This proceeds by ﬁrst computing ect to the ambient space of the embedding (Step 4), and dient by applying the Riemannian correction (multiply by Published as a conference paper at ICLR 2019 Table 1: Matching geometries: Average distortion on canonical graphs (tree, cycle, ring of with 40 nodes, comparing four spaces with total dimension 3. The best distortion is achieved space with matching geometry. Cycle Tree Ring of Trees |V | = 40, |E| = 40 |V | = 40, |E| = 39 |V | = 40, |E| = 40 (E 3)1 0.1064 0.1483 0.0997 (H 3)1 0.1638 0.0321 0.0774 (S 3)1 0.0007 0.1605 0.1106 (H 2)1 ⇥ (S 1)1 0.1108 0.0538 0.0616 doubling the number of factors. These models include the products consisting of only a con curvature base space, ranging to various combinations of Sd/2 2 , Hd/2 2 comprising factors of d sion 2.3 For a given signature, the curvatures are initialized to the appropriate value in { 1 and then learned using the technique in Section 3.1. We additionally compare to the outp Algorithms 2,3 for heuristically selecting a combination of spaces in which to embed these da Quality We focus on the average distortion—which our loss function (2) optimizes—as ou metric for reconstruction, and additionally report the mAP metric for the unweighted graph expected, for the synthetic graphs (tree, cycle, ring of trees), the matching geometries (hype spherical, product of hyperbolic and spherical) yield the best distortion (Table 1). Next, we in Table 2 the quality of embedding different graphs across a variety of allocations of spaces, total dimension d = 10 following previous work (Nickel & Kiela, 2018). We conﬁrm that the ture of each graph informs the best allocation of spaces. In particular, the cities graph—whi intrinsic structure close to S2—embeds well into any space with a spherical component, and th like Ph.D.s graph embeds well into hyperbolic products. We emphasize that even for such da w ༧௨Γͷ݁Ռͱͳͬͨɻ w ͜ͷޙ࣮σʔλຒΊࠐΜͰ͍Δ͕ɺఆྔతͳࢦඪͷΈͰղऍͳͲಛʹͳ͍ w ຒΊࠐΈ࠷ྑͷۭؒͱσʔλͷղऍͷؔੑ͕͍ͭͨΒ ໘ന͍ͷͰͳ͍͔ͱࢥ͏ɻ
l.JYFEDVSWBUVSF7BSJBUJPOBM"VUPFODPEFSTz w "*45"54 l6OTVQFSWJTFE)JFSBSDIZ.BUDIJOHXJUI0QUJNBM5SBOTQPSUPWFS)ZQFSCPMJD4QBDFTz l)ZQFSCPMJD.BOJGPME3FHSFTTJPOz w *$.- l-BUFOU7BSJBCMF.PEFMMJOHXJUI)ZQFSCPMJD/PSNBMJ[JOH'MPXTz l$POTUBOU$VSWBUVSF(SBQI$POWPMVUJPOBM/FUXPSLTz
data Anna Klimovskaia 1✉, David Lopez-Paz1, Léon Bottou 2 & Maximilian Nickel2✉ The need to understand cell developmental processes spawned a plethora of computational methods for discovering hierarchies from scRNAseq data. However, existing techniques are based on Euclidean geometry, a suboptimal choice for modeling complex cell trajectories with multiple branches. To overcome this fundamental representation issue we propose Poincaré maps, a method that harness the power of hyperbolic geometry into the realm of single-cell data analysis. Often understood as a continuous extension of trees, hyperbolic geometry enables the embedding of complex hierarchical data in only two dimensions while preserving the pairwise distances between points in the hierarchy. This enables the use of our embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudotime inference. When compared to existing methods — unable to address all these important tasks using a single embedding — Poincaré maps produce state-of-the-art two-dimensional representations of cell trajectories on multiple scRNAseq datasets. https://doi.org/10.1038/s41467-020-16822-4 OPEN 1234567890():,; ARTICLE Poincaré maps for analyzing complex hierarchies in single-cell data Anna Klimovskaia 1✉, David Lopez-Paz1, Léon Bottou 2 & Maximilian Nickel2✉ The need to understand cell developmental processes spawned a plethora of computational methods for discovering hierarchies from scRNAseq data. However, existing techniques are based on Euclidean geometry, a suboptimal choice for modeling complex cell trajectories with multiple branches. To overcome this fundamental representation issue we propose Poincaré maps, a method that harness the power of hyperbolic geometry into the realm of single-cell data analysis. Often understood as a continuous extension of trees, hyperbolic geometry enables the embedding of complex hierarchical data in only two dimensions while preserving the pairwise distances between points in the hierarchy. This enables the use of our embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudotime inference. When compared to existing methods — unable to address all these important tasks using a single embedding — Poincaré maps produce state-of-the-art two-dimensional representations of cell trajectories on multiple scRNAseq datasets. https://doi.org/10.1038/s41467-020-16822-4 OPEN 1234567890():,; w 4JOHMFDFMM3/"TFRͷσʔλΛҨࢠൃݱྔʹ ج͍ͮͯϙΞϯΧϨԁ൫ʹຒΊࠐΜͩΑจ hierarchy would lead to a contradiction with the physical direc- tion of time. By virtue of the Poincaré visualization, we reassigned the root of the developmental process to the furthest PS cell not belonging to the "mesodermal” cluster. We picked up a root cell from PS as to ease clustering by angle for lineage detection. More speciﬁcally, we chose the most ”exterior" cell from the PS cluster, by visual inspection. Given our reassigned root, we separate the dataset into ﬁve potential lineages (see “Methods”), to ﬁnd the asynchrony in the developmental process in terms of marker expressions (Fig. 4b). Analysis of the composition of cells belonging to each lineage (Fig. 4c) indicates that erythroid cells suggests could alr Discussi The rapi sequenci tational tational m other. Th stage of Pharynx Body wall muscle Glia Neuron Muscle Marginal cell Gland Intestinal valve Ciliated amphid neuron Ciliated non-amphid neuron Hypodermis Seam cells Excretory cell Coelomocyte Z1-Z4 Germline Intestine Unannotated neurons a Cell types Fig. 3 Analysis of C. elegans cell atlas. a Poincaré map (without rotation) on a 40,000 ce used for embedding are (k = 15, σ = 2.0, γ = 3.0). Main cell types are annotated with a te mature cell types towards the border of the disk. Two subpopulations of germline cells Poincaré maps with respect to randomly picked up root cell form one of the sub-populatio types of the early age the embryo). Red line is an average pseudotime distance for a gi NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-16822-4 ଞɺ**#.1Ͱܥ౷थͷຒΊࠐΈʹؔ͢Δൃද͋Γɺޤ͏͝ظ /BUVSF$PNNVOJDBUJPOT
w TͱUɹɹɹɹɹɹɹͱ͢Δχϡʔϥϧωοτ Мඇઢܗؔ alNVP ﬂow uses a computationally n (afﬁne coupling layer) which has evaluate and invert due to its lower e determinant is cheap to compute. ing layer is implemented using a ons some input ˜ x into two sets, ˜ x1:d is transformed elementwise imensions. The second set, ˜ x2 := ed elementwise but in a way that ee Appendix B.2 for more details). perations occur at ToHn K we term Tangent Coupling (T C). mation due to one layer of our T C ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n where, z = ˜ fT C(˜ x) and ˜ fT C is as deﬁned above. Proof Sketch. Here we only provide a sketch of the and details can be found in Appendix C. First, observ the overall transformation is a valid composition of tions: y := expK o ˜ fT C logK o (x). Thus, the o determinant can be computed by chain rule and the id det ⇣ @y @x ⌘ = det ⇣ @expK o (z) @z ⌘ · det ⇣ @f(˜ x) @˜ x ⌘ · det ⇣ @ log @ Tackling each function in the composition individ studied ﬂows: the RealNVP ﬂow (Dinh et al., s core, the RealNVP ﬂow uses a computationally transformation (afﬁne coupling layer) which has of being fast to evaluate and invert due to its lower acobian, whose determinant is cheap to compute. lly, the coupling layer is implemented using a sk, and partitions some input ˜ x into two sets, ﬁrst set, ˜ x1 := ˜ x1:d is transformed elementwise ntly of other dimensions. The second set, ˜ x2 := also transformed elementwise but in a way that the ﬁrst set (see Appendix B.2 for more details). oupling layer operations occur at ToHn K we term f coupling as Tangent Coupling (T C). erall transformation due to one layer of our T C det @x = ||z||L ⇥ i=d+1 (s(˜ x1 ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n where, z = ˜ fT C(˜ x) and ˜ fT C is as deﬁned above. Proof Sketch. Here we only provide a sketch of the p and details can be found in Appendix C. First, observe the overall transformation is a valid composition of tions: y := expK o ˜ fT C logK o (x). Thus, the ov determinant can be computed by chain rule and the ide det ⇣ @y @x ⌘ = det ⇣ @expK o (z) @z ⌘ · det ⇣ @f(˜ x) @˜ x ⌘ · det ⇣ @ logK o @x Tackling each function in the composition individ e RealNVP ﬂow (Dinh et al., VP ﬂow uses a computationally ﬁne coupling layer) which has uate and invert due to its lower terminant is cheap to compute. ayer is implemented using a some input ˜ x into two sets, d is transformed elementwise nsions. The second set, ˜ x2 := lementwise but in a way that ppendix B.2 for more details). ations occur at ToHn K we term ent Coupling (T C). on due to one layer of our T C @x ||z||L i=d+1 ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n (12) where, z = ˜ fT C(˜ x) and ˜ fT C is as deﬁned above. Proof Sketch. Here we only provide a sketch of the proof and details can be found in Appendix C. First, observe that the overall transformation is a valid composition of func- tions: y := expK o ˜ fT C logK o (x). Thus, the overall determinant can be computed by chain rule and the identity, det ⇣ @y @x ⌘ = det ⇣ @expK o (z) @z ⌘ · det ⇣ @f(˜ x) @˜ x ⌘ · det ⇣ @ logK o (x) @x ⌘ . Tackling each function in the composition individually, 2017). At its core, the RealNVP ﬂow uses a computationally symmetric transformation (afﬁne coupling layer) which has the beneﬁt of being fast to evaluate and invert due to its lower triangular Jacobian, whose determinant is cheap to compute. Operationally, the coupling layer is implemented using a binary mask, and partitions some input ˜ x into two sets, where the ﬁrst set, ˜ x1 := ˜ x1:d is transformed elementwise independently of other dimensions. The second set, ˜ x2 := ˜ xd+1:n , is also transformed elementwise but in a way that depends on the ﬁrst set (see Appendix B.2 for more details). Since all coupling layer operations occur at ToHn K we term this form of coupling as Tangent Coupling (T C). Thus the overall transformation due to one layer of our T C ⇥ ⇣R s where, z = ˜ fT C(˜ x) a Proof Sketch. Here w and details can be fou the overall transform tions: y := expK o determinant can be co det ⇣ @y @x ⌘ = det ⇣ @exp @ Tackling each funct Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on P 2 1 . Densities are visualized in the Poincar´ e disk. Additional qualitative results can be found in Appendix F. ﬂow is a composition of a logarithmic map, afﬁne coupling deﬁned on ToHn k , and an exponential map: ˜ fT C(˜ x) = ( ˜ z1 = ˜ x1 ˜ z2 = ˜ x2 (s(˜ x1)) + t( ˜ x1) fT C(x) = expK o ( ˜ fT C(logK o (x))), (11) where ˜ x = logK o (x) is a point on ToHn K , and is a pointwise non-linearity such as the exponential function. Functions s and t are parameterized scale and translation functions implemented as neural nets from ToHd K ! ToH n d K . One important detail is that arbitrary operations on a tangent vector v 2 ToHn K may transport the resultant vector outside the tangent space, hampering subsequent operations. To avoid this we can keep the ﬁrst dimension ﬁxed at v0 = 0 to ensure we remain in ToHn K . Similar to the Euclidean RealNVP, we need an efﬁcient expression for the Jacobian determinant of fT C. Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on P 2 1 . Densities are visualized in the Poincar´ e disk. Additional qualitative results can be found in Appendix F. ﬂow is a composition of a logarithmic map, afﬁne coupling deﬁned on ToHn k , and an exponential map: ˜ fT C(˜ x) = ( ˜ z1 = ˜ x1 ˜ z2 = ˜ x2 (s(˜ x1)) + t( ˜ x1) fT C(x) = expK o ( ˜ fT C(logK o (x))), (11) where ˜ x = logK o (x) is a point on ToHn K , and is a pointwise non-linearity such as the exponential function. Functions s and t are parameterized scale and translation functions implemented as neural nets from ToHd K ! ToH n d K . One important detail is that arbitrary operations on a tangent vector v 2 ToHn K may transport the resultant vector outside the tangent space, hampering subsequent operations. To son of density estimation in hyperbolic space aussian (WG) and mixture of wrapped gaussian Densities are visualized in the Poincar´ e disk. tive results can be found in Appendix F. ition of a logarithmic map, afﬁne coupling , and an exponential map: = ( ˜ z1 = ˜ x1 ˜ z2 = ˜ x2 (s(˜ x1)) + t( ˜ x1) = expK o ( ˜ fT C(logK o (x))), (11) (x) is a point on ToHn K , and is a pointwise ch as the exponential function. Functions meterized scale and translation functions neural nets from ToHd K ! ToH n d K . One is that arbitrary operations on a tangent K may transport the resultant vector outside e, hampering subsequent operations. To n keep the ﬁrst dimension ﬁxed at v0 = 0 main in ToHn K . Euclidean RealNVP, we need an efﬁcient T C
w 0 O ͰܭՄ ﬁne at ace. any an- ges old ne on nd al., lly has wer ute. g a ets, ise := hat ls). rm important detail is that arbitrary operations on a tangent vector v 2 ToHn K may transport the resultant vector outside the tangent space, hampering subsequent operations. To avoid this we can keep the ﬁrst dimension ﬁxed at v0 = 0 to ensure we remain in ToHn K . Similar to the Euclidean RealNVP, we need an efﬁcient expression for the Jacobian determinant of fT C. Proposition 1. The Jacobian determinant of a single T C layer in equation 11 is: det ⇣@y @x ⌘ = ⇣R sinh(||z||L R ) ||z||L ⌘n 1 ⇥ n Y i=d+1 (s(˜ x1))i ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n (12) where, z = ˜ fT C(˜ x) and ˜ fT C is as deﬁned above. Proof Sketch. Here we only provide a sketch of the proof and details can be found in Appendix C. First, observe that the overall transformation is a valid composition of func- tions: y := expK o ˜ fT C logK o (x). Thus, the overall determinant can be computed by chain rule and the identity, ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘
Uɹɹɹɹɹɹɹͱ͢Δχϡʔϥϧωοτ ͨͩ͠ɹɹɹɹɹɹɹɹɹɹɹɹͱ͢Δɻ UϩʔϨϯπۂ໘্ʹ ΔΑ͏ଞͷ͔Βܾఆ͞ΕΔ with Hyperbolic Normalizing Flows ek he in- ves al- mal he sis ﬁes nd is ne hy- ial in an port a point to the tangent space. We employ the coupling strategy previously discussed and partition our input vector into two components: ˜ x1 := ˜ x1:d and ˜ x2 := ˜ xd+1:n . Let ˜ x = logK o (x) be the point on ToHn K after the logarithmic map. The remainder of the WHC layer can be deﬁned as follows; ˜ fWHC(˜ x) = ( ˜ z1 = ˜ x1 ˜ z2 = logK o ⇣ expK t(˜ x1) PT o!t(˜ x1) (v) ⌘ v = ˜ x2 (s(˜ x1)) fWHC(x) = expK o ( ˜ fWHC(logK o (x))). (13) Functions s : ToHd k ! ToH n d k and t : ToHd k ! Hn k are taken to be arbitrary neural nets, but the role of t when compared to T C is vastly different. In particular, the gen- eralization of translation on Riemannian manifolds can be viewed as parallel transport to a different tangent space. Consequently, in Eq. 13, the function t predicts a point on the manifold that we wish to parallel transport to. This greatly increases the ﬂexibility as we are no longer conﬁned to the tangent space at the origin. The logarithmic map is then used to ensure that both ˜ z1 and ˜ z2 are in the same ws t space. We employ the coupling sed and partition our input vector := ˜ x1:d and ˜ x2 := ˜ xd+1:n . Let nt on ToHn K after the logarithmic he WHC layer can be deﬁned as 1 ogK o ⇣ expK t(˜ x1) PT o!t(˜ x1) (v) ⌘ ˜ x1)) HC(logK o (x))). (13) ToH n d k and t : ToHd k ! Hn k are ural nets, but the role of t when y different. In particular, the gen- on Riemannian manifolds can be port to a different tangent space. the function t predicts a point on sh to parallel transport to. This bility as we are no longer conﬁned is ne hy- ial in an cts fWHC(x) = expK o ( ˜ fWHC(logK o (x))). (13) Functions s : ToHd k ! ToH n d k and t : ToHd k ! Hn k are taken to be arbitrary neural nets, but the role of t when compared to T C is vastly different. In particular, the gen- eralization of translation on Riemannian manifolds can be viewed as parallel transport to a different tangent space. Consequently, in Eq. 13, the function t predicts a point on the manifold that we wish to parallel transport to. This greatly increases the ﬂexibility as we are no longer conﬁned to the tangent space at the origin. The logarithmic map is then used to ensure that both ˜ z1 and ˜ z2 are in the same tangent space before the ﬁnal exponential map that projects the point to the manifold. One important consideration in the construction of t is that it should only parallel transport functions of ˜ x2 . However, the output of t is a point on Hn k and without care this can involve elements in ˜ x1 . To prevent such a scenario we construct the output of t = [t0 , 0, . . . , 0, td+1 , . . . , tn] where elements td+1:n are used to determine the value of t0 using Eq. 5, such that it is a point on the manifold and every remaining index is set to zero. Such a construction ensures that only components of any function of ˜ x2 are parallel transported as desired. Figure 3 illustrates the transformation performed by the WHC layer.
w ࣜͷҙຯΘ͔Δ͕ɺͬͪ͜ͷํ͕ྑ͘ͳΔͱݴ͏ཧ۶͕ ݸਓతʹͬ͘͠Γདྷ͍ͯͳ͍ɻɻ͢Έ·ͤΜɻ of the full transformation in Eq. 13 we proceed by analyzing the effect of WHC on valid orthonormal bases w.r.t. the Lorentz inner product for the tangent space at the origin. We state our main result here and provide a sketch of the proof, while the entire proof can be found in Appendix D. Proposition 2. The Jacobian determinant of the function ˜ fWHC in equation 13 is: det ✓ @y @x ◆ = n Y i=d+1 (s(˜ x1))i ⇥ ⇣R sinh(||q||L R ) ||q||L ⌘l ⇥ ⇣R sinh(|| logK o (ˆ q)||L R ) || logK o (q)||L ⌘ l ⇥ ⇣R sinh(||˜ z||L R ) ||˜ z||L ⌘n 1 ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n , (15) where ˜ z = concat(˜ z1 , ˜ z2), the constant l = n d, is a non-linearity, q = PTo!t(˜ x1) (v) and ˆ q = expK t (q). Proof Sketch. We ﬁrst note that the exponential and loga- rithmic maps applied at the beginning and end of the WHC can be dealt with by appealing to the chain rule and the known Jacobian determinants for these functions as used in Proposition 1. Thus, what remains is the following term: det @z @˜ x . To evaluate this term we rely on the following Lemma. layer is O(n) which is the same as added cost of the two new maps that subspace of basis elements. 4. Experiments We evaluate our T C-ﬂow and WHC structured density estimation, graph graph generation.2 Throughout our ex three main baselines. In Euclidean sp latent variables and afﬁne coupling ﬂo denoted N and NC, respectively. I we use Wrapped Normal latent vari analogous baseline (Nagano et al., 2 parameters are deﬁned on tangent s trained with conventional optimizers & Ba, 2014). Following previous w the curvature K as a learnable param of 10 epochs, and we clamp the max before any logarithmic or exponenti 2019). Appendix E contains details o and implementation details. 4.1. Structured Density Estimation We ﬁrst consider structured density e cal VAE setting (Kingma & Welling,
tion (10) are Target C(Ours) C(Ours) Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on P 2 1 . Densities are visualized in the Poincar´ e disk.
BDP-4 BDP-6 N-VAE 55.4±0.2 55.2±0.3 56.1±0.2 H-VAE 54.9±0.3 55.4±0.2 58.0±0.2 NC 55.4±0.4 -54.7±0.1 -55.2±0.3 T C -54.9±0.1 55.4±0.1 57.5±0.2 WHC -55.1±0.4 55.2±0.2 56.9±0.4 Table 1. Test Log Likelihood on Binary Diffusion Process versus latent dimension. All normalizing ﬂows use 2-coupling layers. Model MNIST 2 MNIST 4 MNIST 6 N-VAE 139.5±1.0 115.6±0.2 100.0±0.02 H-VAE ⇤ 113.7±0.9 99.8±0.2 NC 139.2±0.4 115.2±0.6 -98.70.3 T C ⇤ -112.5±0.2 99.3±0.2 WHC -136.5±2.1 -112.8±0.5 99.4±0.2 Table 2. Test Log Likelihood on MNIST averaged over 5 runs verus latent dimension. * indicates numerically unstable settings. 4.2. Graph Reconstruction We evaluate the practical utility of our hyperbolic ﬂows by conducting experiments on the task of link prediction using graph neural networks (GNNs) (Scarselli et al., 2008) as an inference model. Given a simple graph G = (V, A, X), deﬁned by a set of nodes V, an adjacency matrix A 2 |V|⇥|V| |V|⇥n hyperbolic WHC ﬂow. Sim estimation setting, the perform observed in low-dimensional Model Dis-I AUC Dis AP N-VAE 0.90±0.01 0.92± H-VAE 0.91±5e-3 0.92± NC 0.92±0.01 0.93± T C 0.93±0.01 0.93± WHC 0.93±0.01 0.94± Table 3. Test AUC and Test AP o has latent dimesion 6 and Dis-II 4.3. Graph Generation Finally, we explore the utilit generating hierarchical struc we construct datasets containi well as uniformly random lo where each graph contains bet prior work on graph generat our datasets are designed to thus enabling us to test the ut models. We then train a gene w ͖ͪΜͱॻ͍͍ͯͳ͔͕ͬͨɺ/PSNBMJ[JOHqPXͦͷͷʹݩ ݮޮՌͳ͍ͷͰɺ7"&ͱΈ߹Θ্ͤͨͰͷධՁͰͳ͍͔ͱࢥ͏ w ͜Εਖ਼͋Μ·Γࢫຯ͕ͳ͍݁ՌͰ͢Ͷ
±0.2 ±0.2 0.3 ±0.2 ±0.4 cess versus g layers. IST 0±0.02 8±0.2 70.3 3±0.2 4±0.2 over 5 runs ble settings. c ﬂows by hyperbolic WHC ﬂow. Similar to the structured density estimation setting, the performance gains of WHC are best observed in low-dimensional latent spaces. Model Dis-I AUC Dis-I AP Dis-II AUC Dis-II AP N-VAE 0.90±0.01 0.92±0.01 0.92±0.01 0.91±0.01 H-VAE 0.91±5e-3 0.92±5e-3 0.92±4e-3 0.91±0.01 NC 0.92±0.01 0.93±0.01 0.95±4e-3 0.93±0.01 T C 0.93±0.01 0.93±0.01 0.96±0.01 0.95±0.01 WHC 0.93±0.01 0.94±0.01 0.96±0.01 0.96±0.01 Table 3. Test AUC and Test AP on Graph Embeddings where Dis-I has latent dimesion 6 and Dis-II has latent dimension 2. 4.3. Graph Generation Finally, we explore the utility of our hyperbolic ﬂows for generating hierarchical structures. As a synthetic testbed, we construct datasets containing uniformly random trees as well as uniformly random lobster graphs (Golomb, 1996), where each graph contains between 20 to 100 nodes. Unlike
Variable Modelling with Hyperbolic Normalizing Flows Train Lobster Random Tree C C C Figure 4. Selected qualitative results on graph generation for lobster and random tree graph. Model Accuracy Avg. Clust. Avg. GC. NC 56.6±5.5 40.9±42.7 0.34±0.10 T C 32.1±1.9 98.3±89.5 0.25±0.12 WHC 62.1±10.9 21.1±13.4 0.13±0.07 Table 4. Generation statistics on random trees over 5 runs. 5. Related Work Hyperbolic Geometry in Machine Learning:. The inter- section of hyperbolic geometry and machine learning has recently risen to prominence (Dhingra et al., 2018; Tay et al., 2018; Law et al., 2019; Khrulkov et al., 2019; Ovinnikov, 2019). Early prior work proposed to embed data into the Poincar´ e ball model (Nickel & Kiela, 2017; Chamberlain