論文紹介: Latent Variable Modelling with Hyperbolic Normalizing Flows(ICML 2020)

౦ژେֶେֶӃɹ৘ใཧ޻ֶܥݚڀՊɹίϯϐϡʔλՊֶઐ߈ ഡ୩ݚڀࣨɹॿڭ ෱Ӭɹ௡ਹ Latent Variable Modelling with Hyperbolic Normalizing Flows
Avishek Joey Bose 1 2 Ariella Smofsky 1 2 Renjie Liao 3 4 Prakash Panangaden 1 2 William L. Hamilton 1 2 Abstract The choice of approximate posterior distributions plays a central role in stochastic variational inference (SVI). One effective solution is the use of normalizing flows to construct flexible posterior distributions. However, one key limitation of existing normalizing flows is that they are restricted to the Euclidean space and are ill-equipped to model data with an underlying hierarchical structure. To address this fundamental limitation, we present the first extension of normalizing flows to hyperbolic spaces. We first elevate normalizing flows to hyperbolic spaces using coupling transforms defined on the tangent bundle, termed Tangent Coupling (T C). We further introduce Wrapped Hyperboloid Coupling (WHC), a fully invertible and learnable transformation that explic- itly utilizes the geometric structure of hyperbolic spaces, allowing for expressive posteriors while being efficient to sample from. We demonstrate the efficacy of our novel normalizing flow over hy- 2 K 2 K 2 2 Figure 1. The shortest path between a given pair of node embeddings in R 2 and hyperbolic space as modelled by the Lorentz model H 2 K and Poincar´ e disk P 2 K . Unlike Euclidean space, distances between points grow exponentially as you move away from the origin in hyperbolic space, and thus the shortest paths between points in hyperbolic space go through a common parent node, —i.e. the origin, giving rise to hierarchical and tree-like structure. 36v3 [cs.LG] 16 Jun 2020 *$.-

ࠓ೔ͷΞ΢τϥΠϯ w /PSNBMJ[JOH'MPXͱ͸ʁ  w )ZQFSCPMJD(FPNFUSZͱ͸ʁ  w l-BUFOU7BSJBCMF.PEFMMJOHXJUI)ZQFSCPMJD/PSNBMJ[JOH'MPXTz  *$.- ͷ঺հ

ࠓ೔ͷΞ΢τϥΠϯ w /PSNBMJ[JOH'MPXͱ͸ʁ  l/PSNBMJ[JOH'MPXTGPS1SPCBCJMJTUJD.PEFMJOHBOE*OGFSFODFz BSYJW Λࢀর  ਤͳͲ΋ͪ͜Β͔ΒҾ༻  w )ZQFSCPMJD(FPNFUSZͱ͸ʁ  w
l-BUFOU7BSJBCMF.PEFMMJOHXJUI)ZQFSCPMJD/PSNBMJ[JOH'MPXTz  *$.- ͷ঺հ

/PSNBMJ[JOH'MPXͷجຊ w ୯७ͳ֬཰෼෍Λม׵͍ͯ͘͜͠ͱͰɺෳࡶͳ֬཰෼෍Λߏங͢Δ͜ͱ      w 'MPXCBTFENPEFMͰ͸ɺม׵5͕ՄٯͰ͋Δ͜ͱٴͼඍ෼ՄೳͰ͋Δ  ͜ͱΛຬͨ͢΋ͷͱ͢Δɻ͜ΕʹΑΓɺ d basics
rovide a general way of constructing flexible probability distributions dom variables. Let x be a D-dimensional real vector, and suppose we a joint distribution over x. The main idea of flow-based modeling is to ormation T of a real vector u sampled from pu(u): x = T(u) where u ⇠ pu(u). (1) 2 جఈ෼෍ transformation to pu(u) as the base distribution of the flow-based model.1 The transformation T ase distribution pu(u) can have parameters of their own (denote them as and ely); this induces a family of distributions over x parameterized by { , }. ing property of flow-based models is that the transformation T must be invertible T and T 1 must be di↵erentiable. Such transformations are known as di↵eo- ms and require that u be D-dimensional as well (Milnor and Weaver, 1997). Under ditions, the density of x is well-defined and can be obtained by a change of variables 006; Bogachev, 2007): px(x) = pu(u) |det JT (u)| 1 where u = T 1(x). (2) ntly, we can also write px(x) in terms of the Jacobian of T 1: px(x) = pu T 1(x) |det JT 1 (x)| . (3) bian JT (u) is the D ⇥ D matrix of all partial derivatives of T given by: J (u) = 2 6 @T1 @u1 · · · @T1 @uD . . .. . . 3 7 . (4) the base distribution of the flow-based model.1 The transformation T tion pu(u) can have parameters of their own (denote them as and duces a family of distributions over x parameterized by { , }. y of flow-based models is that the transformation T must be invertible 1 must be di↵erentiable. Such transformations are known as di↵eo- re that u be D-dimensional as well (Milnor and Weaver, 1997). Under density of x is well-defined and can be obtained by a change of variables hev, 2007): x(x) = pu(u) |det JT (u)| 1 where u = T 1(x). (2) also write px(x) in terms of the Jacobian of T 1: px(x) = pu T 1(x) |det JT 1 (x)| . (3) is the D ⇥ D matrix of all partial derivatives of T given by: JT (u) = 2 6 4 @T1 @u1 · · · @T1 @uD . . . ... . . . @TD @u1 · · · @TD @uD 3 7 5 . (4) ally construct a flow-based model by implementing T (or T 1) with d taking pu(u) to be a simple density such as a multivariate normal.

ม׵ΛੵΈॏͶΔ w ม׵5͕Մٯ͔ͭඍ෼ՄೳͰ͋Δͱ͖ɺͦͷΑ͏ͳม׵ͷ߹੒΋·ͨ  Մٯ͔ͭඍ෼ՄೳͱͳΔɻ͢ͳΘͪɺม׵ɹͱม׵ɹʹରͯ͠ɺͦͷ  ߹੒ɹɹɹɹͷٯͱϠίϏΞϯ͸ɺ w ͜ͷΑ͏ʹม׵ΛੵΈॏͶ͍ͯ͘͜ͱͰɺෳࡶͳ෼෍ΛදݱͰ͖Δɻ  w جఈ෼෍͔ΒͷαϯϓϦϯάٴͼ5ͷܭࢉ͕༰қͳΒɺ  ม׵ޙͷෳࡶͳ෼෍͔ΒͷαϯϓϦϯά΋༰қʹߦ͑Δ͜ͱͱͳΔɻ
to an infinitesimally small neighbourhood dx around x = T(u), |det JT (u)| is e volume of dx divided by the volume of du. Since the probability mass in dx mus probability mass in du, the density at x is smaller than the density at u if du is and is larger if du is contracted. An important property of invertible and di↵erentiable transformations is tha composable. Given two such transformations T1 and T2, their composition T2 invertible and di↵erentiable. Its inverse and Jacobian determinant are given by: (T2 T1) 1 = T 1 1 T 1 2 det JT2 T1 (u) = det JT2 (T1(u)) · det JT1 (u). In consequence, we can build complex transformations by composing multipl of simpler transformations, without compromising the requirements of invert di↵erentiability, and hence without losing the ability to calculate the density px 1. Some papers refer to pu(u) as the ‘prior’ and to u as the ‘latent variable’. We believe that thi is not as well-suited for normalizing flows as it is for latent-variable models. Upon obse corresponding u = T 1 (x) is uniquely determined and thus no longer ‘latent’. 3 to T. Roughly speaking, if an infinitesimally small neighbourhood du around u to an infinitesimally small neighbourhood dx around x = T(u), |det JT (u)| is e volume of dx divided by the volume of du. Since the probability mass in dx mus probability mass in du, the density at x is smaller than the density at u if du is and is larger if du is contracted. An important property of invertible and di↵erentiable transformations is tha composable. Given two such transformations T1 and T2, their composition T2 invertible and di↵erentiable. Its inverse and Jacobian determinant are given by: (T2 T1) 1 = T 1 1 T 1 2 det JT2 T1 (u) = det JT2 (T1(u)) · det JT1 (u). In consequence, we can build complex transformations by composing multiple of simpler transformations, without compromising the requirements of invert di↵erentiability, and hence without losing the ability to calculate the density px 1. Some papers refer to pu(u) as the ‘prior’ and to u as the ‘latent variable’. We believe that this is not as well-suited for normalizing flows as it is for latent-variable models. Upon obse corresponding u = T 1 (x) is uniquely determined and thus no longer ‘latent’. 3 ighbourhood around u due od du around u is mapped |det JT (u)| is equal to the mass in dx must equal the ity at u if du is expanded, rmations is that they are composition T2 T1 is also nt are given by: (5) u). (6) mposing multiple instances ments of invertibility and e the density px(x). We believe that this terminology models. Upon observing x, the atent’. tifies the relative change of volume of a small neighbourhood around u due peaking, if an infinitesimally small neighbourhood du around u is mapped ally small neighbourhood dx around x = T(u), |det JT (u)| is equal to the vided by the volume of du. Since the probability mass in dx must equal the in du, the density at x is smaller than the density at u if du is expanded, du is contracted. roperty of invertible and di↵erentiable transformations is that they are en two such transformations T1 and T2, their composition T2 T1 is also ↵erentiable. Its inverse and Jacobian determinant are given by: (T2 T1) 1 = T 1 1 T 1 2 (5) det JT2 T1 (u) = det JT2 (T1(u)) · det JT1 (u). (6) we can build complex transformations by composing multiple instances formations, without compromising the requirements of invertibility and and hence without losing the ability to calculate the density px(x). er to pu(u) as the ‘prior’ and to u as the ‘latent variable’. We believe that this terminology uited for normalizing flows as it is for latent-variable models. Upon observing x, the u = T 1 (x) is uniquely determined and thus no longer ‘latent’. 3

۩ମྫ 4 2 0 2 4 4 2 0 2
4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 Figure 1: Example of a 4-step flow transforming samples from a standard-normal base density to a cross-shaped target density. In practice, it is common to chain together multiple transformations T1, . . . , TK to obtain T = TK · · · T1, where each Tk transforms zk 1 into zk , assuming z0 = u and zK = x. Hence, the term ‘flow’ refers to the trajectory that a collection of samples from pu(u) follow as they are gradually transformed by the sequence of transformations T1, . . . , TK. The term ‘normalizing’ refers to the fact that the inverse flow through T 1 K , . . . , T 1 1 takes a collection of samples from px(x) and transforms them (in a sense, ‘normalizes’ them) into a collection of samples from a prescribed density pu(u) (which is often taken to be a multivariate normal). Figure 1 illustrates a flow (K = 4) transforming a standard-normal base distribution to a cross-shaped target density. In terms of functionality, a flow-based model provides two operations: sampling from the model using eq. 1, and evaluating the model’s density using eq. 3. These two operations have di↵erent computational requirements. Sampling from the model requires the ability to w l'MPXzˠ෼෍ͷม׵ͷ࿈ଓʹΑͬͯαϯϓϧ෼෍͕มԽ͍ͯ͘͠ي੻ w l/PSNBMJ[JOHz  ˠෳࡶͳ෼෍͕JOWFSTFqPXʹΑͬͯجఈ෼෍ʹม׵͞ΕΔ͜ͱɻ  ɹ جఈ෼෍ͱͯ͠͸Ψ΢ε෼෍͕޿͘ར༻͞ΕΔ

ύϥϝʔλਪఆ w λʔήοτ෼෍ɹɹɹʹରͯ͠qPXCBTFENPEFM͕pU  ͢ΔΑ͏ϞσϧͷύϥϝʔλΛਪఆ͢Δɻͨͱ͑͹,-EJWFSHFODF  ࠷খԽΛߦ͏ͱ͖ П͸5ͷɺС͸جఈ෼෍ͷύϥϝʔλͱ͢Δ w ͜ΕΛ4(%Ͱղ͘ 2.3
Using flows for modeling and inference Similarly to fitting any probabilistic model, fitting a flow-based model distribution p⇤ x (x) can be done by minimizing some divergence or discrep This minimization is performed with respect to the model’s parameters are the parameters of T and are the parameters of pu(u). In the we discuss a number of divergences for fitting flow-based models, with a the Kullback–Leibler (KL) divergence as it is one of the most popular 5 ng flows for modeling and inference to fitting any probabilistic model, fitting a flow-based model px(x; ✓) to a target on p⇤ x (x) can be done by minimizing some divergence or discrepancy between them. imization is performed with respect to the model’s parameters ✓ = { , }, where e parameters of T and are the parameters of pu(u). In the following sections, s a number of divergences for fitting flow-based models, with a particular focus on ack–Leibler (KL) divergence as it is one of the most popular choices. 5 3.1 Forward KL divergence and maximum likelihood estimation e forward KL divergence2 between the target distribution p⇤ x (x) and the flow-based model (x; ✓) can be written as follows: L(✓) = DKL [ p⇤ x (x) k px(x; ✓) ] = Ep⇤ x (x) [ log px(x; ✓) ] + const. = Ep⇤ x (x) ⇥ log pu T 1(x; ); + log |det JT 1 (x; )| ⇤ + const. (12) e forward KL divergence is well-suited for situations in which we have samples from the get distribution (or the ability to generate them), but we cannot necessarily evaluate e target density p⇤ x (x). Assuming we have a set of samples {xn}N n=1 from p⇤ x (x), we can imate the expectation over p⇤ x (x) by Monte Carlo as follows: L(✓) ⇡ 1 N N X n=1 log pu(T 1(xn; ); ) + log |det JT 1 (xn; )| + const. (13) nimizing the above Monte Carlo approximation of the KL divergence is equivalent to fit- g the flow-based model to the samples {xn}N n=1 by maximum likelihood estimation. e forward KL divergence2 between the target distribution p⇤ x (x) and the flow-based model x; ✓) can be written as follows: L(✓) = DKL [ p⇤ x (x) k px(x; ✓) ] = Ep⇤ x (x) [ log px(x; ✓) ] + const. = Ep⇤ x (x) ⇥ log pu T 1(x; ); + log |det JT 1 (x; )| ⇤ + const. (12) e forward KL divergence is well-suited for situations in which we have samples from the get distribution (or the ability to generate them), but we cannot necessarily evaluate e target density p⇤ x (x). Assuming we have a set of samples {xn}N n=1 from p⇤ x (x), we can imate the expectation over p⇤ x (x) by Monte Carlo as follows: L(✓) ⇡ 1 N N X n=1 log pu(T 1(xn; ); ) + log |det JT 1 (xn; )| + const. (13) nimizing the above Monte Carlo approximation of the KL divergence is equivalent to fit- g the flow-based model to the samples {xn}N n=1 by maximum likelihood estimation. practice, we typically optimize the parameters ✓ iteratively with stochastic gradient- sed methods. We can obtain an unbiased estimate of the gradient of the KL divergence th respect to the parameters as follows: r L(✓) ⇡ 1 N N X r log pu T 1(xn; ); + r log |det JT 1 (xn; )| (14) = Ep⇤ x (x) [ log px(x; ✓) ] + const. = Ep⇤ x (x) ⇥ log pu T 1(x; ); + log |det JT 1 (x; )| ⇤ + const. (12) The forward KL divergence is well-suited for situations in which we have samples from the arget distribution (or the ability to generate them), but we cannot necessarily evaluate he target density p⇤ x (x). Assuming we have a set of samples {xn}N n=1 from p⇤ x (x), we can stimate the expectation over p⇤ x (x) by Monte Carlo as follows: L(✓) ⇡ 1 N N X n=1 log pu(T 1(xn; ); ) + log |det JT 1 (xn; )| + const. (13) Minimizing the above Monte Carlo approximation of the KL divergence is equivalent to fit- ing the flow-based model to the samples {xn}N n=1 by maximum likelihood estimation. n practice, we typically optimize the parameters ✓ iteratively with stochastic gradient- ased methods. We can obtain an unbiased estimate of the gradient of the KL divergence with respect to the parameters as follows: r L(✓) ⇡ 1 N N X n=1 r log pu T 1(xn; ); + r log |det JT 1 (xn; )| (14) r L(✓) ⇡ 1 N N X n=1 r log pu T 1(xn; ); . (15) The update with respect to may also be done in closed form if pu(u; ) admits closed-form

ม׵͸ͲͷΑ͏ͳߏ੒ʹ͢΂͖͔ w ม׵͸Մٯ͔ͭඍ෼ՄೳͰ͋Δ͜ͱ͕ඞཁͰ͋Δ͕ɺ  Ճ͑ͯϠίϏΞϯ͕ߴ଎ʹܭࢉՄೳͰ͋Δ΋ͷ͕๬·͍͠ɻ w ҰൠʹϠίϏΞϯͷܭࢉ͸O(D3)Ͱ͋Δ͕ɺO(D)ͰܭࢉՄೳͳqPX͕  ༷ʑʹఏҊ͞Ε͍ͯΔɻ D͸෼෍ͷ࣍ݩ਺ w
χϡʔϥϧωοτͰߏ੒͞ΕΔ͜ͱ͕΄ͱΜͲ

w "VUPSFHSFTTWJFqPXͰ͸ɺJ࣍ݩ໨ͷม਺ͷม׵Λ࣍ͷΑ͏ʹߦ͏ɻ w ͭ·ΓɺJ࣍ݩ໨ͷม਺ม׵Λߦ͏ࡍʹ͸ɺͦΕΑΓJOEFY͕খ͍͞  ࣍ݩͷ΋ͷͷΈΛ༻͍ͯม׵͢Δɻ w ͜ΕʹΑΓϠίϏΞϯ͸Լࡾ֯ߦྻͱͳΔɻ w Ώ͑ʹͦͷߦྻࣜ͸ "VUPSFHSFTTWJFqPX
sive flows saw that, under reasonable conditions, we can transform any distribution orm distribution in (0, 1)D using maps with a triangular Jacobian. The ssive flows are a direct implementation of this construction, specifying f ing form (as described by e.g. Huang et al., 2018; Jaini et al., 2019): z0 i = ⌧(zi; hi) where hi = ci(z<i), (28) d the transformer and ci the i-th conditioner. The transformer is a strictly on of zi (and therefore invertible), is parameterized by hi, and specifies on zi in order to output z0 i . The conditioner determines the parameters r, and in turn, can modify the transformer’s behavior. The conditioner be a bijection. Its one constraint is that the i-th conditioner can take as ariables with dimension indices less that i. The parameters of f are meters of the conditioner (not shown above for notational simplicity), but ansformer has its own parameters too (in addition to hi). that the above construction is invertible for any choice of ⌧ and ci as long r is invertible. Given z0, we can compute z iteratively as follows: zi = ⌧ 1(z0 i ; hi) where hi = ci(z<i). (29) mputation, each hi and therefore each z0 i can be computed independently n parallel. In the inverse computation however, all z<i need to have been zi, so that z<i is available to the conditioner for computing hi. Transformer Conditioner with respect to zj is zero whenever j > i. Hence, the Jacobian of f can be following form: Jf (z) = 2 6 4 @⌧ @z1 (z1; h1) 0 ... L(z) @⌧ @zD (zD; hD) 3 7 5 . (30) s a lower-triangular matrix whose diagonal elements are the derivatives of the each of the D elements of z. Since the determinant of any triangular matrix product of its diagonal elements, the log-absolute-determinant of Jf (z) can n O(D) time as follows: D Y @⌧ D X @⌧ derivative of z0 i with respect to zj is zero whenever j > i. Hence, the Jacobian of f written in the following form: Jf (z) = 2 6 4 @⌧ @z1 (z1; h1) 0 ... L(z) @⌧ @zD (zD; hD) 3 7 5 . The Jacobian is a lower-triangular matrix whose diagonal elements are the derivativ transformer for each of the D elements of z. Since the determinant of any triangula is equal to the product of its diagonal elements, the log-absolute-determinant of Jf be calculated in O(D) time as follows: log det Jf (z) = log D Y i=1 @⌧ @zi (zi; hi) = D X i=1 log @⌧ @zi (zi; hi) . The lower-triangular part of the Jacobian— denoted here by L(z)—is irrelevant. The tives of the transformer can be computed either analytically or via automatic di↵eren

w "⒏OFUSBOTGPSNFS    TJNQMF͕ͩදݱྗ͸ͦΜͳʹߴ͘ͳ͍͔΋  w /FVSBMUSBOTGPSNFS      ύʔηϓτϩϯɺМ͸ద౰ͳ׆ੑԽؔ਺  w
$POEJUJPOFS͸ద౰ͳχϡʔϥϧωοτͰྑ͍͕ɺDPOEJUJPOFS͝ͱʹ  ҟͳΔωοτϫʔΫʹ͢Δͱύϥϝʔλ͕ଟ͘େมͳͷͰɺ  3//Λར༻͢Δํ๏ͳͲ͕ఏҊ͞Ε͍ͯΔɻ USBOTGPSNFSͱDPOEJUJPOFS g paragraphs, we will describe various choices for the implementation of the well as discuss their expressivity and computational e ciency. egressive flows A simple choice of transformer is the class of a ne func- ⌧(zi; hi) = ↵izi + i where hi = {↵i, i} . (32) be thought of as a location-scale transformation, where ↵i controls the scale s the location. Invertibility is guaranteed if ↵i 6= 0, and this can be easily . taking ↵i = exp ˜ ↵i, where ˜ ↵i is an unconstrained parameter (in which case 12 verse autoregressive flow (IAF) (Kingma et al., 2016), masked autoregressive flow (MAF) (Papamakarios et al., 2017), and Glow (Kingma and Dhariwal, 2018). Non-a ne neural transformers Non-a ne transformers can be constructed from simple components based on the observation that conic combinations as well as compositions of monotonic functions are also monotonic. Given monotonic functions ⌧1, . . . , ⌧K of a real variable z, the following functions are also monotonic: • Conic combination: ⌧(z) = PK k=1 wk⌧k (z), where wk > 0 for all k. • Composition: ⌧(z) = ⌧K · · · ⌧1(z). For example, a non-a ne neural transformer can be constructed using a conic combination of monotonically increasing activation functions (·) (such as logistic sigmoid, tanh, leaky ReLU, and others): ⌧(zi; hi) = wi0 + K X k=1 wik (↵ik zi + ik ) where hi = {wi0, . . . , wiK, ↵ik, ik} , (34) provided ↵ik > 0 and wik > 0 for all k 1. Clearly, the above construction corresponds to a monotonic single-layer perceptron. By repeatedly combining and composing monotonic activation functions, we can construct a multi-layer perceptron that is monotonic, provided that all its weights are strictly positive. Non-a ne neural transformers such as the above can represent any monotonic function arbitrarily well, which follows directly from the universal-approximation capabilities of multi- layer perceptrons (see e.g. Huang et al., 2018, for details). The derivatives of neural transformers needed for the computation of the Jacobian determinant are in principle analytically obtainable, but more commonly they are computed via backpropagation. A drawback of neural transformers is that in general they cannot be inverted analytically, and can be inverted only iteratively e.g. using bijection search. Variants of non-a ne neural transformers have been used in models such as neural autoregressive flow (NAF) (Huang et al., 2018), segment—which can be done in O(log K) time using binary search—and then evaluating or inverting that segment, which is assumed to be analytically tractable. By increasing the number of segments K, a spline-based transformer can be made arbitrarily flexible. 3.1.2 Implementing the conditioner Moving on to the conditioner, ci(z<i) can be any function of z<i, meaning that each conditioner can, in principle, be implemented as an arbitrary neural network with input z<i and output hi. However, a na¨ ıve implementation in which each ci(z<i) is a separate neural network would scale poorly with the dimensionality D, requiring D forward propagations of a vector of average size D/2. This is in addition to the cost of storing and learning the parameters of D independent networks. In fact, early work on flow precursors (Chen and Gopinath, 2000) dismissed the autoregressive approach as prohibitively expensive. Nonetheless, this problem can be e↵ectively addressed in practice by sharing parameters across the conditioners ci(z<i), or even by combining the conditioners into a single network. In the following paragraphs, we will discuss some practical implementations of the conditioner that allow it to scale to high dimensions. Recurrent autoregressive flows One way to share parameters across conditioners is by using a recurrent neural network (RNN). The i-th conditioner is implemented as: hi = c(si) where s1 = initial state si = RNN(zi 1, si 1) for i > 1. (38)

w 3FTJEVBMqPXͰ͸ɺqPXΛ࣍ͷΑ͏ʹදݱ͢Δɻ    w ͜ͷ͏ͪɺ  1MBOOFSqPX  3BEJBMqPX  ͳͲ͸O(D)ͰϠίϏΞϯ͕ܭࢉՄೳ(ͨͩ͠৔߹ʹΑͬͯՄٯͰͳ͍) w FlowΛ཭ࢄతͳม׵ͷద༻Ͱ͸ͳ͘ɺ࿈ଓ࣌ؒతͳม׵ͱ͢Δํ๏΋
ͦͷଞͷqPX determinant from the outset. a class of invertible transformations of the general form: z0 = z + g (z), (54) k that outputs a D-dimensional translation vector, and are This structure bears a strong similarity to residual networks we use the term residual flow to refer to a normalizing flow ations. Residual transformations are not always invertible, but is constrained appropriately. In what follows, we discuss two ning invertible residual transformations: the first is based on cond is based on the matrix determinant lemma. ual flows guaranteed to be invertible if g can be made contractive with ction (Behrmann et al., 2019; Chen et al., 2019). In general, a 22 If the determinant and inverse of A are tractable and M is less than D, the matrix determinant lemma can provide a computationally e cient way to compute the determinant of A+VW>. For example, if A is diagonal, computing the left-hand side costs O D3 + D2M , whereas computing the right-hand side costs O M3 + DM2 , which is preferable if M < D. In the context of flows, the matrix determinant lemma can be used to e ciently compute the Jacobian determinant. In this section, we will discuss examples of residual flows that are specifically designed such that application of the matrix determinant lemma leads to e cient Jacobian-determinant computation. Planar flow One early example is the planar flow (Rezende and Mohamed, 2015), where the function g is a one-layer neural network with a single hidden unit: z0 = z + v (w>z + b). (62) The parameters of the planar flow are v 2 R D, w 2 R D and b 2 R, and is a di↵erentiable activation function such as the hyperbolic tangent. This flow can be interpreted as expand- ing/contracting the space in the direction perpendicular to the hyperplane w>z + b = 0. The Jacobian of the transformation is given by: Jf (z) = I + 0(w>z + b) vw>, (63) where 0 is the derivative of the activation function. The Jacobian has the form of a diagonal matrix plus a rank-1 update. Using the matrix determinant lemma, the Jacobian determinant can be computed in time O(D) as follows: det Jf (z) = 1 + 0(w>z + b) w>v. (64) In general, the planar flow is not invertible for all values of v and w. However, assuming that 0 is positive everywhere and bounded from above (which is the case if is the hyperbolic tangent, for example), a su cient condition for invertibility is w>v > 1 , since it van den Berg et al. (2018) proposed the parameterization V = QU and W = QL, where Q is a D ⇥M matrix whose columns are an orthonormal set of vectors (this requires M  D), U is M ⇥ M upper triangular, and L is M ⇥ M lower triangular. Since Q>Q = I and the product of upper-triangular matrices is also upper triangular, the Jacobian determinant becomes: det Jf (z) = det ⇣ I + S(z)L>U ⌘ = D Y i=1 (1 + Sii(z)LiiUii). (68) Similar to planar flows, Sylvester flows are not invertible for all values of their parameters. Assuming 0 is positive everywhere and bounded from above, a su cient condition for invertibility is LiiUii > 1 supx 0(x) for all i 2 {1, . . . , D}, since it ensures det Jf (z) is non-zero everywhere. Radial flow Radial flows (Tabak and Turner, 2013; Rezende and Mohamed, 2015) take the following form: z0 = z + ↵ + r(z) (z z0) where r(z) = kz z0k . (69) The parameters of the flow are ↵ 2 (0, +1), 2 R and z0 2 R D, and k·k is the Euclidean norm. The above transformation can be thought of as a contraction/expansion radially with center z0. The Jacobian can be written as follows: Jf (z) = ✓ 1 + ↵ + r(z) ◆ I r(z)(↵ + r(z))2 (z z0)(z z0)>, (70) which is a diagonal matrix plus a rank-1 update. Applying the matrix determinant lemma and rearranging, we get the following expression for the Jacobian determinant, which can be computed in O(D): det Jf (z) = ✓ 1 + ↵ (↵ + r(z))2 ◆✓ 1 + ↵ + r(z) ◆D 1 . (71) The radial flow is not invertible for all values of . A su cient condition for invertibility is > ↵, since it ensures that det Jf (z) is non-zero everywhere. In summary, planar, Sylvester and radial flows have O(D) Jacobian determinants, and can we considered constructing flows by parameterizing a one-step 1 ), several of which are then composed to create a flow of K ive strategy is to construct flows in continuous time by param- simal dynamics, and then integrating to find the corresponding ords, we construct the flow by defining an ordinary di↵erential bes the flow’s evolution in time. We call these ‘continuous-time’ ing to a real-valued scalar variable analogous to the number of his scalar ‘time’ as it determines how long the dynamics are run. ribe this class of continuous-time flows and summarize numerical plementation. te at time t (or ‘step’ t, thinking in the discrete setting). Time t usly from t0 to t1, such that zt0 = u and zt1 = x. A continuous- parameterizing the time derivative of zt with a neural network ding the following ordinary di↵erential equation (ODE): dzt dt = g (t, zt). (74) es as inputs both the time t and the flow’s state zt, and outputs time t. The only requirements for g are that it be uniformly meaning that there is a single Lipschitz constant that works for Chen et al., 2018). From Picard’s existence theorem, it follows Let zt denote the flow’s state at time t (or ‘step’ t, thinking in the discrete setting). Time t s assumed to run continuously from t0 to t1, such that zt0 = u and zt1 = x. A continuous- ime flow is constructed by parameterizing the time derivative of zt with a neural network g with parameters , yielding the following ordinary di↵erential equation (ODE): dzt dt = g (t, zt). (74) The neural network g takes as inputs both the time t and the flow’s state zt, and outputs he time derivative of zt at time t. The only requirements for g are that it be uniformly Lipschitz continuous in zt (meaning that there is a single Lipschitz constant that works for all t) and continuous in t (Chen et al., 2018). From Picard’s existence theorem, it follows hat satisfying these requirements ensures that the above ODE has a unique solution (Cod- dington and Levinson, 1955). Many neural-network layers meet these requirements (Gouk et al., 2018), and unlike the neural architectures described in section 3 that require careful tructural assumptions to ensure invertibility and tractability of their Jacobian determinant, g has no such requirements. To compute the transformation x = T(u), we need to run the dynamics forward in time by ntegrating: x = zt1 = u + Z t1 t=t0 g (t, zt) dt. (75) 28

૒ۂ໘ͷཱମࣹӨ ૒ۂ໘ͷཱମࣹӨ • ૒ۂ໘2Λɺ։ԁ൘%ʹࣹӨ • ૒ۂ໘ • ٿͱಉ༷ͷܭࢉʹΑΓɺҎԼ͕੒Γཱͭ • ͨͩ͠ɺٿ໘্ͷ੍໿ͱ૒ۂ໘ͷ੍໿ͷҧ͍͕͋Δ
• ٿͷ৔߹ͱಉ༷ͷܭࢉͰ - . ٿ໘্Ͱ4ʹ͍ۙ΄Ͳ0෇ۙ΁ ٿ໘্Ͱ4ʹԕ͍΄Ͳԁप্΁ ٿͱ͸ٯʂ ۂ཰ͱؔ࿈͢Δʁ ˞UˠແݶେͰcVc?ˠ㱻൒ܘ̍ͷ։ԁ൘ ϩʔϨϯπϞσϧ ϩʔϨϯπϞσϧ ϙΞϯΧϨԁ൫ IUUQTXXXTMJEFTIBSFOFU)JSPUBLB.BUTVNPUPTT ຊεϥΠυ͸௕࡚େֶদຊઌੜͷεϥΠυΛվมͨ͠΋ͷ

ϛϯίϑεΩʔ಺ੵ w ϩʔϨϯπϞσϧͰ͸ɺ಺ੵΛҰൠతͳ಺ੵͱ͸ҟͳΓԼهͷΑ͏ʹ  ϛϯίϑεΩʔ಺ੵͱͯ͠ఆٛ͢Δɻ w ͜ΕʹΑΓɺڑ཭͸࣍ͷΑ͏ʹͳΔɻ w ͜ͷΑ͏ͳۭؒͱܭྔͷԼͰͷਤܗͷੑ࣭Λௐ΂Δ෼໺ͷ͜ͱΛ  ૒ۂزԿͱݺͿɻ <
x, y >L = x1 y1 + x2 y2 − t1 t2 d(x, y)L = BSDDPTI( < x, y >L )

༨ஊͳΜͰੲͷਓ͸͜ΜͳࣄΛߟ͑ͨͷ͔ w Ұൠతͳฏ໘زԿֶͷ૑࢝ऀ͸ϢʔΫϦου لݩલࡾੈل Ͱ͋Γɺ  ൴͸ෳ਺ͷެཧ͔ΒఆཧΛ࿦ূ͍ͯ͘͠ͱ͍͏ݱ୅਺ֶͷࠎࢠΛ  ߏஙͨ͠ҒਓͰ΋͋Δɻ w ฏ໘زԿͷެཧͷ͏ͪ࣍ͷެཧΛฏߦઢެཧͱݺͿɻ   
ʮ௚ઢͱ௚ઢ্ʹଘࡏ͠ͳ͍఺͕༩͑ΒΕͨ࣌ɺ఺Λ௨Γ௚ઢʹฏߦͳ ௚ઢ ަΘΒͳ͍௚ઢ ͸༩͑ΒΕͨฏ໘ʹߴʑຊ͔͠Ҿ͘͜ͱ͕Ͱ͖ͳ ͍ʯ  8JLJQFEJBΑΓ

༨ஊͳΜͰੲͷਓ͸͜ΜͳࣄΛߟ͑ͨͷ͔ w ଞͷਓͷποίϛ  ʮਖ਼͍͠ͱ͸ࢥ͏͚Ͳෳࡶ͡Όͳ͍ʁެཧͰ͸ͳͯ͘ఆཧͳͷͰ͸ʁ  ɹ ଞͷެཧ͔Βূ໌ՄೳͳͷͰ͸ʁ ʯ

༨ஊͳΜͰੲͷਓ͸͜ΜͳࣄΛߟ͑ͨͷ͔ w ੈلʹೖΓ૒ۂزԿֶ͕ϘϠΠ΍ϩόνΣϑεΩʔ͔ΒఏҊ͞Εɺ  ͦͷποίϛ͸൱ఆతʹղܾ͞Εͨɻ w ૒ۂزԿֶͰ͸ɺϢʔΫϦουͷଞͷެཧΛશͯຬͨ͠ͳ͕Β΋ɺฏߦ ઢެཧ͸ຬͨ͞ͳ͍زԿֶͰ͋Δɻ ͭ·ΓฏߦઢެཧΛଞͷެཧ͔Β ಋग़͢Δ͜ͱ͕ग़དྷͳ͍ ϙΞϯΧϨԁ൫
IUUQTXXXGTFTDJXBTFEBBDKQزԿֶ͕༩͑ͯ͘ΕΔ΋ͷʢ਺ֶՊʣ ਤ͸Լه)1ΑΓҾ༻

༨ஊɿݱ࣮ੈք΁ͷԠ༻ w ੈل·Ͱ͸਺ֶతͳڵຯʹࢭ·͍͕ͬͯͨɺੈلʹೖͬͯ  ఏҊ͞Εͨ૬ରੑཧ࿦͸૒ۂزԿ ͓ΑͼͦͷҰൠԽͷϦʔϚϯزԿ Λ  جૅͱͯ͠ల։͞Ε͍ͯΔɻ w ͜ͷ਺೥Ͱػցֶशʹ૒ۂزԿΛద༻͢Δݚڀ͕੝Μʹͳ͓ͬͯΓ  ػցֶशͰ΋ॏཁͳ஍ҐΛ઎ΊΔ͜ͱʹͳΔͷ͔΋
ͳΒͳ͍͔΋

૒ۂۭؒ΁ͷຒΊࠐΈ w ૒ۃۭؒʢϙΞϯΧϨٿϞσϧʣ΁ͷ୯ޠɾάϥϑΛຒΊࠐΈ w ࣗಈతʹ֊૚ߏ଄Λ൓өͨ͠ຒΊࠐΈ͕ಘΒΕΔ w ௿࣍ݩ d ͷຒΊࠐΈͰߴੑೳ͕ಘΒΕΔ w
/*14 /*14 w άϥϑͷදݱֶश /PEF7FD ʹ͓͍ͯɺຒΊࠐΉۭؒΛ  ϢʔΫϦουۭؒͰ͸ͳ͘૒ۂۭؒʹຒΊࠐΉͱɺ  άϥϑ͕໦ঢ়ͷ৔߹ੑೳ͕ഁ֨తʹ޲্͢Δ Table 1: Experimental results on the transitive closure of the WORDNET noun hierarchy. Highlighted cells indicate the best Euclidean embeddings as well as the Poincaré embeddings which achieve equal or better results. Bold numbers indicate absolute best results. Dimensionality 5 10 20 50 100 200 WORDNET Reconstruction Euclidean Rank 3542.3 2286.9 1685.9 1281.7 1187.3 1157.3 MAP 0.024 0.059 0.087 0.140 0.162 0.168 Translational Rank 205.9 179.4 95.3 92.8 92.7 91.0 MAP 0.517 0.503 0.563 0.566 0.562 0.565 Poincaré Rank 4.9 4.02 3.84 3.98 3.9 3.83 MAP 0.823 0.851 0.855 0.86 0.857 0.87 . Euclidean Rank 3311.1 2199.5 952.3 351.4 190.7 81.5 MAP 0.024 0.059 0.176 0.286 0.428 0.490

w ͜͜ͰEJTUBODFSBUJPΛɹɹɹɹɹɹͱ͢Δɻ w &VDMJEFBOEJTUBODFSBUJP͸ৗʹҰఆ͕ͩɺIZQFSCPMJDEJTUBODFSBUJP  ͸ԁपʹۙͮ͘΄Ͳ(SBQI%JTUBODFSBUJPʹۙͮ͘ w (SBQIEJTUBODFΛอͬͨ··ͷຒΊࠐΈ͕Մೳ ͳͥ໦ঢ়ͷ৔߹૒ۂۭ͕ؒྑ͍ͷ͔ʁ y x
O Graph Distance Ratio Hyperbolic Distance Ratio Euclidean Distance Ratio Figure 1: Geodesics and distances in the Poincar´ e disk. As x and y move towards the outside of the disk (i.e., kxk, kyk ! 1), the distance d H (x, y) approaches d H (x, O) + d H (O, y). are not preserved, but are given by d H (x, y) = acosh ✓ 1 + 2 kx yk2 (1 kxk2)(1 kyk2) ◆ . There are some potentially unexpected consequences of this formula, and a simple example gives intuition abou technical property that allows hyperbolic space to embed trees. Consider three points: the origin 0, and points y with kxk = kyk = t for some t > 0. As shown on the right of Figure 1, as t ! 1 (i.e., the points move to the outside of the disk), in ﬂat Euclidean space, the ratio dE (x,y) dE (x,0)+dE (0,y) is constant with respect to t (blue curv contrast, the ratio dH (x,y) dH (x,0)+dH (0,y) approaches 1, or, equivalently, the distance d H (x, y) approaches d H (x, 0)+d H (red and pink curves). That is, the shortest path between x and y is almost the same as the path through the origin is analogous to the property of trees in which the shortest path between two sibling nodes is the path through parent. This tree-like nature of hyperbolic space is the key property exploited by embeddings. Moreover, this pr holds for arbitrarily small angles between x and y. Lines and geodesics There are two types of geodesics (shortest paths) in the Poincar´ e disk model of hyperbolic segments of circles that are orthogonal to the disk surface, and disk diameters [3]. Our algorithms and proofs ma Y Z 0 ຒΊࠐΈ d(x, y) d(x, O) + d(y, O) $ISJTUPQIFS%F4BFUBM*$.-

૒ۂۭؒͱαϙʔτϕΫλʔϚγϯ w ૒ۂۭؒʹຒΊࠐ·Εͨσʔλ఺Λೖྗͱ͢ΔαϙʔτϕΫλʔϚγϯ Large-Margin Classification in Hyperbolic Space Hyunghoon Cho
Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 [email protected] Benjamin DeMeo Department of Biomedical Informatics Harvard University Cambridge, MA 02138 [email protected] Jian Peng Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 [email protected] Bonnie Berger Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 [email protected] Abstract Representing data in hyperbolic space can effectively capture latent hierarchical relationships. With the goal of enabling accurate classification of points in hyperbolic space while respecting their hyperbolic geometry, we introduce hyperbolic SVM1, a hyperbolic formulation of support vector machine classifiers, and elu- cidate through new theoretical work its connection to the Euclidean counterpart. We demonstrate the performance improvement of hyperbolic SVM for multi-class prediction tasks on real-world complex networks as well as simulated datasets. arXiv:1806.00437v1 [cs.LG] 1 Jun 2018 "*45"54 w ී௨ͷιϑτϚʔδϯઢܕ47. w ૒ۂۭؒͰͷιϑτϚʔδϯઢܕ47. le; for correct classifications, we increase our confidence, and for incorrect classifications, mize the error. m margin learning of the optimal decision rule h?, which provides the foundation for support achines, can now be formalized as h? = arg max h2H min j2[m] (h, (x(j), y(j))), (8) is the set of candidate decision rules that we consider. the data space X be Rn and d be the Euclidean distance function and consider only linear s, i.e., H = {h(x; w) : w 2 Rn } where h(x; w) = ⇢ 1 wT x > 0, 1 otherwise, (9) n be shown that the max-margin problem given in Eq. 8 becomes equivalent to solving the g convex optimization problem: minimizew2 Rn 1 2kwk 2 (10) subject to y(j)(wT x(j)) 1, 8j 2 [m] (11) ting algorithm that solves this problem (via its dual) is known as support vector machines ntroducing a relaxation for the separability constraints gives a more commonly used soft- ariant of SVM minimizew2 Rn 1 2kwk 2 + C m X j=1 max(0, 1 y(j)(wT x(j))) (12) 3 minimizew2 Rn+1 1 2 w ⇤ w, (16) subject to y(j)(w ⇤ x(j)) 1, 8j 2 [m], (17) w ⇤ w < 0. (18) The proof of Theorem 2 is exactly analogous to the Euclidean version, and is provided in the Supplementary Information. Our result suggests that despite the apparent complexity of hyperbolic distance calculation, the optimal (linear) maximum margin classifiers in the hyperbolic space can be identified via a relatively simple optimization problem that closely resembles the Euclidean version of SVM, where Euclidean inner products are replaced with Minkowski inner products. Note that if we restrict H to decision functions where w0 = 0, then our formulation coincides with with Euclidean SVM. Thus, Euclidean SVM can be viewed as a special case of our formulation where the first coordinate (corresponds to the time axis in Minkowski spacetime) is neglected. Unlike Euclidean SVM, however, our optimization problem has a non-convex objective as well as a non-convex constraint. Yet, if we restrict our attention to non-trivial, finite-sized problems where it is necessary and sufficient to consider only the set of w for which at least one data point lies on either side of the decision boundary, then the negative norm constraint can be replaced with a convex alternative that intuitively maps out the convex hull of given data points in the ambient Euclidean space of Ln. Finally, the soft-margin formulation of hyperbolic SVM can be derived by relaxing the separability constraints as in the Euclidean case. Instead of imposing a linear penalty on misclassification errors, which has an intuitive interpretation as being proportional to the minimum Euclidean distance to the correct classification in the Euclidean case, we impose a penalty proportional to the hyperbolic distance to the correct classification. Analogous to the Euclidean case, we fix the scale of penalty so that the margin of the closest point to the decision boundary (that is correctly classified) is set to sinh 1(1). This leads to the optimization problem minimizew2 Rn+1 1 2 w ⇤ w + C m X j=1 max(0, sinh 1(1) sinh 1(y(j)(w ⇤ x(j)))), (19) subject to w ⇤ w < 0. (20) In all our experiments in the following section, we consider the simplest approach of solving the above formulation of hyperbolic SVM via projected gradient descent. The initial w is determined ࿦จஶऀʹ͸#POOJF#FSHFS͕ೖ͍ͬͯΔʂ

ਓ޻σʔληοτͰͷ݁Ռ w )ZQFSCPMJDHBVTTJBONJYUVSFΛߟ͑ɺ͔ͦ͜ΒσʔληοτΛੜ੒ w )ZQFSCPMJD47.ͷํ͕&VDMJEFBO47.ΑΓ΋ྑ͍ɻ (a) (b) (c) 0.5 0.6
0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Euclidean SVM Hyperbolic SVM Hyperbolic SVM Euclidean SVM Macro-AUPR Figure 2: Multi-class classiﬁcation of Gaussian mixtures in hyperbolic space. (a) Two-fold cross validation results for 100 simulated Gaussian mixture datasets with 4 randomly positioned components and 100 points sampled from each component. Each dot represents the average performance over 5 trials. Vertical and horizontal lines represent standard deviations. Example decision hyperplanes for hyperbolic and Euclidean SVMs are shown in (b) and (c), respectively, using the Poincaré disk model. Color of each decision boundary denotes which component is being discriminated from the rest.

)47.ͷޡࠩؔ਺ w Ұൠͷ47.Ͱ࢖ΘΕΔώϯδؔ਺͸ತؔ਺͕ͩɺ  ૒ۂ47.Ͱ͸Ͳ͏ݟͯ΋ඇತ w େҬ࠷దԽΛ͢Δ͜ͱ͕೉͘͠ɺہॴղʹؕΔɻ  ˠઌߦݚڀͰ͸ɺ·ͣҰൠͷ47.Ͱֶशͨ͠஋Λॳظ஋ͱͯ͠૒ۂ47. −4 −2 0
2 4 0 1 2 3 4 5 6 x f3 −4 −2 0 2 4 0 1 2 3 4 5 6 x f2 −4 −2 0 2 4 0 1 2 3 4 5 6 x f1 −4 −2 0 2 4 0 1 2 3 4 5 6 x f2 &47. )47.

࣮૷ͯ͠Έͨ w -*#-*/&"3ͱࣗ෼Ͱ࣮૷ͨ͠ITWNΛൺֱɻ  -*#-*/&"3͸TDBMJOH͕ඞཁͳͷͰɺεέʔϦϯάΛ͢Δɻ w σʔληοτ͸ɺݩ࿦จͰ࢖༻͞Ε͍ͯͨ΋ͷΛར༻ɻ  ૒ۂతࠞ߹Ψ΢ε෼෍͔ΒಘΒΕͨσʔλ఺Λ෼ྨग़དྷΔ͔Λ൑ఆ w ධՁࢦඪ͸"DDVSBDZ w
5SBJOJOHEBUBɺUFTUEBUBͷσʔληοτΛݸ༻ҙ w σʔληοτʹؚ·ΕΔQPTJUJWFͷׂ߹͸ w ϋΠύʔύϥϝʔλ$͸Ͱݻఆ w ࠓճͷ࣮૷Ͱ͸ɺॳظ஋Λճৼͬͯ࠷΋ྑ͔ͬͨ΋ͷΛ࠾༻

࣮૷ͯ͠Έͨ݁Ռ ૒ۂ47. -*#-*/&"3 w ૒ۂ47.͸͍͢͝ྑ͔ͬͨɻ w ͳΜ͔ݪஶ࿦จΑΓ΋Α͘ݟ͑Δ͕  ධՁࢦඪͷҧ͍ͳͷ͔  ύϥϝʔλ$Λৼͬͯͳ͍͍͔ͤ  ॳظ஋ઃఆ͕ҧ͏͍͔ͤ 
  ͸ݕূ͍ͯ͠ͳ͍ɻ w 47.ʹ͸৭ʑͳ֦ு͕͋Δ͕ɺ  େ֓ͷ֦ு͸ଟ෼ग़དྷΔɻ  ඞཁ͕͋Δ͔͸ෆ໌ w ࿦จͰ͸Χʔωϧ47.΋ఏҊ  ͨͩ͠ϛϯίϑεΩʔ಺ੵΛߟ͑Δ

઀ۭؒɾฏߦҠಈɾࢦ਺ࣸ૾ɾର਺ࣸ૾ A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based
Learning (a) (b) (c) Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). The distance between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. TµHn set can be literally thought of as the tangent space of the forward hyperboloid sheet at µ. Note that Tµ0 Hn con- sists of v 2 Rn+1 with v0 = 0, and kvkL := p hv, viL = kvk2 . Parallel transport and inverse parallel transport Next, for an arbitrary pair of point µ, ⌫ 2 Hn, the parallel transport from ⌫ to µ is defined as a map PT⌫!µ from T⌫Hn to TµHn that carries a vector in T⌫Hn along the geodesic from ⌫ to µ in a parallel manner without changing its metric tensor. In other words, if norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyper- w ઀ۭؒ૒ۂ্ۭؒͷ఺Жʹରͯ͠ɺ࣍ͷࣜΛຬͨ͢఺ͷू߹ w ฏߦҠಈɹ  ɹɹ্ͷϕΫτϧaΛɺν͔Βµ·Ͱͷଌ஍ઢ(૒ۂۭؒͰͷ௚ઢ)ʹԊͬͯɺ  ΁ͱҠ͢ૢ࡞ɻͨͩͦ͠ͷܭྔ͸อଘ͞ΕΔ͜ͱͱ͢Δɻͭ·Γ ce. To formalize this sequence of operations, we efine the tangent space on hyperbolic space as well ay to transport the tangent space and the way to vector in the tangent space to the surface. The ation of the tangent vector requires parallel trans- the projection of the tangent vector to the surface the definition of exponential map. space of hyperbolic space se TµHn to denote the tangent space of Hn at µ 2(a)). Representing TµHn as a set of vectors in ambient space Rn+1 into which Hn is embedded, an be characterized as the set of points satisfying ogonality relation with respect to the Lorentzian TµH n := {u: hu, µiL = 0}. (2) (a) (b) (c) Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). The distance between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. TµHn set can be literally thought of as the tangent space of the forward hyperboloid sheet at µ. Note that Tµ0 Hn con- sists of v 2 Rn+1 with v0 = 0, and kvkL := p hv, viL = kvk2 . Parallel transport and inverse parallel transport Next, for an arbitrary pair of point µ, ⌫ 2 Hn, the parallel transport from ⌫ to µ is defined as a map PT⌫!µ from T⌫Hn to TµHn that carries a vector in T⌫Hn along the geodesic from ⌫ to µ in a parallel manner without changing its metric tensor. In other words, if PT is the parallel transport on hyperbolic space, then hPT⌫!µ(v), PT⌫!µ(v0)iL = hv, v0iL. The explicit formula for the parallel transport on the Lorentz norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyperbolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the PTν→μ (a) (a) Figure 2: (a) One-dimensional v 2 Tµ0 (green) to u 2 Tµ (blu The distance between µ and exp TµHn set can be literally though the forward hyperboloid sheet a sists of v 2 Rn+1 with v0 = 0, kvk2 . Parallel transport and inverse Next, for an arbitrary pair of p allel transport from ⌫ to µ is from T⌫Hn to TµHn that c along the geodesic from ⌫ to without changing its metric te PT is the parallel transport o hPT⌫!µ(v), PT⌫!µ(v0)iL = The explicit formula for the para Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel tra v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to The distance between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. TµHn set can be literally thought of as the tangent space of the forward hyperboloid sheet at µ. Note that Tµ0 Hn con- sists of v 2 Rn+1 with v0 = 0, and kvkL := p hv, viL = kvk2 . Parallel transport and inverse parallel transport Next, for an arbitrary pair of point µ, ⌫ 2 Hn, the parallel transport from ⌫ to µ is defined as a map PT⌫!µ from T⌫Hn to TµHn that carries a vector in T⌫Hn along the geodesic from ⌫ to µ in a parallel manner without changing its metric tensor. In other words, if PT is the parallel transport on hyperbolic space, then hPT⌫!µ(v), PT⌫!µ(v0)iL = hv, v0iL. The explicit formula for the parallel transport on the Lorentz norm of v. For hyperbolic space, this map (F given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk As we can confirm with straightforward com exponential map is norm preserving in th d`(µ, expµ (u)) = arccosh hµ, expµ (u) Now, in order to evaluate the density of a po bolic space, we need to be able to map the poi tangent space, on which the distribution is ini We, therefore, need to be able to compute the exponential map, which is also called logar /BHBOPFUBM*$.-

઀ۭؒɾฏߦҠಈɾࢦ਺ࣸ૾ɾର਺ࣸ૾ w ฏߦҠಈ͸  ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͨͩ͠  w ࢦ਺ࣸ૾  ɹɹɹ্ͷ఺uΛɹɹ΁ͱࣸ૾͢Δૢ࡞ɻ  ͨͩ͠ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͱͨ࣌͠ɺ  ɹɹɹɹɹɹɹͱͳΔ΋ͷͱ͢Δɻ w
ର਺ࣸ૾ࢦ਺ࣸ૾ͷٯɺMPHͱॻ͘͜ͱ΋ଟ͍  w ͜ͷลͷܭࢉΛղੳతʹͰ͖Δͱݴ͏͜ͱ͕ॏཁͳϙΠϯτ PT is the parallel transport on hyperbolic space, then hPT⌫!µ(v), PT⌫!µ(v0)iL = hv, v0iL. The explicit formula for the parallel transport on the Lorentz model (Figure 2(b)) is given by: PT⌫!µ(v) = v + hµ ↵⌫, viL ↵ + 1 (⌫ + µ), (3) where ↵ = h⌫, µiL. The inverse parallel transport PT 1 ⌫!µ simply carries the vector in TµHn back to T⌫Hn along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) Exponential map and inverse exponential map Finally, we will describe a function that maps a vector in a tangent space to its surface. According to the basic theory of differential geometry, every u 2 TµHn determines a unique maximal geodesic µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- ponential map expµ : TµHn ! Hn is a map defined by expµ (u) = µ(1), and we can use this map to project a vector v in TµHn onto Hn in a way that the distance from µ to destination of the map coincides with kvkL, the metric bolic space, we need to be able to map the point back to t tangent space, on which the distribution is initially define We, therefore, need to be able to compute the inverse of t exponential map, which is also called logarithm map, well. Solving eq. (13) for u, we can obtain the inverse exponenti map as u = exp 1 µ (z) = arccosh(↵) p ↵2 1 (z ↵µ), ( where ↵ = hµ, ziL. See Appendix A.1 for further detai 3. Pseudo-Hyperbolic Gaussian 3.1. Construction Finally, we are ready to provide the construction of o wrapped gaussian distribution G(µ, ⌃) on Hyperbolic spa with µ 2 Hn and positive definite ⌃. In the language of the differential geometry, our strateg can be re-described as follows: 1. Sample a vector ˜ v from the Gaussian distributio N(0, ⌃) defined over Rn. 2. Interpret ˜ v as an element of Tµ0 Hn ⇢ Rn+1 by rewr ing ˜ v as v = [0, ˜ v]. model (Figure 2(b)) is given by: PT⌫!µ(v) = v + hµ ↵⌫, viL ↵ + 1 (⌫ + where ↵ = h⌫, µiL. The inverse paralle PT 1 ⌫!µ simply carries the vector in TµHn bac along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). Exponential map and inverse exponential ma Finally, we will describe a function that maps a tangent space to its surface. According to the basic theory of differential ge ery u 2 TµHn determines a unique maxima µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) ponential map expµ : TµHn ! Hn is a map expµ (u) = µ(1), and we can use this map t vector v in TµHn onto Hn in a way that the dis µ to destination of the map coincides with kvkL H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries ng k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). measured on the surface of Hn coincides with kukL. nt space of µ0 Hn con- p hv, viL = port n, the par- ap PT⌫!µ in T⌫Hn el manner words, if pace, then he Lorentz µ), (3) norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyperbolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as well. Solving eq. (13) for u, we can obtain the inverse exponential map as u = exp 1 µ (z) = arccosh(↵) p (z ↵µ), (6) ng the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) ponential map and inverse exponential map nally, we will describe a function that maps a vector in a gent space to its surface. cording to the basic theory of differential geometry, ev- y u 2 TµHn determines a unique maximal geodesic : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- nential map expµ : TµHn ! Hn is a map defined by pµ (u) = µ(1), and we can use this map to project a ctor v in TµHn onto Hn in a way that the distance from o destination of the map coincides with kvkL, the metric 3. Pseudo-Hyperbolic G 3.1. Construction Finally, we are ready to pro wrapped gaussian distribution with µ 2 Hn and positive de In the language of the differ can be re-described as follow 1. Sample a vector ˜ v fro N(0, ⌃) defined over R 2. Interpret ˜ v as an elemen ing ˜ v as v = [0, ˜ v]. v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) ntial map and inverse exponential map we will describe a function that maps a vector in a space to its surface. ng to the basic theory of differential geometry, ev- 2 TµHn determines a unique maximal geodesic 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- al map expµ : TµHn ! Hn is a map defined by ) = µ(1), and we can use this map to project a v in TµHn onto Hn in a way that the distance from tination of the map coincides with kvkL, the metric 3.1. Construction Finally, we are ready to provide t wrapped gaussian distribution G(µ, with µ 2 Hn and positive definite ⌃ In the language of the differential can be re-described as follows: 1. Sample a vector ˜ v from the N(0, ⌃) defined over Rn. 2. Interpret ˜ v as an element of Tµ ing ˜ v as v = [0, ˜ v]. where ↵ = h⌫, µiL. The inverse parallel transport PT 1 ⌫!µ simply carries the vector in TµHn back to T⌫Hn along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) Exponential map and inverse exponential map Finally, we will describe a function that maps a vector in a tangent space to its surface. According to the basic theory of differential geometry, every u 2 TµHn determines a unique maximal geodesic µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- ponential map expµ : TµHn ! Hn is a map defined by expµ (u) = µ(1), and we can use this map to project a vector v in TµHn onto Hn in a way that the distance from µ to destination of the map coincides with kvkL, the metric u = exp 1 µ (z) = p where ↵ = hµ, ziL. See App 3. Pseudo-Hyperbolic G 3.1. Construction Finally, we are ready to prov wrapped gaussian distribution G with µ 2 Hn and positive defi In the language of the differe can be re-described as follows 1. Sample a vector ˜ v from N(0, ⌃) defined over Rn 2. Interpret ˜ v as an element ing ˜ v as v = [0, ˜ v]. where ↵ = h⌫, µiL. The inverse parallel transport PT 1 ⌫!µ simply carries the vector in TµHn back to T⌫Hn along the geodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) Exponential map and inverse exponential map Finally, we will describe a function that maps a vector in a tangent space to its surface. According to the basic theory of differential geometry, every u 2 TµHn determines a unique maximal geodesic µ : [0, 1] ! Hn with µ(0) = µ and ˙µ(0) = u. Ex- ponential map expµ : TµHn ! Hn is a map defined by expµ (u) = µ(1), and we can use this map to project a vector v in TµHn onto Hn in a way that the distance from µ to destination of the map coincides with kvkL, the metric ↵ where ↵ = hµ, ziL. See Appendix 3. Pseudo-Hyperbolic Gauss 3.1. Construction Finally, we are ready to provide th wrapped gaussian distribution G(µ, ⌃ with µ 2 Hn and positive definite ⌃ In the language of the differential g can be re-described as follows: 1. Sample a vector ˜ v from the N(0, ⌃) defined over Rn. 2. Interpret ˜ v as an element of Tµ0 ing ˜ v as v = [0, ˜ v]. d its tangent space TµH1 (blue). (b) Parallel transport carries c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). the surface of Hn coincides with kukL. norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyperbolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as tangent space TµH1 (blue). (b) Parallel transport carries ponential map projects the u 2 Tµ (blue) to z 2 Hn (red). surface of Hn coincides with kukL. orm of v. For hyperbolic space, this map (Figure 2(c)) is iven by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) s we can confirm with straightforward computation, this xponential map is norm preserving in the sense that `(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. ow, in order to evaluate the density of a point on hyper- olic space, we need to be able to map the point back to the ngent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the xponential map, which is also called logarithm map, as ell. olving eq. (13) for u, we can obtain the inverse exponential map as u = exp 1 µ (z) = arccosh(↵) p ↵2 1 (z ↵µ), (6) here ↵ = hµ, ziL. See Appendix A.1 for further details. . Pseudo-Hyperbolic Gaussian a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries reen) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). e between µ and expµ (u) which is measured on the surface of Hn coincides with kukL. an be literally thought of as the tangent space of hyperboloid sheet at µ. Note that Tµ0 Hn con- Rn+1 with v0 = 0, and kvkL := p hv, viL = ansport and inverse parallel transport n arbitrary pair of point µ, ⌫ 2 Hn, the par- ort from ⌫ to µ is defined as a map PT⌫!µ n to TµHn that carries a vector in T⌫Hn geodesic from ⌫ to µ in a parallel manner anging its metric tensor. In other words, if parallel transport on hyperbolic space, then ), PT⌫!µ(v0)iL = hv, v0iL. formula for the parallel transport on the Lorentz ure 2(b)) is given by: ⌫!µ(v) = v + hµ ↵⌫, viL ↵ + 1 (⌫ + µ), (3) = h⌫, µiL. The inverse parallel transport mply carries the vector in TµHn back to T⌫Hn eodesic. That is, v = PT 1 ⌫!µ (u) = PTµ!⌫(u). (4) norm of v. For hyperbolic space, this map (Figure 2(c)) is given by z = expµ (u) = cosh (kuk L )µ+sinh (kuk L ) u kuk L . (5) As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that d`(µ, expµ (u)) = arccosh hµ, expµ (u)iL = kukL. Now, in order to evaluate the density of a point on hyperbolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as well. Solving eq. (13) for u, we can obtain the inverse exponential map as u = exp 1 µ (z) = arccosh(↵) p ↵2 1 (z ↵µ), (6) where ↵ = hµ, ziL. See Appendix A.1 for further details. 3. Pseudo-Hyperbolic Gaussian 3.1. Construction

૒ۂ্ۭؒͷ֬཰෼෍ w 3JFNBOOJBOOPSNBMEJTUSJCVUJPO      ࠷େΤϯτϩϐʔ๏ʹΑΓಋग़͞ΕΔ͕ɺਖ਼نԽఆ਺ͷܭࢉ͕  ೉͍͠  w 8SBQQFEOPSNBMEJTUSJCVUJPO  ؆୯ʹαϯϓϦϯά͢ΔͨΊʹఏҊ͞Εͨ୅ସҊ 
! 0 and to an (improper for non-compact) uniform distribution when ture tends to 0, one should recover the vanilla normal distribution. Hereby, alisations of the normal distribution, which have different theoretical and al The property that Pennec (2006) takes for granted is the maximization n and a covariance matrix, yielding in the isotropic setting µ, 2) z) = NR M (z|µ, 2) = 1 ZR exp ✓ dM(µ, z)2 2 2 ◆ , (5) nnian distance on the manifold induced by the tensor metric. Such a eferred as Riemannian Normal distribution – is used by Said et al. (2014) , or by Hauberg (2018) in the hypersphere Sd. Sampling from such ng the normalising constant – especially in the anisotropic setting – is 13 A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning (a) (b) (c) Figure 2: (a) One-dimensional Lorentz model H1 (red) and its tangent space TµH1 (blue). (b) Parallel transport carries v 2 Tµ0 (green) to u 2 Tµ (blue) while preserving k · kL . (c) Exponential map projects the u 2 Tµ (blue) to z 2 Hn (red). The distance between µ and exp (u) which is measured on the surface of Hn coincides with kuk . A Wrapped Normal Distribution on Hyperb Algorithm 1 Sampling on hyperbolic space Input: parameter µ 2 Hn, ⌃ Output: z 2 Hn Require: µ0 = (1, 0, · · · , 0)> 2 Hn Sample ˜ v ⇠ N(0, ⌃) 2 Rn v = [0, ˜ v] 2 Tµ0 Hn Move v to u = PTµ0 !µ(v) 2 TµHn by eq. (15) Map u to z = expµ (u) 2 Hn by eq. (13) 3. Parallel transport the vector v to u 2 TµHn ⇢ Rn+1 along the geodesic from µ0 to µ. 4. Map u to Hn by expµ . /BHBOPFUBM*$.-

ϝϏ΢εՃࢉɾݮࢉɾ৐ࢉ w ૒ۂ্ۭؒͰ͸ɺ௨ৗ௨ΓʹՃࢉΛߦ͏͜ͱ͕ग़དྷͳ͍ w ͦͷͨΊɺ࣍ͷϝϏ΢εՃࢉΛ༻͍Δ    w ͭ·Γɺ  ZΛɹ ݪ఺
ͷ઀ฏ໘ʹର਺ࣸ૾͢Δ  ˠͦΕΛYͷ઀ฏ໘΁ͱฏߦҠಈ͢Δɻ  ˠࢦ਺ࣸ૾ʹΑΓ૒ۂۭؒ΁໭͢ͱݴ͏ૢ࡞ w ݮࢉ΍৐ࢉ΋ಉ༷ʹఆٛͰ͖Δɻ  ͳ͓͜ΕΒͷԋࢉ͸݁߹ଇΛຬͨ͞ͳ͍ͨΊɺ܈ʹ͸ͳΒͳ͍ɻ  ୅਺ߏ଄ͱͯ͠͸ɺδϟΠϩ܈ͱݺ͹ΕΔߏ଄ʹͳΔɻ ϝϏ΢εՃࢉ x ⊕ y = (1 + 2c⟨x, y⟩ + c∥y∥2)x + (1 − c∥x∥2)y 1 + 2c⟨x, y⟩ + c2∥x∥2∥y∥2 y x μ 0 x ⊕ y ͜Ε͸ҰମͲ͏͍͏ԋࢉͳͷ͔ʁ Dͱ͢ΔͱY ZʹͳΔ x ⊕ 0 = 0 ⊕ x = x − x ⊕ x = 0 ҎԼͷੑ࣭͸୅ೖ͢Ε͹෼͔Δɻ ϝϏ΢εՃࢉͱϦʔϚϯزԿͷؔ܎Λ௚઀ड़΂ͨจݙ͸ݟ͚ͭΒΕͳ͔ͬͨɻ ϝϏ΢εՃࢉͷҙຯ x ⊕ y = (1 + 2c⟨x, y⟩ + c∥y∥2)x + (1 − c∥x∥2)y 1 + 2c⟨x, y⟩ + c2∥x∥2∥y∥2 = exp x ∘ PT μ0 →x ∘ log μ0 y ͍ۙͱ͜Ζ·Ͱݴٴ͞Εͯ͸͍Δ͕ɾɾ y x μ 0 log μ0 y PT μ0 →x ∘ log μ0 y x ⊕ y </FVS*14>)ZQFSCPMJD/FVSBM/FUXPSLT μ0

χϡʔϥϧωοτͷߏ੒ w جຊతʹ͸ɺϢʔΫϦουۭؒͰఆٛ͞Ε͍ͯΔՃࢉͳͲΛ  ϝϏ΢εԋࢉʹஔ͖׵͑Ε͹ྑ͍ɻ w Ұൠతͳ3// φ͸׆ੑԽؔ਺   w )ZQFSCPMJD3// 
w D͸ۂ཰Λҙຯ͢Δɻ  ͜Ε·Ͱ૒ۂ໘ΛɹɹɹɹɹɹɹͱͳΔ΋ͷͱ͍͕ͯͨ͠ɺ  ɹɹɹɹɹɹɹɹͱͯ͠΋ಉ༷ͷٞ࿦͕ՄೳͰɺͦΕΛߟྀ͍ͯ͠Δɻ fined by ht+1 = '(Wht + Uxt + b) where ' is a pointwise ReLU, etc. This formula can be naturally generalized to the eters W 2 Mm,n (R), U 2 Mm,d (R), b 2 Dm c , we define: t c U ⌦c xt c b), ht 2 Dn c , xt 2 Dd c . (29) e can write ˜ xt := expc 0 (xt ) and use the above formula, since ht c expc 0 (Uxt ) = W ⌦c ht c U ⌦c ˜ xt . pt the GRU architecture: + br), zt = (Wzht 1 + Uzxt + bz), Uxt + b), ht = (1 zt ) ht 1 + zt ˜ ht, (30) rst, how should we adapt the pointwise multiplication by a the Möbius version (see Eq. (26)) can be naturally extended : (h, h 0) 2 Dn ⇥ Dp 7! expc(f(logc(h), logc(h 0))). In perbolic RNN NN. A simple RNN can be defined by ht+1 = '(Wht + Uxt + b) where ' is a pointwise arity, typically tanh, sigmoid, ReLU, etc. This formula can be naturally generalized to the ic space as follows. For parameters W 2 Mm,n (R), U 2 Mm,d (R), b 2 Dm c , we define: ht+1 = ' ⌦c (W ⌦c ht c U ⌦c xt c b), ht 2 Dn c , xt 2 Dd c . (29) if inputs xt ’s are Euclidean, one can write ˜ xt := expc 0 (xt ) and use the above formula, since ht (Pc 0!W ⌦cht (Uxt )) = W ⌦c ht c expc 0 (Uxt ) = W ⌦c ht c U ⌦c ˜ xt . chitecture. One can also adapt the GRU architecture: rt = (Wrht 1 + Urxt + br), zt = (Wzht 1 + Uzxt + bz), ˜ ht = '(W(rt ht 1) + Uxt + b), ht = (1 zt ) ht 1 + zt ˜ ht, (30) denotes pointwise product. First, how should we adapt the pointwise multiplication by a < x, x >L = − 1 < x, x >L = − 1/c 0DUBWJBO&VHFOFUBM/*14

ࠞ߹ۂ཰ۭؒ Published as a conference paper at ICLR 2019 LEARNING
MIXED-CURVATURE REPRESENTATIONS IN PRODUCTS OF MODEL SPACES Albert Gu, Frederic Sala, Beliz Gunel & Christopher R´ e Computer Science Department Stanford University Stanford, CA 94305 {albertgu,fredsala,bgunel}@stanford.edu, [email protected] ABSTRACT The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Eu- clidean space has been the workhorse for embeddings; recently hyperbolic and spherical spaces have gained popularity due to their ability to better embed new types of structured data—such as hierarchical data—but most data is not structured so uniformly. We address this problem by proposing learning embeddings in a product manifold combining multiple copies of these model spaces (spherical, hyperbolic, Euclidean), providing a space of heterogeneous curvature suitable for a wide variety of structures. We introduce a heuristic to estimate the sectional curvature of graph data and directly determine an appropriate signature—the number of component spaces and their dimensions—of the product manifold. Empiri- cally, we jointly learn the curvature and the embedding in the product space via Riemannian optimization. We discuss how to deﬁne and compute intrinsic quan- tities such as means—a challenging notion for product manifolds—and provably learnable optimization functions. On a range of datasets and reconstruction tasks, our product space embeddings outperform single Euclidean or hyperbolic spaces used in previous works, reducing distortion by 32.55% on a Facebook social network dataset. We learn word embeddings and ﬁnd that a product of hyperbolic spaces in 50 dimensions consistently improves on baseline Euclidean and hyperbolic embeddings, by 2.6 points in Spearman rank correlation on similarity tasks and 3.4 points on analogy accuracy. 1 INTRODUCTION *$-3 w ૒ۂۭؒ ۂ཰͕ෛ ΁ͷຒΊࠐΈ͕͍ͭͰ΋࠷దͳΘ͚Ͱ͸ͳ͍ w ٿ໘ɺϢʔΫϦουۭؒɺ͋Δ͍͸ͦͷࠞ߹ۭؒ΁ͷຒΊࠐΈ͕࠷ద͔΋ʁ w ࠞ߹ۭؒ΁ͷຒΊࠐΈΞϧΰϦζϜΛ࡞੒ͨ͠

༷ʑͳۭؒͱͦΕʹ͓͚Δࡾ֯ܗ Published as a conference paper at ICLR 2019 Figure
1: Three component spaces: sphere S2, Euclidean plane E2, and hyperboloid H2. Thick lines are geodesics; these get closer in positively curved (K = +1) space S2, remain equidistant in flat (K = 0) space E2, and get farther apart in negatively curved (K = 1) space H2. We propose embedding into product spaces in which each component has constant curvature. As we show, this allows us to capture a wider range of curvatures than traditional embeddings, while retaining the ability to globally optimize and operate on the resulting embeddings. Specifically, we form a Riemannian product manifold combining hyperbolic, spherical, and Euclidean components and equip it with a decomposable Riemannian metric. While each component space in the product has constant curvature (positive for spherical, negative for hyperbolic, and zero for Euclidean), the Published as a conference paper at ICLR 2019 a b c m a b c m a b c m Figure 3: Geodesic triangles in differently curved spaces: compared to Euclidean geometry in which it satisfies the parallelogram law (Center), the median am is longer in cycle-like positively curved space (Left), and shorter in tree-like negatively curved space (Right). The relative length of am can be used as a heuristic to estimate discrete curvature. 3.2 ESTIMATING THE SIGNATURE w άϥϑ͕؀ঢ়ˠٿ໘ɺάϦουঢ়ˠϢʔΫϦουฏ໘ɺ໦ঢ়ˠ૒ۂ໘  ʹͦΕͧΕ޲͍͍ͯͦ͏

͜ͷάϥϑ͸Ͳͷۭؒ޲͖ʁ G w ਅΜதͷ؀ঢ়ͷ෦෼Λٿ໘ʹɺ֎ଆͷ໦ঢ়ͷ෦෼Λ૒ۂ໘ʹຒΊࠐΈ͍ͨ w ٿۭؒͱ૒ۂۭؒͷੵۭؒ ࠞ߹ۂ཰ۭؒ Ͱදݱ͠Α͏ʂ

ਓ޻σʔληοτͰͷ݁Ռ 2019 G Gtree Gcycle oses per component. Subscripts i
index components in the either hyperbolic nor spherical space is suitable for G, but stortion. Note the decomposition into tree and cycle. depends on hyperbolic distance d H (for which the gradient which is continuously differentiable (Sala et al., 2018). ion can be optimized through standard Riemannian opti- el, 2013) and RSVRG (Zhang et al., 2016). We write down spaces in Algorithm 1. This proceeds by ﬁrst computing ect to the ambient space of the embedding (Step 4), and dient by applying the Riemannian correction (multiply by Published as a conference paper at ICLR 2019 Table 1: Matching geometries: Average distortion on canonical graphs (tree, cycle, ring of with 40 nodes, comparing four spaces with total dimension 3. The best distortion is achieved space with matching geometry. Cycle Tree Ring of Trees |V | = 40, |E| = 40 |V | = 40, |E| = 39 |V | = 40, |E| = 40 (E 3)1 0.1064 0.1483 0.0997 (H 3)1 0.1638 0.0321 0.0774 (S 3)1 0.0007 0.1605 0.1106 (H 2)1 ⇥ (S 1)1 0.1108 0.0538 0.0616 doubling the number of factors. These models include the products consisting of only a con curvature base space, ranging to various combinations of Sd/2 2 , Hd/2 2 comprising factors of d sion 2.3 For a given signature, the curvatures are initialized to the appropriate value in { 1 and then learned using the technique in Section 3.1. We additionally compare to the outp Algorithms 2,3 for heuristically selecting a combination of spaces in which to embed these da Quality We focus on the average distortion—which our loss function (2) optimizes—as ou metric for reconstruction, and additionally report the mAP metric for the unweighted graph expected, for the synthetic graphs (tree, cycle, ring of trees), the matching geometries (hype spherical, product of hyperbolic and spherical) yield the best distortion (Table 1). Next, we in Table 2 the quality of embedding different graphs across a variety of allocations of spaces, total dimension d = 10 following previous work (Nickel & Kiela, 2018). We conﬁrm that the ture of each graph informs the best allocation of spaces. In particular, the cities graph—whi intrinsic structure close to S2—embeds well into any space with a spherical component, and th like Ph.D.s graph embeds well into hyperbolic products. We emphasize that even for such da w ༧૝௨Γͷ݁Ռͱͳͬͨɻ w ͜ͷޙ࣮σʔλ΋ຒΊࠐΜͰ͍Δ͕ɺఆྔతͳࢦඪͷΈͰղऍͳͲ͸ಛʹͳ͍ w ຒΊࠐΈ࠷ྑͷۭؒͱσʔλͷղऍͷؔ܎ੑ͕͍ͭͨΒ  ໘ന͍ͷͰ͸ͳ͍͔ͱࢥ͏ɻ

࠷ۙͷ૒ۂزԿؔ܎࿦จ w /*14  l/VNFSJDBMMZ"DDVSBUF)ZQFSCPMJD&NCFEEJOHT6TJOH5JMJOH#BTFE.PEFMTz  l)ZQFSCPMJD(SBQI$POWPMVUJPOBM/FVSBM/FUXPSLTz  l)ZQFSCPMJD(SBQI/FVSBM/FUXPSLTz  l.VMUJSFMBUJPOBM1PJODBSÉ(SBQI&NCFEEJOHTz  l$POUJOVPVT)JFSBSDIJDBM3FQSFTFOUBUJPOTXJUI1PJODBSÉ7BSJBUJPOBM"VUP&ODPEFSTz w *$-3 
l.JYFEDVSWBUVSF7BSJBUJPOBM"VUPFODPEFSTz w "*45"54  l6OTVQFSWJTFE)JFSBSDIZ.BUDIJOHXJUI0QUJNBM5SBOTQPSUPWFS)ZQFSCPMJD4QBDFTz  l)ZQFSCPMJD.BOJGPME3FHSFTTJPOz w *$.-  l-BUFOU7BSJBCMF.PEFMMJOHXJUI)ZQFSCPMJD/PSNBMJ[JOH'MPXTz  l$POTUBOU$VSWBUVSF(SBQI$POWPMVUJPOBM/FUXPSLTz

͞Βʹ૒ۂزԿΛৄ͘͠ษڧ͍ͨ͠ํ΁ w ʮۂ͕ۭͬͨؒͷزԿֶʯ ٶԬྱࢠஶ   આ໌͕௚ײతͰΘ͔Γ΍͍͢ɻ਺ֶͷ࿩͸ผͷຊͰ  w ʮ૒ۂزԿʯ ਂ୩ݡ࣏ஶ  
͖ͪΜͱͨ͠਺ֶॻ͕ͩɺେมΘ͔Γ΍͍͢ɻ  w ʮଟ༷ମͷجૅʯ দຊ޾෉ஶ   ඇৗʹΘ͔Γ΍͔͕ͬͨ͢ɺࠓͷॴ͜ͷຊΛཧղ͢Δ  ඞཁ͸ͳ͍͔΋ ࠓޙ͸ඞཁʹͳΔͷ͔΋ɾɾʁ   w ػցֶशͱ૒ۂزԿͷؔ܎Λ௚઀ѻ͍ͬͯΔຊ͸ͳ͍ɻ  ࿦จʹ౰ͨΔ΂͠

ੜ໋৘ใղੳͱ૒ۂۭؒ ARTICLE Poincaré maps for analyzing complex hierarchies in single-cell
data Anna Klimovskaia 1✉, David Lopez-Paz1, Léon Bottou 2 & Maximilian Nickel2✉ The need to understand cell developmental processes spawned a plethora of computational methods for discovering hierarchies from scRNAseq data. However, existing techniques are based on Euclidean geometry, a suboptimal choice for modeling complex cell trajectories with multiple branches. To overcome this fundamental representation issue we propose Poincaré maps, a method that harness the power of hyperbolic geometry into the realm of single-cell data analysis. Often understood as a continuous extension of trees, hyperbolic geometry enables the embedding of complex hierarchical data in only two dimensions while preserving the pairwise distances between points in the hierarchy. This enables the use of our embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudotime inference. When compared to existing methods — unable to address all these important tasks using a single embedding — Poincaré maps produce state-of-the-art two-dimensional representations of cell trajectories on multiple scRNAseq datasets. https://doi.org/10.1038/s41467-020-16822-4 OPEN 1234567890():,; ARTICLE Poincaré maps for analyzing complex hierarchies in single-cell data Anna Klimovskaia 1✉, David Lopez-Paz1, Léon Bottou 2 & Maximilian Nickel2✉ The need to understand cell developmental processes spawned a plethora of computational methods for discovering hierarchies from scRNAseq data. However, existing techniques are based on Euclidean geometry, a suboptimal choice for modeling complex cell trajectories with multiple branches. To overcome this fundamental representation issue we propose Poincaré maps, a method that harness the power of hyperbolic geometry into the realm of single-cell data analysis. Often understood as a continuous extension of trees, hyperbolic geometry enables the embedding of complex hierarchical data in only two dimensions while preserving the pairwise distances between points in the hierarchy. This enables the use of our embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudotime inference. When compared to existing methods — unable to address all these important tasks using a single embedding — Poincaré maps produce state-of-the-art two-dimensional representations of cell trajectories on multiple scRNAseq datasets. https://doi.org/10.1038/s41467-020-16822-4 OPEN 1234567890():,; w 4JOHMFDFMM3/"TFRͷσʔλΛҨ఻ࢠൃݱྔʹ  ج͍ͮͯϙΞϯΧϨԁ൫ʹຒΊࠐΜͩΑ࿦จ hierarchy would lead to a contradiction with the physical direction of time. By virtue of the Poincaré visualization, we reassigned the root of the developmental process to the furthest PS cell not belonging to the "mesodermal” cluster. We picked up a root cell from PS as to ease clustering by angle for lineage detection. More specifically, we chose the most ”exterior" cell from the PS cluster, by visual inspection. Given our reassigned root, we separate the dataset into five potential lineages (see “Methods”), to find the asynchrony in the developmental process in terms of marker expressions (Fig. 4b). Analysis of the composition of cells belonging to each lineage (Fig. 4c) indicates that erythroid cells suggests could alr Discussi The rapi sequenci tational tational m other. Th stage of Pharynx Body wall muscle Glia Neuron Muscle Marginal cell Gland Intestinal valve Ciliated amphid neuron Ciliated non-amphid neuron Hypodermis Seam cells Excretory cell Coelomocyte Z1-Z4 Germline Intestine Unannotated neurons a Cell types Fig. 3 Analysis of C. elegans cell atlas. a Poincaré map (without rotation) on a 40,000 ce used for embedding are (k = 15, σ = 2.0, γ = 3.0). Main cell types are annotated with a te mature cell types towards the border of the disk. Two subpopulations of germline cells Poincaré maps with respect to randomly picked up root cell form one of the sub-populatio types of the early age the embryo). Red line is an average pseudotime distance for a gi NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-16822-4 ଞɺ**#.1Ͱܥ౷थͷຒΊࠐΈʹؔ͢Δൃද͋Γɺޤ͏͝ظ଴ /BUVSF$PNNVOJDBUJPOT

֓ཁ w /PSNBMJ[JOHqPX͸෼෍ͷม׵ख๏ͱͯ͠ڧྗ͕ͩɺϢʔΫϦουۭؒ Ͱ͔͠࢖͑ͳ͍ɻ૒ۂۭؒʹ͓͚Δ/PSNBM[JOHqPXΛ։ൃ͍ͨ͠ʂ w جఈ෼෍͸XSBQQFEOPSNBM෼෍ͳͲΛ࢖͑͹ྑ͍ͷͰɺ͋ͱ͸'MPXΛ  ͲͷΑ͏ʹߏ੒͢Δ͔͕ϙΠϯτͱͳΔɻ w ຊ࿦จͰ͸ɺ5BOHFOUDPVQMJOHͱ8SBQQFEIZQFSCPMJDDPVQMJOHͱ͍͏  ̎ͭͷqPXΛఏҊ͠ɺ૒ۂۭؒͰ΋/PSNBMJ[JOHqPXΛར༻Ͱ͖Δ 
Α͏ʹͨ͠ɻ

5BOHFOUDPVQMJOH w ·ͣೖྗɹΛͭʹΘ͚Δ  w ϢʔΫϦουۭؒʹ͓͚Δ3FBM/71qPX͸࣍ͷΑ͏ʹఆٛ͞ΕΔɻ    w ͜Εʹࢦ਺ࣸ૾ͱର਺ࣸ૾Λט·ͤΔ͜ͱͰ૒ۂۭؒ൛ʹ͢Δɻ  ,͸ۂ཰  
w TͱU͸ɹɹɹɹɹɹɹͱ͢Δχϡʔϥϧωοτ  М͸ඇઢܗؔ਺ alNVP flow uses a computationally n (affine coupling layer) which has evaluate and invert due to its lower e determinant is cheap to compute. ing layer is implemented using a ons some input ˜ x into two sets, ˜ x1:d is transformed elementwise imensions. The second set, ˜ x2 := ed elementwise but in a way that ee Appendix B.2 for more details). perations occur at ToHn K we term Tangent Coupling (T C). mation due to one layer of our T C ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n where, z = ˜ fT C(˜ x) and ˜ fT C is as defined above. Proof Sketch. Here we only provide a sketch of the and details can be found in Appendix C. First, observ the overall transformation is a valid composition of tions: y := expK o ˜ fT C logK o (x). Thus, the o determinant can be computed by chain rule and the id det ⇣ @y @x ⌘ = det ⇣ @expK o (z) @z ⌘ · det ⇣ @f(˜ x) @˜ x ⌘ · det ⇣ @ log @ Tackling each function in the composition individ studied flows: the RealNVP flow (Dinh et al., s core, the RealNVP flow uses a computationally transformation (affine coupling layer) which has of being fast to evaluate and invert due to its lower acobian, whose determinant is cheap to compute. lly, the coupling layer is implemented using a sk, and partitions some input ˜ x into two sets, first set, ˜ x1 := ˜ x1:d is transformed elementwise ntly of other dimensions. The second set, ˜ x2 := also transformed elementwise but in a way that the first set (see Appendix B.2 for more details). oupling layer operations occur at ToHn K we term f coupling as Tangent Coupling (T C). erall transformation due to one layer of our T C det @x = ||z||L ⇥ i=d+1 (s(˜ x1 ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n where, z = ˜ fT C(˜ x) and ˜ fT C is as defined above. Proof Sketch. Here we only provide a sketch of the p and details can be found in Appendix C. First, observe the overall transformation is a valid composition of tions: y := expK o ˜ fT C logK o (x). Thus, the ov determinant can be computed by chain rule and the ide det ⇣ @y @x ⌘ = det ⇣ @expK o (z) @z ⌘ · det ⇣ @f(˜ x) @˜ x ⌘ · det ⇣ @ logK o @x Tackling each function in the composition individ e RealNVP flow (Dinh et al., VP flow uses a computationally fine coupling layer) which has uate and invert due to its lower terminant is cheap to compute. ayer is implemented using a some input ˜ x into two sets, d is transformed elementwise nsions. The second set, ˜ x2 := lementwise but in a way that ppendix B.2 for more details). ations occur at ToHn K we term ent Coupling (T C). on due to one layer of our T C @x ||z||L i=d+1 ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n (12) where, z = ˜ fT C(˜ x) and ˜ fT C is as defined above. Proof Sketch. Here we only provide a sketch of the proof and details can be found in Appendix C. First, observe that the overall transformation is a valid composition of functions: y := expK o ˜ fT C logK o (x). Thus, the overall determinant can be computed by chain rule and the identity, det ⇣ @y @x ⌘ = det ⇣ @expK o (z) @z ⌘ · det ⇣ @f(˜ x) @˜ x ⌘ · det ⇣ @ logK o (x) @x ⌘ . Tackling each function in the composition individually, 2017). At its core, the RealNVP flow uses a computationally symmetric transformation (affine coupling layer) which has the benefit of being fast to evaluate and invert due to its lower triangular Jacobian, whose determinant is cheap to compute. Operationally, the coupling layer is implemented using a binary mask, and partitions some input ˜ x into two sets, where the first set, ˜ x1 := ˜ x1:d is transformed elementwise independently of other dimensions. The second set, ˜ x2 := ˜ xd+1:n , is also transformed elementwise but in a way that depends on the first set (see Appendix B.2 for more details). Since all coupling layer operations occur at ToHn K we term this form of coupling as Tangent Coupling (T C). Thus the overall transformation due to one layer of our T C ⇥ ⇣R s where, z = ˜ fT C(˜ x) a Proof Sketch. Here w and details can be fou the overall transform tions: y := expK o determinant can be co det ⇣ @y @x ⌘ = det ⇣ @exp @ Tackling each funct Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on P 2 1 . Densities are visualized in the Poincar´ e disk. Additional qualitative results can be found in Appendix F. flow is a composition of a logarithmic map, affine coupling defined on ToHn k , and an exponential map: ˜ fT C(˜ x) = ( ˜ z1 = ˜ x1 ˜ z2 = ˜ x2 (s(˜ x1)) + t( ˜ x1) fT C(x) = expK o ( ˜ fT C(logK o (x))), (11) where ˜ x = logK o (x) is a point on ToHn K , and is a pointwise non-linearity such as the exponential function. Functions s and t are parameterized scale and translation functions implemented as neural nets from ToHd K ! ToH n d K . One important detail is that arbitrary operations on a tangent vector v 2 ToHn K may transport the resultant vector outside the tangent space, hampering subsequent operations. To avoid this we can keep the first dimension fixed at v0 = 0 to ensure we remain in ToHn K . Similar to the Euclidean RealNVP, we need an efficient expression for the Jacobian determinant of fT C. Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on P 2 1 . Densities are visualized in the Poincar´ e disk. Additional qualitative results can be found in Appendix F. flow is a composition of a logarithmic map, affine coupling defined on ToHn k , and an exponential map: ˜ fT C(˜ x) = ( ˜ z1 = ˜ x1 ˜ z2 = ˜ x2 (s(˜ x1)) + t( ˜ x1) fT C(x) = expK o ( ˜ fT C(logK o (x))), (11) where ˜ x = logK o (x) is a point on ToHn K , and is a pointwise non-linearity such as the exponential function. Functions s and t are parameterized scale and translation functions implemented as neural nets from ToHd K ! ToH n d K . One important detail is that arbitrary operations on a tangent vector v 2 ToHn K may transport the resultant vector outside the tangent space, hampering subsequent operations. To son of density estimation in hyperbolic space aussian (WG) and mixture of wrapped gaussian Densities are visualized in the Poincar´ e disk. tive results can be found in Appendix F. ition of a logarithmic map, affine coupling , and an exponential map: = ( ˜ z1 = ˜ x1 ˜ z2 = ˜ x2 (s(˜ x1)) + t( ˜ x1) = expK o ( ˜ fT C(logK o (x))), (11) (x) is a point on ToHn K , and is a pointwise ch as the exponential function. Functions meterized scale and translation functions neural nets from ToHd K ! ToH n d K . One is that arbitrary operations on a tangent K may transport the resultant vector outside e, hampering subsequent operations. To n keep the first dimension fixed at v0 = 0 main in ToHn K . Euclidean RealNVP, we need an efficient T C

5BOHFOUDPVQMJOH w ͜ͷΑ͏ʹͯ͠΋ϠίϏΞϯ͸ߴ଎ʹܭࢉͰ͖Δɻ               
    w 0 O ͰܭࢉՄೳ fine at ace. any an- ges old ne on nd al., lly has wer ute. g a ets, ise := hat ls). rm important detail is that arbitrary operations on a tangent vector v 2 ToHn K may transport the resultant vector outside the tangent space, hampering subsequent operations. To avoid this we can keep the first dimension fixed at v0 = 0 to ensure we remain in ToHn K . Similar to the Euclidean RealNVP, we need an efficient expression for the Jacobian determinant of fT C. Proposition 1. The Jacobian determinant of a single T C layer in equation 11 is: det ⇣@y @x ⌘ = ⇣R sinh(||z||L R ) ||z||L ⌘n 1 ⇥ n Y i=d+1 (s(˜ x1))i ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n (12) where, z = ˜ fT C(˜ x) and ˜ fT C is as defined above. Proof Sketch. Here we only provide a sketch of the proof and details can be found in Appendix C. First, observe that the overall transformation is a valid composition of functions: y := expK o ˜ fT C logK o (x). Thus, the overall determinant can be computed by chain rule and the identity, ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘

8SBQQFEIZQFSCPMJDDPVQMJOH w 5BOHFOUDPVQMJOH͸ݪ఺ͷ઀ฏ໘͔͠࢖͍ͬͯͳ͍  ˠݪ఺͔Β཭Ε͍ͯΔྖҬΛѻ͓͏ͱ͢Δ৔߹ʹ͸ෆ޲͖Ͱ͸ʁ  ˠ8SBQQFEIZQFSCPMJDDPVQMJOHͷఏҊ          w
U͸ɹɹɹɹɹɹɹͱ͢Δχϡʔϥϧωοτ  ͨͩ͠ɹɹɹɹɹɹɹɹɹɹɹɹͱ͢Δɻ U͸ϩʔϨϯπۂ໘্ʹ  ৐ΔΑ͏ଞͷ஋͔Βܾఆ͞ΕΔ with Hyperbolic Normalizing Flows ek he in- ves al- mal he sis fies nd is ne hy- ial in an port a point to the tangent space. We employ the coupling strategy previously discussed and partition our input vector into two components: ˜ x1 := ˜ x1:d and ˜ x2 := ˜ xd+1:n . Let ˜ x = logK o (x) be the point on ToHn K after the logarithmic map. The remainder of the WHC layer can be defined as follows; ˜ fWHC(˜ x) = ( ˜ z1 = ˜ x1 ˜ z2 = logK o ⇣ expK t(˜ x1) PT o!t(˜ x1) (v) ⌘ v = ˜ x2 (s(˜ x1)) fWHC(x) = expK o ( ˜ fWHC(logK o (x))). (13) Functions s : ToHd k ! ToH n d k and t : ToHd k ! Hn k are taken to be arbitrary neural nets, but the role of t when compared to T C is vastly different. In particular, the gen- eralization of translation on Riemannian manifolds can be viewed as parallel transport to a different tangent space. Consequently, in Eq. 13, the function t predicts a point on the manifold that we wish to parallel transport to. This greatly increases the flexibility as we are no longer confined to the tangent space at the origin. The logarithmic map is then used to ensure that both ˜ z1 and ˜ z2 are in the same ws t space. We employ the coupling sed and partition our input vector := ˜ x1:d and ˜ x2 := ˜ xd+1:n . Let nt on ToHn K after the logarithmic he WHC layer can be defined as 1 ogK o ⇣ expK t(˜ x1) PT o!t(˜ x1) (v) ⌘ ˜ x1)) HC(logK o (x))). (13) ToH n d k and t : ToHd k ! Hn k are ural nets, but the role of t when y different. In particular, the gen- on Riemannian manifolds can be port to a different tangent space. the function t predicts a point on sh to parallel transport to. This bility as we are no longer confined is ne hy- ial in an cts fWHC(x) = expK o ( ˜ fWHC(logK o (x))). (13) Functions s : ToHd k ! ToH n d k and t : ToHd k ! Hn k are taken to be arbitrary neural nets, but the role of t when compared to T C is vastly different. In particular, the gen- eralization of translation on Riemannian manifolds can be viewed as parallel transport to a different tangent space. Consequently, in Eq. 13, the function t predicts a point on the manifold that we wish to parallel transport to. This greatly increases the flexibility as we are no longer confined to the tangent space at the origin. The logarithmic map is then used to ensure that both ˜ z1 and ˜ z2 are in the same tangent space before the final exponential map that projects the point to the manifold. One important consideration in the construction of t is that it should only parallel transport functions of ˜ x2 . However, the output of t is a point on Hn k and without care this can involve elements in ˜ x1 . To prevent such a scenario we construct the output of t = [t0 , 0, . . . , 0, td+1 , . . . , tn] where elements td+1:n are used to determine the value of t0 using Eq. 5, such that it is a point on the manifold and every remaining index is set to zero. Such a construction ensures that only components of any function of ˜ x2 are parallel transported as desired. Figure 3 illustrates the transformation performed by the WHC layer.

8SBQQFEIZQFSCPMJDDPVQMJOH w ϠίϏΞϯ͸ಉ༷ʹޮ཰తʹܭࢉՄೳ               
w ࣜͷҙຯ͸Θ͔Δ͕ɺͬͪ͜ͷํ͕ྑ͘ͳΔͱݴ͏ཧ۶͕  ݸਓతʹ͸ͬ͘͠Γདྷ͍ͯͳ͍ɻɻ͢Έ·ͤΜɻ of the full transformation in Eq. 13 we proceed by analyzing the effect of WHC on valid orthonormal bases w.r.t. the Lorentz inner product for the tangent space at the origin. We state our main result here and provide a sketch of the proof, while the entire proof can be found in Appendix D. Proposition 2. The Jacobian determinant of the function ˜ fWHC in equation 13 is: det ✓ @y @x ◆ = n Y i=d+1 (s(˜ x1))i ⇥ ⇣R sinh(||q||L R ) ||q||L ⌘l ⇥ ⇣R sinh(|| logK o (ˆ q)||L R ) || logK o (q)||L ⌘ l ⇥ ⇣R sinh(||˜ z||L R ) ||˜ z||L ⌘n 1 ⇥ ⇣R sinh(|| logK o (x)||L R ) || logK o (x)||L ⌘1 n , (15) where ˜ z = concat(˜ z1 , ˜ z2), the constant l = n d, is a non-linearity, q = PTo!t(˜ x1) (v) and ˆ q = expK t (q). Proof Sketch. We first note that the exponential and logarithmic maps applied at the beginning and end of the WHC can be dealt with by appealing to the chain rule and the known Jacobian determinants for these functions as used in Proposition 1. Thus, what remains is the following term: det @z @˜ x . To evaluate this term we rely on the following Lemma. layer is O(n) which is the same as added cost of the two new maps that subspace of basis elements. 4. Experiments We evaluate our T C-flow and WHC structured density estimation, graph graph generation.2 Throughout our ex three main baselines. In Euclidean sp latent variables and affine coupling flo denoted N and NC, respectively. I we use Wrapped Normal latent vari analogous baseline (Nagano et al., 2 parameters are defined on tangent s trained with conventional optimizers & Ba, 2014). Following previous w the curvature K as a learnable param of 10 epochs, and we clamp the max before any logarithmic or exponenti 2019). Appendix E contains details o and implementation details. 4.1. Structured Density Estimation We first consider structured density e cal VAE setting (Kingma & Welling,

ม׵ͨ͠෼෍ g with Hyperbolic Normalizing Flows ﬁnal ated t, of
tion (10) are Target C(Ours) C(Ours) Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on P 2 1 . Densities are visualized in the Poincar´ e disk.

ࣄޙ෼෍ͷਪఆਫ਼౓ Latent Variable Modelling with Hyperbolic Normalizing Flows Model BDP-2
BDP-4 BDP-6 N-VAE 55.4±0.2 55.2±0.3 56.1±0.2 H-VAE 54.9±0.3 55.4±0.2 58.0±0.2 NC 55.4±0.4 -54.7±0.1 -55.2±0.3 T C -54.9±0.1 55.4±0.1 57.5±0.2 WHC -55.1±0.4 55.2±0.2 56.9±0.4 Table 1. Test Log Likelihood on Binary Diffusion Process versus latent dimension. All normalizing flows use 2-coupling layers. Model MNIST 2 MNIST 4 MNIST 6 N-VAE 139.5±1.0 115.6±0.2 100.0±0.02 H-VAE ⇤ 113.7±0.9 99.8±0.2 NC 139.2±0.4 115.2±0.6 -98.70.3 T C ⇤ -112.5±0.2 99.3±0.2 WHC -136.5±2.1 -112.8±0.5 99.4±0.2 Table 2. Test Log Likelihood on MNIST averaged over 5 runs verus latent dimension. * indicates numerically unstable settings. 4.2. Graph Reconstruction We evaluate the practical utility of our hyperbolic flows by conducting experiments on the task of link prediction using graph neural networks (GNNs) (Scarselli et al., 2008) as an inference model. Given a simple graph G = (V, A, X), defined by a set of nodes V, an adjacency matrix A 2 |V|⇥|V| |V|⇥n hyperbolic WHC flow. Sim estimation setting, the perform observed in low-dimensional Model Dis-I AUC Dis AP N-VAE 0.90±0.01 0.92± H-VAE 0.91±5e-3 0.92± NC 0.92±0.01 0.93± T C 0.93±0.01 0.93± WHC 0.93±0.01 0.94± Table 3. Test AUC and Test AP o has latent dimesion 6 and Dis-II 4.3. Graph Generation Finally, we explore the utilit generating hierarchical struc we construct datasets containi well as uniformly random lo where each graph contains bet prior work on graph generat our datasets are designed to thus enabling us to test the ut models. We then train a gene w ͖ͪΜͱॻ͍ͯ͸͍ͳ͔͕ͬͨɺ/PSNBMJ[JOHqPXͦͷ΋ͷʹ͸࣍ݩ  ࡟ݮޮՌ͸ͳ͍ͷͰɺ7"&ͱ૊Έ߹Θ্ͤͨͰͷධՁͰ͸ͳ͍͔ͱࢥ͏ w ͜Ε͸ਖ਼௚͋Μ·Γࢫຯ͕ͳ͍݁ՌͰ͢Ͷ

άϥϑ࠶ߏஙͷਪఆਫ਼౓ w "1͸"WFSBHF1SFDJTJPO  %JTˠ࣬ױͱҨ఻ࢠͷؒͷؔ܎ωοτϫʔΫ  %JTˠ4*3NPEFM͔Βߏ੒͞ΕͨωοτϫʔΫ ͢Έ·ͤΜΑ͘Θ͔Γ·ͤΜ w ͜Ε͸8)$͕গ͠Α͔ͬͨ -6
±0.2 ±0.2 0.3 ±0.2 ±0.4 cess versus g layers. IST 0±0.02 8±0.2 70.3 3±0.2 4±0.2 over 5 runs ble settings. c flows by hyperbolic WHC flow. Similar to the structured density estimation setting, the performance gains of WHC are best observed in low-dimensional latent spaces. Model Dis-I AUC Dis-I AP Dis-II AUC Dis-II AP N-VAE 0.90±0.01 0.92±0.01 0.92±0.01 0.91±0.01 H-VAE 0.91±5e-3 0.92±5e-3 0.92±4e-3 0.91±0.01 NC 0.92±0.01 0.93±0.01 0.95±4e-3 0.93±0.01 T C 0.93±0.01 0.93±0.01 0.96±0.01 0.95±0.01 WHC 0.93±0.01 0.94±0.01 0.96±0.01 0.96±0.01 Table 3. Test AUC and Test AP on Graph Embeddings where Dis-I has latent dimesion 6 and Dis-II has latent dimension 2. 4.3. Graph Generation Finally, we explore the utility of our hyperbolic flows for generating hierarchical structures. As a synthetic testbed, we construct datasets containing uniformly random trees as well as uniformly random lobster graphs (Golomb, 1996), where each graph contains between 20 to 100 nodes. Unlike

άϥϑੜ੒ͷධՁ w (SBQI/PSNBMJ[JOH'MPXͱ૊Έ߹Θͤͯɺೖྗͱྨࣅͨ͠  άϥϑΛੜ੒Ͱ͖Δ͔ΛධՁ w 5$΍8)$ͷํ͕ݩͷάϥϑʹ͍ۙɺ໦ঢ়ͷάϥϑΛ࡞Γ͍ͨ࣌͸  ద͍ͯ͠Δ w ఆྔతͳධՁ΋͍͕ͯͨ͠ɺݟํ͕Α͘Θ͔Βͳ͔ͬͨͷͰׂѪ Latent
Variable Modelling with Hyperbolic Normalizing Flows Train Lobster Random Tree C C C Figure 4. Selected qualitative results on graph generation for lobster and random tree graph. Model Accuracy Avg. Clust. Avg. GC. NC 56.6±5.5 40.9±42.7 0.34±0.10 T C 32.1±1.9 98.3±89.5 0.25±0.12 WHC 62.1±10.9 21.1±13.4 0.13±0.07 Table 4. Generation statistics on random trees over 5 runs. 5. Related Work Hyperbolic Geometry in Machine Learning:. The inter- section of hyperbolic geometry and machine learning has recently risen to prominence (Dhingra et al., 2018; Tay et al., 2018; Law et al., 2019; Khrulkov et al., 2019; Ovinnikov, 2019). Early prior work proposed to embed data into the Poincar´ e ball model (Nickel & Kiela, 2017; Chamberlain

·ͱΊ w /PSNBMJ[JOHqPX  ˠ୯७ͳ֬཰෼෍Λมܗ͍͖ͯ͠ɺෳࡶͳ֬཰෼෍Λߏ੒͢Δख๏ w ૒ۂۭؒͰͷػցֶश  ˠ໦ঢ়ͷߏ଄Λදݱ͢Δͷʹద͍ͯ͠Δɺ͜Εʹجͮ͘ػցֶशख๏  ɹͷ։ൃ͕࠷ۙྲྀߦ͍ͯ͠Δɻ w )ZQFSCPMJDOPSNBMJ[JOHqPX 
ˠ૒ۂۭؒʹ͓͚ΔOPSNBMJ[JOHqPXɺϢʔΫϦουۭؒͷ࣌ͱಉ༷ɺ 7"&ͳͲͱ૊Έ߹Θͤͯ࢖͏͜ͱͰϞσϧͷදݱྗ͕޲্    ࣮ݧํ๏ͷهड़͕ෆे෼ͰධՁख๏͕શવΘ͔Βͳ͔ͬͨɾɾɻ

論文紹介: Latent Variable Modelling with Hyperbolic...

論文紹介: Latent Variable Modelling with Hyperbolic Normalizing Flows(ICML 2020)

More Decks by Tsukasa

Other Decks in Science

Featured

Transcript