Upgrade to Pro — share decks privately, control downloads, hide ads and more …

双曲幾何勉強会2回目

Tsukasa
January 06, 2020

 双曲幾何勉強会2回目

Tsukasa

January 06, 2020
Tweet

More Decks by Tsukasa

Other Decks in Research

Transcript

  1. ࠓ೔ͷ಺༰ w ࿦จ঺հ  w l-FBSOJOHNJYFEDVSWBUVSFSFQSFTFOUBUJPOTJOQSPEVDUTPGNPEFM TQBDFTz *$-3  w

    l-BSHF.BSHJO$MBTTJpDBUJPOJO)ZQFSCPMJD4QBDFz 1.-3  w ੜ໋৘ใղੳ΁ͷԠ༻Λݟਾ͑ͨݚڀςʔϚͷํ޲ੑʹ͍ͭͯ
  2. ࠓ೔ͷ಺༰ w ࿦จ঺հ  w l-FBSOJOHNJYFEDVSWBUVSFSFQSFTFOUBUJPOTJOQSPEVDUTPGNPEFM TQBDFTz *$-3  w

    l-BSHF.BSHJO$MBTTJpDBUJPOJO)ZQFSCPMJD4QBDFz 1.-3  w ੜ໋৘ใղੳ΁ͷԠ༻Λݟਾ͑ͨݚڀςʔϚͷํ޲ੑʹ͍ͭͯ
  3. ࠞ߹ۂ཰ۭؒ΁ͷຒΊࠐΈ Published as a conference paper at ICLR 2019 LEARNING

    MIXED-CURVATURE REPRESENTATIONS IN PRODUCTS OF MODEL SPACES Albert Gu, Frederic Sala, Beliz Gunel & Christopher R´ e Computer Science Department Stanford University Stanford, CA 94305 {albertgu,fredsala,bgunel}@stanford.edu, [email protected] ABSTRACT The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Eu- clidean space has been the workhorse for embeddings; recently hyperbolic and spherical spaces have gained popularity due to their ability to better embed new types of structured data—such as hierarchical data—but most data is not struc- tured so uniformly. We address this problem by proposing learning embeddings in a product manifold combining multiple copies of these model spaces (spherical, hyperbolic, Euclidean), providing a space of heterogeneous curvature suitable for a wide variety of structures. We introduce a heuristic to estimate the sectional cur- w ૒ۃۭؒ ۂ཰͕ෛ ΁ͷຒΊࠐΈ͕͍ͭͰ΋࠷దͳΘ͚Ͱ͸ͳ͍ w ٿ໘ɺϢʔΫϦουۭؒɺ͋Δ͍͸ͦͷࠞ߹ۭؒ΁ͷຒΊࠐΈ͕࠷ద͔΋ʁ w ࠞ߹ۭؒ΁ͷຒΊࠐΈΞϧΰϦζϜΛ࡞੒ͨ͠
  4. ༷ʑͳۭؒͱͦΕʹ͓͚Δࡾ֯ܗ Published as a conference paper at ICLR 2019 Figure

    1: Three component spaces: sphere S2, Euclidean plane E2, and hyperboloid H2. Thick lines are geodesics; these get closer in positively curved (K = +1) space S2, remain equidistant in flat (K = 0) space E2, and get farther apart in negatively curved (K = 1) space H2. We propose embedding into product spaces in which each component has constant curvature. As we show, this allows us to capture a wider range of curvatures than traditional embeddings, while retaining the ability to globally optimize and operate on the resulting embeddings. Specifically, we form a Riemannian product manifold combining hyperbolic, spherical, and Euclidean components and equip it with a decomposable Riemannian metric. While each component space in the product has constant curvature (positive for spherical, negative for hyperbolic, and zero for Euclidean), the Published as a conference paper at ICLR 2019 a b c m a b c m a b c m Figure 3: Geodesic triangles in differently curved spaces: compared to Euclidean geometry in which it satisfies the parallelogram law (Center), the median am is longer in cycle-like positively curved space (Left), and shorter in tree-like negatively curved space (Right). The relative length of am can be used as a heuristic to estimate discrete curvature. 3.2 ESTIMATING THE SIGNATURE w άϥϑ͕؀ঢ়ˠٿ໘ɺάϦουঢ়ˠϢʔΫϦουฏ໘ɺ໦ঢ়ˠ૒ۂ໘
 ʹͦΕͧΕ޲͍͍ͯͦ͏
  5. ੵۭؒʹ͓͚Δ஫ҙ w ϢʔΫϦουۭؒͰ͸3 33ʹͳΔ͕ɺଞͷۭؒͰ͸ͦ͏͸ͳΒͳ͍ɻ w ͨͱ͑͹ɺ4͸ٿ໘Ͱ͋Δ͕ɺ4 4͸τʔϥεͰ͋Δɻ Published as a

    conference paper at ICLR 2019 Figure 1: Three component spaces: sphere S2, Euclidean plane E2, and hyperboloid H2 4 4 4 8JLJQFEJBΑΓ
  6. ຒΊࠐΈΞϧΰϦζϜ w -PTTGVODUJPO w other combinatorial construction for the min-distance

    in the Appendix. ZATION & COMPONENT CURVATURES mbeddings, we optimize the placement of points through an auxiliary loss function. distances {d G (X i , X j )} ij , our loss function of choice is L(x) = X 1ijn ✓ dP (x i , x j ) d G (X i , X j ) ◆2 1 , (2) E for our Euclidean embedding space component to distinguish it from R, since our models of spherical geometry also use R as an ambient space. 4 i i j j refer to each Ssi , Hhi , Ee as components or factors. We refer to the decomposition, e.g., (H2)2 = H2 ⇥H2, as the signature. For convenience, let M1 , . . . , M m+n+1 refer to the factors in the product. Distances on P As discussed in Section 2, the product P is a Riemannian manifold defined by the structure of its components. For p, q 2 P, we write d Mi (p, q) for the distance d Mi restricted to the appropriate components of p and q in the product. In particular, the squared distance in the product decomposes via (1). In other words, dP is simply the `2 norm of the component distances d Mi . We note that P can also be equipped with different distances (ignoring the Riemannian struc- ture), leading to a different embedding space. Without the underlying manifold structure, we can- not freely operate on the embedded points such as taking geodesics and means, but some sim- ple applications only interact through distances. For such settings, we consider the `1 distance dP,`1 (p, q) = P sm i=1 d Si (p, q) + P hn i=1 d Hi (p, q) + d E (p, q) and the min distance dP,min(p, q) = min {d S1 (p, q), . . . , d H1 (p, q), . . . , d E (p, q)}. These distances provide simple and interpretable em- bedding spaces using P, enabling us to introduce combinatorial constructions that allow for embed- dings without the need for optimization. We give an example below and discuss further in the Ap- pendix. We then focus on the Riemannian distance, which allows Riemannian optimization directly on the manifold, and enables full use of the manifold structure in generic downstream applications. Example Consider the graph G shown on the right of Figure 2. This graph has a backbone cycle with 9 nodes, each attached to a tree; such topologies are common in networking. If a single edge (a, b) is removed from the cycle, the result is a tree embeddable arbitrarily well into hyperbolic space (Sala et al., 2018). However, a, b (and their subtrees) would then incur an additional distance of 8 1 = 7, being forced to go the other way around the cycle. But using the `1 distance, we can embed Gtree into H2 and Gcycle into S1, yielding arbitrarily low distortion for G. We give the full details and another combinatorial construction for the min-distance in the Appendix. 3.1 OPTIMIZATION & COMPONENT CURVATURES To compute embeddings, we optimize the placement of points through an auxiliary loss function. Given graph distances {d G (X i , X j )} ij , our loss function of choice is ✓ ◆ Published as a conference paper at ICLR 2019 Algorithm 1 R-SGD in products 1: Input: Loss function L : P ! R 2: Initialize x(0) 2 P randomly 3: for t = 0, . . . , T 1 do 4: h rL(x(t)) 5: for i = 1, . . . , m do 6: v i projS x (t) i (h i ) 7: for i = m + 1, . . . , m + n do 8: v i projH x (t) i (h i ) 9: v i Jv i 10: v m+n+1 h m+n+1 11: for i = 1, . . . , m + n + 1 do 12: x(t+1) i Exp x (t) i (v i ) 13: return x(T ) G Gtree Gcycle w E1 ͕ຒΊࠐΜۭͩؒͰͷڑ཭
 E( ͕άϥϑঢ়ͷڑ཭ w ݸผͷۭؒͰಠཱʹຒΊࠐΈΛ͠ ͍ͯ͘ͷͰ؆୯
  7. ਓ޻σʔληοτͰͷ݁Ռ 2019 G Gtree Gcycle oses per component. Subscripts i

    index components in the either hyperbolic nor spherical space is suitable for G, but stortion. Note the decomposition into tree and cycle. depends on hyperbolic distance d H (for which the gradient which is continuously differentiable (Sala et al., 2018). ion can be optimized through standard Riemannian opti- el, 2013) and RSVRG (Zhang et al., 2016). We write down spaces in Algorithm 1. This proceeds by first computing ect to the ambient space of the embedding (Step 4), and dient by applying the Riemannian correction (multiply by Published as a conference paper at ICLR 2019 Table 1: Matching geometries: Average distortion on canonical graphs (tree, cycle, ring of with 40 nodes, comparing four spaces with total dimension 3. The best distortion is achieved space with matching geometry. Cycle Tree Ring of Trees |V | = 40, |E| = 40 |V | = 40, |E| = 39 |V | = 40, |E| = 40 (E 3)1 0.1064 0.1483 0.0997 (H 3)1 0.1638 0.0321 0.0774 (S 3)1 0.0007 0.1605 0.1106 (H 2)1 ⇥ (S 1)1 0.1108 0.0538 0.0616 doubling the number of factors. These models include the products consisting of only a con curvature base space, ranging to various combinations of Sd/2 2 , Hd/2 2 comprising factors of d sion 2.3 For a given signature, the curvatures are initialized to the appropriate value in { 1 and then learned using the technique in Section 3.1. We additionally compare to the outp Algorithms 2,3 for heuristically selecting a combination of spaces in which to embed these da Quality We focus on the average distortion—which our loss function (2) optimizes—as ou metric for reconstruction, and additionally report the mAP metric for the unweighted graph expected, for the synthetic graphs (tree, cycle, ring of trees), the matching geometries (hype spherical, product of hyperbolic and spherical) yield the best distortion (Table 1). Next, we in Table 2 the quality of embedding different graphs across a variety of allocations of spaces, total dimension d = 10 following previous work (Nickel & Kiela, 2018). We confirm that the ture of each graph informs the best allocation of spaces. In particular, the cities graph—whi intrinsic structure close to S2—embeds well into any space with a spherical component, and th like Ph.D.s graph embeds well into hyperbolic products. We emphasize that even for such da w ධՁࢦඪ͸EJTUPSUJPO w ໨࿦ݟͲ͓Γͷ݁Ռ͕ಘΒΕ͍ͯΔɻ 19 milarity task using the WS-353 corpus. Our results and t spaces are a promising area for future study. ND pped with distances d U , d V , an embedding is a mapping is measured by various fidelity measures. A standard stortion of a pair of points a, b is (|d V (f(a), f(b)) over all pairs of points. he explicit value of all distances. At the other end of res is mean average precision (mAP), which applies a graph and node a 2 V have neighborhood N a = e of a. In the embedding f, define R a,bi to be the small- , R a,bi is the smallest set of nearest points required to AP(f) = 1 P 1 P|Na | |N a \R a,bi |/|R a,bi |. Published as a conference paper at ICLR 2019 in Spearman rank correlation on a word similar initial exploration suggest that mixed product sp 2 PRELIMINARIES & BACKGROUND Embeddings For metric spaces1 U, V equippe f : U ! V . The quality of an embedding is m measure is average distortion Davg . The distor d U (a, b)|)/d U (a, b), and Davg is the average ove Distortion is a global metric; it considers the e the global-local spectrum of fidelity measures to unweighted graphs. Let G = (V, E) be a g {b1 , . . . , bdeg(a) }, where deg(a) is the degree of est ball around f(a) that contains b i (that is, R retrieve the ith neighbor of a in f). Then, mAP(f
  8. ࣮σʔληοτͰͷ݁Ռ Table 2: Graph reconstruction: fidelity measures for graph embeddings

    using d = 10 total dimen- sions, with varying allocations of spaces and dimensions. Our loss function (2) targets distortion, and for each dataset the best model reflects the structure of the data. Even on near-perfectly spherical or hierarchical data, products of S (resp. H) perform no worse than the single copy. Cities CS PhDs Power Facebook |V |=312 |V |=1025, |E|=1043 |V |=4941, |E|=6594 |V |=4039, |E|=88234 Davg Davg mAP Davg mAP Davg mAP E 10 0.0735 0.0543 0.8691 0.0917 0.8860 0.0653 0.5801 H 10 0.0932 0.0502 0.9310 0.0388 0.8442 0.0596 0.7824 S 10 0.0598 0.0569 0.8329 0.0500 0.7952 0.0661 0.5562 (H 5)2 0.0756 0.0382 0.9628 0.0365 0.8605 0.0430 0.7742 (S 5)2 0.0593 0.0579 0.7940 0.0471 0.8059 0.0658 0.5728 H 5 ⇥ S 5 0.0622 0.0509 0.9141 0.0323 0.8850 0.0402 0.7414 (H 2)5 0.0687 0.0357 0.9694 0.0396 0.8739 0.0525 0.7519 (S 2)5 0.0638 0.0570 0.8334 0.0483 0.8818 0.0631 0.5808 (H 2)2 ⇥E 2 ⇥(S 2)2 0.0765 0.0391 0.8672 0.0380 0.8152 0.0474 0.5951 Best model S 5 1.0 ⇥S 5 1.1 H 2 .3 ⇥H 2 .6 ⇥H 2 1.5 ⇥(H 2 1.2 )2 H 5 3.4 ⇥ S 5 12.6 H 5 0.3 ⇥ S 5 3.5 D avg improvement over single space 0.8% 28.89% 16.75% 32.55% Table 3: Heuristic allocation: estimated signatures for embedding unweighted graphs from Table 2 into two factors, using Algorithms 2,3 to match the empirical distribution of graph curvature. The resulting curvature signs agree with results from Table 2 for choosing among two-component spaces. CS PhDs Power Facebook Estimated Signature H5 1.3 ⇥ H5 0.2 H5 1.8 ⇥ S5 1.7 H5 0.9 ⇥ S5 1.6 w ධՁࢦඪ͸EJTUPSUJPOͱN"1 w ۭ͚ؒͩͰͳ͘ۂ཰΋ಉ࣌ʹ࠷దԽ w EJTUPSUJPO͸ͲͷϞσϧ΋ࠞ߹ۭ͕ؒྑ͍͕ɺN"1͸ͦ͏Ͱ΋ͳ͍ n rank correlation on a word similarity task using the WS-353 corpus. Our results and oration suggest that mixed product spaces are a promising area for future study. IMINARIES & BACKGROUND gs For metric spaces1 U, V equipped with distances d U , d V , an embedding is a mapping V . The quality of an embedding is measured by various fidelity measures. A standard average distortion Davg . The distortion of a pair of points a, b is (|d V (f(a), f(b)) d U (a, b), and Davg is the average over all pairs of points. s a global metric; it considers the explicit value of all distances. At the other end of ocal spectrum of fidelity measures is mean average precision (mAP), which applies ted graphs. Let G = (V, E) be a graph and node a 2 V have neighborhood N a = eg(a) }, where deg(a) is the degree of a. In the embedding f, define R a,bi to be the small- und f(a) that contains b i (that is, R a,bi is the smallest set of nearest points required to ith neighbor of a in f). Then, mAP(f) = 1 |V | P a2V 1 deg(a) P|Na | i=1 |N a \R a,bi |/|R a,bi |. mAP does not track explicit distances; it is a ranking-based measure for local neighbor- erve that mAP(f)  1 (higher is better) while davg 0 (lower is better). n Manifolds We briefly review some notions from manifolds and Riemannian geom-
  9. ࠓ೔ͷ಺༰ w ࿦จ঺հ  w l-FBSOJOHNJYFEDVSWBUVSFSFQSFTFOUBUJPOTJOQSPEVDUTPGNPEFM TQBDFTz *$-3  w

    l-BSHF.BSHJO$MBTTJpDBUJPOJO)ZQFSCPMJD4QBDFz 1.-3  w ੜ໋৘ใղੳ΁ͷԠ༻Λݟਾ͑ͨݚڀςʔϚͷํ޲ੑʹ͍ͭͯ
  10. ૒ۂۭؒͰͷ47. w ૒ۂۭؒʹຒΊࠐΜͩޙʹσʔλղੳͨ͘͠΋ɺղੳπʔϧ͕ͦΖͬͯͳ͍ɻ w ૒ۂۭؒ ૒ۂ໘Ϟσϧ Ͱͷઢܗ47.Λ࡞ͬͨ Large-Margin Classification in

    Hyperbolic Space Hyunghoon Cho Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 [email protected] Benjamin DeMeo Department of Biomedical Informatics Harvard University Cambridge, MA 02138 [email protected] Jian Peng Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 [email protected] ଞ໊
  11. ී௨ͷ47.ͱͷҧ͍ w ී௨ͷιϑτϚʔδϯઢܕ47. w ૒ۂۭؒͰͷιϑτϚʔδϯઢܕ47. w ͜͜Ͱ ͸ϛϯίϑεΩʔ಺ੵɻTJOI͕ग़ͯ͘Δͷ͸૒ۂۭؒͰͷ఺ͱ௚ઢͷ ڑ཭͕ɹɹɹɹɹɹɹɹ͔ͩΒɻ w

    ࿦จʹ͸͜ΕΛޯ഑߱ԼͰղ͍ͨͱॻ͍ͯ͋Γɺ૒ରදݱपΓͷٞ࿦
 ͸ͳ͔ͬͨ ͸ͣ minimizew2 Rn 1 2kwk 2 (10) subject to y(j)(wT x(j)) 1, 8j 2 [m] (11) ing algorithm that solves this problem (via its dual) is known as support vector machines ntroducing a relaxation for the separability constraints gives a more commonly used soft- riant of SVM minimizew2 Rn 1 2kwk 2 + C m X j=1 max(0, 1 y(j)(wT x(j))) (12) 3 where the first coordinate (corresponds to the time axis in Minkowski spacetime) is neglected. Unlike Euclidean SVM, however, our optimization problem has a non-convex objective as well as a non-convex constraint. Yet, if we restrict our attention to non-trivial, finite-sized problems where it is necessary and sufficient to consider only the set of w for which at least one data point lies on either side of the decision boundary, then the negative norm constraint can be replaced with a convex alternative that intuitively maps out the convex hull of given data points in the ambient Euclidean space of Ln. Finally, the soft-margin formulation of hyperbolic SVM can be derived by relaxing the separability constraints as in the Euclidean case. Instead of imposing a linear penalty on misclassification errors, which has an intuitive interpretation as being proportional to the minimum Euclidean distance to the correct classification in the Euclidean case, we impose a penalty proportional to the hyperbolic distance to the correct classification. Analogous to the Euclidean case, we fix the scale of penalty so that the margin of the closest point to the decision boundary (that is correctly classified) is set to sinh 1(1). This leads to the optimization problem minimizew2 Rn+1 1 2 w ⇤ w + C m X j=1 max(0, sinh 1(1) sinh 1(y(j)(w ⇤ x(j)))), (19) subject to w ⇤ w < 0. (20) In all our experiments in the following section, we consider the simplest approach of solving the above formulation of hyperbolic SVM via projected gradient descent. The initial w is determined based on the solution w0 of a soft-margin SVM in the ambient Euclidean space of the hyperboloid model, so that w ⇤ x = (w0)T x for all x. This provides a good initialization for the optimization and has an additional benefit of improving the stability of the algorithm in the presence of potentially many local optima. 5 Experimental Results sinh 1 ✓ w ⇤ x p w ⇤ w ◆ . (15) ult, we reduce the problem of calculating the minimum hyperbolic distance to the ne to a Euclidean geometry problem by mapping the decision boundary and the e Poincaré half-space model, in which the decision boundary is characterized as a here. A full proof of Theorem 1 is provided in Supplementary Information. a, one can apply a sequence of transformations to the max-margin classification for the hyperbolic setting to obtain the following result.
  12. ਓ޻σʔληοτͰͷ݁Ռ w )ZQFSCPMJDHBVTTJBONJYUVSFΛߟ͑ɺ͔ͦ͜ΒσʔληοτΛੜ੒ w )ZQFSCPMJD47.ͷํ͕ྑ͍ ౰ͨΓલͷؾ΋͢Δ (a) (b) (c) 0.5

    0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Euclidean SVM Hyperbolic SVM Hyperbolic SVM Euclidean SVM Macro-AUPR Figure 2: Multi-class classification of Gaussian mixtures in hyperbolic space. (a) Two-fold cross validation results for 100 simulated Gaussian mixture datasets with 4 randomly positioned components and 100 points sampled from each component. Each dot represents the average performance over 5 trials. Vertical and horizontal lines represent standard deviations. Example decision hyperplanes for hyperbolic and Euclidean SVMs are shown in (b) and (c), respectively, using the Poincaré disk model. Color of each decision boundary denotes which component is being discriminated from the rest.
  13. ࣮ωοτϫʔΫσʔλͷຒΊࠐΈ w ϢʔΫϦουۭؒʹຒΊࠐΉΑΓɺ૒ۂۭؒʹຒΊࠐΜͩํ͕ਫ਼౓͕ྑ͍ w ૒ۂۭؒʹຒΊࠐΜͩޙ΋ɺϢʔΫϦου47.ΑΓ΋૒ۂ47.Λར༻ͨ͠ํ ͕ਫ਼౓͕ྑ͘ͳΔɻ Classifier Embedding Dataset (Dimension)

    karate polbooks football polblogs Hyperbolic SVM Hyperbolic (2) 0.86 ± 0.03 0.73 ± 0.04 0.24 ± 0.03 0.93 ± 0.01 Euclidean SVM Hyperbolic (2) 0.86 ± 0.03 0.66 ± 0.02 0.21 ± 0.01 0.93 ± 0.01 Euclidean SVM Euclidean (2) 0.47 ± 0.07 0.34 ± 0.03 0.09 ± 0.01 0.60 ± 0.09 Euclidean SVM Euclidean (5) 0.55 ± 0.08 0.35 ± 0.03 0.10 ± 0.01 0.69 ± 0.04 Euclidean SVM Euclidean (10) 0.50 ± 0.08 0.36 ± 0.03 0.10 ± 0.01 0.72 ± 0.04 Euclidean SVM Euclidean (25) 0.50 ± 0.09 0.37 ± 0.04 0.11 ± 0.02 0.80 ± 0.03 Table 1: Node classification performance on four real-world network datasets. We performed two-fold cross validation experiments on four real-world network datasets described in main text. For all four datasets, hyperbolic SVM matched or outperformed the performance achieved by Euclidean SVM (the datasets where the performance of two methods were comparable, karate and polblogs, contained only two well-separated classes.) Methods are evaluated based on macro-averaged area under the precision recall curve. Mean performance summarized over 5 cross-validation trials over 5 different embeddings for each dataset is shown, each followed by the standard deviation. Numbers corresponding to the best performance on each dataset are shown in boldface. hyperbolic spaces, our formulation naturally extends to higher dimensional hyperbolic spaces, which may be of interest in future applications. More broadly, our work belongs to a growing body of literature that aims to develop learning algorithms that directly operate over a Riemannian manifold [17, 18]. Linear hyperplane-based
  14. ࠓ೔ͷ಺༰ w ࿦จ঺հ  w l-FBSOJOHNJYFEDVSWBUVSFSFQSFTFOUBUJPOTJOQSPEVDUTPGNPEFM TQBDFTz *$-3  w

    l-BSHF.BSHJO$MBTTJpDBUJPOJO)ZQFSCPMJD4QBDFz 1.-3  w ੜ໋৘ใղੳ΁ͷԠ༻Λݟਾ͑ͨݚڀςʔϚͷํ޲ੑʹ͍ͭͯ
  15. #*ͱදݱֶश w #*Ͱ΋දݱֶश͸৭ʑ࢖ΘΕ͍ͯΔɻ
 
 
 
 
 
 
 


    
 w *'ͷߴ͍࠷ۙͷ࿦จΛషͬͯΈ·ͨ͠  w දݱֶशͨ͠ޙʹڭࢣ͋Γֶशͯ͠ɺ༏ҐੑΛࣔ͢έʔε͕ଟ͍ ARTICLES https://doi.org/10.1038/s41592-019-0616-3 Knowledge gained by integrating complementary omics data will lead to improved detection of microbial products and optimized culturing conditions for uncharacterized micro- organisms1. Previous work has been able to predict metabolite abundance profiles from microbe abundance profiles2,3. However, because conventional correlation techniques have unaccept- ably high false-discovery rates, finding meaningful relationships between genes within complex microbiomes and their products in the metabolome is challenging. Although there has been a widespread effort to develop mul- it remains unclear whether these models can obtain individual microbe–metabolite interactions. Pearson’s and Spearman’s correlations assume independence between interactions, simplifying the estimation procedure by reducing it to a combination of independent two-dimensional problems. However, many studies have shown that the simplifi- cations in these methods are not statistically valid for composi- tional data, a fact first recognized by Pearson in 1895 and followed up in numerous studies13–17. This problem is further complicated because both microbiome17 and mass spectrometry18–21 datasets Learning representations of microbe–metabolite interactions James T. Morton 1,2, Alexander A. Aksenov 3,4, Louis Felix Nothias3,4, James R. Foulds5, Robert A. Quinn6, Michelle H. Badri7, Tami L. Swenson8, Marc W. Van Goethem8, Trent R. Northen 8,9, Yoshiki Vazquez-Baeza10,11, Mingxun Wang3,4, Nicholas A. Bokulich12,13, Aaron Watters14, Se Jin Song 1,11, Richard Bonneau7,14,15,16, Pieter C. Dorrestein 3,4 and Rob Knight 1,2,11,17* Integrating multiomics datasets is critical for microbiome research; however, inferring interactions across omics datasets has multiple statistical challenges. We solve this problem by using neural networks (https://github.com/biocore/mmvec) to esti- mate the conditional probability that each molecule is present given the presence of a specific microorganism. We show with known environmental (desert soil biocrust wetting) and clinical (cystic fibrosis lung) examples, our ability to recover microbe– metabolite relationships, and demonstrate how the method can discover relationships between microbially produced metabo- lites and inflammatory bowel disease. /BUVSF.FUIPET
  16. ഑ྻͱXPSEWFD RESEARCH ARTICLE Continuous Distributed Representation of Biological Sequences for

    Deep Proteomics and Genomics Ehsaneddin Asgari1, Mohammad R. K. Mofrad1,2* 1 Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America, 2 Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America * [email protected] Abstract We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in prote- omics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visuali- zation, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93% ± 0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of a11111 PEN ACCESS on: Asgari E, Mofrad MRK (2015) Continuous uted Representation of Biological Sequences ep Proteomics and Genomics. PLoS ONE 10 0141287. doi:10.1371/journal.pone.0141287 : Firas H Kobeissy, University of Florida, ED STATES ved: July 9, 2015 pted: October 5, 2015 shed: November 10, 2015 ight: © 2015 Asgari, Mofrad. This is an open s article distributed under the terms of the 1-040/&  versus DisProt showed biophysical differences between FG-Nups and average DisProt sequences [29]. We further propose using protein-vectors for the visualization of biological sequences. Sim- plicity and biophysical interpretations encoded within ProtVec distinguishes this method from the previous work [30, 31]. As an example, we use ProtVec for the visualization of FG-Nups, DisProt, and structured PDB proteins. This visualization confirms the results obtained [29] on the biophysical features of FG-Nups and typical disordered proteins. Furthermore, we employ ProtVec to classify FG-Nups versus random PDB sequences as well as DisProt disordered Fig 1. Protein sequence splitting. In order to prepare the training data, each protein sequence will be represented as three sequences (1, 2, 3) of 3-grams. doi:10.1371/journal.pone.0141287.g001 Continuous Distributed Representation of Biological Sequences we visualized the distribution of different criteria, including mass, volume, polarity, hydr bicity, charge, and van der Waals volume in this space. To do so, for each 3-gram we con qualitative and quantitative analyses as described below. Qualitative Analysis. In order to visualize the distribution of the aforementioned pr ties, we projected all 3-gram embeddings from 100-dimensional space to a 2D space usin chastic Neighbor Embedding (t-SNE) [36]. In the diagrams presented in Fig 2, each poin represents a 3-gram and is colored according to its scale in each property. Interestingly, a Fig 2. Normalized distributions of biochemical and biophysical properties in protein-space. In these plots, each point represents a 3-gram (a w three residues) and the colors indicate the scale for each property. Data points in these plots are projected from a 100-dimensional space a 2D space SNE. As it is shown words with similar properties are automatically clustered together meaning that the properties are smoothly distributed in this spac doi:10.1371/journal.pone.0141287.g002 Continuous Distributed Representation of Biological Seq w LNFSʹ෼ׂͨ͋͠ͱXPSEWFDʹͿͪࠐΜ͚͕ͩͩͩɺҾ༻਺͕ଟ͍ ճ  w ߟ͑΍͍͔͢Βׂ͔ͱ৭ʑ͋Δ
  17. ΦϯτϩδʔͱXPSEWFD Onto2Vec: joint vector-based representation of biological entities and their

    ontology-based annotations Fatima Zohra Smaili, Xin Gao* and Robert Hoehndorf* Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia *To whom correspondence should be addressed. Abstract Motivation: Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associ- ate a (kind of) biological entity with a set of phenomena within the domain. The structure and infor- mation contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequent- ly used as features in machine learning applications. Results: We propose the Onto2Vec method, an approach to learn feature vectors for biological enti- ties based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions be- tween proteins, classification of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representa- tions of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes. First, we demonstrate that Onto2Vec-generated feature vectors can significantly im- Bioinformatics, 34, 2018, i52–i60 doi: 10.1093/bioinformatics/bty259 ISMB 2018 classes used in the annotation in a single representation. Trivially, since Onto2Vec can generate representations of single classes in an ontology, an entity annotated with n classes, C1; . . . ; Cn , can be rep- resented as a (linear) combination of the vector representations of hese classes. For example, if an entity e is annotated with C1 and semantic similarity. As a first experiment, we evaluated the accuracy of Onto2Vec in predicting protein–protein interactions. For this pur- pose, we generated several representations of proteins: first, we used Onto2Vec to learn representations of proteins jointly with represen- tations of GO classes by adding proteins and their annotations to Fig. 1. Onto2Vec Workflow. The blue-shaded part illustrates the steps to obtain vector representation for classes from the ontology. The purple-shaded part shows the steps to obtain vector representations of ontology classes and the entities annotated to these classes 54 F.Z.Smaili et al. Downloaded from https://academic.oup.com/bioinformatic similarity). Table 3 summarizes the results. While Resnik semantic similarity and Onto2Vec similarity cannot distinguish between dif- ferent types of interaction, we find that the supervised models, in particular the multiclass SVM and ANN, are capable when using Onto2Vec vector representations to distinguish between different Fig. 3. ROC curves for PPI prediction for the supervised learning methods, in additio Table 2. Spearman correlation coefficients between STRING confi- dence scores and PPI prediction scores of different prediction methods Yeast Human Resnik 0.1107 0.1151 Onto2Vec 0.1067 0.1099 Binary GO 0.1021 0.1031 Onto2Vec LR 0.1424 0.1453 Onto2Vec SVM 0.2245 0.2621 Onto2Vec NN 0.2516 0.2951 Binary GO LR 0.1121 0.1208 Binary GO SVM 0.1363 0.1592 Binary GO NN 0.1243 0.1616 Note: The highest absolute correlation across all methods is highlighted in bold. i56 11*QSFEJDUJPOͷTQFBNBO$$
  18. Ҩ఻ࢠൃݱͱXPSEWFD w ಺༰͸͓͖ͯ͞ڈ೥·Ͱͳ͔ͬͨͷʁຊ౰ʹʁ RESEARCH Open Access Gene2vec: distributed representation of

    genes based on co-expression Jingcheng Du†, Peilin Jia†, Yulin Dai, Cui Tao, Zhongming Zhao‡ and Degui Zhi*‡ From The International Conference on Intelligent Biology and Medicine (ICIBM) 2018 Los Angeles, CA, USA. 10-12 June 2018 Abstract Background: Existing functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding. Results: From a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co- expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene- gene interaction prediction. Conclusions: We proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications. Keywords: Distributed representation, Gene2Vec, Gene co-expression, Embedding, Word2vec, Gene-gene interaction Background Genes, discrete segments of the genome that are tran- scribed, are basic building blocks of molecular biological systems. Although almost all transcripts in the human genome have been identified, functional annotation of The challenge of creating a quantitative semantic rep- resentation of discrete units of a complex system is not unique to gene systems. For a long time, creating a quantitative representation of words had been challen- ging for linguistic modeling. Hinton proposed the pio- Du et al. BMC Genomics 2019, 20(Suppl 1):82 https://doi.org/10.1186/s12864-018-5370-x based on which we explored the distribution of all hu- man genes from our results (Fig. 3). A direct visualization of the gene distribution revealed that the majority of genes formed one single cloud while several isolated groups of genes scattered around. We extracted these gene islands and found they were mainly non-pro- tein-coding genes. Island 2 was significantly populated with the snoRNA genes (pink dots, p = 1.07 × 10− 72, Fish- er’s Exact Test). Island 4, located to the very right of the plot, mainly contains human cDNA/PAC clone genes. microRNA genes (cyan dots) were mainly distributed in island 2 (p = 3.99 × 10− 19), island 4 (p = 3.51 × 10− 73), and island 5 (p = 2.64 × 10− 41). A group of ncRNAs which start with “LOC” and are often uncharacterized split the whole distribution into the left panel and the right panel (red dots, Fig. 3). In the left panel, we observed a cluster of open reading frames (yellow dots, Fig. 3) in the human genome. Tissue specific genes form spatial patterns in gene embedding We mapped genes with z-scores representing their tissue-specific expression onto the gene co-expression map. We observed clear clusters in several tissues such as blood, skin, spleen, and lung (Fig. 4 and Additional file 1). Genes with high tissue specificity in blood highlighted two distant clusters. This is likely because that blood samples are relatively more widely used in gene expres- sion studies and blood-specific genes and their relation- ships are thus better represented in our map. Tissues that are biologically relevant showed similar patterns. For example, tissues of female reproductive systems presented graded and similar patterns, including breast, ovary, and uterus. In these tissues, genes located in the bottom part of the map in general showed increased tissue specificity, compared to genes located on the top part of the map (Fig. 4 and Additional file 1). 1 2 3 4 5 6 7 8 9 10 50 1.428 1.444 1.467 1.470 1.487 1.465 1.473 1.479 1.475 1.462 100 1.415 1.467 1.488 1.491 1.498 1.501 1.519 1.486 1.480 1.490 200 1.403 1.463 1.491 1.498 1.495 1.482 1.470 1.488 1.521 1.509 300 1.392 1.443 1.472 1.473 1.473 1.509 1.474 1.513 1.479 1.480 Bold number denotes the largest number in that row Fig. 3 Gene co-expression map generated from embedding reveals clusters of functionally related genes. F1 and F2 are the first and the second dimensions of t-SNE. Red: LOC non-coding genes; cyan: microRNA; pink: small nucleolar RNA (snoRNA); yellow: undercharacterized ORFs #.$(FOPNJDT 
  19. ͍ͣΕʹͤΑग़དྷΔࣄ͸·ͩগͳ͍ w ຒΊࠐΜͩޙʹ૒ۂۭؒͰग़དྷΔࣄ͸ݶΒΕ͍ͯΔ w 1$"ɺ.%4ɺ,NFBOTɺ47.ɺ-PHJTUJDSFHSFTTJPOɺ/FVSBMOFU 7"&ɺ(//ɺ"UUFOUJPO 
 %JSFDUFE4QBOOJOH'PSFTU͸طʹ͋Δɻ w ݸਓతʹ͸ɺ47.पΓ͸ͱΓ͋͑ͣҰ௨Γཉ͍͠ײ͕͢͡Δɻ

    w 47ճؼɺDMBTT47.ɺ3%'Χʔωϧͷ୅ΘΓɺߏ଄Խ47.ɺ൒ڭࢣ͋Γ47.
 ϚϧνΠϯελϯεֶशɺͦͯ͠छʑͷߴ଎Խख๏ w ٕज़తʹ͸ઌͷ࿦จͷͪΐͬͱͨ͠վྑͰࡁΈͦ͏ͳؾ͕͢Δ w ࠞ߹ۂ཰ۭؒͰग़དྷΔ͜ͱ͸΄ͱΜͲͳ͍ ·ͩ7"&͔͠ͳ͍ʁ