MIXED-CURVATURE REPRESENTATIONS IN PRODUCTS OF MODEL SPACES Albert Gu, Frederic Sala, Beliz Gunel & Christopher R´ e Computer Science Department Stanford University Stanford, CA 94305 {albertgu,fredsala,bgunel}@stanford.edu, chrismre@cs.stanford.edu ABSTRACT The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Eu- clidean space has been the workhorse for embeddings; recently hyperbolic and spherical spaces have gained popularity due to their ability to better embed new types of structured data—such as hierarchical data—but most data is not struc- tured so uniformly. We address this problem by proposing learning embeddings in a product manifold combining multiple copies of these model spaces (spherical, hyperbolic, Euclidean), providing a space of heterogeneous curvature suitable for a wide variety of structures. We introduce a heuristic to estimate the sectional cur- w ۃۭؒ ۂ͕ෛ ͷຒΊࠐΈ͕͍ͭͰ࠷దͳΘ͚Ͱͳ͍ w ٿ໘ɺϢʔΫϦουۭؒɺ͋Δ͍ͦͷࠞ߹ۭؒͷຒΊࠐΈ͕࠷ద͔ʁ w ࠞ߹ۭؒͷຒΊࠐΈΞϧΰϦζϜΛ࡞ͨ͠
1: Three component spaces: sphere S2, Euclidean plane E2, and hyperboloid H2. Thick lines are geodesics; these get closer in positively curved (K = +1) space S2, remain equidistant in ﬂat (K = 0) space E2, and get farther apart in negatively curved (K = 1) space H2. We propose embedding into product spaces in which each component has constant curvature. As we show, this allows us to capture a wider range of curvatures than traditional embeddings, while retaining the ability to globally optimize and operate on the resulting embeddings. Speciﬁcally, we form a Riemannian product manifold combining hyperbolic, spherical, and Euclidean components and equip it with a decomposable Riemannian metric. While each component space in the product has constant curvature (positive for spherical, negative for hyperbolic, and zero for Euclidean), the Published as a conference paper at ICLR 2019 a b c m a b c m a b c m Figure 3: Geodesic triangles in differently curved spaces: compared to Euclidean geometry in which it satisﬁes the parallelogram law (Center), the median am is longer in cycle-like positively curved space (Left), and shorter in tree-like negatively curved space (Right). The relative length of am can be used as a heuristic to estimate discrete curvature. 3.2 ESTIMATING THE SIGNATURE w άϥϑ͕ঢ়ˠٿ໘ɺάϦουঢ়ˠϢʔΫϦουฏ໘ɺঢ়ˠۂ໘ ʹͦΕͧΕ͍͍ͯͦ͏
in the Appendix. ZATION & COMPONENT CURVATURES mbeddings, we optimize the placement of points through an auxiliary loss function. distances {d G (X i , X j )} ij , our loss function of choice is L(x) = X 1ijn ✓ dP (x i , x j ) d G (X i , X j ) ◆2 1 , (2) E for our Euclidean embedding space component to distinguish it from R, since our models of spherical geometry also use R as an ambient space. 4 i i j j refer to each Ssi , Hhi , Ee as components or factors. We refer to the decomposition, e.g., (H2)2 = H2 ⇥H2, as the signature. For convenience, let M1 , . . . , M m+n+1 refer to the factors in the product. Distances on P As discussed in Section 2, the product P is a Riemannian manifold deﬁned by the structure of its components. For p, q 2 P, we write d Mi (p, q) for the distance d Mi restricted to the appropriate components of p and q in the product. In particular, the squared distance in the product decomposes via (1). In other words, dP is simply the `2 norm of the component distances d Mi . We note that P can also be equipped with different distances (ignoring the Riemannian struc- ture), leading to a different embedding space. Without the underlying manifold structure, we can- not freely operate on the embedded points such as taking geodesics and means, but some sim- ple applications only interact through distances. For such settings, we consider the `1 distance dP,`1 (p, q) = P sm i=1 d Si (p, q) + P hn i=1 d Hi (p, q) + d E (p, q) and the min distance dP,min(p, q) = min {d S1 (p, q), . . . , d H1 (p, q), . . . , d E (p, q)}. These distances provide simple and interpretable em- bedding spaces using P, enabling us to introduce combinatorial constructions that allow for embed- dings without the need for optimization. We give an example below and discuss further in the Ap- pendix. We then focus on the Riemannian distance, which allows Riemannian optimization directly on the manifold, and enables full use of the manifold structure in generic downstream applications. Example Consider the graph G shown on the right of Figure 2. This graph has a backbone cycle with 9 nodes, each attached to a tree; such topologies are common in networking. If a single edge (a, b) is removed from the cycle, the result is a tree embeddable arbitrarily well into hyperbolic space (Sala et al., 2018). However, a, b (and their subtrees) would then incur an additional distance of 8 1 = 7, being forced to go the other way around the cycle. But using the `1 distance, we can embed Gtree into H2 and Gcycle into S1, yielding arbitrarily low distortion for G. We give the full details and another combinatorial construction for the min-distance in the Appendix. 3.1 OPTIMIZATION & COMPONENT CURVATURES To compute embeddings, we optimize the placement of points through an auxiliary loss function. Given graph distances {d G (X i , X j )} ij , our loss function of choice is ✓ ◆ Published as a conference paper at ICLR 2019 Algorithm 1 R-SGD in products 1: Input: Loss function L : P ! R 2: Initialize x(0) 2 P randomly 3: for t = 0, . . . , T 1 do 4: h rL(x(t)) 5: for i = 1, . . . , m do 6: v i projS x (t) i (h i ) 7: for i = m + 1, . . . , m + n do 8: v i projH x (t) i (h i ) 9: v i Jv i 10: v m+n+1 h m+n+1 11: for i = 1, . . . , m + n + 1 do 12: x(t+1) i Exp x (t) i (v i ) 13: return x(T ) G Gtree Gcycle w E1 ͕ຒΊࠐΜۭͩؒͰͷڑ E( ͕άϥϑঢ়ͷڑ w ݸผͷۭؒͰಠཱʹຒΊࠐΈΛ͠ ͍ͯ͘ͷͰ؆୯
index components in the either hyperbolic nor spherical space is suitable for G, but stortion. Note the decomposition into tree and cycle. depends on hyperbolic distance d H (for which the gradient which is continuously differentiable (Sala et al., 2018). ion can be optimized through standard Riemannian opti- el, 2013) and RSVRG (Zhang et al., 2016). We write down spaces in Algorithm 1. This proceeds by ﬁrst computing ect to the ambient space of the embedding (Step 4), and dient by applying the Riemannian correction (multiply by Published as a conference paper at ICLR 2019 Table 1: Matching geometries: Average distortion on canonical graphs (tree, cycle, ring of with 40 nodes, comparing four spaces with total dimension 3. The best distortion is achieved space with matching geometry. Cycle Tree Ring of Trees |V | = 40, |E| = 40 |V | = 40, |E| = 39 |V | = 40, |E| = 40 (E 3)1 0.1064 0.1483 0.0997 (H 3)1 0.1638 0.0321 0.0774 (S 3)1 0.0007 0.1605 0.1106 (H 2)1 ⇥ (S 1)1 0.1108 0.0538 0.0616 doubling the number of factors. These models include the products consisting of only a con curvature base space, ranging to various combinations of Sd/2 2 , Hd/2 2 comprising factors of d sion 2.3 For a given signature, the curvatures are initialized to the appropriate value in { 1 and then learned using the technique in Section 3.1. We additionally compare to the outp Algorithms 2,3 for heuristically selecting a combination of spaces in which to embed these da Quality We focus on the average distortion—which our loss function (2) optimizes—as ou metric for reconstruction, and additionally report the mAP metric for the unweighted graph expected, for the synthetic graphs (tree, cycle, ring of trees), the matching geometries (hype spherical, product of hyperbolic and spherical) yield the best distortion (Table 1). Next, we in Table 2 the quality of embedding different graphs across a variety of allocations of spaces, total dimension d = 10 following previous work (Nickel & Kiela, 2018). We conﬁrm that the ture of each graph informs the best allocation of spaces. In particular, the cities graph—whi intrinsic structure close to S2—embeds well into any space with a spherical component, and th like Ph.D.s graph embeds well into hyperbolic products. We emphasize that even for such da w ධՁࢦඪEJTUPSUJPO w ݟͲ͓Γͷ݁Ռ͕ಘΒΕ͍ͯΔɻ 19 milarity task using the WS-353 corpus. Our results and t spaces are a promising area for future study. ND pped with distances d U , d V , an embedding is a mapping is measured by various ﬁdelity measures. A standard stortion of a pair of points a, b is (|d V (f(a), f(b)) over all pairs of points. he explicit value of all distances. At the other end of res is mean average precision (mAP), which applies a graph and node a 2 V have neighborhood N a = e of a. In the embedding f, deﬁne R a,bi to be the small- , R a,bi is the smallest set of nearest points required to AP(f) = 1 P 1 P|Na | |N a \R a,bi |/|R a,bi |. Published as a conference paper at ICLR 2019 in Spearman rank correlation on a word similar initial exploration suggest that mixed product sp 2 PRELIMINARIES & BACKGROUND Embeddings For metric spaces1 U, V equippe f : U ! V . The quality of an embedding is m measure is average distortion Davg . The distor d U (a, b)|)/d U (a, b), and Davg is the average ove Distortion is a global metric; it considers the e the global-local spectrum of ﬁdelity measures to unweighted graphs. Let G = (V, E) be a g {b1 , . . . , bdeg(a) }, where deg(a) is the degree of est ball around f(a) that contains b i (that is, R retrieve the ith neighbor of a in f). Then, mAP(f
using d = 10 total dimen- sions, with varying allocations of spaces and dimensions. Our loss function (2) targets distortion, and for each dataset the best model reﬂects the structure of the data. Even on near-perfectly spherical or hierarchical data, products of S (resp. H) perform no worse than the single copy. Cities CS PhDs Power Facebook |V |=312 |V |=1025, |E|=1043 |V |=4941, |E|=6594 |V |=4039, |E|=88234 Davg Davg mAP Davg mAP Davg mAP E 10 0.0735 0.0543 0.8691 0.0917 0.8860 0.0653 0.5801 H 10 0.0932 0.0502 0.9310 0.0388 0.8442 0.0596 0.7824 S 10 0.0598 0.0569 0.8329 0.0500 0.7952 0.0661 0.5562 (H 5)2 0.0756 0.0382 0.9628 0.0365 0.8605 0.0430 0.7742 (S 5)2 0.0593 0.0579 0.7940 0.0471 0.8059 0.0658 0.5728 H 5 ⇥ S 5 0.0622 0.0509 0.9141 0.0323 0.8850 0.0402 0.7414 (H 2)5 0.0687 0.0357 0.9694 0.0396 0.8739 0.0525 0.7519 (S 2)5 0.0638 0.0570 0.8334 0.0483 0.8818 0.0631 0.5808 (H 2)2 ⇥E 2 ⇥(S 2)2 0.0765 0.0391 0.8672 0.0380 0.8152 0.0474 0.5951 Best model S 5 1.0 ⇥S 5 1.1 H 2 .3 ⇥H 2 .6 ⇥H 2 1.5 ⇥(H 2 1.2 )2 H 5 3.4 ⇥ S 5 12.6 H 5 0.3 ⇥ S 5 3.5 D avg improvement over single space 0.8% 28.89% 16.75% 32.55% Table 3: Heuristic allocation: estimated signatures for embedding unweighted graphs from Table 2 into two factors, using Algorithms 2,3 to match the empirical distribution of graph curvature. The resulting curvature signs agree with results from Table 2 for choosing among two-component spaces. CS PhDs Power Facebook Estimated Signature H5 1.3 ⇥ H5 0.2 H5 1.8 ⇥ S5 1.7 H5 0.9 ⇥ S5 1.6 w ධՁࢦඪEJTUPSUJPOͱN"1 w ۭ͚ؒͩͰͳ͘ۂಉʹ࠷దԽ w EJTUPSUJPOͲͷϞσϧࠞ߹ۭ͕ؒྑ͍͕ɺN"1ͦ͏Ͱͳ͍ n rank correlation on a word similarity task using the WS-353 corpus. Our results and oration suggest that mixed product spaces are a promising area for future study. IMINARIES & BACKGROUND gs For metric spaces1 U, V equipped with distances d U , d V , an embedding is a mapping V . The quality of an embedding is measured by various ﬁdelity measures. A standard average distortion Davg . The distortion of a pair of points a, b is (|d V (f(a), f(b)) d U (a, b), and Davg is the average over all pairs of points. s a global metric; it considers the explicit value of all distances. At the other end of ocal spectrum of ﬁdelity measures is mean average precision (mAP), which applies ted graphs. Let G = (V, E) be a graph and node a 2 V have neighborhood N a = eg(a) }, where deg(a) is the degree of a. In the embedding f, deﬁne R a,bi to be the small- und f(a) that contains b i (that is, R a,bi is the smallest set of nearest points required to ith neighbor of a in f). Then, mAP(f) = 1 |V | P a2V 1 deg(a) P|Na | i=1 |N a \R a,bi |/|R a,bi |. mAP does not track explicit distances; it is a ranking-based measure for local neighbor- erve that mAP(f) 1 (higher is better) while davg 0 (lower is better). n Manifolds We brieﬂy review some notions from manifolds and Riemannian geom-
Hyperbolic Space Hyunghoon Cho Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 hhcho@mit.edu Benjamin DeMeo Department of Biomedical Informatics Harvard University Cambridge, MA 02138 bdemeo@g.harvard.edu Jian Peng Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 jianpeng@illinois.edu ଞ໊
จʹ͜ΕΛޯ߱ԼͰղ͍ͨͱॻ͍ͯ͋ΓɺରදݱपΓͷٞ ͳ͔ͬͨ ͣ minimizew2 Rn 1 2kwk 2 (10) subject to y(j)(wT x(j)) 1, 8j 2 [m] (11) ing algorithm that solves this problem (via its dual) is known as support vector machines ntroducing a relaxation for the separability constraints gives a more commonly used soft- riant of SVM minimizew2 Rn 1 2kwk 2 + C m X j=1 max(0, 1 y(j)(wT x(j))) (12) 3 where the ﬁrst coordinate (corresponds to the time axis in Minkowski spacetime) is neglected. Unlike Euclidean SVM, however, our optimization problem has a non-convex objective as well as a non-convex constraint. Yet, if we restrict our attention to non-trivial, ﬁnite-sized problems where it is necessary and sufﬁcient to consider only the set of w for which at least one data point lies on either side of the decision boundary, then the negative norm constraint can be replaced with a convex alternative that intuitively maps out the convex hull of given data points in the ambient Euclidean space of Ln. Finally, the soft-margin formulation of hyperbolic SVM can be derived by relaxing the separability constraints as in the Euclidean case. Instead of imposing a linear penalty on misclassiﬁcation errors, which has an intuitive interpretation as being proportional to the minimum Euclidean distance to the correct classiﬁcation in the Euclidean case, we impose a penalty proportional to the hyperbolic distance to the correct classiﬁcation. Analogous to the Euclidean case, we ﬁx the scale of penalty so that the margin of the closest point to the decision boundary (that is correctly classiﬁed) is set to sinh 1(1). This leads to the optimization problem minimizew2 Rn+1 1 2 w ⇤ w + C m X j=1 max(0, sinh 1(1) sinh 1(y(j)(w ⇤ x(j)))), (19) subject to w ⇤ w < 0. (20) In all our experiments in the following section, we consider the simplest approach of solving the above formulation of hyperbolic SVM via projected gradient descent. The initial w is determined based on the solution w0 of a soft-margin SVM in the ambient Euclidean space of the hyperboloid model, so that w ⇤ x = (w0)T x for all x. This provides a good initialization for the optimization and has an additional beneﬁt of improving the stability of the algorithm in the presence of potentially many local optima. 5 Experimental Results sinh 1 ✓ w ⇤ x p w ⇤ w ◆ . (15) ult, we reduce the problem of calculating the minimum hyperbolic distance to the ne to a Euclidean geometry problem by mapping the decision boundary and the e Poincaré half-space model, in which the decision boundary is characterized as a here. A full proof of Theorem 1 is provided in Supplementary Information. a, one can apply a sequence of transformations to the max-margin classiﬁcation for the hyperbolic setting to obtain the following result.
0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Euclidean SVM Hyperbolic SVM Hyperbolic SVM Euclidean SVM Macro-AUPR Figure 2: Multi-class classiﬁcation of Gaussian mixtures in hyperbolic space. (a) Two-fold cross validation results for 100 simulated Gaussian mixture datasets with 4 randomly positioned components and 100 points sampled from each component. Each dot represents the average performance over 5 trials. Vertical and horizontal lines represent standard deviations. Example decision hyperplanes for hyperbolic and Euclidean SVMs are shown in (b) and (c), respectively, using the Poincaré disk model. Color of each decision boundary denotes which component is being discriminated from the rest.
karate polbooks football polblogs Hyperbolic SVM Hyperbolic (2) 0.86 ± 0.03 0.73 ± 0.04 0.24 ± 0.03 0.93 ± 0.01 Euclidean SVM Hyperbolic (2) 0.86 ± 0.03 0.66 ± 0.02 0.21 ± 0.01 0.93 ± 0.01 Euclidean SVM Euclidean (2) 0.47 ± 0.07 0.34 ± 0.03 0.09 ± 0.01 0.60 ± 0.09 Euclidean SVM Euclidean (5) 0.55 ± 0.08 0.35 ± 0.03 0.10 ± 0.01 0.69 ± 0.04 Euclidean SVM Euclidean (10) 0.50 ± 0.08 0.36 ± 0.03 0.10 ± 0.01 0.72 ± 0.04 Euclidean SVM Euclidean (25) 0.50 ± 0.09 0.37 ± 0.04 0.11 ± 0.02 0.80 ± 0.03 Table 1: Node classiﬁcation performance on four real-world network datasets. We performed two-fold cross validation experiments on four real-world network datasets described in main text. For all four datasets, hyperbolic SVM matched or outperformed the performance achieved by Euclidean SVM (the datasets where the performance of two methods were comparable, karate and polblogs, contained only two well-separated classes.) Methods are evaluated based on macro-averaged area under the precision recall curve. Mean performance summarized over 5 cross-validation trials over 5 different embeddings for each dataset is shown, each followed by the standard deviation. Numbers corresponding to the best performance on each dataset are shown in boldface. hyperbolic spaces, our formulation naturally extends to higher dimensional hyperbolic spaces, which may be of interest in future applications. More broadly, our work belongs to a growing body of literature that aims to develop learning algorithms that directly operate over a Riemannian manifold [17, 18]. Linear hyperplane-based
w *'ͷߴ͍࠷ۙͷจΛషͬͯΈ·ͨ͠ w දݱֶशͨ͠ޙʹڭࢣ͋Γֶशͯ͠ɺ༏ҐੑΛࣔ͢έʔε͕ଟ͍ ARTICLES https://doi.org/10.1038/s41592-019-0616-3 Knowledge gained by integrating complementary omics data will lead to improved detection of microbial products and optimized culturing conditions for uncharacterized micro- organisms1. Previous work has been able to predict metabolite abundance profiles from microbe abundance profiles2,3. However, because conventional correlation techniques have unaccept- ably high false-discovery rates, finding meaningful relationships between genes within complex microbiomes and their products in the metabolome is challenging. Although there has been a widespread effort to develop mul- it remains unclear whether these models can obtain individual microbe–metabolite interactions. Pearson’s and Spearman’s correlations assume independence between interactions, simplifying the estimation procedure by reducing it to a combination of independent two-dimensional problems. However, many studies have shown that the simplifi- cations in these methods are not statistically valid for composi- tional data, a fact first recognized by Pearson in 1895 and followed up in numerous studies13–17. This problem is further complicated because both microbiome17 and mass spectrometry18–21 datasets Learning representations of microbe–metabolite interactions James T. Morton 1,2, Alexander A. Aksenov 3,4, Louis Felix Nothias3,4, James R. Foulds5, Robert A. Quinn6, Michelle H. Badri7, Tami L. Swenson8, Marc W. Van Goethem8, Trent R. Northen 8,9, Yoshiki Vazquez-Baeza10,11, Mingxun Wang3,4, Nicholas A. Bokulich12,13, Aaron Watters14, Se Jin Song 1,11, Richard Bonneau7,14,15,16, Pieter C. Dorrestein 3,4 and Rob Knight 1,2,11,17* Integrating multiomics datasets is critical for microbiome research; however, inferring interactions across omics datasets has multiple statistical challenges. We solve this problem by using neural networks (https://github.com/biocore/mmvec) to esti- mate the conditional probability that each molecule is present given the presence of a specific microorganism. We show with known environmental (desert soil biocrust wetting) and clinical (cystic fibrosis lung) examples, our ability to recover microbe– metabolite relationships, and demonstrate how the method can discover relationships between microbially produced metabo- lites and inflammatory bowel disease. /BUVSF.FUIPET
ontology-based annotations Fatima Zohra Smaili, Xin Gao* and Robert Hoehndorf* Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia *To whom correspondence should be addressed. Abstract Motivation: Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associ- ate a (kind of) biological entity with a set of phenomena within the domain. The structure and infor- mation contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequent- ly used as features in machine learning applications. Results: We propose the Onto2Vec method, an approach to learn feature vectors for biological enti- ties based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions be- tween proteins, classiﬁcation of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representa- tions of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes. First, we demonstrate that Onto2Vec-generated feature vectors can signiﬁcantly im- Bioinformatics, 34, 2018, i52–i60 doi: 10.1093/bioinformatics/bty259 ISMB 2018 classes used in the annotation in a single representation. Trivially, since Onto2Vec can generate representations of single classes in an ontology, an entity annotated with n classes, C1; . . . ; Cn , can be rep- resented as a (linear) combination of the vector representations of hese classes. For example, if an entity e is annotated with C1 and semantic similarity. As a first experiment, we evaluated the accuracy of Onto2Vec in predicting protein–protein interactions. For this pur- pose, we generated several representations of proteins: first, we used Onto2Vec to learn representations of proteins jointly with represen- tations of GO classes by adding proteins and their annotations to Fig. 1. Onto2Vec Workﬂow. The blue-shaded part illustrates the steps to obtain vector representation for classes from the ontology. The purple-shaded part shows the steps to obtain vector representations of ontology classes and the entities annotated to these classes 54 F.Z.Smaili et al. Downloaded from https://academic.oup.com/bioinformatic similarity). Table 3 summarizes the results. While Resnik semantic similarity and Onto2Vec similarity cannot distinguish between dif- ferent types of interaction, we find that the supervised models, in particular the multiclass SVM and ANN, are capable when using Onto2Vec vector representations to distinguish between different Fig. 3. ROC curves for PPI prediction for the supervised learning methods, in additio Table 2. Spearman correlation coefﬁcients between STRING conﬁ- dence scores and PPI prediction scores of different prediction methods Yeast Human Resnik 0.1107 0.1151 Onto2Vec 0.1067 0.1099 Binary GO 0.1021 0.1031 Onto2Vec LR 0.1424 0.1453 Onto2Vec SVM 0.2245 0.2621 Onto2Vec NN 0.2516 0.2951 Binary GO LR 0.1121 0.1208 Binary GO SVM 0.1363 0.1592 Binary GO NN 0.1243 0.1616 Note: The highest absolute correlation across all methods is highlighted in bold. i56 11*QSFEJDUJPOͷTQFBNBO$$
genes based on co-expression Jingcheng Du†, Peilin Jia†, Yulin Dai, Cui Tao, Zhongming Zhao‡ and Degui Zhi*‡ From The International Conference on Intelligent Biology and Medicine (ICIBM) 2018 Los Angeles, CA, USA. 10-12 June 2018 Abstract Background: Existing functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding. Results: From a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co- expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene- gene interaction prediction. Conclusions: We proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications. Keywords: Distributed representation, Gene2Vec, Gene co-expression, Embedding, Word2vec, Gene-gene interaction Background Genes, discrete segments of the genome that are tran- scribed, are basic building blocks of molecular biological systems. Although almost all transcripts in the human genome have been identified, functional annotation of The challenge of creating a quantitative semantic rep- resentation of discrete units of a complex system is not unique to gene systems. For a long time, creating a quantitative representation of words had been challen- ging for linguistic modeling. Hinton proposed the pio- Du et al. BMC Genomics 2019, 20(Suppl 1):82 https://doi.org/10.1186/s12864-018-5370-x based on which we explored the distribution of all hu- man genes from our results (Fig. 3). A direct visualization of the gene distribution revealed that the majority of genes formed one single cloud while several isolated groups of genes scattered around. We extracted these gene islands and found they were mainly non-pro- tein-coding genes. Island 2 was significantly populated with the snoRNA genes (pink dots, p = 1.07 × 10− 72, Fish- er’s Exact Test). Island 4, located to the very right of the plot, mainly contains human cDNA/PAC clone genes. microRNA genes (cyan dots) were mainly distributed in island 2 (p = 3.99 × 10− 19), island 4 (p = 3.51 × 10− 73), and island 5 (p = 2.64 × 10− 41). A group of ncRNAs which start with “LOC” and are often uncharacterized split the whole distribution into the left panel and the right panel (red dots, Fig. 3). In the left panel, we observed a cluster of open reading frames (yellow dots, Fig. 3) in the human genome. Tissue specific genes form spatial patterns in gene embedding We mapped genes with z-scores representing their tissue-specific expression onto the gene co-expression map. We observed clear clusters in several tissues such as blood, skin, spleen, and lung (Fig. 4 and Additional file 1). Genes with high tissue specificity in blood highlighted two distant clusters. This is likely because that blood samples are relatively more widely used in gene expres- sion studies and blood-specific genes and their relation- ships are thus better represented in our map. Tissues that are biologically relevant showed similar patterns. For example, tissues of female reproductive systems presented graded and similar patterns, including breast, ovary, and uterus. In these tissues, genes located in the bottom part of the map in general showed increased tissue specificity, compared to genes located on the top part of the map (Fig. 4 and Additional file 1). 1 2 3 4 5 6 7 8 9 10 50 1.428 1.444 1.467 1.470 1.487 1.465 1.473 1.479 1.475 1.462 100 1.415 1.467 1.488 1.491 1.498 1.501 1.519 1.486 1.480 1.490 200 1.403 1.463 1.491 1.498 1.495 1.482 1.470 1.488 1.521 1.509 300 1.392 1.443 1.472 1.473 1.473 1.509 1.474 1.513 1.479 1.480 Bold number denotes the largest number in that row Fig. 3 Gene co-expression map generated from embedding reveals clusters of functionally related genes. F1 and F2 are the first and the second dimensions of t-SNE. Red: LOC non-coding genes; cyan: microRNA; pink: small nucleolar RNA (snoRNA); yellow: undercharacterized ORFs #.$(FOPNJDT