– t • . n IL b – epILhl G i l kIM c ac N P [ L C l e u xL P • 32 2 : 2 23 h – 32 2 : 2 23 32 Ps ou i s ouG rv Neural Processes x C y T C T z x T y C (a) Graphical model y C x C h r C r a z y T x T h r T g a r C T Generation Inference (b) Computational diagram Figure 1. Neural process model. (a) Graphical model of a neural process. x and y correspond to the data where y = f(x). C and T are the number of context points and target points respectively and z is the global latent variable. A grey background indicates that the variable is observed. (b) Diagram of our neural process implementation. Variables in circles correspond to the variables of the graphical ditional stochastic process Q✓ that deﬁnes distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters deﬁning Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0) | O, T0) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Speciﬁcally, we consider Q✓ s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The deﬁning characteristic of a CNP is that it conditions on O via an embedding of ﬁxed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural networks, is a commutative operation that takes elements in Rd and maps them into a single element of Rd, and i are parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend- ing on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 . . . rn can be computed in O(1) from r1 . . . rn 1 , this architecture supports streaming observations with minimal overhead. For regression tasks we use i to parametrize the mean and variance i = (µi, 2 i ) of a Gaussian distribution N(µi, 2 i ) for every xi 2 T. For classiﬁcation tasks i parametrizes and unobserved values. In practice, we take Monte Carlo estimates of the gradient of this loss by sampling f and N. This approach shifts the burden of imposing prior knowl- edge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately intended to summarize their empirical experience. Still, we emphasize that the Q✓ are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that. In summary, 1. A CNP is a conditional distribution over functions trained to model the empirical conditional distributions of functions f ⇠ P. 2. A CNP is permutation invariant in O and T. 3. A CNP is scalable, achieving a running time complex- ity of O(n + m) for making m predictions with n observations. Within this speciﬁcation of the model there are still some aspects that can be modiﬁed to suit speciﬁc requirements. The exact implementation of h, for example, can be adapted to the data type. For low dimensional data the encoder can be implemented as an MLP, whereas for inputs with larger dimensions and spatial correlations it can also include convolutions. Finally, in the setup described the model is not able to produce any coherent samples, as it learns to model only a factored prediction of the mean and the variances, disregarding the covariance between target points. This is a result of this particular implementation of the model. One way we can obtain coherent samples is by introducing a latent variable that we can sample from. We carry out some proof-of-concept experiments on such a model in section 4.2.3. . ditional stochastic process Q✓ that deﬁnes distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters deﬁning Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0) | O, T0) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Speciﬁcally, we consider Q✓ s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The deﬁning characteristic of a CNP is that it conditions on O via an embedding of ﬁxed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural networks, is a commutative operation that takes elements in Rd and maps them into a single element of Rd, and i are parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend- ing on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 . . . rn can be computed in O(1) from r1 . . . rn 1 , this architecture supports streaming observations with minimal overhead. For regression tasks we use i to parametrize the mean and variance i = (µi, 2 i ) of a Gaussian distribution N(µi, 2 i ) for every xi 2 T. For classiﬁcation tasks i parametrizes and unobserved values. In practice, we estimates of the gradient of this loss by This approach shifts the burden of imp edge from an analytic prior to empiri the advantage of liberating a practition specify an analytic form for the prior, w intended to summarize their empirical e emphasize that the Q✓ are not necessaril conditionals for all observation sets, and does not guarantee that. In summary, 1. A CNP is a conditional distribut trained to model the empirical cond of functions f ⇠ P. 2. A CNP is permutation invariant in 3. A CNP is scalable, achieving a run ity of O(n + m) for making m observations. Within this speciﬁcation of the model t aspects that can be modiﬁed to suit spe The exact implementation of h, for exam to the data type. For low dimensiona can be implemented as an MLP, wher larger dimensions and spatial correlation convolutions. Finally, in the setup describ able to produce any coherent samples, a only a factored prediction of the mean disregarding the covariance between t is a result of this particular implement One way we can obtain coherent sample a latent variable that we can sample fr some proof-of-concept experiments o section 4.2.3. . Speciﬁcally, given a set of observations O, a CNP is a con- ditional stochastic process Q✓ that deﬁnes distributions over f(x) for inputs x 2 T. ✓ is the real vector of all parame- ters deﬁning Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T. If O0 , T0 are permutations of O and T, re- spectively, then Q✓(f(T) | O, T) = Q✓(f(T0) | O, T0) = Q✓(f(T) | O0 , T). In this work, we generally enforce per- mutation invariance with respect to T by assuming a fac- tored structure. Speciﬁcally, we consider Q✓ s that factor Q✓(f(T) | O, T) = Q x2T Q✓(f(x) | O, x). In the absence of assumptions on output space Y , this is the easiest way to ensure a valid stochastic process. Still, this framework can be extended to non-factored distributions, we consider such a model in the experimental section. The deﬁning characteristic of a CNP is that it conditions on O via an embedding of ﬁxed dimensionality. In more detail, we use the following architecture, ri = h✓(xi, yi) 8(xi, yi) 2 O (1) r = r1 r2 . . . rn 1 rn (2) i = g✓(xi, r) 8(xi) 2 T (3) where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd ! Re are neural networks, is a commutative operation that takes elements in Rd and maps them into a single element of Rd, and i are parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend- ing on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 . . . rn can be computed in O(1) from r1 . . . rn 1 , this architecture supports streaming observations with minimal overhead. For regression tasks we use i to parametrize the mean and variance i = (µi, 2) of a Gaussian distribution N(µi, 2) Thus, the targets it scores Q✓ on include and unobserved values. In practice, we estimates of the gradient of this loss by This approach shifts the burden of imp edge from an analytic prior to empiri the advantage of liberating a practition specify an analytic form for the prior, w intended to summarize their empirical e emphasize that the Q✓ are not necessaril conditionals for all observation sets, and does not guarantee that. In summary, 1. A CNP is a conditional distribut trained to model the empirical cond of functions f ⇠ P. 2. A CNP is permutation invariant in 3. A CNP is scalable, achieving a run ity of O(n + m) for making m observations. Within this speciﬁcation of the model t aspects that can be modiﬁed to suit spe The exact implementation of h, for exam to the data type. For low dimensiona can be implemented as an MLP, wher larger dimensions and spatial correlation convolutions. Finally, in the setup describ able to produce any coherent samples, a only a factored prediction of the mean disregarding the covariance between t is a result of this particular implement One way we can obtain coherent sample a latent variable that we can sample fr some proof-of-concept experiments o a less informed posterior of the underlying function. This formulation makes it clear that the posterior given a subset of the context points will serve as the prior when additional context points are included. By using this setup, and train- ing with different sizes of context, we encourage the learned model to be ﬂexible with regards to the number and position of the context points. 2.4. The Neural process model In our implementation of NPs we accommodate for two ad- ditional desiderata: invariance to the order of context points and computational efﬁciency. The resulting model can be boiled down to three core components (see Figure 1b): • An encoder h from input space into representation space that takes in pairs of (x, y)i context values and produces a representation ri = h((x, y)i) for each of the pairs. We parameterise h as a neural network. • An aggregator a that summarises the encoded inputs. We are interested in obtaining a single order-invariant global representation r that parameterises the latent dis- tribution z ⇠ N(µ(r), I (r)). The simplest operation that ensures order-invariance and works well in prac- tice is the mean function r = a(ri) = 1 n Pn i=1 ri . Cru- cially, the aggregator reduces the runtime to O(n + m) where n and m are the number of context and target points respectively. • A conditional decoder g that takes as input the sam- pled global latent variable z as well as the new target locations xT and outputs the predictions ˆ yT for the corresponding values of f(xT ) = yT . 3. Related work 3.1. Conditional neural processes Neural Processes (NPs) are a generalisation of Conditional lack a latent variable that allows for global sampl Figure 2c for a diagram of the model). As a resul are unable to produce different function samples same context data, which can be important if model uncertainty is desirable. It is worth mentioning that inal CNP formulation did include experiments with variable in addition to the deterministic connectio ever, given the deterministic connections to the p variables, the role of the global latent variable is n In contrast, NPs constitute a more clear-cut gener of the original deterministic CNP with stronger p to other latent variable models and approximate B methods. These parallels allow us to compare ou to a wide range of related research areas in the fo sections. Finally, NPs and CNPs themselves can be seen a alizations of recently published generative query n (GQN) which apply a similar training procedure to new viewpoints in 3D scenes given some context tions (Eslami et al., 2018). Consistent GQN (CG an extension of GQN that focuses on generating co samples and is thus also closely related to NPs (Kum 2018). 3.2. Gaussian processes We start by considering models that, like NPs, li spectrum between neural networks (NNs) and G processes (GPs). Algorithms on the NN end of the s ﬁt a single function that they learn from a very large of data directly. GPs on the other hand can rep distribution over a family of functions, which is con by an assumption on the functional form of the co between two points. Scattered across this spectrum, we can place recent that has combined ideas from Bayesian non-para with neural networks. Methods like (Calandra et a Huang et al., 2015) remain fairly close to the GPs, b the prior on the context. As such this prior is equivalent to a less informed posterior of the underlying function. This formulation makes it clear that the posterior given a subset of the context points will serve as the prior when additional context points are included. By using this setup, and train- ing with different sizes of context, we encourage the learned model to be ﬂexible with regards to the number and position of the context points. 2.4. The Neural process model In our implementation of NPs we accommodate for two ad- ditional desiderata: invariance to the order of context points and computational efﬁciency. The resulting model can be boiled down to three core components (see Figure 1b): • An encoder h from input space into representation space that takes in pairs of (x, y)i context values and produces a representation ri = h((x, y)i) for each of the pairs. We parameterise h as a neural network. • An aggregator a that summarises the encoded inputs. We are interested in obtaining a single order-invariant global representation r that parameterises the latent dis- tribution z ⇠ N(µ(r), I (r)). The simplest operation that ensures order-invariance and works well in prac- tice is the mean function r = a(ri) = 1 n Pn i=1 ri . Cru- cially, the aggregator reduces the runtime to O(n + m) where n and m are the number of context and target points respectively. • A conditional decoder g that takes as input the sam- pled global latent variable z as well as the new target locations xT and outputs the predictions ˆ yT for the corresponding values of f(xT ) = yT . 3. Related work 3.1. Conditional neural processes a large part of the motivation behind n lack a latent variable that allows for g Figure 2c for a diagram of the model) are unable to produce different funct same context data, which can be import uncertainty is desirable. It is worth men inal CNP formulation did include exper variable in addition to the deterministi ever, given the deterministic connectio variables, the role of the global latent In contrast, NPs constitute a more clea of the original deterministic CNP wit to other latent variable models and app methods. These parallels allow us to to a wide range of related research are sections. Finally, NPs and CNPs themselves ca alizations of recently published genera (GQN) which apply a similar training p new viewpoints in 3D scenes given so tions (Eslami et al., 2018). Consisten an extension of GQN that focuses on g samples and is thus also closely related 2018). 3.2. Gaussian processes We start by considering models that, spectrum between neural networks (N processes (GPs). Algorithms on the NN ﬁt a single function that they learn from of data directly. GPs on the other ha distribution over a family of functions, by an assumption on the functional for between two points. Scattered across this spectrum, we can p that has combined ideas from Bayesi Neural Processes y C z T y T C T z Neural statistician x C y T C T x T y C (c) Conditional neural process x C y T C T z x T y C (d) Neural process elated models (a-c) and of the neural process (d). Gray shading indicates the variable is observed. C T for target variables i.e. the variables to predict given C. . . ]W b b 0 2 : ,+ 8 1 . 0 2 : ,+ 8 1