Slide 30
Slide 30 text
• 2 . 0 2 : ,+ 8 1
– t
• . n IL b
– epILhl G i l kIM c ac N P [ L C
l e u xL P
• 32 2 : 2 23 h
– 32 2 : 2 23 32 Ps ou i s ouG rv
Neural Processes
x
C
y
T
C T
z
x
T
y
C
(a) Graphical model
y
C
x
C
h
r
C
r
a z y
T
x
T
h
r
T
g
a
r
C T
Generation Inference
(b) Computational diagram
Figure 1. Neural process model. (a) Graphical model of a neural process. x and y correspond to the data where y = f(x). C and T
are the number of context points and target points respectively and z is the global latent variable. A grey background indicates that the
variable is observed. (b) Diagram of our neural process implementation. Variables in circles correspond to the variables of the graphical
ditional stochastic process Q✓
that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓
is invariant to permutations
of O and T. If O0
, T0 are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0) | O, T0) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓
s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn
(2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd
! Re are neural
networks, is a commutative operation that takes elements
in Rd and maps them into a single element of Rd, and i
are
parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend-
ing on the task the model learns to parametrize a different
output distribution. This architecture ensures permutation
invariance and O(n + m) scaling for conditional prediction.
We note that, since r1 . . . rn
can be computed in O(1)
from r1 . . . rn 1
, this architecture supports streaming
observations with minimal overhead.
For regression tasks we use i
to parametrize the mean and
variance i = (µi, 2
i
) of a Gaussian distribution N(µi, 2
i
)
for every xi 2 T. For classification tasks i
parametrizes
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling f and N.
This approach shifts the burden of imposing prior knowl-
edge from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the Q✓
are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that. In summary,
1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions f ⇠ P.
2. A CNP is permutation invariant in O and T.
3. A CNP is scalable, achieving a running time complex-
ity of O(n + m) for making m predictions with n
observations.
Within this specification of the model there are still some
aspects that can be modified to suit specific requirements.
The exact implementation of h, for example, can be adapted
to the data type. For low dimensional data the encoder
can be implemented as an MLP, whereas for inputs with
larger dimensions and spatial correlations it can also include
convolutions. Finally, in the setup described the model is not
able to produce any coherent samples, as it learns to model
only a factored prediction of the mean and the variances,
disregarding the covariance between target points. This
is a result of this particular implementation of the model.
One way we can obtain coherent samples is by introducing
a latent variable that we can sample from. We carry out
some proof-of-concept experiments on such a model in
section 4.2.3.
.
ditional stochastic process Q✓
that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓
is invariant to permutations
of O and T. If O0
, T0 are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0) | O, T0) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓
s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn
(2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd
! Re are neural
networks, is a commutative operation that takes elements
in Rd and maps them into a single element of Rd, and i
are
parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend-
ing on the task the model learns to parametrize a different
output distribution. This architecture ensures permutation
invariance and O(n + m) scaling for conditional prediction.
We note that, since r1 . . . rn
can be computed in O(1)
from r1 . . . rn 1
, this architecture supports streaming
observations with minimal overhead.
For regression tasks we use i
to parametrize the mean and
variance i = (µi, 2
i
) of a Gaussian distribution N(µi, 2
i
)
for every xi 2 T. For classification tasks i
parametrizes
and unobserved values. In practice, we
estimates of the gradient of this loss by
This approach shifts the burden of imp
edge from an analytic prior to empiri
the advantage of liberating a practition
specify an analytic form for the prior, w
intended to summarize their empirical e
emphasize that the Q✓
are not necessaril
conditionals for all observation sets, and
does not guarantee that. In summary,
1. A CNP is a conditional distribut
trained to model the empirical cond
of functions f ⇠ P.
2. A CNP is permutation invariant in
3. A CNP is scalable, achieving a run
ity of O(n + m) for making m
observations.
Within this specification of the model t
aspects that can be modified to suit spe
The exact implementation of h, for exam
to the data type. For low dimensiona
can be implemented as an MLP, wher
larger dimensions and spatial correlation
convolutions. Finally, in the setup describ
able to produce any coherent samples, a
only a factored prediction of the mean
disregarding the covariance between t
is a result of this particular implement
One way we can obtain coherent sample
a latent variable that we can sample fr
some proof-of-concept experiments o
section 4.2.3.
.
Specifically, given a set of observations O, a CNP is a con-
ditional stochastic process Q✓
that defines distributions over
f(x) for inputs x 2 T. ✓ is the real vector of all parame-
ters defining Q. Inheriting from the properties of stochastic
processes, we assume that Q✓
is invariant to permutations
of O and T. If O0
, T0 are permutations of O and T, re-
spectively, then Q✓(f(T) | O, T) = Q✓(f(T0) | O, T0) =
Q✓(f(T) | O0
, T). In this work, we generally enforce per-
mutation invariance with respect to T by assuming a fac-
tored structure. Specifically, we consider Q✓
s that factor
Q✓(f(T) | O, T) =
Q
x2T Q✓(f(x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
ensure a valid stochastic process. Still, this framework can
be extended to non-factored distributions, we consider such
a model in the experimental section.
The defining characteristic of a CNP is that it conditions on
O via an embedding of fixed dimensionality. In more detail,
we use the following architecture,
ri = h✓(xi, yi) 8(xi, yi) 2 O (1)
r = r1 r2 . . . rn 1 rn
(2)
i = g✓(xi, r) 8(xi) 2 T (3)
where h✓ : X ⇥Y ! Rd and g✓ : X ⇥Rd
! Re are neural
networks, is a commutative operation that takes elements
in Rd and maps them into a single element of Rd, and i
are
parameters for Q✓(f(xi) | O, xi) = Q(f(xi) | i). Depend-
ing on the task the model learns to parametrize a different
output distribution. This architecture ensures permutation
invariance and O(n + m) scaling for conditional prediction.
We note that, since r1 . . . rn
can be computed in O(1)
from r1 . . . rn 1
, this architecture supports streaming
observations with minimal overhead.
For regression tasks we use i
to parametrize the mean and
variance i = (µi, 2) of a Gaussian distribution N(µi, 2)
Thus, the targets it scores Q✓
on include
and unobserved values. In practice, we
estimates of the gradient of this loss by
This approach shifts the burden of imp
edge from an analytic prior to empiri
the advantage of liberating a practition
specify an analytic form for the prior, w
intended to summarize their empirical e
emphasize that the Q✓
are not necessaril
conditionals for all observation sets, and
does not guarantee that. In summary,
1. A CNP is a conditional distribut
trained to model the empirical cond
of functions f ⇠ P.
2. A CNP is permutation invariant in
3. A CNP is scalable, achieving a run
ity of O(n + m) for making m
observations.
Within this specification of the model t
aspects that can be modified to suit spe
The exact implementation of h, for exam
to the data type. For low dimensiona
can be implemented as an MLP, wher
larger dimensions and spatial correlation
convolutions. Finally, in the setup describ
able to produce any coherent samples, a
only a factored prediction of the mean
disregarding the covariance between t
is a result of this particular implement
One way we can obtain coherent sample
a latent variable that we can sample fr
some proof-of-concept experiments o
a less informed posterior of the underlying function. This
formulation makes it clear that the posterior given a subset
of the context points will serve as the prior when additional
context points are included. By using this setup, and train-
ing with different sizes of context, we encourage the learned
model to be flexible with regards to the number and position
of the context points.
2.4. The Neural process model
In our implementation of NPs we accommodate for two ad-
ditional desiderata: invariance to the order of context points
and computational efficiency. The resulting model can be
boiled down to three core components (see Figure 1b):
• An encoder h from input space into representation
space that takes in pairs of (x, y)i
context values and
produces a representation ri = h((x, y)i) for each of
the pairs. We parameterise h as a neural network.
• An aggregator a that summarises the encoded inputs.
We are interested in obtaining a single order-invariant
global representation r that parameterises the latent dis-
tribution z ⇠ N(µ(r), I (r)). The simplest operation
that ensures order-invariance and works well in prac-
tice is the mean function r = a(ri) = 1
n
Pn
i=1
ri
. Cru-
cially, the aggregator reduces the runtime to O(n + m)
where n and m are the number of context and target
points respectively.
• A conditional decoder g that takes as input the sam-
pled global latent variable z as well as the new target
locations xT
and outputs the predictions ˆ
yT
for the
corresponding values of f(xT ) = yT
.
3. Related work
3.1. Conditional neural processes
Neural Processes (NPs) are a generalisation of Conditional
lack a latent variable that allows for global sampl
Figure 2c for a diagram of the model). As a resul
are unable to produce different function samples
same context data, which can be important if model
uncertainty is desirable. It is worth mentioning that
inal CNP formulation did include experiments with
variable in addition to the deterministic connectio
ever, given the deterministic connections to the p
variables, the role of the global latent variable is n
In contrast, NPs constitute a more clear-cut gener
of the original deterministic CNP with stronger p
to other latent variable models and approximate B
methods. These parallels allow us to compare ou
to a wide range of related research areas in the fo
sections.
Finally, NPs and CNPs themselves can be seen a
alizations of recently published generative query n
(GQN) which apply a similar training procedure to
new viewpoints in 3D scenes given some context
tions (Eslami et al., 2018). Consistent GQN (CG
an extension of GQN that focuses on generating co
samples and is thus also closely related to NPs (Kum
2018).
3.2. Gaussian processes
We start by considering models that, like NPs, li
spectrum between neural networks (NNs) and G
processes (GPs). Algorithms on the NN end of the s
fit a single function that they learn from a very large
of data directly. GPs on the other hand can rep
distribution over a family of functions, which is con
by an assumption on the functional form of the co
between two points.
Scattered across this spectrum, we can place recent
that has combined ideas from Bayesian non-para
with neural networks. Methods like (Calandra et a
Huang et al., 2015) remain fairly close to the GPs, b
the prior on the context. As such this prior is equivalent to
a less informed posterior of the underlying function. This
formulation makes it clear that the posterior given a subset
of the context points will serve as the prior when additional
context points are included. By using this setup, and train-
ing with different sizes of context, we encourage the learned
model to be flexible with regards to the number and position
of the context points.
2.4. The Neural process model
In our implementation of NPs we accommodate for two ad-
ditional desiderata: invariance to the order of context points
and computational efficiency. The resulting model can be
boiled down to three core components (see Figure 1b):
• An encoder h from input space into representation
space that takes in pairs of (x, y)i
context values and
produces a representation ri = h((x, y)i) for each of
the pairs. We parameterise h as a neural network.
• An aggregator a that summarises the encoded inputs.
We are interested in obtaining a single order-invariant
global representation r that parameterises the latent dis-
tribution z ⇠ N(µ(r), I (r)). The simplest operation
that ensures order-invariance and works well in prac-
tice is the mean function r = a(ri) = 1
n
Pn
i=1
ri
. Cru-
cially, the aggregator reduces the runtime to O(n + m)
where n and m are the number of context and target
points respectively.
• A conditional decoder g that takes as input the sam-
pled global latent variable z as well as the new target
locations xT
and outputs the predictions ˆ
yT
for the
corresponding values of f(xT ) = yT
.
3. Related work
3.1. Conditional neural processes
a large part of the motivation behind n
lack a latent variable that allows for g
Figure 2c for a diagram of the model)
are unable to produce different funct
same context data, which can be import
uncertainty is desirable. It is worth men
inal CNP formulation did include exper
variable in addition to the deterministi
ever, given the deterministic connectio
variables, the role of the global latent
In contrast, NPs constitute a more clea
of the original deterministic CNP wit
to other latent variable models and app
methods. These parallels allow us to
to a wide range of related research are
sections.
Finally, NPs and CNPs themselves ca
alizations of recently published genera
(GQN) which apply a similar training p
new viewpoints in 3D scenes given so
tions (Eslami et al., 2018). Consisten
an extension of GQN that focuses on g
samples and is thus also closely related
2018).
3.2. Gaussian processes
We start by considering models that,
spectrum between neural networks (N
processes (GPs). Algorithms on the NN
fit a single function that they learn from
of data directly. GPs on the other ha
distribution over a family of functions,
by an assumption on the functional for
between two points.
Scattered across this spectrum, we can p
that has combined ideas from Bayesi
Neural Processes
y
C
z
T
y
T
C T
z
Neural statistician
x
C
y
T
C T
x
T
y
C
(c) Conditional neural process
x
C
y
T
C T
z
x
T
y
C
(d) Neural process
elated models (a-c) and of the neural process (d). Gray shading indicates the variable is observed. C
T for target variables i.e. the variables to predict given C.
. . ]W b b
0 2 : ,+ 8 1
.
0 2 : ,+ 8 1