DEEP LEARNING JP [DL Papers] Conditional Neural Processes (ICML2018) Neural Processes (ICML2018Workshop) Kazuki Fujikawa, DeNA

Deep Kernel Learning

Deep neural networks excel at function approximation, yet they are typically trained from scratch for each new function. On the other hand, Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs are computationally expensive, and it can be hard to design appropriate priors.

Starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transform the inputs (predictors) x as k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w), where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep convolutional network, parametrized by weights w.

For kernel learning, we use the chain rule to compute derivatives of the log marginal likelihood with respect to the deep kernel hyperparameters.

Review of Gaussian Processes and Single-Layer Neural Networks

We briefly review the correspondence between single-hidden layer neural networks and GPs (Neal (1994a;b); Williams (1997)). The ith component of the network output, z1 i , is computed as z1 i (x) = b1 i + N1 Σj=1 W1 ij x1 j (x), where x1 j (x) = φ(b0 j + Σk=1 din W0 jk xk).

Because the weight and bias parameters are taken to be i.i.d., the post-activations x1 j , x1 j0 are independent for j ≠ j0. Moreover, since z1 i (x) is a sum of i.i.d terms, it follows from the Central Limit Theorem that in the limit of infinite width N1 → ∞, z1 i (x) will be Gaussian distributed.

Likewise, from the multidimensional Central Limit Theorem, any finite collection of {z1 i (x↵=1), ..., z1 i (x↵=k)} will have a joint multivariate Gaussian distribution, which is exactly the definition of a Gaussian process. Therefore we conclude that z1 i ~ GP(µ1, K1), a GP with mean µ1 and covariance K1, which are themselves independent of i. Because the parameters have zero mean, we have that µ1(x) = E[z1 i (x)] = 0 and K1(x, x0) ≡ E[z1 i (x)z1 i (x0)] = σ2 b + σ2 w E[x1 i (x)x1 i (x0)] ≡ σ2 b + σ2 w C(x, x0). Gaussian Processes and Deep Neural Networks

The arguments of the previous section can be extended to deeper layers by induction. We proceed by taking the hidden layer widths to be infinite in succession (N1 → ∞, N2 → ∞, etc.) to guarantee that the input to the layer under consideration is already governed by a GP.

Suppose that zl-1 j is a GP, identical and independent for every j. After l-1 steps, the network computes zl i (x) = bl i + Nl Σj=1 Wl ij xl j (x), where xl j (x) = φ(zl-1 j (x)).

As before, zl i (x) is a sum of i.i.d. random terms so that, as Nl → ∞, any finite collection will have joint multivariate Gaussian distribution and zl i ~ GP(0, Kl). The covariance is Kl(x, x0) ≡ E[zl i (x)zl i (x0)] = σ2 b + σ2 w E[φ(zl-1 i (x))φ(zl-1 i (x0))].

By induction, the expectation is over the GP governing zl-1 i, but this is equivalent to integrating against the joint distribution of only zl-1 i (x) and zl-1 i (x0). The latter is described by a zero mean, two-dimensional Gaussian whose covariance matrix has distinct entries Kl-1(x, x0), Kl-1(x, x), and Kl-1(x0, x0). We introduce the shorthand Kl(x, x0) = σ2 b + σ2 w F(Kl-1(x, x0), Kl-1(x, x), Kl-1(x0, x0)) to emphasize the recursive relationship between Kl and Kl-1 via a deterministic function F whose form depends only on the nonlinearity φ. MNIST Image Completion

As shown in Figure 3a the model learns to make good predictions of the underlying digit even for a small number of context points. Crucially, when conditioned only on one non-informative context point (e.g. a black pixel on the edge) the model's prediction corresponds to the average over all MNIST digits. As the number of context points increases the predictions become more similar to the underlying ground truth.

It is worth mentioning that even with a complete set of observations the model does not achieve pixel-perfect reconstruction, as we have a bottleneck at the representation level. Since this implementation of CNP returns factored outputs, the best prediction it can produce given limited context information is to average over all possible predictions that agree with the context.

An important aspect of the model is its ability to estimate the uncertainty of the prediction. As shown in the bottom row of Figure 3a, as we add more observations, the variance shifts from being almost uniformly spread over the digit positions to being localized around areas that are specific to the underlying digit, specifically its edges. Being able to model the uncertainty given some context can be helpful for many tasks. One example is active exploration.

An important aspect of CNPs demonstrated in Figure 5, is its flexibility not only in the number of observations and targets it receives but also with regards to their input values. It is interesting to compare this property to GPs on one hand, and to trained generative models on the other hand.

Consider conditioning the model on one half of the image, for example. This forces the model to not only predict pixel values but also handle arbitrary conditioning patterns. Latent Variable Model

We apply this model to MNIST and CelebA (Figure 6). We use the same models as before, but we concatenate the representation r to a vector of latent variables z of size 64 (for CelebA we use bigger models where the sizes of r and z are 1024 and 128 respectively). For both the prior and posterior models, we use three layered MLPs and average their outputs.

We emphasize that the difference between the prior and posterior is that the prior only sees the observed pixels, while the posterior sees both the observed and the target pixels. When sampling from this model with a small number of observed pixels, we get coherent samples and we see that the variability of the datasets is captured. As the model is conditioned on more and more observations, the variability of the samples drops and they eventually converge to a single possibility. Classification on Omniglot

Finally, we apply the model to one-shot classification using the Omniglot dataset (Lake et al., 2015). This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms.

We generate 5-way and 20-way classification tasks. Results show CNP achieves 95.3% (1-shot) and 98.5% (5-shot) accuracy on 5-way tasks, and 89.9% (1-shot) and 96.8% (5-shot) on 20-way tasks, using a significantly simpler architecture (three convolutional layers for the encoder and a three-layer MLP for the decoder) and with a lower runtime of O(n + m) at test time as opposed to O(nm). Neural Processes Model

The defining characteristic of a CNP is that it conditions on O via an embedding of fixed dimensionality. In more detail, we use the following architecture:

ri = h✓(xi, yi) ∀(xi, yi) ∈ O
r = r1 ⊕ r2 ⊕ ... ⊕ rn-1 ⊕ rn
φi = g✓(xi, r) ∀(xi) ∈ T

where h✓: X × Y → Rd and g✓: X × Rd → Re are neural networks, ⊕ is a commutative operation that takes elements in Rd and maps them into a single element of Rd, and φi are parameters for Q✓(f(xi) | O, xi).

This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1...rn can be computed in O(1) from r1...rn-1, this architecture supports streaming observations with minimal overhead. 