done by S. M. Ali Eslami, Danilo J. Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis.
Models Discriminative -> Learn P(y | x) Generative -> Learn P(x | y) e.g. learn features of whether a y = malignant or benign. Also learns “cost prior” P(y). Slide credit: Andrew Ng, Stanford OpenClassroom
Network Architecture • Given query viewpoint (Vq) and representation (r) defines the distribution from which images can be sampled. • One possible network applies a sequence of computational cores that take (Vq) and (r) as input. • Each core is a skip-conv LSTM network.
[Reconstruction Likelihood + regularisation] Deeper models have higher likelihood, not sharing weights of cores improves performance. Effect of (g) on model performance
Algebra • Suggests compositionally of shapes, colours and positions • Can perform arithmetic in (r). • Samples are then drawn from (g), conditioned on the new (r).
with multiple objects • (g) is capable of predicting images from arbitrary viewpoints. • Implies (f) captures identities, counts, positions, colours, position of light and colours of walls and floor.
of Robotic Arm • 9-joint robotic arm and a target object in a randomised room (Jaco arm). • RL-task: Hand to reach target and remain close to it. Reward: decreasing function of the distance.
of Robotic Arm • Two networks: • Pre-train GQN on scenes with Jaco arm • Use (f) to train an RL-agent • (r) has much lower dimensionality than input images • Substantially more robust and data-efficient policy learning • ~4 times fewer interactions than standard methods
Environments (Partially observed) • 7x7 grid mazes generated with OpenGL-based DeepMind Lab game engine. • (g) is capable of predicting top-down view from only a handful of first- person observations.
Environment • Randomly generated shapes (similar to 3D Tetris pieces). • (g) could infer even from a single image. • Capable of re-rendering from any viewpoint with high (indistinguishable) levels of accuracy. • If high occlusion: (g) generated one of the many shapes that's consistent with the observed portion of the image.
vs GQN • SFM and other multiple view geometry techniques —> point clouds, mesh clouds, collections of pre-defined primitives… - (3D Scanning Lecture) • GQN learns representational space; can express the presence of textures, parts, objects, lights and scenes at a suitably high level of abstraction. • GQN enables task-specific fine-tuning of the representation itself.
vs Other Learning-based Methods • Other neural approaches (auto-encoders etc) focus on regularities in colors and patches in the image space, but fail to achieve high-level representation. • GQN can account for uncertainty in scenes with high occlusions. • GQN is not specific to particular choice of generation architecture.
Restrictions • Resulting representations are no longer interpretable. • Experimented on synthetic environments: • A need for controlled analysis • Limited availability of suitable real datasets • Total scene understanding involves more than just 3D scene.
• A single architecture to perceive, interpret and represent synthetic scenes without human labelling. • Representations adapt to capture details of the environment. • No problem specific engineering of generators. • Paves the way towards fully unsupervised scene understanding, planning and behaviour.
• DeepMind Blog - June, 2018 • Science - Vol. 360, Issue 6394, pp. 1204-1210 • Open Access Version • Datasets used in Experiments • Related Video • Detailed pseudo-code is provided as Supplementary Materials. • DeepMind has filed a U.K. patent application (GP-201495-00-PCT) related to this work.