Attention is not quite all you need

Attention is not quite all you need January 2019 Ashwath
Salimath [email protected] https://octavian.ai @octavian_ai

Ashwath Salimath https://octavian.ai

Introduction

Let’s first understand 1. Graphs 2. CLEVR-GRAPH 3. Attention in
MacGraph

• To answer natural language questions about a graph with
natural language answers it’s necessary to transform the input question into a graph state that results in the correct answer being reached and it’s necessary to extract the answer information from the graph state and transform it into the desired answer. Attention in MacGraph

• Our solution for transforming between natural language and graph
state is to use attention. • Read more in detail from our another article Graphs and neural networks: Reading node properties

• Attention cells are radically different to dense layers. Attention
cells work with lists of information, extracting individual elements depending on their content or location. • These properties make attention cells great for selecting from the lists of nodes and edges that make up a graph.

Background

• Whilst working on MacGraph, a neural network that answers
questions using knowledge graphs, We came across a problem: How to tell if something’s not present in a list. • I will talk about our solution, the focus signal, and show how it performs reliably across a range of different datasets and architectures.

• In traditional programming, it’s easy to tell if something
is not in a list i.e. run a for loop over the items. • Neural networks are composed of differentiable functions, so that they can be trained using gradient descent. • Equality operators, for loops and if conditions, the standard pieces of traditional programming used to solve this task, do not work that well in neural networks.

• An equality operator is essentially a step function, which
has zero gradient almost everywhere (and therefore breaks gradient descent back-propagation) • If conditions generally use a boolean signal to switch branches, which again is often the output of a problematic step function • While loops can be inefficient on GPUs, and sometimes not even useful as often neural network libraries require all data to have the same dimensions (e.g. TensorFlow executes on a statically defined tensor graph).

• A popular neural-network technique for working with lists of
items (e.g. translating sentences treating them as lists of words) is to apply “attention”. This is a function where a learned “query” of what the network is looking for is compared to each item in the list, and a weighted sum of the items similar to the query is output.

1. The query is dot-producted with each item in the
list to compute a “score”. This is done in parallel for all items. 2. The scores are then passed through softmax to transform them into a list that sums to 1.0. The scores can then be used as a probability distribution. 3. Finally, a weighted sum of the items is calculated, weighting each item by its score.

• It’s fast and simple to implement. • Compared to
a Recurring Neural Network (e.g. LSTM) it is much more able to refer to past values in the input sequence. • Many tasks can be solved by rearranging and combining list elements to form a new list. Attention has been very successful

Our problem

Despite attention’s versatility and success, it has a deficiency that
plagued our work on graph question answering: attention does not tell us if an item is present in a list.

This first happened when we attempted to answer questions like
“Is there a station called London Bridge?” and “Is Trafalgar Square station adjacent to Waterloo station?”.

• This happens because attention returns a weighted sum of
the list. • If the query matches (e.g. scores highly) against one item in the list, the output will be almost exactly that value. • If the query did not match any items, then a sum of all the items in the list is returned. • Based on attention’s output, the rest of the network cannot easily differentiate between those two situations.

Our solution

The simple solution we propose is output a scalar aggregate
of the raw item-query scores (e.g. before using softmax). This signal will be low if no items are similar to the query, and high if many items are.

Attention score Focus aggregation

Experimental setup

Finding that the focus signal was essential for MacGraph succeeding
on some tasks, we tested this concept on a range of datasets and model architectures.

We constructed a network that takes a list of items
and a desired item (the “query”), and outputs whether that item was in the list. Output activation

• The network takes the inputs, performs attention (optionally with
our focus signal), transforms the outputs through a couple of residual layers, then outputs a binary distribution of whether the item was found. • The network loss is calculated using softmax cross entropy and trained using the Adam optimizer.

Here are all of the network configurations we tested:

Datasets

• Each dataset is a set of examples, each example
contains the input features List of items, Desired item and the ground truth output, Item is in list. • An item is an N dimensional vector of floating point numbers. • Each dataset was constructed so 100% accuracy is possible. • Each dataset has balanced answer classes (i.e. an equal number of True and False answers).

We tested on three different datasets, each with a different
source of items: • Orthogonal one-hot vectors of length 12. • Many-hot vectors (e.g. random strings of 1.0s and 0.0s) of length 12. • Word2vec vectors of length 300.

A training example: orthogonal one-hot dataset

• For our word2vec dataset, we used pre-calculated word2vec embeddings
trained on Google News. • The word2vec dataset has item lists of up to length 77, each representing a shuffled, randomly chosen sentence from a 300-line wikipedia article. Each desired-item is a randomly chosen word2vec vector from the words within the article.

Results

Conclusion

• We’ve shown that “focus signal” is a robust mechanism
for detecting the existence of items in a list, which is an important operation for machine reasoning and question answering. • We hope that it will help other teams in tackling the challenge of item-existence.

Ashwath Salimath [email protected] https://octavian.ai @octavian_ai Thank you.

Attention is not quite all you need

Attention is not quite all you need

Other Decks in Research

Featured

Transcript