Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Attention is not quite all you need

Attention is not quite all you need

While working on MacGraph, a neural network that answers questions using knowledge graphs, we came across a problem: How to tell if something’s not present in a list.

We (Octavian.AI) have developed a novel and robust neural network solution to detecting if an item is present in a list. We've created this as an extension of the popular attention mechanism.

Avatar for Ashwath Salimath

Ashwath Salimath

January 05, 2019
Tweet

Other Decks in Research

Transcript

  1. • To answer natural language questions about a graph with

    natural language answers it’s necessary to transform the input question into a graph state that results in the correct answer being reached and it’s necessary to extract the answer information from the graph state and transform it into the desired answer. Attention in MacGraph
  2. • Our solution for transforming between natural language and graph

    state is to use attention. • Read more in detail from our another article Graphs and neural networks: Reading node properties
  3. • Attention cells are radically different to dense layers. Attention

    cells work with lists of information, extracting individual elements depending on their content or location. • These properties make attention cells great for selecting from the lists of nodes and edges that make up a graph.
  4. • Whilst working on MacGraph, a neural network that answers

    questions using knowledge graphs, We came across a problem: How to tell if something’s not present in a list. • I will talk about our solution, the focus signal, and show how it performs reliably across a range of different datasets and architectures.
  5. • In traditional programming, it’s easy to tell if something

    is not in a list i.e. run a for loop over the items. • Neural networks are composed of differentiable functions, so that they can be trained using gradient descent. • Equality operators, for loops and if conditions, the standard pieces of traditional programming used to solve this task, do not work that well in neural networks.
  6. • An equality operator is essentially a step function, which

    has zero gradient almost everywhere (and therefore breaks gradient descent back-propagation) • If conditions generally use a boolean signal to switch branches, which again is often the output of a problematic step function • While loops can be inefficient on GPUs, and sometimes not even useful as often neural network libraries require all data to have the same dimensions (e.g. TensorFlow executes on a statically defined tensor graph).
  7. • A popular neural-network technique for working with lists of

    items (e.g. translating sentences treating them as lists of words) is to apply “attention”. This is a function where a learned “query” of what the network is looking for is compared to each item in the list, and a weighted sum of the items similar to the query is output.
  8. 1. The query is dot-producted with each item in the

    list to compute a “score”. This is done in parallel for all items. 2. The scores are then passed through softmax to transform them into a list that sums to 1.0. The scores can then be used as a probability distribution. 3. Finally, a weighted sum of the items is calculated, weighting each item by its score.
  9. • It’s fast and simple to implement. • Compared to

    a Recurring Neural Network (e.g. LSTM) it is much more able to refer to past values in the input sequence. • Many tasks can be solved by rearranging and combining list elements to form a new list. Attention has been very successful
  10. Despite attention’s versatility and success, it has a deficiency that

    plagued our work on graph question answering: attention does not tell us if an item is present in a list.
  11. This first happened when we attempted to answer questions like

    “Is there a station called London Bridge?” and “Is Trafalgar Square station adjacent to Waterloo station?”.
  12. • This happens because attention returns a weighted sum of

    the list. • If the query matches (e.g. scores highly) against one item in the list, the output will be almost exactly that value. • If the query did not match any items, then a sum of all the items in the list is returned. • Based on attention’s output, the rest of the network cannot easily differentiate between those two situations.
  13. The simple solution we propose is output a scalar aggregate

    of the raw item-query scores (e.g. before using softmax). This signal will be low if no items are similar to the query, and high if many items are.
  14. Finding that the focus signal was essential for MacGraph succeeding

    on some tasks, we tested this concept on a range of datasets and model architectures.
  15. We constructed a network that takes a list of items

    and a desired item (the “query”), and outputs whether that item was in the list. Output activation
  16. • The network takes the inputs, performs attention (optionally with

    our focus signal), transforms the outputs through a couple of residual layers, then outputs a binary distribution of whether the item was found. • The network loss is calculated using softmax cross entropy and trained using the Adam optimizer.
  17. • Each dataset is a set of examples, each example

    contains the input features List of items, Desired item and the ground truth output, Item is in list. • An item is an N dimensional vector of floating point numbers. • Each dataset was constructed so 100% accuracy is possible. • Each dataset has balanced answer classes (i.e. an equal number of True and False answers).
  18. We tested on three different datasets, each with a different

    source of items: • Orthogonal one-hot vectors of length 12. • Many-hot vectors (e.g. random strings of 1.0s and 0.0s) of length 12. • Word2vec vectors of length 300.
  19. • For our word2vec dataset, we used pre-calculated word2vec embeddings

    trained on Google News. • The word2vec dataset has item lists of up to length 77, each representing a shuffled, randomly chosen sentence from a 300-line wikipedia article. Each desired-item is a randomly chosen word2vec vector from the words within the article.
  20. • We’ve shown that “focus signal” is a robust mechanism

    for detecting the existence of items in a list, which is an important operation for machine reasoning and question answering. • We hope that it will help other teams in tackling the challenge of item-existence.