Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Padmaja Bhagwat - Listen, Attend, and Walk : Interpreting natural language navigational instructions

Padmaja Bhagwat - Listen, Attend, and Walk : Interpreting natural language navigational instructions

Imagine you have an appointment in a large building you do not know. Your host sent instructions describing how to reach their office. Though the instructions were fairly clear, in a few places, such as at the end, you had to infer what to do. How does a _robot (agent)_ interpret an instruction in the environment to infer the correct course of action? Enabling harmonious _Human - Robot Interaction_ is of primary importance if they are to work seamlessly alongside people.

Dealing with natural language instructions in hard because of two main reasons, first being, Humans - through their prior experience know how to interpret natural language but agents can’t, and second is overcoming the ambiguity that is inherently associated with natural language instructions. This talk is about how deep learning models were used to solve such complex and ambiguous problem of converting natural language instruction into its corresponding action sequence.

Following verbal route instructions requires knowledge of language, space, action and perception. In this talk I shall be presenting, a neural sequence-to-sequence model for direction following, a task that is essential to realize effective autonomous agents.

At a high level, a sequence-to- sequence model is an end-to-end model made up of two recurrent neural networks:

- **Encoder** - which takes the model’s input sequence as input and encodes it into a fixed-size context vector.
- **Decoder** - which uses the context vector from above as a seed from which to generate an output sequence.

For this reason, sequence-to-sequence models are often referred to as _encoder-decoder_ models. The alignment based encoder-decoder model would translate the natural language instructions into corresponding action sequences. This model does not assume any prior linguistic knowledge: syntactic, semantic or lexical. The model learns the meaning of every word, including object names, verbs, spatial relations as well as syntax and the compositional semantics of the language on its own.

In this talk, steps involved in pre-processing of data, training the model, testing the model and final simulation of the model in the virtual environment will be discussed. This talk will also cover some of the challenges and trade-offs made while designing the model.

https://us.pycon.org/2018/schedule/presentation/132/

PyCon 2018

May 11, 2018
Tweet

More Decks by PyCon 2018

Other Decks in Programming

Transcript

  1. 04 Boy paralyzed after tumor fights back to gain black

    belt Girl Hit By Car In Hospital For instance...
  2. Instruction: take a left onto the red brick and go

    a ways down until you come to the section with the butterflies on the wall. Introduction 05 http://www.cs.utexas.edu/users/ml/clamp/navigation/ Map of the “l” virtual environment Path: [(23, 23, 90), (23, 23, 0), (23, 22, 0), (23, 21, 0), (23, 20, 0), (23, 19, 0)] Action Sequence: [1, 0, 0, 0, 0, 3] Input = Instruction + Initial position + Map (world state) Output = Action sequence
  3. Convert NL instruction to vector • One - hot encoding

    • Word2Vec Take the pink path to the red brick intersection 09
  4. Implemented using Bidirectional LSTM Natural language instruction X 1:N =

    (X 1 , X 2 ,…..X N ) Hidden annotations h 1:N = (h 1 , h 2 ,….h N ) 13 Eg: Turn left after taking right https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66 Encoder
  5. • h j summarizes the words up to and including

    x j • concatenate forward and backward annotations Te - affine transformation σ - logistic sigmoid i e - Input gate of LSTM f e - forget gate of LSTM 0 e - output gate of LSTM h j is calculated as follows: 14 Encoder
  6. import torch import torch.nn as nn class Encoder(nn.Module): def __init__(self,

    input_size, hidden_size, bidirectionality = False): super(EncoderRNN, self).__init__() if bidirectionality is True: self.hidden_size = hidden_size self.hidden_size2 = hidden_size // 2 else: self.hidden_size = hidden_size # input_size = 524; hidden_size = 128 self.embedding = nn.Embedding(input_size, hidden_size) self.lstm = nn.LSTM(hidden_size, self.hidden_size2, bidirectional=True) 15 Encoder
  7. def forward(self, input, hidden): lstm_input = self.embedding(input).view(1, 1, -1) output,

    hidden = self.lstm(lstm_input, hidden) return output, hidden 16 Encoder
  8. Multi-level Aligner Context vector z t is computed as follows:

    Word Vector (X j ) + Hidden annotation (h j ) + Decoder’s prev. hidden state (S t-1 ) Context Vector (z t ) 17 Eg: Take the pink path to the red brick intersection
  9. S t-1 - decoder hidden state at time t-1 x

    j - input instruction j € {1, 2, 3,…. N} h j - hidden annotation v, W, U, V - learned parameters The weight α tj associated with each pair (x j , h j ) 18 Multi-level Aligner
  10. Decoder Context Vector (Z t ) + World state (y

    t ) + Decoder’s prev. hidden state (S t-1 ) Action Sequence (a t ) Implemented using LSTM 19 http://www.stratio.com/blog/deep-learning-3-recurrent-neural-networks-lstm/
  11. Conditional probability distribution over the next action E - embedding

    matrix L 0 , L s , L z - parameters to be learned 20 Decoder
  12. Attention Decoder 21 MAX_LENGTH = 46 class AttentionDecoderRNN(nn.Module): def __init__(self,

    input_size, hidden_size, world_state_size, output_size, max_length = MAX_LENGTH): “““ Initializing layers ””” super(AttentionDecoderRNN, self).__init__() self.embedding = nn.Embedding(input_size, hidden_size) self.lstm = nn.LSTM(hidden_size, hidden_size) self.input_hidden_combine = nn.Linear(hidden_size * 2, hidden_size) self.transform_beta = nn.Linear(hidden_size, 1) self.decoder_input = nn.Linear(hidden_size * 3, hidden_size) self.linear = nn.Linear(hidden_size * 2, hidden_size) self.out = nn.Linear(hidden_size, output_size) self.dense = nn.Linear(world_state_size, hidden_size)
  13. def forward(self, input, world_state, hidden, encoder_outputs): “““ embedding the input

    sentence ””” embed = self.embedding(input) embedded = Variable(torch.zeros(self.max_length, self.hidden_size)) for idx, e in enumerate(embed): embedded[idx] = e 22 “““ calculating beta ””” scope_attr = self.input_hidden_combine(torch.cat((embedded, encoder_outputs), 1)) beta_inprocess = scope_attr + hidden[0][0] beta = F.tanh(beta_inprocess) beta = self.transform_beta(beta) Attention Decoder
  14. “““ calculating alpha ””” attn_weights = F.softmax( beta, dim =

    0 ) 23 “““ calculating context vector ””” zt = torch.bmm( attn_weights.unsqueeze(0), scope_attr.unsqueeze(0) ) “““ calculating decoder output ””” combined_input = torch.cat ( ( world_state, hidden[0][0], zt[0] ), 1 ) input_to_decoder = self.decoder_input( combined_input ).unsqueeze(0) output, hidden = self.lstm( input_to_decoder, hidden ) output_ctx_combine = self.linear( torch.cat( ( output[0], zt[0] ), 1 ) ) qt = self.out( world_state + output_ctx_combine ) “““ calculating probability distribution ””” output = F.log_softmax( qt, dim = 1 ) return output, hidden, attn_weights1 Attention Decoder
  15. Train the model STOP = 3 def train( idx_data, map_name,

    input_variable, target_variable, action_seq, encoder, decoder ): 25 world_state = target_variable[0] # initialize the world state decoder_input = input_variable # initialize decoder input for di in range( action_length ): decoder_output, decoder_hidden, decoder_attention = AttentionDecoder(decoder_input, world_state, decoder_hidden, encoder_outputs) for ei in range( input_length ): encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden) topv, topi = decoder_output.data.topk(1); ni = topi[0][0] pos_curr = run_model.take_one_step(pos_curr, ni) world_state = run_model.get_feat_current_position(pos_curr, map_name) loss += criterion(decoder_output, action_seq[di]) if ni == STOP: break
  16. Conclusion and future work • Our system is successfully able

    to generate action sequences corresponding to a novel navigational instruction • Proposed approach is limited to pre-processed textual input • Future work : Integrating computer vision along with NLU to build real time application 28
  17. Python Libraries used NumPy and SciPy for various mathematical computation

    on tensors PyTorch for building dynamic graphs PyGame for visualizing the virtual environment Matplotlib for visualizing the attention weights 29
  18. Thank you Team: Manisha Jhawar Nitya C K Padmaja V

    Bhagwat Guide: Prof. Ananthanarayana VS https://github.com/PadmajaVB 31