Padmaja Bhagwat - Listen, Attend, and Walk : Interpreting natural language navigational instructions

Slide 1

Slide 1 text

Listen, Attend and Walk: Interpreting Natural Language Navigational Instructions PADMAJA V BHAGWAT https://padmajavb.github.io/ 01

Slide 2

Slide 2 text

#NITK #Bachelors #MachineLearning #ArtificialNeuralNetworks @padmaja_ bhagwat 02

Slide 3

Slide 3 text

Natural Language Processing Why is hard? 03

Slide 4

Slide 4 text

04 Boy paralyzed after tumor fights back to gain black belt Girl Hit By Car In Hospital For instance...

Slide 5

Slide 5 text

Instruction: take a left onto the red brick and go a ways down until you come to the section with the butterflies on the wall. Introduction 05 http://www.cs.utexas.edu/users/ml/clamp/navigation/ Map of the “l” virtual environment Path: [(23, 23, 90), (23, 23, 0), (23, 22, 0), (23, 21, 0), (23, 20, 0), (23, 19, 0)] Action Sequence: [1, 0, 0, 0, 0, 3] Input = Instruction + Initial position + Map (world state) Output = Action sequence

Slide 6

Slide 6 text

Steps involved 06 Get data Transform Get vectors Build the model Train Test and Simulate

Slide 7

Slide 7 text

SAIL route instruction dataset: http://www.cs.utexas.edu/users/ml/clamp/navigation/ 07 Get data

Slide 8

Slide 8 text

• Remove sentence with invalid action sequence • Remove stop-words 08 Transform

Slide 9

Slide 9 text

Convert NL instruction to vector • One - hot encoding • Word2Vec Take the pink path to the red brick intersection 09

Slide 10

Slide 10 text

What model to use? 10

Slide 11

Slide 11 text

Encoder-Decoder Model 11 https://www.researchgate.net/figure/A-high-level-view-of-the-encoder-decoder-architecture-The-direction-of-arrows-show-the_fig21_324 706603

Slide 12

Slide 12 text

Overall Architecture 12 https://arxiv.org/pdf/1506.04089.pdf

Slide 13

Slide 13 text

Implemented using Bidirectional LSTM Natural language instruction X 1:N = (X 1 , X 2 ,…..X N ) Hidden annotations h 1:N = (h 1 , h 2 ,….h N ) 13 Eg: Turn left after taking right https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66 Encoder

Slide 14

Slide 14 text

• h j summarizes the words up to and including x j • concatenate forward and backward annotations Te - affine transformation σ - logistic sigmoid i e - Input gate of LSTM f e - forget gate of LSTM 0 e - output gate of LSTM h j is calculated as follows: 14 Encoder

Slide 15

Slide 15 text

import torch import torch.nn as nn class Encoder(nn.Module): def __init__(self, input_size, hidden_size, bidirectionality = False): super(EncoderRNN, self).__init__() if bidirectionality is True: self.hidden_size = hidden_size self.hidden_size2 = hidden_size // 2 else: self.hidden_size = hidden_size # input_size = 524; hidden_size = 128 self.embedding = nn.Embedding(input_size, hidden_size) self.lstm = nn.LSTM(hidden_size, self.hidden_size2, bidirectional=True) 15 Encoder

Slide 16

Slide 16 text

def forward(self, input, hidden): lstm_input = self.embedding(input).view(1, 1, -1) output, hidden = self.lstm(lstm_input, hidden) return output, hidden 16 Encoder

Slide 17

Slide 17 text

Multi-level Aligner Context vector z t is computed as follows: Word Vector (X j ) + Hidden annotation (h j ) + Decoder’s prev. hidden state (S t-1 ) Context Vector (z t ) 17 Eg: Take the pink path to the red brick intersection

Slide 18

Slide 18 text

S t-1 - decoder hidden state at time t-1 x j - input instruction j € {1, 2, 3,…. N} h j - hidden annotation v, W, U, V - learned parameters The weight α tj associated with each pair (x j , h j ) 18 Multi-level Aligner

Slide 19

Slide 19 text

Decoder Context Vector (Z t ) + World state (y t ) + Decoder’s prev. hidden state (S t-1 ) Action Sequence (a t ) Implemented using LSTM 19 http://www.stratio.com/blog/deep-learning-3-recurrent-neural-networks-lstm/

Slide 20

Slide 20 text

Conditional probability distribution over the next action E - embedding matrix L 0 , L s , L z - parameters to be learned 20 Decoder

Slide 21

Slide 21 text

Attention Decoder 21 MAX_LENGTH = 46 class AttentionDecoderRNN(nn.Module): def __init__(self, input_size, hidden_size, world_state_size, output_size, max_length = MAX_LENGTH): “““ Initializing layers ””” super(AttentionDecoderRNN, self).__init__() self.embedding = nn.Embedding(input_size, hidden_size) self.lstm = nn.LSTM(hidden_size, hidden_size) self.input_hidden_combine = nn.Linear(hidden_size * 2, hidden_size) self.transform_beta = nn.Linear(hidden_size, 1) self.decoder_input = nn.Linear(hidden_size * 3, hidden_size) self.linear = nn.Linear(hidden_size * 2, hidden_size) self.out = nn.Linear(hidden_size, output_size) self.dense = nn.Linear(world_state_size, hidden_size)

Slide 22

Slide 22 text

def forward(self, input, world_state, hidden, encoder_outputs): “““ embedding the input sentence ””” embed = self.embedding(input) embedded = Variable(torch.zeros(self.max_length, self.hidden_size)) for idx, e in enumerate(embed): embedded[idx] = e 22 “““ calculating beta ””” scope_attr = self.input_hidden_combine(torch.cat((embedded, encoder_outputs), 1)) beta_inprocess = scope_attr + hidden[0][0] beta = F.tanh(beta_inprocess) beta = self.transform_beta(beta) Attention Decoder

Slide 23

Slide 23 text

“““ calculating alpha ””” attn_weights = F.softmax( beta, dim = 0 ) 23 “““ calculating context vector ””” zt = torch.bmm( attn_weights.unsqueeze(0), scope_attr.unsqueeze(0) ) “““ calculating decoder output ””” combined_input = torch.cat ( ( world_state, hidden[0][0], zt[0] ), 1 ) input_to_decoder = self.decoder_input( combined_input ).unsqueeze(0) output, hidden = self.lstm( input_to_decoder, hidden ) output_ctx_combine = self.linear( torch.cat( ( output[0], zt[0] ), 1 ) ) qt = self.out( world_state + output_ctx_combine ) “““ calculating probability distribution ””” output = F.log_softmax( qt, dim = 1 ) return output, hidden, attn_weights1 Attention Decoder

Slide 24

Slide 24 text

Train the model Grid Jelly 24

Slide 25

Slide 25 text

Train the model STOP = 3 def train( idx_data, map_name, input_variable, target_variable, action_seq, encoder, decoder ): 25 world_state = target_variable[0] # initialize the world state decoder_input = input_variable # initialize decoder input for di in range( action_length ): decoder_output, decoder_hidden, decoder_attention = AttentionDecoder(decoder_input, world_state, decoder_hidden, encoder_outputs) for ei in range( input_length ): encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden) topv, topi = decoder_output.data.topk(1); ni = topi[0][0] pos_curr = run_model.take_one_step(pos_curr, ni) world_state = run_model.get_feat_current_position(pos_curr, map_name) loss += criterion(decoder_output, action_seq[di]) if ni == STOP: break

Slide 26

Slide 26 text

Training Vs. Validation error Loss function: Negative log likelihood Optimizer: Stochastic Gradient Descent 26

Slide 27

Slide 27 text

Test the model L 27

Slide 28

Slide 28 text

Conclusion and future work • Our system is successfully able to generate action sequences corresponding to a novel navigational instruction • Proposed approach is limited to pre-processed textual input • Future work : Integrating computer vision along with NLU to build real time application 28

Slide 29

Slide 29 text

Python Libraries used NumPy and SciPy for various mathematical computation on tensors PyTorch for building dynamic graphs PyGame for visualizing the virtual environment Matplotlib for visualizing the attention weights 29

Slide 30

Slide 30 text

GitHub https://github.com/PadmajaVB /listen-attend-and-walk 30

Slide 31

Slide 31 text

Thank you Team: Manisha Jhawar Nitya C K Padmaja V Bhagwat Guide: Prof. Ananthanarayana VS https://github.com/PadmajaVB 31