Ordered neurons integrating tree structures into recurrent neural networks

Slide 1

Slide 1 text

1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks (ICLR2019) Kazuki Fujikawa, DeNA

Slide 8

Slide 8 text

ON-LSTM • LSTMのメモリ状態の更新プロセスを変えることで実現 – 標準のLSTMのメモリ更新 ① 過去の情報𝑐!%$ をどの程度利⽤するかを制御する𝑓! (forget gate) を導出 ② 新規の情報 ̂ 𝑐! をどの程度利⽤するかを制御する𝑖! (input gate) を導出 ③ 過去の情報𝑐!%$ と新規の情報 ̂ 𝑐! を𝑓! と𝑖! で重み付けしてメモリを更新する 8 ਤҾ༻: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ practice we use a continuous relaxation by computing the quantity p(d  k), obtained by ta a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. He ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) Following the properties of the cumax() activation, the values in the master forget gate are mo tonically increasing from 0 to 1, and those in the master input gate are monotonically decrea from 1 to 0. These gates serve as high-level control for the update operations of cell states. U the master gates, we define a new update rule: !t = ˜ ft ˜ it ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) ct = ˆ ft ct 1 +ˆ it ˆ ct In order to explain the intuition behind the new update rule, we assume that the master gates binary: neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy of information between the neurons. To this end, we propose to make the gate for each neuron dependent on the others by enforcing the order in which neurons should be updated. 4.1 ACTIVATION FUNCTION: cumax() differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy or global information that will last anywhere from several time steps to the entire sentence, representing nodes near the root of the tree. Low-ranking neurons encode short-term or local information that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates

Slide 9

Slide 9 text

ON-LSTM • LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTM • 下図 “ON GATES” の部分以外は標準のLSTMと同様 9 ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) Following the properties of the cumax() activation, the values in the master forget gate tonically increasing from 0 to 1, and those in the master input gate are monotonically from 1 to 0. These gates serve as high-level control for the update operations of cell st the master gates, we define a new update rule: !t = ˜ ft ˜ it ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) ct = ˆ ft ct 1 +ˆ it ˆ ct In order to explain the intuition behind the new update rule, we assume that the maste binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Sup (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the i stored in the first df t neurons of the previous cell state ct 1 will be completely parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constitue number of zeroed neurons, i.e. a large df t , represents the end of a high-level co the parse tree, as most of the information in the state will be discarded. Convers df t represents the end of a low-level constituent as high-level information is kep 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy of information between the neurons. To this end, we propose to make the gate for each neuron dependent on the others by enforcing the order in which neurons should be updated. 4.1 ACTIVATION FUNCTION: cumax() To enforce an order to the update frequency, we introduce a new activation function: ˆ g = cumax(. . .) = cumsum(softmax(. . .)), (6) ON GATES ਤҾ༻: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Slide 10

Slide 10 text

ON-LSTM • LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 10 ) 𝑔& = exp(𝒛& ) ∑ &! exp(𝒛&!) 𝑔& = 3 &!'& ) 𝑔& 𝑐𝑢𝑚𝑎𝑥(7) 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(7) 活性化関数: 𝑐𝑢𝑚𝑎𝑥 (cumulative softmax) を利⽤ , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f

Slide 11

Slide 11 text

ON-LSTM • LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 11 & 𝑓! , ̃ 𝚤! の特性 & 𝑓! ̃ 𝚤! ▪ 𝑐𝑢𝑚𝑎𝑥 は 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 の累積和 → ピークを迎えた後は1に近い値を連続して出⼒する ▪ * 𝑓! , ̃ 𝚤! は逆側の次元で1に近い値を持つ ▪ 構⽂⽊の上位階層が変化するノードでは、 * 𝑓! は0に近い値が多く、 ̃ 𝚤! は1に近い値が多いことが期待される（学習で求まるものであり、陽に与えるものではない） ▪ * 𝑓! , ̃ 𝚤! は⾜して1になる訳ではない → 重複している部分 = “不完全な” 部分（⼀部更新が必要な部分︖） per at ICLR 2019 etween a constituency parse tree and the hidden states of the proposed kens S = (x1, x2, x3) and its corresponding constituency tree are illus- ock view of the tree structure in (b), where both S and VP nodes span representation for high-ranking nodes should be relatively consistent ) Visualization of the update frequency of groups of hidden state neu- n the input word, dark grey blocks are completely updated while light ated. The three groups of neurons have different update frequencies. requently while lower groups are more frequently updated. , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f

Slide 12

Slide 12 text

ON-LSTM • LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 12 & 𝑓! , ̃ 𝚤! の特性 & 𝑓! ̃ 𝚤! ▪ 𝑐𝑢𝑚𝑎𝑥 は 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 の累積和 → ピークを迎えた後は1に近い値を連続して出⼒する ▪ * 𝑓! , ̃ 𝚤! は逆側の次元で1に近い値を持つ ▪ 構⽂⽊の上位階層が変化するノードでは、 * 𝑓! は0に近い値が多く、 ̃ 𝚤! は1に近い値が多いことが期待される（学習で求まるものであり、陽に与えるものではない） ▪ * 𝑓! , ̃ 𝚤! は⾜して1になる訳ではない → 重複している部分 = “不完全な” 部分（⼀部更新が必要な部分︖） per at ICLR 2019 etween a constituency parse tree and the hidden states of the proposed kens S = (x1, x2, x3) and its corresponding constituency tree are illus- ock view of the tree structure in (b), where both S and VP nodes span representation for high-ranking nodes should be relatively consistent ) Visualization of the update frequency of groups of hidden state neu- n the input word, dark grey blocks are completely updated while light ated. The three groups of neurons have different update frequencies. requently while lower groups are more frequently updated. , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f

Slide 13

Slide 13 text

ON-LSTM • LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 ② & 𝑓! と ̃ 𝚤! の重複部分: 𝑤! を導出 13 & 𝑓! ̃ 𝚤! ° 𝑤! = , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small

Slide 14

Slide 14 text

• LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 ② & 𝑓! と ̃ 𝚤! の重複部分: 𝑤! を導出 ③ & 𝑓! , 𝑤! , 𝑓! (標準のLSTMのforget gate) を使って = 𝑓! (ON-LSTM forget gate) を導出 deally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in ractice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. .2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) n order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large ON-LSTM 14 & 𝑓! 𝑤! − = ̃ 𝚤! と重複していない部分 → & 𝑓! をそのまま利⽤ , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small

Slide 15

Slide 15 text

• LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 ② & 𝑓! と ̃ 𝚤! の重複部分: 𝑤! を導出 ③ & 𝑓! , 𝑤! , 𝑓! (標準のLSTMのforget gate) を使って = 𝑓! (ON-LSTM forget gate) を導出 deally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in ractice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. .2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) n order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large ON-LSTM 15 𝑓! 𝑤! ° = ̃ 𝚤! と重複している部分 → 𝑓! との合議で決定 , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small

Slide 16

Slide 16 text

• LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 ② & 𝑓! と ̃ 𝚤! の重複部分: 𝑤! を導出 ③ & 𝑓! , 𝑤! , 𝑓! (標準のLSTMのforget gate) を使って = 𝑓! (ON-LSTM forget gate) を導出 ④ ̃ 𝚤! , 𝑤! , 𝑖! (標準のLSTMのinput gate) を使って ̂ 𝚤! (ON-LSTM input gate) を導出 deally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in ractice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. .2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) n order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large ON-LSTM 16 , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small k ik eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df . Given the Eq. (12) and (14), the information

Slide 17

Slide 17 text

• LSTMのメモリ状態の更新プロセスを変えることで実現 – ON-LSTMのメモリ更新 ① 消去 / 書込すべき領域 & 𝑓! （master forget gate） ̃ 𝚤! (master input gate) を導出 ② & 𝑓! と ̃ 𝚤! の重複部分: 𝑤! を導出 ③ & 𝑓! , 𝑤! , 𝑓! (標準のLSTMのforget gate) を使って = 𝑓! (ON-LSTM forget gate) を導出 ④ ̃ 𝚤! , 𝑤! , 𝑖! (標準のLSTMのinput gate) を使って ̂ 𝚤! (ON-LSTM input gate) を導出 ⑤ 過去の情報𝑐!%$ と新規の情報 ̂ 𝑐! を𝑓! と𝑖! で重み付けしてメモリを更新する deally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in ractice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. .2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) n order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large ON-LSTM 17 , g should take the form of a discrete variable. Unfortunately, computing gradients when a e variable is included in the computation graph is not trivial (Schulman et al., 2015), so in e we use a continuous relaxation by computing the quantity p(d  k), obtained by taking ulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, g]. STRUCTURED GATING MECHANISM on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ing the properties of the cumax() activation, the values in the master forget gate are mono- ly increasing from 0 to 1, and those in the master input gate are monotonically decreasing to 0. These gates serve as high-level control for the update operations of cell states. Using ster gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) er to explain the intuition behind the new update rule, we assume that the master gates are • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large f cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small k ik eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df . Given the Eq. (12) and (14), the information : Correspondences between a constituency parse tree and the hidden states of the proposed TM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- (a). We provide a block view of the tree structure in (b), where both S and VP nodes span an one time step. The representation for high-ranking nodes should be relatively consistent ultiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- each time step, given the input word, dark grey blocks are completely updated while light cks are partially updated. The three groups of neurons have different update frequencies. groups update less frequently while lower groups are more frequently updated. l information that will last anywhere from several time steps to the entire sentence, repre- nodes near the root of the tree. Low-ranking neurons encode short-term or local information last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The iation between high-ranking and low-ranking neurons is learnt in a completely data-driven by controlling the update frequency of single neurons: to erase (or update) high-ranking the model should first erase (or update) all lower-ranking neurons. In other words, some always update more (or less) frequently than the others, and that order is pre-determined as he model architecture. N-LSTM ection, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) erence with the LSTM is that we replace the update function for the cell state ct with a ction that will be explained in the following sections. The forget gates ft and input gates f the k-th value in g being 1 by evaluating the probability of the disjunction of any of the values efore the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the ategories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) deally, g should take the form of a discrete variable. Unfortunately, computing gradients when a iscrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in ractice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. .2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- onically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using he master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) n order to explain the intuition behind the new update rule, we assume that the master gates are

Slide 25

Slide 25 text

実験2: 教師なし構⽂解析 • 結果・考察 – 3層ON-LSTMの各層のmaster forget gateの値を⽤いてParsingした結果を⽐較 • 2層⽬のmaster forget gateでparsingした場合が最も良い結果となった • 1, 3層⽬は⾔語モデルのタスクに対してより有⽤な特徴表現になっていて、抽象化されていなかった可能性が考えられる 25 Published as a conference paper at ICLR 2019 Training Data Training Object Vocab Size Parsing F1 Depth WSJ Accuracy on WSJ by Tag Model WSJ10 WSJ ADJP NP PP INTJ µ ( ) max µ ( ) max PRPN-UP AllNLI Train LM 76k 66.3 (0.8) 68.5 38.3 (0.5) 39.8 5.8 28.7 65.5 32.7 0.0 PRPN-LM AllNLI Train LM 76k 52.4 (4.9) 58.1 35.0 (5.4) 42.8 6.1 37.8 59.7 61.5 100.0 PRPN-UP WSJ Train LM 15.8k 62.2 (3.9) 70.3 26.0 (2.3) 32.8 5.8 24.8 54.4 17.8 0.0 PRPN-LM WSJ Train LM 10k 70.5 (0.4) 71.3 37.4 (0.3) 38.1 5.9 26.2 63.9 24.4 0.0 ON-LSTM 1st-layer WSJ Train LM 10k 35.2 (4.1) 42.8 20.0 (2.8) 24.0 5.6 38.1 23.8 18.3 100.0 ON-LSTM 2nd-layer WSJ Train LM 10k 65.1 (1.7) 66.8 47.7 (1.5) 49.4 5.6 46.2 61.4 55.4 0.0 ON-LSTM 3rd-layer WSJ Train LM 10k 54.0 (3.9) 57.6 36.6 (3.3) 40.4 5.3 44.8 57.5 47.2 0.0 300D ST-Gumbel AllNLI Train NLI – – – 19.0 (1.0) 20.1 – 15.6 18.8 9.9 59.4 w/o Leaf GRU AllNLI Train NLI – – – 22.8 (1.6) 25.0 – 18.9 24.1 14.2 51.8 300D RL-SPINN AllNLI Train NLI – – – 13.2 (0.0) 13.2 – 1.7 10.8 4.6 50.6 w/o Leaf GRU AllNLI Train NLI – – – 13.1 (0.1) 13.2 – 1.6 10.9 4.6 50.0 CCM WSJ10 Full – – – 71.9 – – – – – – – DMV+CCM WSJ10 Full – – – 77.6 – – – – – – – UML-DOP WSJ10 Full – – – 82.9 – – – – – – – Random Trees – – – 31.7 (0.3) 32.2 18.4 (0.1) 18.6 5.3 17.4 22.3 16.0 40.4 Balanced Trees – – – 43.4 (0.0) 43.4 24.5 (0.0) 24.5 4.6 22.1 20.2 9.3 55.9 Left Branching – – – 19.6 (0.0) 19.6 9.0 (0.0) 9.0 12.4 – – – – Right Branching – – – 56.6 (0.0) 56.6 39.8 (0.0) 39.8 12.4 – – – – Table 2: Unlabeled parsing F1 results evaluated on the full WSJ10 and WSJ test set. Our language model has three layers, each of them provides a sequence of ˆ df t . We provide the parsing performance for all layers. Results with RL-SPINN and ST-Gumbel are evaluated on the full WSJ (Williams et al., 2017). PRPN models are evaluated on the WSJ test set (Htut et al., 2018). We run the model with 5 Published as a conference paper at ICLR 2019 A SAMPLE PARSES FROM THE MODEL WITH THE BEST PERPLEXITY The RTC needs the most able competent management available The RTC needs the most able competent management available Resolution Funding Corp. to sell 4.5 billion 30-year bonds Resolution Funding Corp. to sell 4.5 billion 30-year bonds Interest expense in the 1988 third quarter was 75.3 million Interest expense in the 1988 third quarter was 75.3 million All prices are as of monday ’s close All prices are as of Monday ’s close That ’ll save us time and get people involved That ’ll save us time and get people involved ON-LSTM 2nd-layer Human expert annotations

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text