NLP@ICLR2019 - Speaker Deck

Slide 1

Slide 1 text

21/ 20/1 @7 9 9

Slide 10

Slide 10 text

) 1 1 ( : + 0 : 0 ( ( : + • I + IMNOT c RI !"#$ L I[ a L S %" : L] d I ̂ !" L I[ a L S '" L] e RI !"#$ I ̂ !" L%" '" MNOLT 8: / 5- 2 . 8 : : 01 : - 0 2 : practice we use a continuous relaxation by computing the quantity p(d  k), obtained by ta a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. He ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) Following the properties of the cumax() activation, the values in the master forget gate are mo tonically increasing from 0 to 1, and those in the master input gate are monotonically decrea from 1 to 0. These gates serve as high-level control for the update operations of cell states. U the master gates, we define a new update rule: !t = ˜ ft ˜ it ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) ct = ˆ ft ct 1 +ˆ it ˆ ct In order to explain the intuition behind the new update rule, we assume that the master gates neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy of information between the neurons. To this end, we propose to make the gate for each neuron dependent on the others by enforcing the order in which neurons should be updated. 4.1 ACTIVATION FUNCTION: cumax() that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy or global information that will last anywhere from several time steps to the entire sentence, representing nodes near the root of the tree. Low-ranking neurons encode short-term or local information that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates

Slide 11

Slide 11 text

+ 1 - , 1: /- • + (- )O e c[ ] p ! "# 1 1 ̃ %# 1 1 Th r ! "# N ̃ %# O lkm &# Th s ! "# &# "# iaO(- )O 1 T L ' "# + (- ) 1 Th t ̃ %# &# (# iaO(- )O 1 T L ̂ %# + (- ) 1 Th u [Odo*#+, Ne Odo ̂ *# T"# N(# M IL T e S discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ik Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information f LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span e than one time step. The representation for high-ranking nodes should be relatively consistent ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- . At each time step, given the input word, dark grey blocks are completely updated while light blocks are partially updated. The three groups of neurons have different update frequencies. most groups update less frequently while lower groups are more frequently updated. lobal information that will last anywhere from several time steps to the entire sentence, repre- ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven ion by controlling the update frequency of single neurons: to erase (or update) high-ranking ons, the model should first erase (or update) all lower-ranking neurons. In other words, some ons always update more (or less) frequently than the others, and that order is pre-determined as of the model architecture. ON-LSTM is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) difference with the LSTM is that we replace the update function for the cell state ct with a function that will be explained in the following sections. The forget gates ft and input gates e used to control the erasing and writing operation on cell states ct , as before. Since the gates before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the categories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: iaO(- ) Rn iaO(- ) Ngf

Slide 13

Slide 13 text

+ 3 3 1 - 1: /- • + (- ) LM TO S I N ! "# 1 1 ̃ %# 1 1 R eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in ! "# ̃ %# &'()* +,"-()* 1 0 e ! "# ̃ %# = 0 ! "# . = 1 *. , x = = 1 = 0, ! "# ̃ %# = 0 0 , e paper at ICLR 2019 es between a constituency parse tree and the hidden states of the proposed of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- a block view of the tree structure in (b), where both S and VP nodes span The representation for high-ranking nodes should be relatively consistent s. (c) Visualization of the update frequency of groups of hidden state neu- given the input word, dark grey blocks are completely updated while light updated. The three groups of neurons have different update frequencies. ess frequently while lower groups are more frequently updated.

Slide 14

Slide 14 text

4 4 + = (= = / = - = + + : = 1 • + ) / OST g ] L f ! "# ̃ %# = Ra eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in ! "# ̃ %# &'()* +,"-()* . ! "# ̃ %# ! "# 0 . e *. 1 , . ! "# ̃ %# x [d MI ce N ce e paper at ICLR 2019 es between a constituency parse tree and the hidden states of the proposed of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- a block view of the tree structure in (b), where both S and VP nodes span The representation for high-ranking nodes should be relatively consistent s. (c) Visualization of the update frequency of groups of hidden state neu- given the input word, dark grey blocks are completely updated while light updated. The three groups of neurons have different update frequencies. ess frequently while lower groups are more frequently updated.

Slide 15

Slide 15 text

+ 1 - 1: /- • + (- )ILMN ] R O ! "# 1 5 1 ̃ %# 1 1 T ! "# ̃ %# IS [ &# T eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f

Slide 16

Slide 16 text

, + ( / - + + : 1 • ,+ ) / LO T c [S M R ! "# 6 ̃ %# N] d ! "# ̃ %# L a &# N] e ! "# &# "# L) / L6 N I ' "# ,+ ) / 6 N] discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ̃ %# ! "#

Slide 17

Slide 17 text

, + ( 7 7 / - + + : 1 • ,+ ) / LO T c [S M R ! "# 7 7 ̃ %# 7 N] d ! "# ̃ %# L a &# N] e ! "# &# "# L) / L 7 7 N I ' "# ,+ ) / 7 7 N] discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ̃ %# "#

Slide 18

Slide 18 text

, + ( / - + + : 8 1 • ,+ ) / OT [d m a l ! "# ̃ %# Se n ! "# N ̃ %# O ih &# Se o ! "# &# "# f O) / O S] L ' "# ,+ ) / Se p ̃ %# &# (# f O) / O S] L ̂ %# ,+ ) / Se Ock*#+, Nd Ock ̂ *# S"# N(# M g ILT S[d R discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ik Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information f LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span e than one time step. The representation for high-ranking nodes should be relatively consistent ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- . At each time step, given the input word, dark grey blocks are completely updated while light blocks are partially updated. The three groups of neurons have different update frequencies. most groups update less frequently while lower groups are more frequently updated. lobal information that will last anywhere from several time steps to the entire sentence, repre- ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven ion by controlling the update frequency of single neurons: to erase (or update) high-ranking ons, the model should first erase (or update) all lower-ranking neurons. In other words, some ons always update more (or less) frequently than the others, and that order is pre-determined as of the model architecture. ON-LSTM is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) difference with the LSTM is that we replace the update function for the cell state ct with a function that will be explained in the following sections. The forget gates ft and input gates e used to control the erasing and writing operation on cell states ct , as before. Since the gates before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the categories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary:

Slide 40

Slide 40 text

4 4 : : 00 + • g :0 • tix a o dNcfo emEP W • ti ti n i rG yu s G • P "#""#$|&'( = * • yl W • a cfLB [ ] L b [ W • Lh n ] SL W 837k edges, while negative examples are generated by swapping one of the terms to a random word in the dictionary. Experimental details are given in Appendix D.1. The smoothed box model performs nearly as well as the original box lattice in terms of test ac- curacy1. While our model requires less hyper-parameter tuning than the original, we suspect that our performance would be increased on a task with a higher degree of sparsity than the 50/50 positive/negative split of the standard WordNet data, which we explore in the next section. 5.2 IMBALANCED WORDNET In order to confirm our intuition that the smoothed box model performs better in the sparse regime, we perform further experiments using different numbers of positive and negative examples from the WordNet mammal subset, comparing the box lattice, our smoothed approach, and order embeddings (OE) as a baseline. The training data is the transitive reduction of this subset of the mammal Word- Net, while the dev/test is the transitive closure of the training data. The training data contains 1,176 positive examples, and the dev and test sets contain 209 positive examples. Negative examples are generated randomly using the ratio stated in the table. As we can see from the table, with balanced data, all models include OE baseline, Box, Smoothed Box models nearly match the full transitive closure. As the number of negative examples increases, the performance drops for the original box model, but Smoothed Box still outperforms OE and Box in all setting. This superior performance on imbalanced data is important for e.g. real-world entailment graph learning, where the number of negatives greatly outweigh the positives. Positive:Negative Box OE Smoothed Box 1:1 0.9905 0.9976 1.0 1:2 0.8982 0.9139 1.0 1:6 0.6680 0.6640 0.9561 1:10 0.5495 0.5897 0.8800 Table 5: F1 scores of the box lattice, order embeddings, and our smoothed model, for different levels of label imbalance on the WordNet mammal subset. 5.3 FLICKR proach retains the inductive bias of the original box model, is equivalent in the limit, and satisfies e necessary condition that p(x, x) = p(x). A comparison of the 3 different functions is given in gure 3, with the softplus overlap showing much better behavior for highly disjoint boxes than the ussian model, while also preserving the meet property. (a) Standard (hinge) overlap (b) Gaussian overlap, 2 {2, 6} (c) Softplus overlap gure 3: Comparison of different overlap functions for two boxes of width 0.3 as a function of their nters. Note that in order to achieve high overlap, the Gaussian model must drastically lower its mperature, causing vanishing gradients in the tails. EXPERIMENTS 1 WORDNET Method Test Accuracy % transitive 88.2 word2gauss 86.6 OE 90.6 Li et al. (2017) 91.3 POE 91.6 Box 92.2 Smoothed Box 92.0 Table 4: Classification accuracy on WordNet test set. e perform experiments on the WordNet hypernym prediction task in order to evaluate the per-

Slide 41

Slide 41 text

+: : 4 1 : • eb B : • [ Li P E EMk]Glm ETS • P "|$ M d ET M Lc • P "|$ = &(",$) &($) = #+,-./0 ",$ 12 / #456+5 #+,-./0 $ 12 / #456+5 • , M P "|$ Li P h ag + : f h ag Kk] Box 0.050 0.900 Smoothed Box 0.036 0.917 Table 6: KL and Pearson correlation between model and gold probability. randomly pick 100K conditional probabilities for training data and 10k probabilities for dev and test data 2. We compare with several baselines: low-rank matrix factorization, complex bilinear factorization (Trouillon et al., 2016), and two hierarchical embedding methods, POE (Lai & Hockenmaier, 2017) and the Box Lattice (Vilnis et al., 2018). Since the training matrix is asymmetric, we used separate embeddings for target and conditioned movies. For the complex bilinear model, we added one additional vector of parameters to capture the “imply” relation. We evaluate on the test set using KL divergence, Pearson correlation, and Spearman correlation with the ground truth probabilities. Experimental details are given in Appendix D.4. From the results in Table 7, we can see that our smoothed box embedding method outperforms the original box lattice as well as all other baselines’ performances, especially in Spearman correlation, the most relevant metric for recommendation, a ranking task. We perform an additional study on the robustness of the smoothed model to initialization conditions in Appendix C. KL Pearson R Spearman R Matrix Factorization 0.0173 0.8549 0.8374 Complex Bilinear Factorization 0.0141 0.8771 0.8636 POE 0.0170 0.8548 0.8511 Box 0.0147 0.8775 0.8768 Smoothed Box 0.0138 0.8985 0.8977 Table 7: Performance of the smoothed model, the original box model, and several baselines on MovieLens. 6 CONCLUSION AND FUTURE WORK We presented an approach to smoothing the energy and optimization landscape of probabilistic box embeddings and provided a theoretical justiﬁcation for the smoothing. Due to a decreased number

Slide 48

Slide 48 text

8 8 4 4 4 8 8 8 8 + - • D 8 8 8 • 8 8 Pu f ds 8 4 • m v [ c oe v [ c g Aoe v [ c m v [ c i wNpSL nWy N • N 8 4 y lC • a ]th A 4 4 8 8 P ! ∈ ℝ$×$×& ! ∈ ℝ$×'×& ! ∈ ℝ(×'×& ! ∈ ℝ$×(×'×& : 4 8 8 4 4 8 8 8 8 8 sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature. Our experiments show that lightweight convolutions perform competitively to strong self-attention results and that dynamic convolutions can perform even better. On WMT English-German translation dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French they match the best reported result in the literature, and on IWSLT German-English dynamic convolutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime than a highly-optimized self-attention baseline. For language modeling on the Billion word bench- mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail abstractive document summarization we outperform a strong self-attention model. 2 BACKGROUND We first outline sequence to sequence learning and self-attention. Our work builds on non-separable convolutions as well as depthwise separable convolutions. Sequence to sequence learning maps a source sequence to a target sequence via two separate networks such as in machine translation (Sutskever et al., 2014). The encoder network computes representations for the source sequence such as an English sentence and the decoder network au- toregressively generates a target sequence based on the encoder output. The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2 Rn⇥d to obtain key (K), query (Q), and value (V) representations, where n is the number of time steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head can learn separate attention weights over dk features and attend to different positions. The module computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor- malizes the result. Finally, it computes a weighted sum using the output of the value projection (V): Attention(Q, K, V ) = softmax( QKT p dk )V Depthwise convolutions perform a convolution independently over every channel. The number of parameters can be reduced from d2k to dk where k is the kernel width. The output O 2 Rn⇥d of a depthwise convolution with weight W 2 Rd⇥k for element i and output dimension c is defined as: Oi,c = DepthwiseConv(X, Wc,: , i, c) = k X j=1 Wc,j · X (i+j d k+1 2 e),c 3 LIGHTWEIGHT CONVOLUTIONS In this section, we introduce LightConv, a depthwise convolution which shares certain output channels and whose weights are normalized across the temporal dimension using a softmax. Compared to attention weights from the previous time-step into account (Chorowski et al., 2015; Luong et al., 2015). Shen et al. (2018a) reduce complexity by performing attention within blocks of the input sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature. Our experiments show that lightweight convolutions perform competitively to strong self-attention results and that dynamic convolutions can perform even better. On WMT English-German translation dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French they match the best reported result in the literature, and on IWSLT German-English dynamic convolutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime than a highly-optimized self-attention baseline. For language modeling on the Billion word bench- mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail abstractive document summarization we outperform a strong self-attention model. 2 BACKGROUND We first outline sequence to sequence learning and self-attention. Our work builds on non-separable convolutions as well as depthwise separable convolutions. Sequence to sequence learning maps a source sequence to a target sequence via two separate networks such as in machine translation (Sutskever et al., 2014). The encoder network computes representations for the source sequence such as an English sentence and the decoder network au- toregressively generates a target sequence based on the encoder output. The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2 Rn⇥d to obtain key (K), query (Q), and value (V) representations, where n is the number of time steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head can learn separate attention weights over dk features and attend to different positions. The module computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor- malizes the result. Finally, it computes a weighted sum using the output of the value projection (V): Attention(Q, K, V ) = softmax( QKT p dk )V Depthwise convolutions perform a convolution independently over every channel. The number of parameters can be reduced from d2k to dk where k is the kernel width. The output O 2 Rn⇥d of a depthwise convolution with weight W 2 Rd⇥k for element i and output dimension c is defined as: Oi,c = DepthwiseConv(X, Wc,: , i, c) = k X j=1 Wc,j · X (i+j d k+1 2 e),c 3 LIGHTWEIGHT CONVOLUTIONS Under review as a conference paper at ICLR 2019 Mat Mul Linear Linear Linear Scale SoftMax Mat Mul Linear Q K V input (a) Self-attention LConv Linear Linear input GLU (b) Lightweight convolution input LConv Linear Linear Linear dynamic weights GLU (c) Dynamic convolution Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions. self-attention, LightConv has a fixed context window and it determines the importance of context elements with a set of weights that do not change over time steps. We will show that models equipped with lightweight convolutions show better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the- art results in natural language processing applications. Furthermore, the low computational profile of LightConv enables us to formulate efficient dynamic convolutions (§4). LightConv computes the following for the i-th element in the sequence and output channel c: LightConv(X, Wd cH d e,: , i, c) = DepthwiseConv(X, softmax(Wd cH d e,: ), i, c) Weight sharing. We tie the parameters of every subsequent number of d H channels, which reduces the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032 (d2 ⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights (d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that this vast reduction in the number of parameters is crucial to make dynamic convolutions possible on current hardware. Under review as a conference paper at ICLR 2019 Mat Mul Linear Linear Linear Scale SoftMax Mat Mul Linear Q K V input (a) Self-attention LConv Linear Linear input GLU (b) Lightweight convolution input LConv Linear Linear Linear dynamic weights GLU (c) Dynamic convolution Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions. self-attention, LightConv has a fixed context window and it determines the importance of context elements with a set of weights that do not change over time steps. We will show that models equipped with lightweight convolutions show better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the- art results in natural language processing applications. Furthermore, the low computational profile of LightConv enables us to formulate efficient dynamic convolutions (§4). LightConv computes the following for the i-th element in the sequence and output channel c: LightConv(X, Wd cH d e,: , i, c) = DepthwiseConv(X, softmax(Wd cH d e,: ), i, c) Weight sharing. We tie the parameters of every subsequent number of d H channels, which reduces the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032 (d2 ⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights (d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that Under review as a conference paper at ICLR 2019 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels that vary over time as a learned functi steps. A dynamic version of standard convolutions would be impractic to their large memory requirements. We address this problem by build drastically reduces the number of parameters (§3). DynamicConv takes the same form as LightConv but uses a time-step computed using a function f : Rd ! RH⇥k: DynamicConv(X, i, c) = LightConv(X, f(Xi)h,: , i we model f with a simple linear module with learned weights WQ 2 Pd c=1 WQ h,j,c Xi,c. Similar to self-attention, DynamicConv changes the weights assigned to co However, the weights of DynamicConv do not depend on the entire conte the current time-step only. Self-attention requires a quadratic number of o Under review as a conference paper at ICLR 2019 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels that vary over time as a learned function of the individual ti steps. A dynamic version of standard convolutions would be impractical for current GPUs d to their large memory requirements. We address this problem by building on LightConv wh drastically reduces the number of parameters (§3). DynamicConv takes the same form as LightConv but uses a time-step dependent kernel that computed using a function f : Rd ! RH⇥k: DynamicConv(X, i, c) = LightConv(X, f(Xi)h,: , i, c) we model f with a simple linear module with learned weights WQ 2 RH⇥k⇥d, i.e., f(Xi) Pd c=1 WQ h,j,c Xi,c. Similar to self-attention, DynamicConv changes the weights assigned to context elements over tim However, the weights of DynamicConv do not depend on the entire context, they are a function Under review as a conference paper at I 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels tha steps. A dynamic version of standard to their large memory requirements. drastically reduces the number of param DynamicConv takes the same form a computed using a function f : Rd ! R DynamicConv(X we model f with a simple linear mod Pd c=1 WQ h,j,c Xi,c. Similar to self-attention, DynamicConv However, the weights of DynamicConv = ! ! ! *(,) ./

Slide 49

Slide 49 text

+ C 4 4 4 C : A 9 • - 4D LME ( ) ) ( Under review as a conference paper at ICLR 2019 Model Param BLEU Sent/sec Vaswani et al. (2017) 213M 26.4 - Self-attention baseline (k=inf, H=16) 210M 26.9 ± 0.1 52.1 ± 0.1 Self-attention baseline (k=3,7,15,31x3, H=16) 210M 26.9 ± 0.3 54.9 ± 0.2 CNN (k=3) 208M 25.9 ± 0.2 68.1 ± 0.3 CNN Depthwise (k=3, H=1024) 195M 26.1 ± 0.2 67.1 ± 1.0 + Increasing kernel (k=3,7,15,31x4, H=1024) 195M 26.4 ± 0.2 63.3 ± 0.1 + DropConnect (H=1024) 195M 26.5 ± 0.2 63.3 ± 0.1 + Weight sharing (H=16) 195M 26.5 ± 0.1 63.7 ± 0.4 + Softmax-normalized weights [LightConv] (H=16) 195M 26.6 ± 0.2 63.6 ± 0.1 + Dynamic weights [DynamicConv] (H=16) 200M 26.9 ± 0.2 62.6 ± 0.4 Note: DynamicConv(H=16) w/o softmax-normalization 200M diverges AAN decoder + self-attn encoder 260M 26.8 ± 0.1 59.5 ± 0.1 AAN decoder + AAN encoder 310M 22.5 ± 0.1 59.2 ± 2.1 Table 3: Ablation on WMT English-German newstest2013. (+) indicates that a result includes all preceding features. Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU. 6.2 MODEL ABLATION In this section we evaluate the impact of the various choices we made for LightConv (§3) and Dy- namicConv (§4). We ﬁrst show that limiting the maximum context size of self-attention has no impact on validation accuracy (Table 3). Note that our baseline is stronger than the original result of Vaswani et al. (2017). Next, we replace self-attention blocks with non-separable convolutions (CNN) with kernel size 3 and input/output dimension d = 1024. The CNN block has no input and output projections compared to the baseline and we add one more encoder layer to assimilate the parameter count. This CNN with a narrow kernel trails self-attention by 1 BLEU. We improve this result by switching to a depthwise separable convolution (CNN Depthwise) with input and output projections of size d = 1024. When we progressively increase the kernel width from 4 262402 .522 72482402 720 / 49 4 49 4 9 : 4 C : A 9 4 A 9 4 4 A 9

Slide 52

Slide 52 text

E C C Rg S]d ol ,C A A B 2: Wa c L L hru M L L mp[ v N i v N Ve ] I N ]d b Ve ] ]d b N i v N i Obj:2 1 : -:FC : E +: C :C E:C C:E : :D C D : E: :D C - EFC F :C D , 5 • n 2 DF : D s P Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 3 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Filter Green Cube Program Representations Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape Concepts A. Curriculum concept learning B. Illustrative execution of NS-C Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semant y v[ y . 1 B :A B A ? 8 B B t i -- t Ve ] C : + C 0? -- t ]d b -- Filter(Red) ↓ Query(Shape) Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape No (0.98) Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semantic parsing y v[ y . 1 B :A B A ? 8 B B t i -- Published as a conference paper at ICLR 201 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions Deploy: complex scenes, complex A. Curriculum concept learning Figure 4: A. Demonstration of the curriculum l y v[ y . 1 B :A B A ? 8 B B t i -- t T c -- t ]d b -- Filter(Red) ↓ Query(Shape) Obj:1 Green Red

Slide 63

Slide 63 text

1 . FC 6 , C C C C 2 C C . FC F C 3- • e h + , C : C • [S ]MP b W • FCC F F , C T c I MN idTgf aLPJ Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object lef cube have the same shape purple matte thing? Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Pars Filter Program Representa Relate O Filter Filter AEQuery Object 1 O A. Curriculum concept learning B. Illustrative exec Figure 4: A. Demonstration of the curriculum learning of visual concepts, wo of sentences by watching images and reading paired questions and answers different complexities are illustrated to the learner in an incremental mann

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text