NLP@ICLR2019

21/ 20/1 @7 9 9

b • • • • A g • a •
• M 0 4e 12 0 4 • e 0 0 4 0 N : D A 4 ~ • N 0 o 4 • 12

0 3 • • 2 3 1 39

0 • • 2 14 9

,/ 0/2 • a 0/2O c Pc • 55 A
• 1 4: & & • A 24 • 1 4:N 55 A K& OL IR • . CB 0/2 00 45 9 A 4 :4A9 4AA A9 A5 R S BMMJL I=L AIIAF =I JK L M MCI #P I E I. 9 062 5BM # 5A 1 M / CM LFC C A # = : :

0 • • 6 2 6 1 9 • G
G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA D I F F G • CEEI D I + EC IGN E GE B I : EM )C D - • N - II DI ED L I - IL I D (ND C : ED EB I ED • GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D: GEC I G B F G ED E

21 0 • • 7 191 I • G G
GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA D I F F G • CEEI D I + EC IGN E GE B I : EM )C D - • N - II DI ED L I - IL I D (ND C : ED EB I ED • GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D: GEC I G B F G ED E A L 1 9 2 L CNO @

):9 8 • 0( + 1 • B AB9C9 @
N • B89B98 9 B@ C 9:B : B99 2 B B9C @ 19 BB9 9 B 9 @B C 2 9 9C A A9B • D : D DBI B CD :C + • I + CC DD D G D + : DG : D I F ED C E • - EB I AD + B B D BAB D : C B C D C B - DEB EA BF C , L I 1

+ + 9 • c[a e • u R: Ot
RO w • d hu c]h S w I :c]h o IT O • i [ • l O u c]h :t ] • w s e R O • w N k O : • n cg • r . . /

) 1 1 ( : + 0 : 0 (
( : + • I + IMNOT c RI !"#$ L I[ a L S %" : L] d I ̂ !" L I[ a L S '" L] e RI !"#$ I ̂ !" L%" '" MNOLT 8: / 5- 2 . 8 : : 01 : - 0 2 : practice we use a continuous relaxation by computing the quantity p(d  k), obtained by ta a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. He ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) Following the properties of the cumax() activation, the values in the master forget gate are mo tonically increasing from 0 to 1, and those in the master input gate are monotonically decrea from 1 to 0. These gates serve as high-level control for the update operations of cell states. U the master gates, we define a new update rule: !t = ˜ ft ˜ it ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) ct = ˆ ft ct 1 +ˆ it ˆ ct In order to explain the intuition behind the new update rule, we assume that the master gates neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy of information between the neurons. To this end, we propose to make the gate for each neuron dependent on the others by enforcing the order in which neurons should be updated. 4.1 ACTIVATION FUNCTION: cumax() that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates it are used to control the erasing and writing operation on cell states ct , as before. Since the gates in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy or global information that will last anywhere from several time steps to the entire sentence, representing nodes near the root of the tree. Low-ranking neurons encode short-term or local information that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. 4 ON-LSTM In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model uses an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) The difference with the LSTM is that we replace the update function for the cell state ct with a new function that will be explained in the following sections. The forget gates ft and input gates

+ 1 - , 1: /- • + (- )O
e c[ ] p ! "# 1 1 ̃ %# 1 1 Th r ! "# N ̃ %# O lkm &# Th s ! "# &# "# iaO(- )O 1 T L ' "# + (- ) 1 Th t ̃ %# &# (# iaO(- )O 1 T L ̂ %# + (- ) 1 Th u [Odo*#+, Ne Odo ̂ *# T"# N(# M IL T e S discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ik Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information f LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span e than one time step. The representation for high-ranking nodes should be relatively consistent ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- . At each time step, given the input word, dark grey blocks are completely updated while light blocks are partially updated. The three groups of neurons have different update frequencies. most groups update less frequently while lower groups are more frequently updated. lobal information that will last anywhere from several time steps to the entire sentence, repre- ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven ion by controlling the update frequency of single neurons: to erase (or update) high-ranking ons, the model should first erase (or update) all lower-ranking neurons. In other words, some ons always update more (or less) frequently than the others, and that order is pre-determined as of the model architecture. ON-LSTM is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) difference with the LSTM is that we replace the update function for the cell state ct with a function that will be explained in the following sections. The forget gates ft and input gates e used to control the erasing and writing operation on cell states ct , as before. Since the gates before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the categories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: iaO(- ) Rn iaO(- ) Ngf

+ 1 - 2 2 1: /- • + (-
) LM TO S I N ! "# 1 1 ̃ %# 1 1 R eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in & '( = exp(.( ) ∑ (1 exp(.(1 ) '( = 2 (13( & '( 45678(9) :;"<678(9) 45678 ( ) (

+ 3 3 1 - 1: /- • + (-
) LM TO S I N ! "# 1 1 ̃ %# 1 1 R eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in ! "# ̃ %# &'()* +,"-()* 1 0 e ! "# ̃ %# = 0 ! "# . = 1 *. , x = = 1 = 0, ! "# ̃ %# = 0 0 , e paper at ICLR 2019 es between a constituency parse tree and the hidden states of the proposed of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- a block view of the tree structure in (b), where both S and VP nodes span The representation for high-ranking nodes should be relatively consistent s. (c) Visualization of the update frequency of groups of hidden state neu- given the input word, dark grey blocks are completely updated while light updated. The three groups of neurons have different update frequencies. ess frequently while lower groups are more frequently updated.

4 4 + = (= = / = - =
+ + : = 1 • + ) / OST g ] L f ! "# ̃ %# = Ra eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we deﬁne a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the ﬁrst df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in ! "# ̃ %# &'()* +,"-()* . ! "# ̃ %# ! "# 0 . e *. 1 , . ! "# ̃ %# x [d MI ce N ce e paper at ICLR 2019 es between a constituency parse tree and the hidden states of the proposed of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- a block view of the tree structure in (b), where both S and VP nodes span The representation for high-ranking nodes should be relatively consistent s. (c) Visualization of the update frequency of groups of hidden state neu- given the input word, dark grey blocks are completely updated while light updated. The three groups of neurons have different update frequencies. ess frequently while lower groups are more frequently updated.

+ 1 - 1: /- • + (- )ILMN ]
R O ! "# 1 5 1 ̃ %# 1 1 T ! "# ̃ %# IS [ &# T eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f

, + ( / - + + : 1 •
,+ ) / LO T c [S M R ! "# 6 ̃ %# N] d ! "# ̃ %# L a &# N] e ! "# &# "# L) / L6 N I ' "# ,+ ) / 6 N] discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ̃ %# ! "#

, + ( 7 7 / - + + :
1 • ,+ ) / LO T c [S M R ! "# 7 7 ̃ %# 7 N] d ! "# ̃ %# L a &# N] e ! "# &# "# L) / L 7 7 N I ' "# ,+ ) / 7 7 N] discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ̃ %# "#

, + ( / - + + : 8 1
• ,+ ) / OT [d m a l ! "# ̃ %# Se n ! "# N ̃ %# O ih &# Se o ! "# &# "# f O) / O S] L ' "# ,+ ) / Se p ̃ %# &# (# f O) / O S] L ̂ %# ,+ ) / Se Ock*#+, Nd Ock ̂ *# S"# N(# M g ILT S[d R discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df , represents the end of a high-level constituent in eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, = E[g]. 2 STRUCTURED GATING MECHANISM ased on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) ollowing the properties of the cumax() activation, the values in the master forget gate are mono- nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using e master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) order to explain the intuition behind the new update rule, we assume that the master gates are nary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in k k ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information stored in the first df t neurons of the previous cell state ct 1 will be completely erased. In a parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large number of zeroed neurons, i.e. a large df t , represents the end of a high-level constituent in the parse tree, as most of the information in the state will be discarded. Conversely, a small f ik Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary: • The master forget gate ˜ ft controls the erasing behavior of the model. Suppose ˜ ft = (0, . . . , 0, 1, . . . , 1) and the split point is df t . Given the Eq. (12) and (14), the information f LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus- d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span e than one time step. The representation for high-ranking nodes should be relatively consistent ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu- . At each time step, given the input word, dark grey blocks are completely updated while light blocks are partially updated. The three groups of neurons have different update frequencies. most groups update less frequently while lower groups are more frequently updated. lobal information that will last anywhere from several time steps to the entire sentence, repre- ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven ion by controlling the update frequency of single neurons: to erase (or update) high-ranking ons, the model should first erase (or update) all lower-ranking neurons. In other words, some ons always update more (or less) frequently than the others, and that order is pre-determined as of the model architecture. ON-LSTM is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model an architecture similar to the standard LSTM, reported below: ft = (Wf xt + Uf ht 1 + bf ) (1) it = (Wixt + Uiht 1 + bi) (2) ot = (Woxt + Uoht 1 + bo) (3) ˆ ct = tanh(Wcxt + Ucht 1 + bc) (4) ht = ot tanh(ct) (5) difference with the LSTM is that we replace the update function for the cell state ct with a function that will be explained in the following sections. The forget gates ft and input gates e used to control the erasing and writing operation on cell states ct , as before. Since the gates before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the categories are mutually exclusive, we can do this by computing the cumulative distribution function: p(gk = 1) = p(d  k) = X ik p(d = i) (8) Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence, ˆ g = E[g]. 4.2 STRUCTURED GATING MECHANISM Based on the cumax() function, we introduce a master forget gate ˜ ft and a master input gate ˜ it : ˜ ft = cumax(W ˜ f xt + U ˜ f ht 1 + b ˜ f ) (9) ˜ it = 1 cumax(W˜ i xt + U˜ i ht 1 + b˜ i ) (10) Following the properties of the cumax() activation, the values in the master forget gate are monotonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using the master gates, we define a new update rule: !t = ˜ ft ˜ it (11) ˆ ft = ft !t + ( ˜ ft !t) = ˜ ft (ft ˜ it + 1 ˜ it) (12) ˆ it = it !t + (˜ it !t) = ˜ it (it ˜ ft + 1 ˜ ft) (13) ct = ˆ ft ct 1 +ˆ it ˆ ct (14) In order to explain the intuition behind the new update rule, we assume that the master gates are binary:

1 1 : : : , + : : -+
: • +, cNM d r[ u ea!RO "# = %& − ∑ )*+ ,- . /#) g %& lSf i ∑ )*+ ,- . /#) 9 S :Smt v "# RO So nI ! ! + 1Sk ThR ] s L

2 2 2 : : 2 : , 22 +
0 2 : 20 2: 2 2 -+ 2: • +, cNM d r[ u ea!RO "# = %& − ∑ )*+ ,- . /#) g %& lSf i ∑ )*+ ,- . /#) 2 2 2S 0 :Smt v "# RO So nI ! ! + 1Sk ThR ] s L

12 21 2 : : 2 : , 22 +
2 : 2 2: 2 2 -+ 2: • +, cNM d r[ u ea!RO "# = %& − ∑ )*+ ,- . /#) g %& lSf i ∑ )*+ ,- . /#) 2 2 2S :Smt v "# RO So nI ! ! + 1Sk ThR ] s L

: + : : , - • + ML]c i
t [d ! N O "# = %& − ∑ )*+ ,- . /#) af %& kRe h ∑ )*+ ,- . /#) 2: R : Rls u "# N O IRn m ! ! + 1R Sg Or o T

2 2 2 : : 2 : , 22 +
2 : 2 2: 2 2 -+ 2: • +, cNM d r[ u ea!RO "# = %& − ∑ )*+ ,- . /#) g %& lSf i ∑ )*+ ,- . /#) 2 3 2 2S :Smt v "# RO So nI ! ! + 1Sk ThR ] s L

1 0 I) - : A 4 I A 0
: 0 L BI 4 • ) pks • k ih l) 2 +: B 2 + • mgn nro i , 4 A N ( eW • pks [ Yu ]2 MA N • xP • tg ] mgn nro if A N [ u YS]T O2 MA Nf y • , 4 4 7: ( YSda O4 D:Mtg y RcO [ w

: : : F : 3 :: 2 :F 1:
: : : I F 52 : • t r W n • , 2+ 2+ g edfh wig a • ,23- o A F : : : a LT0 F Rsma p • A F : : : F R O [ Lsm WSR • wig ec T W u WSTLTJ k P TLWNSRl O M]

• + 0 • B AB C @ GO •
D :D: -: DBA AF: D F A D:: FD F D: AFB : DD:AF -: D -:FIBD :A : F C C:D • 1 @@ @ B @9 B@ C 6 @E C 2 • +: FF:AF BA I F + FI: F A ( A BA B F BA • : -: DB B BA :CF +: DA:D AF:DCD:F A :A: BD A :AF:A : )DB - F D C:D BA , B P RL N I

2 + +2 2 2 + 2 2 • •
md P + 2 iLr] p no EB • S[e sl c EB • [ • + 2 [ ft • P ", $ bg ahG 7 ah ft Published as a conference paper at ICLR 2019 (a) Ontology (b) Order Embeddings (c) Box Embeddings Figure 1: Comparison between the Order Embedding (vector lattice) and Box Embedding representations for a simple ontology. Regions represent concepts and overlaps represent their entailment. Shading represents density in the probabilistic case. 3.1 PARTIAL ORDERS AND LATTICES A non-strict partially ordered set (poset) is a pair P, , where P is a set, and is a binary relation. For all a, b, c 2 P,

A 8 A A E 8:8 A82 D 8 8
• 8 L G 2A + A A8 B 8 + A A8 + A A8 D + A A8 0313 + 12 5 + 2 43 , + 12 5 + 8 7 7 7 E A E 7 7 7 8 8 A 7 7 : B 8 A 2A8 7 7 : 2 8C 2 A 8A : 2 8C 2 A 8A : 2 8C 2 A 8A : 2 8C 2 A 8A

A A A E : A 2 D • S
2A + A A + E A E 9 A : B A 2A : 2 C 2 A A + P [ L E A E 2 A G : V : G 2 A R [ L

0 AAE : E , A EC A CA DE
(A + :D 3 • pd + :gxV a EAC BC D E E A , DD BC D E E A + + : A D EC D=A E DD ) AD C E CD E A C GAC E C E C GAC E C E : A D EC nV P bcj[seS D=A E DD E C E [se o ) AD C E CD E A C GAC Vli fm hvu rR ] t L

1 BB =A B D B: DB = =E =
)B , =A E 3 = • b , =A [ePc BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA +014247 7203 ,0230 , 3 547 0 0 =BA (E D +=E B=A A EE BE GA D =A DE =BA DA= BD D = DA= BD D = DA= BD D = 0 =BA (E D aP V ]Rd BE GA D =A DE =BA DA= BD P LfS +=E B=A A EE D = Pd L

BB =A B D B: DB = =E = )B
, =A E 3 = • b , =A [ePcV 2 BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA )B 0 CD E A =BA 0313 + 12 5 + 2 43 , + 12 5 0 =BA 7 7 7 (E D 7 7 7 +=E B=A A EE 7 7 BE GA D =A DE =BA 7 7 DA= BD D = DA= BD D = DA= BD D = DA= BD D = 70 =BA (E D P ]Rd 7+=E B=A A EE D = Ld a 7 BE GA D =A DE =BA DA= BD P LfS

8 8 + 212 : 3 B 2 0 •
xcb B , 1 : 0 • ] o B p L SV !" , !$ ∈ [0, 1]* re l L • ] o p L ts [ E r P ! = ∏ . /(1$,. − 1",. ) P !, 4 = ∏ . / max 0, min 1$,. , :$,. − max 1",. , :",. P !|4 = < !,4 <(4) • [fg n hm L E l L • P ! P 4 p L P S fg • P !|4 p L P S fg 4" 31 !" !$ 1 1: 4$ (a) Original lattice (b) Ground truth CPD Figure 1: Representation of the toy probabilistic lattice used in Section 5.1. Darker color corresponds to more unary marginal probability. The associated CPD is obtained by a weighted aggregation of leaf elements. (a) POE lattice (b) Box lattice aiR G d

: 4 )4 :4 B + 0 (: 433 •
,4 4 4 0 R P "|$ P S[ • " $ G ] E P "|$ a S[L 0 :0::0 + 0 :0::0 P " = & ' ( (*+,' − *.,' ) P ", $ = max 0, min *+ , 6+ − max *. , 6. P "|$ = P ", $ P($)

: ) : B 5 + 0 (: 33 •
, 0 R P "|$ P S[ • " $ G ] E P "|$ a S[L 0 :0::0 + 0 :0::0 P " = & ' ( (*+,' − *.,' ) P ", $ = max 0, min *+ , 6+ − max *. , 6. P "|$ = P ", $ P($)

: 6 ) : B + 0 (: 33 6
• , 0 R P "|$ P S[ • " $ G ] E P "|$ a S[L 0 :0::0 + 0 :0::0 P " = & ' ( (*+,' − *.,' ) P ", $ = max 0, min *+ , 6+ − max *. , 6. P "|$ = P ", $ P($)

7: 3 )3 3 B + 0 07 7 7
( 03227: 7 • 32 • aE P] G L [ G P E S P + P " = $ % & softplus(/0,% − /3,% ) P ", 5 = softplus min /0 , 90 − max /3 , 93 P "|5 = P ", 5 P(5) softpluslog(1 + @A)

: 3 )3 3 B + 0 0 8 (
0322 : • 32 • 8 aE P] G L [ G P E S P 8 + 8 P " = $ % & softplus(/0,% − /3,% ) P ", 5 = softplus min /0 , 90 − max /3 , 93 P "|5 = P ", 5 P(5) softpluslog(1 + @A)

9 : 3 )3 93 B + 0 0 (90322
: • 9 32 • aE P] G L [ G P E S P 9 99 + 9 99 P " = $ % & softplus(/0,% − /3,% ) P ", 5 = softplus min /0 , 90 − max /3 , 93 P "|5 = P ", 5 P(5) softpluslog(1 + @A)

4 4 : : 00 + • g :0 •
tix a o dNcfo emEP W • ti ti n i rG yu s G • P "#""#$|&'( = * • yl W • a cfLB [ ] L b [ W • Lh n ] SL W 837k edges, while negative examples are generated by swapping one of the terms to a random word in the dictionary. Experimental details are given in Appendix D.1. The smoothed box model performs nearly as well as the original box lattice in terms of test ac- curacy1. While our model requires less hyper-parameter tuning than the original, we suspect that our performance would be increased on a task with a higher degree of sparsity than the 50/50 positive/negative split of the standard WordNet data, which we explore in the next section. 5.2 IMBALANCED WORDNET In order to confirm our intuition that the smoothed box model performs better in the sparse regime, we perform further experiments using different numbers of positive and negative examples from the WordNet mammal subset, comparing the box lattice, our smoothed approach, and order embeddings (OE) as a baseline. The training data is the transitive reduction of this subset of the mammal Word- Net, while the dev/test is the transitive closure of the training data. The training data contains 1,176 positive examples, and the dev and test sets contain 209 positive examples. Negative examples are generated randomly using the ratio stated in the table. As we can see from the table, with balanced data, all models include OE baseline, Box, Smoothed Box models nearly match the full transitive closure. As the number of negative examples increases, the performance drops for the original box model, but Smoothed Box still outperforms OE and Box in all setting. This superior performance on imbalanced data is important for e.g. real-world entailment graph learning, where the number of negatives greatly outweigh the positives. Positive:Negative Box OE Smoothed Box 1:1 0.9905 0.9976 1.0 1:2 0.8982 0.9139 1.0 1:6 0.6680 0.6640 0.9561 1:10 0.5495 0.5897 0.8800 Table 5: F1 scores of the box lattice, order embeddings, and our smoothed model, for different levels of label imbalance on the WordNet mammal subset. 5.3 FLICKR proach retains the inductive bias of the original box model, is equivalent in the limit, and satisfies e necessary condition that p(x, x) = p(x). A comparison of the 3 different functions is given in gure 3, with the softplus overlap showing much better behavior for highly disjoint boxes than the ussian model, while also preserving the meet property. (a) Standard (hinge) overlap (b) Gaussian overlap, 2 {2, 6} (c) Softplus overlap gure 3: Comparison of different overlap functions for two boxes of width 0.3 as a function of their nters. Note that in order to achieve high overlap, the Gaussian model must drastically lower its mperature, causing vanishing gradients in the tails. EXPERIMENTS 1 WORDNET Method Test Accuracy % transitive 88.2 word2gauss 86.6 OE 90.6 Li et al. (2017) 91.3 POE 91.6 Box 92.2 Smoothed Box 92.0 Table 4: Classification accuracy on WordNet test set. e perform experiments on the WordNet hypernym prediction task in order to evaluate the per-

+: : 4 1 : • eb B : •
[ Li P E EMk]Glm ETS • P "|$ M d ET M Lc • P "|$ = &(",$) &($) = #+,-./0 ",$ 12 / #456+5 #+,-./0 $ 12 / #456+5 • , M P "|$ Li P h ag + : f h ag Kk] Box 0.050 0.900 Smoothed Box 0.036 0.917 Table 6: KL and Pearson correlation between model and gold probability. randomly pick 100K conditional probabilities for training data and 10k probabilities for dev and test data 2. We compare with several baselines: low-rank matrix factorization, complex bilinear factorization (Trouillon et al., 2016), and two hierarchical embedding methods, POE (Lai & Hockenmaier, 2017) and the Box Lattice (Vilnis et al., 2018). Since the training matrix is asymmetric, we used separate embeddings for target and conditioned movies. For the complex bilinear model, we added one additional vector of parameters to capture the “imply” relation. We evaluate on the test set using KL divergence, Pearson correlation, and Spearman correlation with the ground truth probabilities. Experimental details are given in Appendix D.4. From the results in Table 7, we can see that our smoothed box embedding method outperforms the original box lattice as well as all other baselines’ performances, especially in Spearman correlation, the most relevant metric for recommendation, a ranking task. We perform an additional study on the robustness of the smoothed model to initialization conditions in Appendix C. KL Pearson R Spearman R Matrix Factorization 0.0173 0.8549 0.8374 Complex Bilinear Factorization 0.0141 0.8771 0.8636 POE 0.0170 0.8548 0.8511 Box 0.0147 0.8775 0.8768 Smoothed Box 0.0138 0.8985 0.8977 Table 7: Performance of the smoothed model, the original box model, and several baselines on MovieLens. 6 CONCLUSION AND FUTURE WORK We presented an approach to smoothing the energy and optimization landscape of probabilistic box embeddings and provided a theoretical justiﬁcation for the smoothing. Due to a decreased number

42 • + • 29 @2@ • E :E: -:
ECBF BG: E G B E:: GE G E:F BGC : EE:BG -: E -:G CE F :B :FG :E • CCG B G : ):C :GEL C EC FG C : B F + • 2D @@ @ C @ @C @ 2 4 D 2 9A@ 0 A 1 • : -: EC L C CB : G +: EB:E BG:E E:G B :B:F CE F B :BG:B :F (EC - G E :EI F CB , C L I N

: 4 4 : : : - • ead fi
• e gi t W 3 : -+ s D[ • v yDo C[ • 3 : Pw Av m ND • ]ch a • d L[ S v W A 3 : V l u m [n : : : set dynamic convolutions achieve a new state of the art of 29.7 BLEU. 1 INTRODUCTION There has been much recent progress in sequence modeling through recurrent neural networks (RNN; Sutskever et al. 2014; Bahdanau et al. 2015; Wu et al. 2016), convolutional networks (CNN; Kalchbrenner et al. 2016; Gehring et al. 2016; 2017; Kaiser et al. 2017) and self-attention models (Paulus et al., 2017; Vaswani et al., 2017). RNNs integrate context information by updating a hidden state at every time-step, CNNs summarize a ﬁxed size context through multiple layers, while as self-attention directly summarizes all context. Attention assigns context elements attention weights which deﬁne a weighted sum over context representations (Bahdanau et al., 2015; Sukhbaatar et al., 2015; Chorowski et al., 2015; Luong et al., 2015). Source-target attention summarizes information from another sequence such as in machine translation while as self-attention operates over the current sequence. Self-attention has been formu- lated as content-based where attention weights are computed by comparing the current time-step to all elements in the context (Figure 1a). The ability to compute comparisons over such unrestricted context sizes are seen as a key characteristic of self-attention (Vaswani et al., 2017). (a) Self-attention (b) Dynamic convolution Figure 1: Self-attention computes attention weights by comparing all pairs of elements to each other (a) while as dynamic convolutions predict separate kernels for each time-step (b).

+ A I + : I : ) (A A
A 2 4 • PO RNW[ S ,, (,, A T D (A (A - - - - ) 2 : !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& C : D L

5 + A I + : I : 5 )
5 (A A A 4 • a l ] [fm k ,, (,, 5D5 (A (A - - ) 5 : !" !# !$ !% !& !" !# !$ !% !& C : D L (A A Sk S edoPpN ) 5 : lnTONR(,, gW chi

6L , ( A A , 6A LA6 ) AI
A 4 • ti w[ acd mx[ug -- )-- : ( A A 6 6 +A: A ) A 2 ) C L A A A LA6 !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& A D A A A LA6 pzok [h rOwy e Wfok[ sq n vT R S 2 ) C L l R A P]Nn

L , ( A A , A LA 7 )
AI A 4 • zr hklfu p -- )-- : ( A A sq +A: A ) A 2 ) C L A A A A LA 7 !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& A D A7 A A LA 7 2 ) C L )--eiOg WdS [w xy mt ]RcN : ( A A n vae oT P

8 8 4 4 4 8 8 8 8 +
- • D 8 8 8 • 8 8 Pu f ds 8 4 • m v [ c oe v [ c g Aoe v [ c m v [ c i wNpSL nWy N • N 8 4 y lC • a ]th A 4 4 8 8 P ! ∈ ℝ$×$×& ! ∈ ℝ$×'×& ! ∈ ℝ(×'×& ! ∈ ℝ$×(×'×& : 4 8 8 4 4 8 8 8 8 8 sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature. Our experiments show that lightweight convolutions perform competitively to strong self-attention results and that dynamic convolutions can perform even better. On WMT English-German translation dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French they match the best reported result in the literature, and on IWSLT German-English dynamic convolutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime than a highly-optimized self-attention baseline. For language modeling on the Billion word bench- mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail abstractive document summarization we outperform a strong self-attention model. 2 BACKGROUND We first outline sequence to sequence learning and self-attention. Our work builds on non-separable convolutions as well as depthwise separable convolutions. Sequence to sequence learning maps a source sequence to a target sequence via two separate networks such as in machine translation (Sutskever et al., 2014). The encoder network computes representations for the source sequence such as an English sentence and the decoder network au- toregressively generates a target sequence based on the encoder output. The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2 Rn⇥d to obtain key (K), query (Q), and value (V) representations, where n is the number of time steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head can learn separate attention weights over dk features and attend to different positions. The module computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor- malizes the result. Finally, it computes a weighted sum using the output of the value projection (V): Attention(Q, K, V ) = softmax( QKT p dk )V Depthwise convolutions perform a convolution independently over every channel. The number of parameters can be reduced from d2k to dk where k is the kernel width. The output O 2 Rn⇥d of a depthwise convolution with weight W 2 Rd⇥k for element i and output dimension c is defined as: Oi,c = DepthwiseConv(X, Wc,: , i, c) = k X j=1 Wc,j · X (i+j d k+1 2 e),c 3 LIGHTWEIGHT CONVOLUTIONS In this section, we introduce LightConv, a depthwise convolution which shares certain output channels and whose weights are normalized across the temporal dimension using a softmax. Compared to attention weights from the previous time-step into account (Chorowski et al., 2015; Luong et al., 2015). Shen et al. (2018a) reduce complexity by performing attention within blocks of the input sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature. Our experiments show that lightweight convolutions perform competitively to strong self-attention results and that dynamic convolutions can perform even better. On WMT English-German translation dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French they match the best reported result in the literature, and on IWSLT German-English dynamic convolutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime than a highly-optimized self-attention baseline. For language modeling on the Billion word bench- mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail abstractive document summarization we outperform a strong self-attention model. 2 BACKGROUND We first outline sequence to sequence learning and self-attention. Our work builds on non-separable convolutions as well as depthwise separable convolutions. Sequence to sequence learning maps a source sequence to a target sequence via two separate networks such as in machine translation (Sutskever et al., 2014). The encoder network computes representations for the source sequence such as an English sentence and the decoder network au- toregressively generates a target sequence based on the encoder output. The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2 Rn⇥d to obtain key (K), query (Q), and value (V) representations, where n is the number of time steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head can learn separate attention weights over dk features and attend to different positions. The module computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor- malizes the result. Finally, it computes a weighted sum using the output of the value projection (V): Attention(Q, K, V ) = softmax( QKT p dk )V Depthwise convolutions perform a convolution independently over every channel. The number of parameters can be reduced from d2k to dk where k is the kernel width. The output O 2 Rn⇥d of a depthwise convolution with weight W 2 Rd⇥k for element i and output dimension c is defined as: Oi,c = DepthwiseConv(X, Wc,: , i, c) = k X j=1 Wc,j · X (i+j d k+1 2 e),c 3 LIGHTWEIGHT CONVOLUTIONS Under review as a conference paper at ICLR 2019 Mat Mul Linear Linear Linear Scale SoftMax Mat Mul Linear Q K V input (a) Self-attention LConv Linear Linear input GLU (b) Lightweight convolution input LConv Linear Linear Linear dynamic weights GLU (c) Dynamic convolution Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions. self-attention, LightConv has a fixed context window and it determines the importance of context elements with a set of weights that do not change over time steps. We will show that models equipped with lightweight convolutions show better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the- art results in natural language processing applications. Furthermore, the low computational profile of LightConv enables us to formulate efficient dynamic convolutions (§4). LightConv computes the following for the i-th element in the sequence and output channel c: LightConv(X, Wd cH d e,: , i, c) = DepthwiseConv(X, softmax(Wd cH d e,: ), i, c) Weight sharing. We tie the parameters of every subsequent number of d H channels, which reduces the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032 (d2 ⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights (d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that this vast reduction in the number of parameters is crucial to make dynamic convolutions possible on current hardware. Under review as a conference paper at ICLR 2019 Mat Mul Linear Linear Linear Scale SoftMax Mat Mul Linear Q K V input (a) Self-attention LConv Linear Linear input GLU (b) Lightweight convolution input LConv Linear Linear Linear dynamic weights GLU (c) Dynamic convolution Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions. self-attention, LightConv has a fixed context window and it determines the importance of context elements with a set of weights that do not change over time steps. We will show that models equipped with lightweight convolutions show better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the- art results in natural language processing applications. Furthermore, the low computational profile of LightConv enables us to formulate efficient dynamic convolutions (§4). LightConv computes the following for the i-th element in the sequence and output channel c: LightConv(X, Wd cH d e,: , i, c) = DepthwiseConv(X, softmax(Wd cH d e,: ), i, c) Weight sharing. We tie the parameters of every subsequent number of d H channels, which reduces the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032 (d2 ⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights (d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that Under review as a conference paper at ICLR 2019 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels that vary over time as a learned functi steps. A dynamic version of standard convolutions would be impractic to their large memory requirements. We address this problem by build drastically reduces the number of parameters (§3). DynamicConv takes the same form as LightConv but uses a time-step computed using a function f : Rd ! RH⇥k: DynamicConv(X, i, c) = LightConv(X, f(Xi)h,: , i we model f with a simple linear module with learned weights WQ 2 Pd c=1 WQ h,j,c Xi,c. Similar to self-attention, DynamicConv changes the weights assigned to co However, the weights of DynamicConv do not depend on the entire conte the current time-step only. Self-attention requires a quadratic number of o Under review as a conference paper at ICLR 2019 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels that vary over time as a learned function of the individual ti steps. A dynamic version of standard convolutions would be impractical for current GPUs d to their large memory requirements. We address this problem by building on LightConv wh drastically reduces the number of parameters (§3). DynamicConv takes the same form as LightConv but uses a time-step dependent kernel that computed using a function f : Rd ! RH⇥k: DynamicConv(X, i, c) = LightConv(X, f(Xi)h,: , i, c) we model f with a simple linear module with learned weights WQ 2 RH⇥k⇥d, i.e., f(Xi) Pd c=1 WQ h,j,c Xi,c. Similar to self-attention, DynamicConv changes the weights assigned to context elements over tim However, the weights of DynamicConv do not depend on the entire context, they are a function Under review as a conference paper at I 4 DYNAMIC CONVOLUTIONS A dynamic convolution has kernels tha steps. A dynamic version of standard to their large memory requirements. drastically reduces the number of param DynamicConv takes the same form a computed using a function f : Rd ! R DynamicConv(X we model f with a simple linear mod Pd c=1 WQ h,j,c Xi,c. Similar to self-attention, DynamicConv However, the weights of DynamicConv = ! ! ! *(,) ./

+ C 4 4 4 C : A 9 •
- 4D LME ( ) ) ( Under review as a conference paper at ICLR 2019 Model Param BLEU Sent/sec Vaswani et al. (2017) 213M 26.4 - Self-attention baseline (k=inf, H=16) 210M 26.9 ± 0.1 52.1 ± 0.1 Self-attention baseline (k=3,7,15,31x3, H=16) 210M 26.9 ± 0.3 54.9 ± 0.2 CNN (k=3) 208M 25.9 ± 0.2 68.1 ± 0.3 CNN Depthwise (k=3, H=1024) 195M 26.1 ± 0.2 67.1 ± 1.0 + Increasing kernel (k=3,7,15,31x4, H=1024) 195M 26.4 ± 0.2 63.3 ± 0.1 + DropConnect (H=1024) 195M 26.5 ± 0.2 63.3 ± 0.1 + Weight sharing (H=16) 195M 26.5 ± 0.1 63.7 ± 0.4 + Softmax-normalized weights [LightConv] (H=16) 195M 26.6 ± 0.2 63.6 ± 0.1 + Dynamic weights [DynamicConv] (H=16) 200M 26.9 ± 0.2 62.6 ± 0.4 Note: DynamicConv(H=16) w/o softmax-normalization 200M diverges AAN decoder + self-attn encoder 260M 26.8 ± 0.1 59.5 ± 0.1 AAN decoder + AAN encoder 310M 22.5 ± 0.1 59.2 ± 2.1 Table 3: Ablation on WMT English-German newstest2013. (+) indicates that a result includes all preceding features. Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU. 6.2 MODEL ABLATION In this section we evaluate the impact of the various choices we made for LightConv (§3) and Dy- namicConv (§4). We ﬁrst show that limiting the maximum context size of self-attention has no impact on validation accuracy (Table 3). Note that our baseline is stronger than the original result of Vaswani et al. (2017). Next, we replace self-attention blocks with non-separable convolutions (CNN) with kernel size 3 and input/output dimension d = 1024. The CNN block has no input and output projections compared to the baseline and we add one more encoder layer to assimilate the parameter count. This CNN with a narrow kernel trails self-attention by 1 BLEU. We improve this result by switching to a depthwise separable convolution (CNN Depthwise) with input and output projections of size d = 1024. When we progressively increase the kernel width from 4 262402 .522 72482402 720 / 49 4 49 4 9 : 4 C : A 9 4 A 9 4 4 A 9

: • -1 , - • 0 : F I:IA
MON • C C + CA D ) E :C E : C EC E C D EA CC E + C + E AC D DE B B C • AAE : E ( A EC A CA DE AI :D • DD EE E A E : E : E A GA E A D • 5@ LC A FI - : , I F IA : I + C :I : F A A : 9 2 : A

+ C - :5 5 : -5 A A -
5 A + C -C :A: 1 • yr m • l pvuI ] Sd [ F ] Sd • l pvu ns Iih u ] T , Sd[F b N d M F ] o a Q c • gv r • l pvu ns Sdv L • VeWe Sd t L Models Visual Features Semantics Extra Labels Inference # Prog. Attr. FiLM (Perez et al., 2018) Convolutional Implicit 0 No Feature Manipulation IEP (Johnson et al., 2017b) Convolutional Explicit 700K No Feature Manipulation MAC (Hudson & Manning, 2018) Attentional Implicit 0 No Feature Manipulation Stack-NMN (Hu et al., 2018) Attentional Implicit 0 No Attention Manipulation TbD (Mascharka et al., 2018) Attentional Explicit 700K No Attention Manipulation NS-VQA (Yi et al., 2018) Object-Based Explicit 0.2K Yes Symbolic Execution NS-CL Object-Based Explicit 0 No Symbolic Execution Table 1: Comparison with other frameworks on the CLEVR VQA dataset, w.r.t. visual features, implicit or explicit semantics and supervisions. 1 2 3 4 Q: What is the shape of the red object left of the sphere? ✓ Query(Shape, Filter(Red, Relate(Left, Filter(Sphere)))) ☓ Query(Shape, Filter(Sphere, Relate(Left, Filter(Red)))) ☓ Exist(AERelate(Shape, Filter(Red, Relate(Left, Filter(Sphere))))) …… Visual Representation Semantic Parsing (Candidate Interpretations) Symbolic Reasoning Answer: Cylinder Groundtruth: Box Back-propagation REINFORCE Obj 1 Obj 2 Obj 3 Obj 4 Sphere Concept Embeddings …… Back-propagation Θ" Θ#

E C C Rg S]d ol ,C A A B
2: Wa c L L hru M L L mp[ v N i v N Ve ] I N ]d b Ve ] ]d b N i v N i Obj:2 1 : -:FC : E +: C :C E:C C:E : :D C D : E: :D C - EFC F :C D , 5 • n 2 DF : D s P Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 3 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Filter Green Cube Program Representations Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape Concepts A. Curriculum concept learning B. Illustrative execution of NS-C Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semant y v[ y . 1 B :A B A ? 8 B B t i -- t Ve ] C : + C 0? -- t ]d b -- Filter(Red) ↓ Query(Shape) Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape No (0.98) Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semantic parsing y v[ y . 1 B :A B A ? 8 B B t i -- Published as a conference paper at ICLR 201 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions Deploy: complex scenes, complex A. Curriculum concept learning Figure 4: A. Demonstration of the curriculum l y v[ y . 1 B :A B A ? 8 B B t i -- t T c -- t ]d b -- Filter(Red) ↓ Query(Shape) Obj:1 Green Red

Red C 2 N I 5PG IF I JM .
M J M B 5 L I L 5 M L , IG 2 MN F 5NJ L I :1 I • p v I M B I I JML 5 G M 3 L B am ]sgfd eiUWckS3 IB GlytVhoT ry am R[ ry u :D O EN ODA ND A B ODA MA FA?O 1 LE 4 22 4 L2 M IFI +G B 5J 5C J +G B 5J Cylinder Sphere Box LN F , MN 5J Obj:1 Obj:2 ly b + Q j DAMA n 3 ↓ Filter(Red) Query(Shape) -4 -4 22 22 1 NG 6 ,22[i W gb uS6AN2AO [ ENP A OPMAb y W ? AM -A? AMU+E06 06 Vf c x - C [4M CM by 4M CM ], ?A O A E C U, M A A E CVbk E OAM b U6A , NE A t s ]a3 F n V 4M CM ], ?A O A E C U D A A A E CVbk PAMR b U3 F , NE A t s ]a D Ab k V S mh y je d orbl S A E C ?Abp

Red C 2 N I 5PG IF I JM .
M J M B 5 L I L 5 M L , IG 2 MN F 5NJ L I :1 I • d I M B I I JML 5 G M 3 L B fsR]c ba lki nV[h T3 IB Gr m U fs S p 1 ? ? ? F E ? G E A : 1 LE 4 22 4 L2 M IFI +G B 5J 5C J +G B 5J Cylinder Sphere Box LN F , MN 5J Obj:1 Obj:2 r gy E sb F? G t 3 ↓ Filter(Red) Query(Shape) -4 -4 22 22 1 LE 4 22 fsW u e o 4 L2 M LN F , MN vr m Q ,D:E G + :E GO 2 2 P ] o 6+ED U0GE G C p 0GE G C i W ED: F ,C D O EBEG C D P cy - B G niO2 E D lxR u W[ A f P 0GE G C i W ED: F ,C D O ? F C D P cy 1 G niO A E D lxR u W[ ? F cyPS tea S p sb gj d S ,C D F : hr

Red FC 5C NL SJ LIG ,L CMP 3C N
CN 1 PCNMNCPG E C C LNB B C PC C NLJ 5 P N I MCNRG GL 4 L • i 2LG P 3C N G E LD ,L CMP B CJ PG 6 N G E T hbfe pom nr luV6NLEN Jv yW dUc s 3 4F E F E 4C B F B F 4 ,55 C 5CP ,LILN .J CBBG E M C F MC .J CBBG E M C Cylinder Sphere Box G I C P NC M C Obj:1 Obj:2 v k B o 1C +6 ↓ Filter(Red) Query(Shape) +G0 : 0 : 55 55 4E R pNW r yj E F R 2 E 4? 4F tl s ]. LBCN -C LBCNV+G0 : 0 :WgUa[t -L E 6NLEN J v B: 4 V f Vm BA CFV A: B?B A:M x ?F m kf SV BE A i OhsS - cuM B: 4 V f Vm BA CFV A: 1 4C A:M x m kf - SV BE A i OhsS 1 4C xMP b SPQl o SVdg a P A: 1C4 en

Red 3 LJ 6 JGE +JI N 1 LI L
IN L L NEIC 6 I M :JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J • d JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC etR]c ba lki oV[h T4LJCL r mvU et S p :D O EN ODA ND A B ODA MA FA?O 2 MF 5 +33 5 M3 N +JGJL , EIC 6 6 , EIC 6 Cylinder Sphere Box EM G - N L 6 Obj:1 Obj:2 r f + Q j DAMA u 4 ↓ Filter(Red) Query(Shape) E.5 .5 33 33 1 NG 6 ,22[i W gb uS6AN2AO [ ENP A OPMAb y W ? AM -A? AMU+E06 06 Vf c x - C [4M CM by 4LJCL n s W+JI N , EIC T+JGJL EICU gy E OAM b U6A , NE A t s ]a3 F n V 4M CM ], ?A O A E C U D A A A E CVbk PAMR b U3 F , NE A t s ]a D Ab k V S mh y je d orbl S A E C ?Abp

Red EB 3B NL 7SJ LIF +L BMP 1B N
BN PBNMNBPF 7 B BO LN O 7B PB BO -NLJ 3 P N I 7 MBNRFOFL 2 L • i LF P 1B N F LC +L BMPO 7BJ PF 5 NOF j Tbhcgf pomanu lyV5NL N J r W j eUd :D O EN ODA ND A B ODA MA FA?O 2 O +33 BO3BP +LILN ,J B F 7M B 7E MB ,J B F 7M B Cylinder Sphere Box :FO I -B P NB 7M B Obj:1 Obj:2 k + Q j DAMA 5 ↓ Filter(Red) Query(Shape) F. . 33 33 1 NG 6 ,22[i W gb tS6AN2AO [ ENP A OPMAb x u W ? AM -A? AMU+E06 06 Vf c - C [4M CM bx 4M CM p y ], ?A O A E C U, M A A E CVbk -FIPBN a sV B ]+LOF B v t [ 4 G W 4M CM p y ], ?A O A E C U D A A A E CVbk PAMRy b pU3 F , NE A s r ]a D Ab k V S mh x je d n bl S A E C ?Abo

Red 3 LJ JGE +JI N 1 LI L IN
L L NEIC I M :JL M I IN I M -LJ 3 N L G LPEMEJI 2 J • d JEIN 1 LIEIC JB +JI NM I INE 4 LMEIC etR]c ba lki oV[h T4LJCL r mvU et S p :D O EN ODA ND A B ODA MA FA?O 2 MF 5 +33 5 M3 N +JGJL , EIC , EIC Cylinder Sphere Box EM G - N L Obj:1 Obj:2 r f + Q j DAMA u 4 ↓ Filter(Red) Query(Shape) E.58 .58 33 33 1 NG 6 ,22[i W gb uS6AN2AO [ ENP A OPMAb y W ? AM -A? AMU+E06 06 Vf c x - C [4M CM by 4M CM ], ?A O A E C U, M A A E CVbk E OAM b U6A , NE A t s ]a3 F n V 4LJCL n s W+JI N , EIC T EICU gy PAMR b U3 F , NE A t s ]a D Ab k V S mh y je d orbl S A E C ?Abp

Red 9FC 3CROM T MJG +ML CNQ 1C OLCO LQCONOCQGLE
CLCP MOBP LB CLQCL CP -OM 3 QRO J RNCOSGPGML 2 M • l MGLQ 1C OLGLE M +ML CNQP LB C LQG 5 OPGLE n Vekfji vtrds b 5OMEO a n hWg :D O EN ODA ND A B ODA MA FA?O 2 PI +33 CP3CQ +MJMO , CBBGLE N C F NC , CBBGLE N C Cylinder Sphere Box GPR J -C QROC N C Obj:1 Obj:2 o + Q xg DAMA 5 ↓ Filter(Red) Query(Shape) G. : . : 33 33 1 NG 6 ,22 f V e R6AN2AO ENP A OPMA u r V ? AM -A? AM +E06 06 UdSa]t - C 4M CM u 4M CM ] m ] [, ?A O] A E C , M A A E CU h E OAM sm 6A ], NE A p Wo [ 3 F j U 4M CM ] m ] [, ?A O] A E C D A A A E CU h RCOT d 4 +MPGLC [y c F NCd p U um ] xgcSb ]kn i R A E C ?A l

4 Red 3 LJ 6 JGE +JI N 1 LI
L IN L L NEIC 6 I M :JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J • e 0JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC fyS d cb nm ]lsW iuU4LJCL v [o V fyaT t :D O EN ODA ND A B ODA MA FA?O 2 MF +33 M3 N +JGJL , EIC 6 6 , EIC 6 Cylinder Sphere Box EM G - N L 6 Obj:1 Obj:2 v h + Q uh DAMA ↓ Filter(Red) Query(Shape) E. . 33 33 1 NG 6 ,22[g W x eb oS6AN2AO [ ENP A OPMAb s p W ? AM -A? AMU+E06 06 Vd c r - C [4M CM bs 4M CM l t ], ?A O A E C U, M A A E CVbi E OAMt b lU6A , NE A n my ]a3 F k V 4M CM l t ], ?A O A E C U D A A A E CVbi PAMRt b lU3 F , NE A n my ]a D Ab i V S jf s gaT pr]k R, EIC 6 ]

6C 2 N I PG IF I JM . M
J M B L I L M L , IG 2 MN F NJ L I :1 I • e i I M B I I JML G M 3 L B 3 IB GalT] R I JM +G BU dS al [f + , R i MEBNB kc 4 I Red gl bW V gl h 6 E P FO PEB OE MB C PEB NBA ?GB P 1 LE 4 22 4 L2 M IFI +G B J C J +G B J Cylinder LN F , MN J Obj:1 Obj:2 ↓ Filter(Red) Query(Shape) -4 -4 22 22 2 O -33 h a fc r BO3BP :FOQ 0B PQNBc s a/ ABN B ABNV,F1 1 WeUd u D N DN c N DN n x - BMP / ?BAAF D V- N B ?BAAF DWcj 0F PBNx ctnV BA] - OF B o ] b4?G l W N DN n x - BMP / ?BAAF D V E MB B ?BAAF DWcj 6QBNSx ctnV4?G ] - OF B o ] b E MBc j W kg] [ i ic BF C N B N DN pcmy Sphere Box

o EIBJL 3 LJ 6 JGE +JI N 1 LI
L /IN L L NEIC 6 I M :JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J • JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC 4LJCL i Wb]agS+JI N , EIC cpT 1 NG 6 ,22[g W x eb oS6AN2AO [ ENP A OPMAb s p W ? AM -A? AMU+E06 06 Vd c r - C [4M CM bs 4M CM l t ], ?A O A E C U, M A A E CVbi E OAMt b lU6A , NE A n my ]a3 F k V 4M CM l t ], ?A O A E C U D A A A E CVbi PAMRt b lU3 F , NE A n my ]a D Ab i V S jf s l tl vhV EIBJL U4LJCL Wmk e d i r + Q uh DAMA Red s n[R s fyu :D O EN ODA ND A B ODA MA FA?O 2 MF +33 M3 N +JGJL , EIC 6 6 , EIC 6 Cylinder EM G - N L 6 Obj:1 Obj:2 ↓ Filter(Red) Query(Shape) E. . 33 33 Sphere Box

1 . FC 6 , C C C C 2
C C . FC F C 3- • e h + , C : C • [S ]MP b W • FCC F F , C T c I MN idTgf aLPJ Published as a conference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object lef cube have the same shape purple matte thing? Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Pars Filter Program Representa Relate O Filter Filter AEQuery Object 1 O A. Curriculum concept learning B. Illustrative exec Figure 4: A. Demonstration of the curriculum learning of visual concepts, wo of sentences by watching images and reading paired questions and answers different complexities are illustrated to the learner in an incremental mann

7 4 LI 6NE DC + F K 2 IF
I FK I I KCF 6 F J I J F 6 FK F J I E 4 KLI D 6L IMCJC F : • +2- 5 , K J K :0 FJ F % • uc a e t am ln • 7I CF 1 DC 1 7 JK 1 • v m 1Rg S y s 1Rg S ya g • am b oph ri [Td m SV] W 217 4./ 0 5: /3 8. . M P /31 9 6NA EE 217 4./ 0 5: /3 0 L RQS- CLLI- E DE DL A M L A M A % 2/39 5:/3 I LA I CLLI- E DE DL A M L A M A % 2/39 5:/3 IILO

: • : : : : : : : ::
: 6 :5 • : EM I AMG ER A " Y7PDEPED MESPNM , 3MREGPARIMG RPEE RPSCRSPE IMRN PECSPPEMR MESPA MER NP "Z IM 8PNC" N 3 9 " • 5EPIRW :REO EM ER A " Y9EGS APIXIMG AMD 7ORILIXIMG : 5 AMGSAGE 5NDE "Z IM 8PNC" N 3 9 " • AMG I IM ER A " Y.PEA IMG R E N RLAV BNRR EMEC , - IG PAM PMM AMGSAGE LNDE "Z IM 8PNC" N 3 9 " • : : 6 • I IAMG ER A " :LNNR IMG R E 2ENLERPW N 8PNBABI I RIC .NV 1LBEDDIMG " IM 8PNC" N 3 9 " • 5I N NT NLA ER A " Y0I RPIBSRED PEOPE EMRARINM N NPD AMD O PA E AMD R EIP CNLON IRINMA IRW"Z IM 8PNC" N 638: " • I MI S E ER A " Y NPD PEOPE EMRARINM TIA GAS IAM ELBEDDIMG"Z IM 8PNC" N 3 9 " • EMDPNT 3TAM ER A " Y7PDEP ELBEDDIMG N ILAGE AMD AMGSAGE"Z IM 8PNC" N 3 9 " • AI - ICE ER A " Y EAPMIMG RN OPEDICR DEMNRARINMA OPNBABI IRIE NP LNDE IMG EM RAI LEMR"Z IM 8PNC" N 1- " • I MI S E ER A " Y8PNBABI I RIC ELBEDDIMG N MN EDGE GPAO IR BNV ARRICE LEA SPE "Z IM 8PNC" N - "

: • C : : 6 6 6 : C:
:A : • P . HFS H" T H NN J FKJ RF E HFDE R FDE JA ATJ IF KJQKHP FKJN" FJ MK " KC 0 8 " • NR JF NEFNE H" J FKJ FN HH TKP J A" FJ MK " KC 0 9 " • -6 ,C : : : : , : : , : : , A : • K 1F TP J H" :E J PMK NTI KHF KJ L H MJ M 0J MLM FJD N J N RKMAN JA N J J N CMKI J PM H NPL MQFNFKJ" FJ MK " KC 0 8 " • PANKJ ,M R H" V KILKNF FKJ H J FKJ J RKM N CKM I EFJ M NKJFJD"V FJ MK " KC 0 8 " • N E M , QFA H" :M JNL M J T T A NFDJ HKNFJD E D L R J L MCKMI J JA FJ MLM FHF T FJ QFNP H M NKJFJD"V FJ MK " KC 8 " • F 2 SFJ H" PM H 9TI KHF 7 ,FN J JDHFJD M NKJFJD CMKI QFNFKJ JA H JDP D PJA MN JAFJD"V FJ MK " KC PM0 9 " • 1KEJNKJ 1PN FJ H" - 8 AF DJKN F A N CKM KILKNF FKJ H H JDP D JA H I J MT QFNP H M NKJFJD"V FJ MK " KC 8 "

NLP@ICLR2019

NLP@ICLR2019

More Decks by Kazuki Fujikawa

Other Decks in Science

Featured

Transcript