論文紹介：What Context Features Can Transformer Language Models Use?

Slide 1

Slide 1 text

What Context Features Can Transformer Language Models Use? 読む⼈︓村⼭友理（お茶⼤） 2021/09/17 第13回最先端NLP勉強会 Joe O’Connor and Jacob Andreas, ACL 2021 事前投票4票

Slide 2

Slide 2 text

Research Question 2 John went to the library to check out a book. p(book | context) • Count-based LMs: 10-20 tokens [Brown 2011] • RNNs: ~200 tokens [Khandelwal+ 2018] • Transformer LMs: 1,000+ tokens [Beltagy+ 2020] なぜcontextは⻑い⽅が良いのか︖＝⻑いcontextが何を与えるのか︖

Slide 3

Slide 3 text

contextのどんな情報が有⽤なのか 3 In 2000, producer David Heyman asked Radcliffe to audition for the role of Harry Potter for the film adaptation of Harry Potter and the Philosopher’s Stone, the best-selling book by British author J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only context + ordinary context)

Slide 4

Slide 4 text

4 ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only context + ordinary context) 情報量の差分が⼩さければ、仮定が成り⽴つ In 2000, producer David Heyman asked Radcliffe to audition for the role of Harry Potter for the film adaptation of Harry Potter and the Philosopher’s Stone, the best-selling book by British author J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” contextのどんな情報が有⽤なのか

Slide 5

Slide 5 text

Ablated Information 5 • ablated information • ablated likelihood • 直感的には、A(f, k) はkトークンにより追加された情報に対して、それら kトークンにablation f を適⽤することで失われる割合を計算 • 0に近ければ何の情報も落ちない︔1 に近ければ情報はすべて落ちる

Slide 6

Slide 6 text

Ablated Information 6 • ablated information • ablated likelihood • 直感的には、A(f, k) はkトークンにより追加された情報に対して、それら kトークンにablation f を適⽤することで失われる割合を計算 • 0に近ければ何の情報も落ちない︔1 に近ければ情報はすべて落ちる n n-k k n-k n

Slide 7

Slide 7 text

実験設定 7 GPT-2 [Radford+ 2019] をWikiText-103 dataset [Merity+ 2016] で学習 • roughly 100 training runs Transformer LM 2000 David Heyman Radcliffe Harry Potter Harry Potter and the Philosopher’s Stone British J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” 512 512+512 512+256 ordinary context ablated context long-range mid-range

Slide 8

Slide 8 text

Does order matter? 8 Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

Slide 9

Slide 9 text

Does order matter? 9 かなり破壊的

Slide 10

Slide 10 text

Does order matter? 10

Slide 11

Slide 11 text

Does order matter? 11 局所的な共起関係が保たれれば、正しい語順はあまり重要ではない • dog bites man ≈man bites dog

Slide 12

Slide 12 text

Does order matter? 12 ⼊⼒全体を、同じドキュメント内の直前の 512トークンに置き換え（トピック的には似ている）

Slide 13

Slide 13 text

Does order matter? 13 • 半分以上の情報が失われる • トピック情報を与えるわけではない︖

Slide 14

Slide 14 text

Do all words matter? 14 • 固有表現のみを保持しておけば良いという訳ではない • 名詞が有⽤な情報のほぼ全てを与えている

Slide 15

Slide 15 text

まとめ 15 • long-range context の情報が transformer モデルにどのように使われるかを調べた • 有⽤な情報は内容語や局所的な共起関係に主に含まれる • ⻑い context の効果はトピックや固有表現だけでは説明できない • context内の情報量の少ない語(例 padding token)を情報量の多い語 (例 nouns+verbs)に置き換えても、結果が良くなるわけではなかった