論文紹介：What Context Features Can Transformer Language Models Use?

What Context Features Can Transformer Language Models Use? 読む⼈︓村⼭友理（お茶⼤） 2021/09/17
第13回最先端NLP勉強会 Joe O’Connor and Jacob Andreas, ACL 2021 事前投票4票

Research Question 2 John went to the library to check
out a book. p(book | context) • Count-based LMs: 10-20 tokens [Brown 2011] • RNNs: ~200 tokens [Khandelwal+ 2018] • Transformer LMs: 1,000+ tokens [Beltagy+ 2020] なぜcontextは⻑い⽅が良いのか︖＝⻑いcontextが何を与えるのか︖

contextのどんな情報が有⽤なのか 3 In 2000, producer David Heyman asked Radcliffe to
audition for the role of Harry Potter for the film adaptation of Harry Potter and the Philosopher’s Stone, the best-selling book by British author J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only context + ordinary context)

4 ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only
context + ordinary context) 情報量の差分が⼩さければ、仮定が成り⽴つ In 2000, producer David Heyman asked Radcliffe to audition for the role of Harry Potter for the film adaptation of Harry Potter and the Philosopher’s Stone, the best-selling book by British author J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” contextのどんな情報が有⽤なのか

Ablated Information 5 • ablated information • ablated likelihood •
直感的には、A(f, k) はkトークンにより追加された情報に対して、それら kトークンにablation f を適⽤することで失われる割合を計算 • 0に近ければ何の情報も落ちない︔1 に近ければ情報はすべて落ちる

Ablated Information 6 • ablated information • ablated likelihood •
直感的には、A(f, k) はkトークンにより追加された情報に対して、それら kトークンにablation f を適⽤することで失われる割合を計算 • 0に近ければ何の情報も落ちない︔1 に近ければ情報はすべて落ちる n n-k k n-k n

実験設定 7 GPT-2 [Radford+ 2019] をWikiText-103 dataset [Merity+ 2016] で学習
• roughly 100 training runs Transformer LM 2000 David Heyman Radcliffe Harry Potter Harry Potter and the Philosopher’s Stone British J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” 512 512+512 512+256 ordinary context ablated context long-range mid-range

Does order matter? 8 Pierre Vinken, 61 years old, will
join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

Does order matter? 9 かなり破壊的

Does order matter? 10

Does order matter? 11 局所的な共起関係が保たれれば、正しい語順はあまり重要ではない • dog bites man ≈man
bites dog

Does order matter? 12 ⼊⼒全体を、同じドキュメント内の直前の 512トークンに置き換え（トピック的には似ている）

Does order matter? 13 • 半分以上の情報が失われる • トピック情報を与えるわけではない︖

Do all words matter? 14 • 固有表現のみを保持しておけば良いという訳ではない • 名詞が有⽤な情報のほぼ全てを与えている

まとめ 15 • long-range context の情報が transformer モデルにどのように使われるかを調べた •
有⽤な情報は内容語や局所的な共起関係に主に含まれる • ⻑い context の効果はトピックや固有表現だけでは説明できない • context内の情報量の少ない語(例 padding token)を情報量の多い語 (例 nouns+verbs)に置き換えても、結果が良くなるわけではなかった

論文紹介：What Context Features Can Transformer Lang...

論文紹介：What Context Features Can Transformer Language Models Use?

yuri

More Decks by yuri

Other Decks in Research

Featured

Transcript

What Context Features Can Transformer Language Models Use? 読む⼈︓村⼭友理（お茶⼤） 2021/09/17

Research Question 2 John went to the library to check

contextのどんな情報が有⽤なのか 3 In 2000, producer David Heyman asked Radcliffe to

4 ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only

Ablated Information 5 • ablated information • ablated likelihood •

Ablated Information 6 • ablated information • ablated likelihood •

実験設定 7 GPT-2 [Radford+ 2019] をWikiText-103 dataset [Merity+ 2016] で学習

Does order matter? 8 Pierre Vinken, 61 years old, will

Does order matter? 9 かなり破壊的

Does order matter? 10

Does order matter? 11 局所的な共起関係が保たれれば、正しい語順はあまり重要ではない • dog bites man ≈man

Does order matter? 12 ⼊⼒全体を、同じドキュメント内の直前の 512トークンに置き換え（トピック的には似ている）

Does order matter? 13 • 半分以上の情報が失われる • トピック情報を与えるわけではない︖

Do all words matter? 14 • 固有表現のみを保持しておけば良いという訳ではない • 名詞が有⽤な情報のほぼ全てを与えている

まとめ 15 • long-range context の情報が transformer モデルにどのように使われるかを調べた •