Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介:What Context Features Can Transformer Language Models Use?

E2bd2f5b0eb832d048beb7b8dc3227d1?s=47 yuri
September 09, 2021

論文紹介:What Context Features Can Transformer Language Models Use?

E2bd2f5b0eb832d048beb7b8dc3227d1?s=128

yuri

September 09, 2021
Tweet

Transcript

  1. What Context Features Can Transformer Language Models Use? 読む⼈︓村⼭友理(お茶⼤) 2021/09/17

    第13回最先端NLP勉強会 Joe O’Connor and Jacob Andreas, ACL 2021 事前投票4票
  2. Research Question 2 John went to the library to check

    out a book. p(book | context) • Count-based LMs: 10-20 tokens [Brown 2011] • RNNs: ~200 tokens [Khandelwal+ 2018] • Transformer LMs: 1,000+ tokens [Beltagy+ 2020] なぜcontextは⻑い⽅が良いのか︖=⻑いcontextが何を与えるのか︖
  3. contextのどんな情報が有⽤なのか 3 In 2000, producer David Heyman asked Radcliffe to

    audition for the role of Harry Potter for the film adaptation of Harry Potter and the Philosopher’s Stone, the best-selling book by British author J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only context + ordinary context)
  4. 4 ターゲットから離れたcontextでは、固有表現の情報のみが使われると仮定すると p(Harry | full context) ≈ p(Harry | named-entity-only

    context + ordinary context) 情報量の差分が⼩さければ、仮定が成り⽴つ In 2000, producer David Heyman asked Radcliffe to audition for the role of Harry Potter for the film adaptation of Harry Potter and the Philosopher’s Stone, the best-selling book by British author J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” contextのどんな情報が有⽤なのか
  5. Ablated Information 5 • ablated information • ablated likelihood •

    直感的には、A(f, k) はkトークンにより追加された情報に対して、それら kトークンにablation f を適⽤することで失われる割合を計算 • 0に近ければ何の情報も落ちない︔1 に近ければ情報はすべて落ちる
  6. Ablated Information 6 • ablated information • ablated likelihood •

    直感的には、A(f, k) はkトークンにより追加された情報に対して、それら kトークンにablation f を適⽤することで失われる割合を計算 • 0に近ければ何の情報も落ちない︔1 に近ければ情報はすべて落ちる n n-k k n-k n
  7. 実験設定 7 GPT-2 [Radford+ 2019] をWikiText-103 dataset [Merity+ 2016] で学習

    • roughly 100 training runs Transformer LM 2000 David Heyman Radcliffe Harry Potter Harry Potter and the Philosopher’s Stone British J.K. Rowling. Rowling had been searching for an unknown British actor to personify the character, and the movie’s director Chris Columbus recalled thinking, ”This is what I want. This is Harry Potter”, after he saw a video of the young actor in David Copperfield. Eight months later, and after several auditions, Radcliffe was selected to play the part. Rowling also endorsed the selection saying, ”I don’t think Chris Columbus could have found a better Harry.” 512 512+512 512+256 ordinary context ablated context long-range mid-range
  8. Does order matter? 8 Pierre Vinken, 61 years old, will

    join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
  9. Does order matter? 9 かなり破壊的

  10. Does order matter? 10

  11. Does order matter? 11 局所的な共起関係が保たれれば、正しい語順はあまり重要ではない • dog bites man ≈man

    bites dog
  12. Does order matter? 12 ⼊⼒全体を、同じドキュメント内の直前の 512トークンに置き換え(トピック的には 似ている)

  13. Does order matter? 13 • 半分以上の情報が失われる • トピック情報を与えるわけではない︖

  14. Do all words matter? 14 • 固有表現のみを保持しておけば良いという訳ではない • 名詞が有⽤な情報のほぼ全てを与えている

  15. まとめ 15 • long-range context の情報が transformer モデルにどのように使われ るかを調べた •

    有⽤な情報は内容語や局所的な共起関係に主に含まれる • ⻑い context の効果はトピックや固有表現だけでは説明できない • context内の情報量の少ない語(例 padding token)を情報量の多い語 (例 nouns+verbs)に置き換えても、結果が良くなるわけではなかった