Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Syntactic Language Modeling with Treelets

Large-Scale Syntactic Language Modeling with Treelets

第4回最先端NLP勉強会で、Adam Pauls and Dan Klein. Large-Scale Syntactic Language Modeling with Treelets. ACL 2012. の紹介をしました。

A0e65af9a6baff8efb7e632212f5eec3?s=128

Mamoru Komachi

August 31, 2012
Tweet

Transcript

  1. Large-Scale Syntactic Language Modeling with Treelets Adam Pauls and Dan

    Klein (ACL 2012) Presented by Mamoru Komachi At ୈ4ճ࠷ઌ୺NLPษڧձ 2012/08/31
  2. N-gram LM ͸௕ڑ཭ͷ ґଘؔ܎Λѻ͑·ͤΜ |  NάϥϜݴޠϞσϧͷར఺ {  ࣮૷͕؆୯ {  ؤ݈ʹಈ࡞͢Δ

    {  େن໛Ͱ΋େৎ෉ |  NάϥϜݴޠϞσϧͷܽ఺ {  ௕ڑ཭ͷґଘؔ܎Λଊ͑ΒΕͳ͍ 2
  3. εέʔϧ͠ɺ࣮૷΋؆୯ͳ ౷ޠతݴޠϞσϧΛఏҊ͠·͢ |  ੜ੒తݴޠϞσϧ {  ߏจ໦্ͷtreelet ʹ ৚݅෇͚ΒΕͨϞσϧ {  େن໛σʔλʹεέʔϧ

    ͢Δ {  NάϥϜݴޠϞσϧͱ ಉ͘͡Β͍࣮૷͕؆୯ |  ͨ͘͞ΜฒྻͰܭࢉ͠ͳ ͯ͘΋Α͍ʢ୯७͸ਖ਼ٛʣ 3
  4. ͍Ζ͍ΖͳλεΫɾઃఆͰɺ Treelet ݴޠϞσϧΛධՁ͠·͢ |  ઌߦݚڀͱͷൺֱ {  NάϥϜݴޠϞσϧ΍ଞͷ໦ߏ଄Λ༻͍ͨੜ੒త ͳ౷ޠతݴޠϞσϧΑΓੑೳ͕ߴ͍ {  ਖ਼ྫ͚͔ͩΒߏஙͯ͠΋ɺ֤λεΫʹಛԽͨࣝ͠

    ผϞσϧͱಉ౳ͷੑೳ |  ͍Ζ͍ΖͳλεΫͰͷൺֱ {  ύʔϓϨΩγςΟ {  ٙࣅෛྫͱਖ਼ྫͷ෼ྨλεΫ {  ػց຋༁ͷग़ྗͱϦϑΝϨϯεͷ෼ྨλεΫ 4
  5. 2. Treelet ݴޠϞσϧ͸ ࠨ͔Βӈʹ໦Λੜ੒͠·͢ |  ໦ʹର͢Δ֬཰஋ͷׂ౰:  T=constituency tree (e.g.

    r = P ˠ C1 …Cd P=parent symbol of rule r C=children h=ʢ͢Ͱʹੜ੒͞Εͨʣconditioning context PCFGͷͱ͖͸h=P (਌ϊʔυ)ͷΈʹ৚͚݅ͮ 5
  6. ΈΜͳ͝ଘ஌NάϥϜ ݴޠϞσϧͷ͍͍ੑ࣭ |  NάϥϜݴޠϞσϧ͸ {  ؍ଌͨ͠NάϥϜස౓ʹج͍ͮͯ֬཰ΛׂΓ౰ͯ {  ؍ଌͨ͜͠ͱͷͳ͍NάϥϜ͸ΑΓখ͍͞จ຺ʹ όοΫΦϑ ͢Δ

    |  ͜͏͍ͬͨεϜʔδϯά͸ؤ݈Ͱ࣮૷΋؆୯  6
  7. Treelet ݴޠϞσϧ΋ಉ༷ εϜʔδϯάΛ͠·͢ |  ͲͷΑ͏ʹεϜʔδϯά ͢Δ͔ʁ  ਌ʢPʣΛจ຺ʹ͢Δɻ ʢ਌Λੜ੒͢Δr’ʹՃ͑ʣ 

     |  ґଘؔ܎Λߟྀ͢ΔͨΊจ຺͸௕͍͕ͨ͘͠ɺ σʔλ͔Β֬཰஋ਪఆͷͨΊʹ୹͘΋͍ͨ͠ 7
  8. ਌Λੜ੒͢ΔϧʔϧΛ จ຺ʹ͢Δ3ͭͷϝϦοτ |  Pͱͦͷ਌ͷP’ͷ྆ํΛߟྀʹೖΕΔͱɺP୯ମ ΑΓ༧ଌྗ͕ߴ͍ɻ (Johnson, 1998) |  ҐஔʹΑΔҧ͍ΛߟྀʹೖΕΒΕΔɻ E.g.

    ओޠͱ໨తޠͷ໊ࢺ۟͸ҧ͏෼෍ (Klein and Manning, 2003)ˠಈࢺ͔ΒΈ໊ͨࢺͷҐ ஔ͸͜ΕΒΛ۠ผ͢ΔΑ͍ࢦඪʹͳΔ |  ୯ޠͷੜ੒ͷͱ͖preterminal ͷ sibling ʹ৚ ͚݅ͮΔ͜ͱͰ͖Δɻˠಈࢺͷ֨ϑϨʔϜΛߟ ྀ͢Δ͜ͱ͕ՄೳʹͳΔ 8
  9. 2.1 Treelet ݴޠϞσϧ͸ ہॴతͳจ຺΋ߟྀ͠·͢ |  ཧ૝తʹ͸NάϥϜݴޠ ϞσϧΛτοϓμ΢ϯͷ PCFG ෩ʹஔ͖׵͍͑ͨ… ˠݱ࣮తʹ͸NάϥϜͷ

    ৘ใ͸༧ଌʹॏཁ |  Left-to-right จ຺Λߟྀ ͢ΔͨΊʹɺલͷ2୯ޠΛ จ຺ʹՃ͑Δ ˠίϩέʔγϣϯ΍ޠኮ తͳ૬ؔؔ܎Λଊ͑Δ ͜ͱ͕Ͱ͖Δ 9
  10. 2.2 ऴ୺ه߸ͱඇऴ୺ه߸ ͰόοΫΦϑΛ෼͚·͢ |  ඇऴ୺ه߸ˠ |  ऴ୺ه߸ɹˠ p(Cd 1 |

    P, ! P , ! r ) → p(Cd 1 | P, ! P ) → p(Cd 1 | P) λ p(C i |Ci−1 i−3 , P) i=1 d ∏ +(1− λ) p(C i |Ci−1 i−3 ) i=1 d ∏ 10
  11. 2.3 Treelet LM ͸4ͭͷ ֬཰෼෍͕ඞཁͳ͚ͩʂ |  NάϥϜݴޠϞσϧΛ࡞Δͷͱ ಉ͘͡ɺtreelet ͷස౓ΛΧ΢ ϯτͯ͠ӈͷ֬཰෼෍Λܭࢉɻ

    |  ස౓͸Ͳ͔͜Βܭࢉ͢Δʁ {  ਓखͰ࡞ͬͨPenn Treebank ίʔύε ˠ࣭͸ߴ͍͕αΠζ͕খ͍͞ {  ߏจղੳثΛ࢖ͬͯࣗಈతʹߏจ໦Λੜ੒ ˠΤϥʔ͸ؚ·ΕΔ͕ߏจ໦ࣗମʹڵຯ͕͋ΔΘ ͚Ͱ͸͘ɺੜ੒͞Εͨจʹڵຯ͕͋ΔͷͰ໰୊ ͳ͍ɻ p(C 1 d | P, ! P , ! r ) p(w | P, R, ! r ,w −1 ,w −2 ) p(C i |Ci−1 i−n+1 , P) p(C i |Ci−1 i−n+1 ) 11
  12. 3 ґଘؔ܎Λߟྀ͢Δ ͨΊͷ7ͭͷม׵ϧʔϧ 12 |  9ݸͷϧʔϧΛॱ൪ʹద༻͠ɺconstituency tree Λม׵͍ͯ͘͠

  13. Temporal NPs: ໊࣌ؒࢺ۟ |  Klein and Manning (2003) ʹैͬͯ࣌ؒදݱʹ ҹΛ͚ͭΔ

    e.g. today ˠ NNT, months ˠ NNTS 13
  14. Head Annotations:  ओࣙͷΞϊςʔγϣϯ |  ओ͕ࣙ closed class ͷ୯ޠͷͱ͖ɺඇऴ୺ه߸ ͱલऴ୺ه߸ΛϚʔΫ

    e.g. VP-VB^S 14
  15. NP Flattening:  ໊ࢺ۟ͷฏୱԽ |  ฒྻɾಉ֨ʹͳ͍ͬͯͳ͍ࢠͲ΋໊ࢺ۟Λআ͖ɺ ଞͷ໊ࢺ۟ʹࢧ഑͞Ε͍ͯΔ໊ࢺ۟͸࡟আ e.g. લஔࢺ۟ʹम০͞Ε͍ͯΔ໊ࢺ۟ 15

  16. Number Annotations:  ਺ͷΞϊςʔγϣϯ |  ਺ࣈ͸ CD-YR, CD-NM, CD-DC, CD-MX,

    CD- AL ͷ5ͭͷΫϥεʹ෼ׂɻ E.g. CD-DC খ਺఺ΛؚΉ਺ࣈ 16
  17. SBAR Flattening: SBAR ͷฏୱԽ |  SBAR ʹࢧ഑͞Ε͍ͯΔ S ϊʔυ͸࡟আ ˠओޠ΍໨తޠ͕ͳ͍৔߹ɺSBAR

    ௚Լͷ S ͸ ಛघͳ෼෍Λ͍ͯ͠Δ 17
  18. VP Flattening: ಈࢺ۟ͷฏୱԽ |  VPΛ௚઀ࢧ഑͍ͯ͠ΔVP͸࡟আ e.g. will be going ˠ

    going ͷ VP ͷΈ࢒͢ 18
  19. Gapped Sentence Annotation |  Collins (1999) ͱ Klein and Manning

    (2003) ʹ ै͍ɺempty subject Λ࣋ͭϊʔυΛϚʔΫɻ ˠࣗಈղੳΛ͢ΔͷͰͦ͏͍͏ͷ͕ग़ͯ͘Δ 19
  20. Parent Annotation: ਌ͷΞϊςʔγϣϯ |  ಈࢺ۟͸਌ͷγϯϘϧͰΞϊςʔγϣϯɻ ˠgrandparent Ͱ৚͚݅ͮΔ͜ͱ͕Ͱ͖Δɻ SBAR௚ԼͷVP͸໨తޠ͕ͳ͍͜ͱ͕Α͋͘Δ 20

  21. Unary Deletion:  ୯߲ϧʔϧͷ࡟আ |  ϧʔτͱલऴ୺ه߸ͷੜ੒نଇҎ֎ͷ߲Λ1ͭ͠ ͔࣋ͨͳ͍ϧʔϧ͸࡟আ ˠ΄ͱΜͲͷ୯߲ϧʔϧ͸अຐ 21

  22. ܭࢉྔ͸େ͖͍Ͱ͕͢ɺ ࣮༻্͸໰୊͋Γ·ͤΜ |  L୯ޠ͔ΒͳΔจͷ֬཰Λܭࢉ͢Δʹ͸શͯͷՄ ೳͳߏจ໦ʹ͍ͭͯ଍͠ࠐΉඞཁ {  PCFG ͰఆࣜԽ͢Δͱ O(L^3) { 

    ݱ࣮తʹ͸σίʔμʹ૊ΈࠐΈɺpruning ͢Δɻ {  ˠຊ࿦จͷର৅֎͕ͩɺ೉͘͠ͳ͍ |  ࠓճͷ࣮ݧͰ͸طଘͷߏจղੳثΛ༻͍ɺ1000- best ߏจ໦Ͱ֬཰஋Λܭࢉ {  1-best Ͱ΋λεΫతʹ͸ਫ਼౓͸มΘΒͳ͍͕ɺ ύʔϓϨΩγςΟΛաେධՁͯ͠͠·͏ɻ {  ϘτϧωοΫ͸ߏจղੳثͷॲཧ࣌ؒ 22
  23. 15-20୯ޠͰੜ੒ͯ͠Έͨ จΛൺ΂͍ͯͩ͘͞ ੜ੒ϞσϧͳͷͰ͜ͷΑ͏ʹจΛੜ੒Ͱ͖Δ 23

  24. ύʔϓϨΩγςΟ΋طଘͷ ੜ੒ϞσϧΑΓ௿͍Ͱ͢ |  WSJ ͷηΫγϣϯ0ͰධՁ Treelet=ఏҊख๏ Treelet-Trans=ఏҊ͢Δม׵نଇΛ ద༻ͨ͋͠ͱͷ໦Ͱ PCFG Treelet-Rule=Treelet

    ͔Β ޠኮʹؔ͢Δจ຺Λআ͍ͨ΋ͷ 5-gram=KNεϜʔδϯάͨ͠5άϥϜݴޠϞσϧ PCFG-LA=ݴޠϞσϧϞʔυʹͨ͠Berkeley Parser HeadLex=(Collins, 1999) ͷϞσϧ1ͱಉ༷ͷओࣙޠኮԽख๏ 24
  25. ࣝผϞσϧͱൺֱͯ͠΋ Treelet LM ͸ߴ͍ੑೳͰ͢ |  Trigram ͔Β࡞੒ͨٙ͠ࣅෛྫ (Okanohara and Tsujii,

    2007) ͱਖ਼͍͠จͷ෼ྨλεΫ BLLIP=(Post, 2011) ͷίʔύε 1B=PTB+BLLIP+Gigaword LSVM=Latent SVM (Cherry and Quirk, 2008) TSG=Tree Substitution Grammar Rerank=Reranking features from (Charniak and Johnson, 2005)  Treelet-Rule ͷ΄͏͕͍͍ͷ͸ ٙࣅෛྫ͕3-gram͔Βੜ੒͞Ε͍ͯΔ͔Βʁ 25
  26. ػց຋༁Ͱ΋Treelet  ݴޠϞσϧ͸༗ޮͰ͢ |  Moses (French to English, Germean to

    English) ͱ Joshua (Chinese to English) Ͱग़ྗ ͨ͠ӳจͱɺϦϑΝϨϯεจΛ෼ྨ͢ΔλεΫ      ݴޠϞσϧ͸1Bίʔύεͱର༁ίʔύεͷӳޠ ଆͰτϨʔχϯά 26
  27. ·ͱΊ |  ୯७ͳ౷ޠతݴޠϞσϧΛఏҊͨ͠ |  େن໛σʔλΛ࢖ͬͯطଘͷNάϥϜʹର͢Δε Ϝʔδϯάख๏ͰਪఆՄೳ |  ͍ΖΜͳλεΫͰଞͷੜ੒ϞσϧͷݴޠϞσϧ ΑΓΑ͍ੑೳɺࣝผϞσϧͷख๏ͱಉఔ౓ͷੑ ೳ

    |  ࣮૷΋؆୯ʂ 27
  28. ࣭໰ͷ࣌ؒʢ̍ʣ |  ਅ౻͞Μͷ symbol refinement ͷ΄͏͕ཧ࿦త ʹ΋͖Ε͍ͩ͠ਓखͰࠇຐज़తͳ΋ͷΛ࡞Βͳ ͯ͘Α͍͕ɺͦΕͱͷҧ͍͸ʁʢ࣋ڮʣ  ˠ࣮૷͕؆୯ɺେن໛ʹεέʔϧ͢Δʢখொʣ

    |  ͡Ό͋ͳΜͰߴ଎ʁʢ࣋ڮʣ  ˠߏจղੳͷ͕࣌ؒݴޠϞσϧߏஙʹೖ͍ͬͯ ͳ͍͔ΒͰ͸ʁʢদݪʣ 28
  29. ࣭໰ͷ࣌ؒʢ̎ʣ |  ୯߲ϧʔϧʹ͸ॿಈࢺʹؔ͢Δҙຯͷ͋Δϧʔ ϧ΋͋ΔͷͰ͸ͳ͍͔ͱࢥ͏͕ɺऔΓআ͍ͯ͠ ·ͬͯΑ͍ͷ͔ʁʢ૬ᖒʣ  ˠશ͕ͯҙຯ͕ͳ͍ͱ͸ॻ͍͍ͯͳ͍͕ɺڪΒ ͘ύʔϓϨΩγςΟͳͲͰධՁͯ͠վળ͍ͯ͠Δ ͔Βಋೖ͞Εͨϧʔϧͩͱࢥ͏ʢখொʣ 29

  30. ࣭໰ͷ࣌ؒʢ̏ʣ |  Treelet ͷܭࢉͷͱ͖ derivation ͕1ͭʹܾ·Β ͳ͍৔߹͕͋Δͱࢥ͏͕ɺͲ͏͍ͯ͠Δ͔ʁ ʢ࣋ڮʣ  ˠશ֬཰ΛٻΊΔʹ͸inside

    outside ͱ͔࢖͏ ͷͰ͸ͳ͍͔ͱࢥ͏͕ɺ໌ࣔతʹॻ͍ͯ͋ͬͨ هԱ͕ͳ͍ɻ࣮ݧͰ͸1,000ϕετͷղੳ໦Λ ࢖ͬͯස౓Λܭࢉ͍ͯͨ͠ɻʢখொʣ 30