Large-Scale Syntactic Language Modeling with Treelets

Large-Scale Syntactic Language Modeling with Treelets

第4回最先端NLP勉強会で、Adam Pauls and Dan Klein. Large-Scale Syntactic Language Modeling with Treelets. ACL 2012. の紹介をしました。

A0e65af9a6baff8efb7e632212f5eec3?s=128

Mamoru Komachi

August 31, 2012
Tweet

Transcript

  1. 1.

    Large-Scale Syntactic Language Modeling with Treelets Adam Pauls and Dan

    Klein (ACL 2012) Presented by Mamoru Komachi At ୈ4ճ࠷ઌ୺NLPษڧձ 2012/08/31
  2. 2.

    N-gram LM ͸௕ڑ཭ͷ ґଘؔ܎Λѻ͑·ͤΜ |  NάϥϜݴޠϞσϧͷར఺ {  ࣮૷͕؆୯ {  ؤ݈ʹಈ࡞͢Δ

    {  େن໛Ͱ΋େৎ෉ |  NάϥϜݴޠϞσϧͷܽ఺ {  ௕ڑ཭ͷґଘؔ܎Λଊ͑ΒΕͳ͍ 2
  3. 3.

    εέʔϧ͠ɺ࣮૷΋؆୯ͳ ౷ޠతݴޠϞσϧΛఏҊ͠·͢ |  ੜ੒తݴޠϞσϧ {  ߏจ໦্ͷtreelet ʹ ৚݅෇͚ΒΕͨϞσϧ {  େن໛σʔλʹεέʔϧ

    ͢Δ {  NάϥϜݴޠϞσϧͱ ಉ͘͡Β͍࣮૷͕؆୯ |  ͨ͘͞ΜฒྻͰܭࢉ͠ͳ ͯ͘΋Α͍ʢ୯७͸ਖ਼ٛʣ 3
  4. 4.

    ͍Ζ͍ΖͳλεΫɾઃఆͰɺ Treelet ݴޠϞσϧΛධՁ͠·͢ |  ઌߦݚڀͱͷൺֱ {  NάϥϜݴޠϞσϧ΍ଞͷ໦ߏ଄Λ༻͍ͨੜ੒త ͳ౷ޠతݴޠϞσϧΑΓੑೳ͕ߴ͍ {  ਖ਼ྫ͚͔ͩΒߏஙͯ͠΋ɺ֤λεΫʹಛԽͨࣝ͠

    ผϞσϧͱಉ౳ͷੑೳ |  ͍Ζ͍ΖͳλεΫͰͷൺֱ {  ύʔϓϨΩγςΟ {  ٙࣅෛྫͱਖ਼ྫͷ෼ྨλεΫ {  ػց຋༁ͷग़ྗͱϦϑΝϨϯεͷ෼ྨλεΫ 4
  5. 5.

    2. Treelet ݴޠϞσϧ͸ ࠨ͔Βӈʹ໦Λੜ੒͠·͢ |  ໦ʹର͢Δ֬཰஋ͷׂ౰:  T=constituency tree (e.g.

    r = P ˠ C1 …Cd P=parent symbol of rule r C=children h=ʢ͢Ͱʹੜ੒͞Εͨʣconditioning context PCFGͷͱ͖͸h=P (਌ϊʔυ)ͷΈʹ৚͚݅ͮ 5
  6. 7.

    Treelet ݴޠϞσϧ΋ಉ༷ εϜʔδϯάΛ͠·͢ |  ͲͷΑ͏ʹεϜʔδϯά ͢Δ͔ʁ  ਌ʢPʣΛจ຺ʹ͢Δɻ ʢ਌Λੜ੒͢Δr’ʹՃ͑ʣ 

     |  ґଘؔ܎Λߟྀ͢ΔͨΊจ຺͸௕͍͕ͨ͘͠ɺ σʔλ͔Β֬཰஋ਪఆͷͨΊʹ୹͘΋͍ͨ͠ 7
  7. 8.

    ਌Λੜ੒͢ΔϧʔϧΛ จ຺ʹ͢Δ3ͭͷϝϦοτ |  Pͱͦͷ਌ͷP’ͷ྆ํΛߟྀʹೖΕΔͱɺP୯ମ ΑΓ༧ଌྗ͕ߴ͍ɻ (Johnson, 1998) |  ҐஔʹΑΔҧ͍ΛߟྀʹೖΕΒΕΔɻ E.g.

    ओޠͱ໨తޠͷ໊ࢺ۟͸ҧ͏෼෍ (Klein and Manning, 2003)ˠಈࢺ͔ΒΈ໊ͨࢺͷҐ ஔ͸͜ΕΒΛ۠ผ͢ΔΑ͍ࢦඪʹͳΔ |  ୯ޠͷੜ੒ͷͱ͖preterminal ͷ sibling ʹ৚ ͚݅ͮΔ͜ͱͰ͖Δɻˠಈࢺͷ֨ϑϨʔϜΛߟ ྀ͢Δ͜ͱ͕ՄೳʹͳΔ 8
  8. 9.

    2.1 Treelet ݴޠϞσϧ͸ ہॴతͳจ຺΋ߟྀ͠·͢ |  ཧ૝తʹ͸NάϥϜݴޠ ϞσϧΛτοϓμ΢ϯͷ PCFG ෩ʹஔ͖׵͍͑ͨ… ˠݱ࣮తʹ͸NάϥϜͷ

    ৘ใ͸༧ଌʹॏཁ |  Left-to-right จ຺Λߟྀ ͢ΔͨΊʹɺલͷ2୯ޠΛ จ຺ʹՃ͑Δ ˠίϩέʔγϣϯ΍ޠኮ తͳ૬ؔؔ܎Λଊ͑Δ ͜ͱ͕Ͱ͖Δ 9
  9. 10.

    2.2 ऴ୺ه߸ͱඇऴ୺ه߸ ͰόοΫΦϑΛ෼͚·͢ |  ඇऴ୺ه߸ˠ |  ऴ୺ه߸ɹˠ p(Cd 1 |

    P, ! P , ! r ) → p(Cd 1 | P, ! P ) → p(Cd 1 | P) λ p(C i |Ci−1 i−3 , P) i=1 d ∏ +(1− λ) p(C i |Ci−1 i−3 ) i=1 d ∏ 10
  10. 11.

    2.3 Treelet LM ͸4ͭͷ ֬཰෼෍͕ඞཁͳ͚ͩʂ |  NάϥϜݴޠϞσϧΛ࡞Δͷͱ ಉ͘͡ɺtreelet ͷස౓ΛΧ΢ ϯτͯ͠ӈͷ֬཰෼෍Λܭࢉɻ

    |  ස౓͸Ͳ͔͜Βܭࢉ͢Δʁ {  ਓखͰ࡞ͬͨPenn Treebank ίʔύε ˠ࣭͸ߴ͍͕αΠζ͕খ͍͞ {  ߏจղੳثΛ࢖ͬͯࣗಈతʹߏจ໦Λੜ੒ ˠΤϥʔ͸ؚ·ΕΔ͕ߏจ໦ࣗମʹڵຯ͕͋ΔΘ ͚Ͱ͸͘ɺੜ੒͞Εͨจʹڵຯ͕͋ΔͷͰ໰୊ ͳ͍ɻ p(C 1 d | P, ! P , ! r ) p(w | P, R, ! r ,w −1 ,w −2 ) p(C i |Ci−1 i−n+1 , P) p(C i |Ci−1 i−n+1 ) 11
  11. 16.

    Number Annotations:  ਺ͷΞϊςʔγϣϯ |  ਺ࣈ͸ CD-YR, CD-NM, CD-DC, CD-MX,

    CD- AL ͷ5ͭͷΫϥεʹ෼ׂɻ E.g. CD-DC খ਺఺ΛؚΉ਺ࣈ 16
  12. 19.

    Gapped Sentence Annotation |  Collins (1999) ͱ Klein and Manning

    (2003) ʹ ै͍ɺempty subject Λ࣋ͭϊʔυΛϚʔΫɻ ˠࣗಈղੳΛ͢ΔͷͰͦ͏͍͏ͷ͕ग़ͯ͘Δ 19
  13. 22.

    ܭࢉྔ͸େ͖͍Ͱ͕͢ɺ ࣮༻্͸໰୊͋Γ·ͤΜ |  L୯ޠ͔ΒͳΔจͷ֬཰Λܭࢉ͢Δʹ͸શͯͷՄ ೳͳߏจ໦ʹ͍ͭͯ଍͠ࠐΉඞཁ {  PCFG ͰఆࣜԽ͢Δͱ O(L^3) { 

    ݱ࣮తʹ͸σίʔμʹ૊ΈࠐΈɺpruning ͢Δɻ {  ˠຊ࿦จͷର৅֎͕ͩɺ೉͘͠ͳ͍ |  ࠓճͷ࣮ݧͰ͸طଘͷߏจղੳثΛ༻͍ɺ1000- best ߏจ໦Ͱ֬཰஋Λܭࢉ {  1-best Ͱ΋λεΫతʹ͸ਫ਼౓͸มΘΒͳ͍͕ɺ ύʔϓϨΩγςΟΛաେධՁͯ͠͠·͏ɻ {  ϘτϧωοΫ͸ߏจղੳثͷॲཧ࣌ؒ 22
  14. 24.

    ύʔϓϨΩγςΟ΋طଘͷ ੜ੒ϞσϧΑΓ௿͍Ͱ͢ |  WSJ ͷηΫγϣϯ0ͰධՁ Treelet=ఏҊख๏ Treelet-Trans=ఏҊ͢Δม׵نଇΛ ద༻ͨ͋͠ͱͷ໦Ͱ PCFG Treelet-Rule=Treelet

    ͔Β ޠኮʹؔ͢Δจ຺Λআ͍ͨ΋ͷ 5-gram=KNεϜʔδϯάͨ͠5άϥϜݴޠϞσϧ PCFG-LA=ݴޠϞσϧϞʔυʹͨ͠Berkeley Parser HeadLex=(Collins, 1999) ͷϞσϧ1ͱಉ༷ͷओࣙޠኮԽख๏ 24
  15. 25.

    ࣝผϞσϧͱൺֱͯ͠΋ Treelet LM ͸ߴ͍ੑೳͰ͢ |  Trigram ͔Β࡞੒ͨٙ͠ࣅෛྫ (Okanohara and Tsujii,

    2007) ͱਖ਼͍͠จͷ෼ྨλεΫ BLLIP=(Post, 2011) ͷίʔύε 1B=PTB+BLLIP+Gigaword LSVM=Latent SVM (Cherry and Quirk, 2008) TSG=Tree Substitution Grammar Rerank=Reranking features from (Charniak and Johnson, 2005)  Treelet-Rule ͷ΄͏͕͍͍ͷ͸ ٙࣅෛྫ͕3-gram͔Βੜ੒͞Ε͍ͯΔ͔Βʁ 25
  16. 26.

    ػց຋༁Ͱ΋Treelet  ݴޠϞσϧ͸༗ޮͰ͢ |  Moses (French to English, Germean to

    English) ͱ Joshua (Chinese to English) Ͱग़ྗ ͨ͠ӳจͱɺϦϑΝϨϯεจΛ෼ྨ͢ΔλεΫ      ݴޠϞσϧ͸1Bίʔύεͱର༁ίʔύεͷӳޠ ଆͰτϨʔχϯά 26
  17. 30.

    ࣭໰ͷ࣌ؒʢ̏ʣ |  Treelet ͷܭࢉͷͱ͖ derivation ͕1ͭʹܾ·Β ͳ͍৔߹͕͋Δͱࢥ͏͕ɺͲ͏͍ͯ͠Δ͔ʁ ʢ࣋ڮʣ  ˠશ֬཰ΛٻΊΔʹ͸inside

    outside ͱ͔࢖͏ ͷͰ͸ͳ͍͔ͱࢥ͏͕ɺ໌ࣔతʹॻ͍ͯ͋ͬͨ هԱ͕ͳ͍ɻ࣮ݧͰ͸1,000ϕετͷղੳ໦Λ ࢖ͬͯස౓Λܭࢉ͍ͯͨ͠ɻʢখொʣ 30