Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Syntactic Language Modeling with Tr...

Large-Scale Syntactic Language Modeling with Treelets

第4回最先端NLP勉強会で、Adam Pauls and Dan Klein. Large-Scale Syntactic Language Modeling with Treelets. ACL 2012. の紹介をしました。

Avatar for Mamoru Komachi

Mamoru Komachi

August 31, 2012
Tweet

More Decks by Mamoru Komachi

Other Decks in Research

Transcript

  1. Large-Scale Syntactic Language Modeling with Treelets Adam Pauls and Dan

    Klein (ACL 2012) Presented by Mamoru Komachi At ୈ4ճ࠷ઌ୺NLPษڧձ 2012/08/31
  2. N-gram LM ͸௕ڑ཭ͷ ґଘؔ܎Λѻ͑·ͤΜ |  NάϥϜݴޠϞσϧͷར఺ {  ࣮૷͕؆୯ {  ؤ݈ʹಈ࡞͢Δ

    {  େن໛Ͱ΋େৎ෉ |  NάϥϜݴޠϞσϧͷܽ఺ {  ௕ڑ཭ͷґଘؔ܎Λଊ͑ΒΕͳ͍ 2
  3. εέʔϧ͠ɺ࣮૷΋؆୯ͳ ౷ޠతݴޠϞσϧΛఏҊ͠·͢ |  ੜ੒తݴޠϞσϧ {  ߏจ໦্ͷtreelet ʹ ৚݅෇͚ΒΕͨϞσϧ {  େن໛σʔλʹεέʔϧ

    ͢Δ {  NάϥϜݴޠϞσϧͱ ಉ͘͡Β͍࣮૷͕؆୯ |  ͨ͘͞ΜฒྻͰܭࢉ͠ͳ ͯ͘΋Α͍ʢ୯७͸ਖ਼ٛʣ 3
  4. ͍Ζ͍ΖͳλεΫɾઃఆͰɺ Treelet ݴޠϞσϧΛධՁ͠·͢ |  ઌߦݚڀͱͷൺֱ {  NάϥϜݴޠϞσϧ΍ଞͷ໦ߏ଄Λ༻͍ͨੜ੒త ͳ౷ޠతݴޠϞσϧΑΓੑೳ͕ߴ͍ {  ਖ਼ྫ͚͔ͩΒߏஙͯ͠΋ɺ֤λεΫʹಛԽͨࣝ͠

    ผϞσϧͱಉ౳ͷੑೳ |  ͍Ζ͍ΖͳλεΫͰͷൺֱ {  ύʔϓϨΩγςΟ {  ٙࣅෛྫͱਖ਼ྫͷ෼ྨλεΫ {  ػց຋༁ͷग़ྗͱϦϑΝϨϯεͷ෼ྨλεΫ 4
  5. 2. Treelet ݴޠϞσϧ͸ ࠨ͔Βӈʹ໦Λੜ੒͠·͢ |  ໦ʹର͢Δ֬཰஋ͷׂ౰:  T=constituency tree (e.g.

    r = P ˠ C1 …Cd P=parent symbol of rule r C=children h=ʢ͢Ͱʹੜ੒͞Εͨʣconditioning context PCFGͷͱ͖͸h=P (਌ϊʔυ)ͷΈʹ৚͚݅ͮ 5
  6. Treelet ݴޠϞσϧ΋ಉ༷ εϜʔδϯάΛ͠·͢ |  ͲͷΑ͏ʹεϜʔδϯά ͢Δ͔ʁ  ਌ʢPʣΛจ຺ʹ͢Δɻ ʢ਌Λੜ੒͢Δr’ʹՃ͑ʣ 

     |  ґଘؔ܎Λߟྀ͢ΔͨΊจ຺͸௕͍͕ͨ͘͠ɺ σʔλ͔Β֬཰஋ਪఆͷͨΊʹ୹͘΋͍ͨ͠ 7
  7. ਌Λੜ੒͢ΔϧʔϧΛ จ຺ʹ͢Δ3ͭͷϝϦοτ |  Pͱͦͷ਌ͷP’ͷ྆ํΛߟྀʹೖΕΔͱɺP୯ମ ΑΓ༧ଌྗ͕ߴ͍ɻ (Johnson, 1998) |  ҐஔʹΑΔҧ͍ΛߟྀʹೖΕΒΕΔɻ E.g.

    ओޠͱ໨తޠͷ໊ࢺ۟͸ҧ͏෼෍ (Klein and Manning, 2003)ˠಈࢺ͔ΒΈ໊ͨࢺͷҐ ஔ͸͜ΕΒΛ۠ผ͢ΔΑ͍ࢦඪʹͳΔ |  ୯ޠͷੜ੒ͷͱ͖preterminal ͷ sibling ʹ৚ ͚݅ͮΔ͜ͱͰ͖Δɻˠಈࢺͷ֨ϑϨʔϜΛߟ ྀ͢Δ͜ͱ͕ՄೳʹͳΔ 8
  8. 2.1 Treelet ݴޠϞσϧ͸ ہॴతͳจ຺΋ߟྀ͠·͢ |  ཧ૝తʹ͸NάϥϜݴޠ ϞσϧΛτοϓμ΢ϯͷ PCFG ෩ʹஔ͖׵͍͑ͨ… ˠݱ࣮తʹ͸NάϥϜͷ

    ৘ใ͸༧ଌʹॏཁ |  Left-to-right จ຺Λߟྀ ͢ΔͨΊʹɺલͷ2୯ޠΛ จ຺ʹՃ͑Δ ˠίϩέʔγϣϯ΍ޠኮ తͳ૬ؔؔ܎Λଊ͑Δ ͜ͱ͕Ͱ͖Δ 9
  9. 2.2 ऴ୺ه߸ͱඇऴ୺ه߸ ͰόοΫΦϑΛ෼͚·͢ |  ඇऴ୺ه߸ˠ |  ऴ୺ه߸ɹˠ p(Cd 1 |

    P, ! P , ! r ) → p(Cd 1 | P, ! P ) → p(Cd 1 | P) λ p(C i |Ci−1 i−3 , P) i=1 d ∏ +(1− λ) p(C i |Ci−1 i−3 ) i=1 d ∏ 10
  10. 2.3 Treelet LM ͸4ͭͷ ֬཰෼෍͕ඞཁͳ͚ͩʂ |  NάϥϜݴޠϞσϧΛ࡞Δͷͱ ಉ͘͡ɺtreelet ͷස౓ΛΧ΢ ϯτͯ͠ӈͷ֬཰෼෍Λܭࢉɻ

    |  ස౓͸Ͳ͔͜Βܭࢉ͢Δʁ {  ਓखͰ࡞ͬͨPenn Treebank ίʔύε ˠ࣭͸ߴ͍͕αΠζ͕খ͍͞ {  ߏจղੳثΛ࢖ͬͯࣗಈతʹߏจ໦Λੜ੒ ˠΤϥʔ͸ؚ·ΕΔ͕ߏจ໦ࣗମʹڵຯ͕͋ΔΘ ͚Ͱ͸͘ɺੜ੒͞Εͨจʹڵຯ͕͋ΔͷͰ໰୊ ͳ͍ɻ p(C 1 d | P, ! P , ! r ) p(w | P, R, ! r ,w −1 ,w −2 ) p(C i |Ci−1 i−n+1 , P) p(C i |Ci−1 i−n+1 ) 11
  11. Number Annotations:  ਺ͷΞϊςʔγϣϯ |  ਺ࣈ͸ CD-YR, CD-NM, CD-DC, CD-MX,

    CD- AL ͷ5ͭͷΫϥεʹ෼ׂɻ E.g. CD-DC খ਺఺ΛؚΉ਺ࣈ 16
  12. Gapped Sentence Annotation |  Collins (1999) ͱ Klein and Manning

    (2003) ʹ ै͍ɺempty subject Λ࣋ͭϊʔυΛϚʔΫɻ ˠࣗಈղੳΛ͢ΔͷͰͦ͏͍͏ͷ͕ग़ͯ͘Δ 19
  13. ܭࢉྔ͸େ͖͍Ͱ͕͢ɺ ࣮༻্͸໰୊͋Γ·ͤΜ |  L୯ޠ͔ΒͳΔจͷ֬཰Λܭࢉ͢Δʹ͸શͯͷՄ ೳͳߏจ໦ʹ͍ͭͯ଍͠ࠐΉඞཁ {  PCFG ͰఆࣜԽ͢Δͱ O(L^3) { 

    ݱ࣮తʹ͸σίʔμʹ૊ΈࠐΈɺpruning ͢Δɻ {  ˠຊ࿦จͷର৅֎͕ͩɺ೉͘͠ͳ͍ |  ࠓճͷ࣮ݧͰ͸طଘͷߏจղੳثΛ༻͍ɺ1000- best ߏจ໦Ͱ֬཰஋Λܭࢉ {  1-best Ͱ΋λεΫతʹ͸ਫ਼౓͸มΘΒͳ͍͕ɺ ύʔϓϨΩγςΟΛաେධՁͯ͠͠·͏ɻ {  ϘτϧωοΫ͸ߏจղੳثͷॲཧ࣌ؒ 22
  14. ύʔϓϨΩγςΟ΋طଘͷ ੜ੒ϞσϧΑΓ௿͍Ͱ͢ |  WSJ ͷηΫγϣϯ0ͰධՁ Treelet=ఏҊख๏ Treelet-Trans=ఏҊ͢Δม׵نଇΛ ద༻ͨ͋͠ͱͷ໦Ͱ PCFG Treelet-Rule=Treelet

    ͔Β ޠኮʹؔ͢Δจ຺Λআ͍ͨ΋ͷ 5-gram=KNεϜʔδϯάͨ͠5άϥϜݴޠϞσϧ PCFG-LA=ݴޠϞσϧϞʔυʹͨ͠Berkeley Parser HeadLex=(Collins, 1999) ͷϞσϧ1ͱಉ༷ͷओࣙޠኮԽख๏ 24
  15. ࣝผϞσϧͱൺֱͯ͠΋ Treelet LM ͸ߴ͍ੑೳͰ͢ |  Trigram ͔Β࡞੒ͨٙ͠ࣅෛྫ (Okanohara and Tsujii,

    2007) ͱਖ਼͍͠จͷ෼ྨλεΫ BLLIP=(Post, 2011) ͷίʔύε 1B=PTB+BLLIP+Gigaword LSVM=Latent SVM (Cherry and Quirk, 2008) TSG=Tree Substitution Grammar Rerank=Reranking features from (Charniak and Johnson, 2005)  Treelet-Rule ͷ΄͏͕͍͍ͷ͸ ٙࣅෛྫ͕3-gram͔Βੜ੒͞Ε͍ͯΔ͔Βʁ 25
  16. ػց຋༁Ͱ΋Treelet  ݴޠϞσϧ͸༗ޮͰ͢ |  Moses (French to English, Germean to

    English) ͱ Joshua (Chinese to English) Ͱग़ྗ ͨ͠ӳจͱɺϦϑΝϨϯεจΛ෼ྨ͢ΔλεΫ      ݴޠϞσϧ͸1Bίʔύεͱର༁ίʔύεͷӳޠ ଆͰτϨʔχϯά 26
  17. ࣭໰ͷ࣌ؒʢ̏ʣ |  Treelet ͷܭࢉͷͱ͖ derivation ͕1ͭʹܾ·Β ͳ͍৔߹͕͋Δͱࢥ͏͕ɺͲ͏͍ͯ͠Δ͔ʁ ʢ࣋ڮʣ  ˠશ֬཰ΛٻΊΔʹ͸inside

    outside ͱ͔࢖͏ ͷͰ͸ͳ͍͔ͱࢥ͏͕ɺ໌ࣔతʹॻ͍ͯ͋ͬͨ هԱ͕ͳ͍ɻ࣮ݧͰ͸1,000ϕετͷղੳ໦Λ ࢖ͬͯස౓Λܭࢉ͍ͯͨ͠ɻʢখொʣ 30