Upgrade to Pro — share decks privately, control downloads, hide ads and more …

オープンソースとしての形態素解析器Sudachi / WAP NLP Tech Talk #4

オープンソースとしての形態素解析器Sudachi / WAP NLP Tech Talk #4

Sorami Hisamoto

November 26, 2021
Tweet

More Decks by Sorami Hisamoto

Other Decks in Programming

Transcript

 1. IUUQTKFEXPSLTIPQHJUIVCJPKFE એ఻ᶄ ൃදืूΛۙ೔ެ։ ಛผߨԋ 
 ֟পଠҰઌੜʢหޢ࢜ 4503*"๏཯ࣄ຿ॴʣ ઐ໳෼໺͸ελʔτΞοϓ๏຿ٴͼσʔλɾ"*๏຿ɻ 
 ݱࡏɺ༷ʑͳδϟϯϧʢҩྍɾ੡଄ۀɾϓϥοτϑΥʔϜܕ౳ʣͷ

  "*ελʔτΞοϓΛɺސ໰หޢ࢜ͱͯ͠ଟ਺αϙʔτ͍ͯ͠Δɻ 
 ܦࡁ࢈ۀলʮ"*ɾσʔλܖ໿ΨΠυϥΠϯʯݕ౼ձݕ౼ҕһʢʙ ʣɻ೔ຊσΟʔϓϥʔχϯάڠձʢ+%-"ʣ༗ࣝऀҕһ ʢʙʣɻʮୈճ*1#"4&"8"3%ʯ஌ࡒઐ໳Ո෦໳άϥϯϓ ϦΛड৆ʢʣɻ 
 
 ࢀߟ 
 ࣗવݴޠܥ"*αʔϏεͱஶ࡞ݖ৵֐c4503*"๏཯ࣄ຿ॴ ʢ ϒϩάهࣄʣ
 2. w ʮܗଶૉղੳثʯͱʮࣙॻʯ w ίʔυ΋ࣙॻ΋Φʔϓϯ w "QBDIF঎༻ར༻ɺվมɺ࠶഑෍౳0,ʂ 4VEBDIJͱ͸ w 4VEBDIJ☕

  4VEBDIJ%JDU📚 w 4VEBDIJ1Z🐍 TVEBDIJST🦀 w FMBTUJDTFBSDITVEBDIJ DIJ7F 4VEBDIJ5SB 7J4VEBDIJ ʜ 4VEBDIJ'BNJMZ
 3. w 4VEBDIJ͸ϫʔΫε͕ࣾɺձࣾͱͯ͠શ໘తʹόοΫΞοϓ w গͳ͘ͱ΋޲͜͏೥͸ϝϯςʢಛʹࣙॻͷܧଓతͳߋ৽͸ॏཁʣ ӦརاۀʹΑΔ044 ༏͍͠ऴ਎ͷಠࡋऀ ʁ #FOFWPMFOU%JDUBUPS'PS-JGF ߴԬҰഅLB[VNBU

  w ݩ஡᝗ʢ$IB4FOʣϝϯςφ w աڈͭͷܗଶૉղੳث։ൃʹؔΘΔʢʂʣ w ίʔύε։ൃɺ͔ͳ׽ࣈม׵ɺςΩετϚΠχϯάɺʜ HJUIVCDPNLB[VNBU ࣾ಺ʹ͸ଞʹ΋ɺ+VNBO ͷ࡞ऀ΍ɺ"50,ࣙॻ੍࡞ऀͳͲɺ ༷ʑͳΤΩεύʔτ͕͍Δ ձࣾ಺ʹดͣ͡ ɺ޿͘ίϛϡχςΟͱڞʹਐา͍͖͍ͯͨ͠ؾ࣋ͪ 
 ˠಙౡͷਓ͡Όͳͯ͘΋ʂ ՀཟͰͳ͘ɺ೐Θ͏όβʔϧ͕ݟ͍ͨʢݸਓͷײ૝Ͱ͢ʣ ͦͯ͠ɺΦʔϓϯιʔε
 4. w ܗଶૉղੳث͸ίϯϐϡʔλʔαΠΤϯεͷๅݿ w ϝϞϦɺจࣈίʔυɺσʔλߏ଄ɺάϥϑ࠷୹ܦ࿏୳ࡧɺʜ w ͻͱΓͷਓ͕ؒશͯΛཧղͰ͖Δେ͖͞ʢ-JOVYΧʔωϧͱ͸ҧ͏ʣ ίʔυ΍ࣙॻͷத਎Λ೷͍ͯΈΔ src/main/.../SudachiCommandLine.java sudachipy/command_line.py

  ͱΓ͋͑ͣ͜ͷ͋ͨΓ͔Β୧ͬͯΈΔʜˣʢ+BWB΋1ZUIPO΋֓Ͷಉ͡ߏ଄ʣ ͓͓·͔ͳྲྀΕ%JDUJPOBSZΛͭ͘Δˠ5PLFSOJ[FSΛͭ͘ΔˠςΩετΛղੳ͢Δˠग़ྗ͢Δ ʢࣙॻόΠφϦͷߏ଄͸ͪ͜Βʹ΋·ͱΊ͋Γˣ 
 3VTUʹΑΔࣗવݴޠॲཧπʔϧͷ࣮૷ܗଶૉղੳثʮTVEBDIJSTʯ2JJUBʣ
 5. ܗଶૉղੳثΛಡΈղ͘ॿ͚ʹͳΔจݙ ޻౻୓ ʰܗଶૉղੳͷཧ࿦ͱ࣮૷ʱ 
 ʢۙ୅Պֶࣾ ʣ ಙӬ୓೭ ʰ೔ຊޠೖྗΛࢧ͑Δٕज़ʱ 


  ʢٕज़ධ࿦ࣾ ʣ ೔ຊޠܗଶૉղੳͷཪଆΛ೷͘ʂ 
 .F$BC͸ͲͷΑ͏ʹܗଶૉղੳ͍ͯ͠Δ͔ 
 ΫοΫύου։ൃऀϒϩά ʢ!B@CJDLZ͞Μ ʣ
 6. ϓϥάΠϯΛͭͬͯ͘ΈΔͱ͔ʜ ίʔυΛॻ͍ͯΈΔɺϓ ϧϦΫΛ౤͛ͯΈΔ github.com/WorksApplications/Sudachi/blob/develop/docs/tutorial_plugin.md ϓϥάΠϯͷछྨ w ೖྗςΩετमਖ਼ʢ*OQVU5FYUʣ w จࣈਖ਼نԽʢSFXSJUFEFGͷఆٛʹैͬͯʣ

  w ௕Իූͷਖ਼نԽʢzΰʔʙʙϧzˠlΰʔϧzʣ w ະ஌ޠॲཧʢ007ʣ w ୯ޠ઀ଓॲཧʢ$POOFDU$PTUʣ w ඼ࢺͷ઀ଓې੍ w ग़ྗղमਖ਼ʢ1BUI3FXSJUFʣ w ΧλΧφະ஌ޠͷ·ͱΊ্͛ w ਺ࢺ·ͱΊ্͛ʢzԁમzˠԁમʣ ྫ͑͹ʮܟশ΍લޙؔ܎͔Βਓ໊෦෼Λਪఆ͢ΔϓϥάΠϯʯͱ͔ʢཉ͍͠ʣ
 7. w ͦΜͳʹ௝͍͜͠ͱͰ͸ͳ͍਺ଟ͋Δ w 5FOTPS'MPX (PPHMF 3FBDU .FUB 5ZQF4DSJQU .JDSPTPGU ʜ

  w ͍ΖΜͳཧ༝ w ࣗ෼͕ͨͪཉ͍͠΋ͷΛͭͬͯ͘ɺ͔ͤͬ͘ͳΒެ։͢Δ w ٕज़ελοΫΛଞࣾʹҕͶͣࣗࣾͰϦʔυ͢Δ w ޿ใ΍࠾༻ͷͨΊʹଅਐ͢Δ w ֎ͷਓʹख఻ͬͯ΋Β͏ ʜ w ͻͱͭͷཧ༝ίϛϡχςΟ΁ͷߩݙ w ΈΜͳ044ͷԸܙΛड͚͍ͯΔ w 4VEBDIJ΋ڊਓͷݞͷ্ʹཱͭ$IB4FO .F$BC ,VSPNPKJ 6OJ%JD /&PMPHE ʜ Ӧརاۀͱ༷ͯ͠ʑͳࢥ࿭͕͋ΔͩΖ͏ 
 8. -FHBMTDBQF͸ʮ๏཯ʯΛεϙϯαʔ͍ͯ͠Δ ϢχίʔυɾίϯιʔγΞϜʹΑΔ 
 "EPQUB$IBSBDUFSϓϩάϥϜΛ௨͡ɺ 
 ʮ๏ 6 $% ʯͱʮ཯

  6 '# ʯͷೋࣈΛεϙϯαʔ -FHBMTDBQF͸ʮ๏཯ʯΛεϙϯαʔ͍ͯ͠·͢ ʛ-FHBMTDBQFʛOPUF z͜ͷ6OJDPEFͷීٴʹΑΓੈͷதͷଟ ͘ͷςΩετσʔλ͕65'΍ 65'ͳͲͷ6OJDPEF༝དྷͷจࣈΤϯ ίʔσ ΟϯάͰදݱ͞ΕΔ͜ͱʹͳΓɺ จࣈԽ͚ʹ೰·͞ΕΔ͜ͱ΋ςΩετσ ʔλͷػցతͳॲཧͰ೰·͞ΕΔ͜ͱ΋ ݮͬͨͱ࣮ײ͍ͯ͠·͢ɻͦΜͳ 6OJDPEF͸ࠓ΍8FCͷٕज़Λࢧ͑Δɺ ͍Θ͹ʮΠϯϑϥετϥΫνϟʔʯͷҰ ཁૉͱݺΜͰ΋աݴͰ͸ͳ͍Ͱ͠ΐ͏ɻz IUUQTUXJUUFSDPNLPNJZB@BUTVTIJTUBUVT
 9. w ஍ਤαʔϏεͷձࣾ w 044Ͱଟେͳߩݙ w .BQCPY(-+4 w ˠWͰඇ044Խ 044ͱϏδωεͷཱ྆.BQCPYͷྫ 

  lΦʔϓϯιʔεϓϩδΣΫτʹ͓͍ͯ͸ɺ࠷େͷߩݙऀ͕࠷େͷ डӹऀʹͳΔͱ͍͏ʮϑΣΞʯͳؔ܎ੑ͕ߏஙͰ͖ͳ͚Ε͹ɺ͍ ͣΕ͸ϓϩδΣΫτ͕נղ͢Δ͔ɺ͋Δ͍͸ίϛϡχςΟ཭୤ ʢඇ044ԽͳͲʣʹͳΔ͔ʹߦ͖ண͔͟ΔΛಘͳ͍͜ͱ͸ɺྺ࢙ ͕෺ޠ͍ͬͯΔɻz lܦӦਞʹͱͬͯ͸ɺ౤ࢿՈ͔ΒͷϓϨογϟʔํ͕ɺίϛϡχς ΟσΠ཭୤ͷϦΞΫγϣϯΑΓ΋ѹ౗తʹେ͖͍ͩΖ͏͠ɺΤϯ λʔϓϥΠζྖҬΛސ٬λʔήοτͱͯ͠ߦ͘ʹ͋ͨͬͯ͸ɺ஌ తࡒ࢈ݖΛ׬શʹίϯτϩʔϧ͢΂͖ͱ͍͏ݟղ΋͋ΔͩΖ͏ɻ ͦͷதͰɺ044ϥΠηϯεͱͯ͠ఏڙ͠ଓ͚Δ͜ͱͷརӹΛɺ঎ ۀతͳϩδοΫͰઆ໌͢Δͷ͸ͳ͔ͳ͔೉͍͠ɻz zࠓճͷඇ044Խ͸େม࢒೦͕ͩɺ͜Ε·Ͱͷ.BQCPYͷ஍ཧ৘ใ ίϛϡχςΟσΠ΁ͷͱͯͭ΋ͳ͍ߩݙʹ৺͔Βײँ͢Δͱͱ΋ ʹɺΧϥϑϧͳϩϯάςʔϧΛߏ੒͢ΔҰһͱͯ͠ɺࠓޙͷ׆༂ ʹΤʔϧΛૹΓ͍ͨɻz .BQCPY(-+4ͷඇ044Խʹ͍ͭͯԣ඿εϩʔϥΠϑ.Z TMPXMJGFJO:PLPIBNB ʢڧௐҾ༻ऀʣ
 10. ଞํͰ(JU-BCͷΑ͏ͳྫ΋͋Δ l·ͱΊΔͱɺ͜͏Ͱ͢ɻ΢ΫϥΠφͷ։ൃऀ͕Φʔϓϯιʔ εͰ࢝ΊͨϓϩδΣΫτʹɺ͙͢ʹੈքத͔Βڠྗऀ͕ݱ ΕɺπΠʔτΛ͖͔͚ͬʹΦϥϯμਓͱڞಉ૑ۀɻγϦίϯ όϨʔͷ໊໳ΞΫηϥϨʔλʔʹࢀՃͯ͠େ͖ͳΠάδοτ ͱͳΔ੒௕Λ਱͛ͨͱ͍͏͜ͱͰ͢ɻ(JU-BC͸࠷ޙͷࢿۚௐ ୡϥ΢ϯυͰͷόϦϡΤʔγϣϯ͸Cʢ ԯԁʣͰɺ *10ޙʹ͸ஹԁͷେ୆ɺ͍ΘΏΔσΧίʔϯʹಧ͖ͦ͏ͳ

  ͱ͜Ζ·Ͱ੒௕͍ͯ͠·͢ɻz zʮΦʔϓϯιʔεͱגࣜձࣾʯͱ͍͏ͷ͸ɺΦʔϓϯιʔ εϜʔϒϝϯτ͕ࢿຊओٛ΁ͷΞϯνςʔθͱͯ࢝͠·ͬͨ ͜ͱߟ͑Δͱɺਫͱ༉ͷΑ͏ʹ΋ࢥ͑·͢ɻ͔͠͠ɺҩྍɾ ෱ࢱɾڭҭɾ೶ۀͳͲͰ΋גࣜձ͕ࣾ੒ՌΛڍ͍͛ͯ·͢ɻ Φʔϓϯιʔε͚ͩ͸ҧ͏ɺͱࢲʹ͸ࢥ͑·ͤΜɻ։ൃ׆ಈ Λ૊৫Խ͠ɺ࣋ଓՄೳͳ੒௕ΛՌͨͯ͠ΠϯύΫτ࠷େԽΛ ૂ͏ͱ͖ɺגࣜձࣾͱ͍͏ث͸ڧྗͳπʔϧɻ(JU-BC্৔ ͸ɺͦͷ͜ͱΛ͍ࣔͯ͠ΔͷͰ͸ͳ͍Ͱ͠ΐ͏͔ɻz ΢ΫϥΠφൃݸਓϓϩδΣΫτ(JU-BC͕ஹԁن໛ͷ*10 ΁ɺͦͷͭͷڭ܇c$PSBM$BQJUBM w ΢ΫϥΠφͰ͸͡·ͬͨ 
 044ϓϩδΣΫτ w ձࣾʹͳΓɺ೥ܦͬͯ*10΁ w ͍·΍࣌Ձ૯ֹஹԁ΁ 
 ಧ͖ͦ͏ͳ੎͍ w גࣜձࣾΛखஈͱͯ͠׆༻ ʢڧௐҾ༻ऀʣ
 11. ༡ͼͷΑ͏ͳ࿑ಇͷதʹ͋ΔϢʔτϐΞ ʮֶੜ࣌୅ʹϚϧΫʔθΛಡΜͰɺ࠷ۙͦΕΛٱ͠ͿΓʹ ಡΈ௚ͯ͠ΈͨΜͰ͢ɻੲ͸೉ͯͬ͘͠͞ͺΓΘ͔Βͳ͔ͬ ͨͱ͜Ζ͕ɺ࣮ʹΑ͘Θ͔ͬͯ໘ന͍ͷͰࢥΘͣ໷;͔͠ ͨ͠ΜͰ͕͢ͶɺϚϧΫʔθ͸ʰ࿑ಇͷ֎ʹϢʔτϐΞΛ ٻΊΔͳɻ࿑ಇͷதʹϢʔτϐΞΛٻΊΑʱͱ͍͏ΜͰ ͢ɻզຫͯ͠΍Δ࿑ಇ͸ૄ֎͞Εͨ࿑ಇͰɺ΋ͪΖΜͦͷத ʹ͸ϢʔτϐΞ͸ͳ͍ɻ͔͠͠ɺͦ͏Ͱ͸ͳͯ͘ɺ༡ͼͷ Α͏ͳ࿑ಇ͕͋Δɻ୭͔ʹڧ͍ΒΕͯ͢ΔͷͰ͸ͳͯ͘ɺ

  ࣗ༝ʹࣗ෼ͷൃҙͰ͢Δ࿑ಇɺԿΒ͔ͷཉٻΛ཈ѹ͠ͳ͕ Β͢Δ࿑ಇͰ͸ͳ͘ɺཉٻΛຬ଍ͤ͞ΔͨΊͷ࿑ಇɺࣗ෼ ͷೳྗΛൃݟͰ͖Δ͜ͱʹتͼΛײ͡ΒΕΔ࿑ಇɺͦ͏͍ ͏༡ͼ͔࿑ಇ͔Θ͔Βͳ͍Α͏ͳࣗ༝ͳ࿑ಇͷதʹϢʔτ ϐΞ͕͋Δͱ͍͏ΜͰ͢ɻ ɹͦΕΛಡΜͰεοΩϦ͠·ͨ͠Αɻ͜Ε·ͰੈͷதҰൠ ͷਓͷΑ͏ʹɺ͍΍ͳ࢓ࣄͰ΋ਅ໘໨ʹҰੜݒ໋΍Δͱ͍ ͏͜ͱ͕Ͱ͖ͳ͍ࣗ෼͸ɺຊ࣭తʹଵଦͳਓؒͳͷͰ͸ͳ ͍͔ͱࢥͬͯɺີ͔ʹίϯϓϨοΫεΛ͍࣋ͬͯͨΜͰ͢ ͚Ͳɺ͜ΕΛಡΜͰɺࣗ෼͕ਖ਼͍͠ͱ͍͏͜ͱʹ֬৴Λ࣋ ͯ·ͨ͠ʯ ɹͳΕͨख͖ͭͰ࣫ΛృΓͳ͕ΒɺҴຊ༟͸ͦ͏͍ͬͯ۶ ୗͳ͘সͬͨɻ ΦʔΫɾϰΟϨοδృࢣ 
 Ҵຊ༟ ࡀ 
 ඈभɾߴࢁͷ૲ਂ͍ธଜͰ 
 ܑ΍༑ਓͱޠΒ͍ɺ 
 ໦޻Ո۩࡞ΓͷϰΟϨοδΛ 
 ͭ͘Δஉͨͪͷ྘ͷੜ׆ɻ ཱՖོʰ੨य़ඬྲྀʱऩ࿥ ʢߨஊࣾจݿ ʣ ༷ʑͳ৬ۀͷएऀΛऔࡐͨ͠υΩϡϝϯτ ʢڧௐҾ༻ऀʣ
 12. +VTUGPS'VO ָ͍͜͠ͱ͠Α͏ ʰͦΕ͕΅͘ʹ͸ָ͔͔ͬͨ͠Βʱ খֶؗϓϩμΫγϣϯ  l+VTUGPS'VO5IF4UPSZPGBO"DDJEFOUBM3FWPMVUJPOBSZz Ϧʔφετʔόϧζ ஶ σϏουμΠϠϞϯυ

  ஶ ෩ݟ५ ຋༁ தౡ༸ ؂म ݁Ռతʹɺ 
 ਓੜมΘΔ͔΋ʁ 
 ΈΜͳͷͨΊʹͳΔ͔΋ʁ Φʔϓϯιʔε͸ָ͍͠Α