Slide 1

Slide 1 text

ۃେ෦෼จࣈྻʹΑΔ 
 ؔ࿈ϑϨʔζநग़ͱͦͷԠ༻ i d : t a k u y a - a 
 @ t a k u y a _ a ٕज़ษڧձ 2017-01-26

Slide 2

Slide 2 text

id:takuya-a ΞϓϦέʔγϣϯΤϯδχΞ 2015 ೥ 4 ݄ೖࣾ ڵຯ • ৘ใݕࡧ • ࣗવݴޠॲཧ • ػցֶश OSS ׆ಈ kuromoji.js ͱ͔

Slide 3

Slide 3 text

ࣗಈΩʔϑϨʔζநग़ 
 Automatic keyphrase extraction • จॻͷຊจ͔Βॏཁ͔ͭओ୊ʹ͋͏ϑϨʔζ 
 Λࣗಈબ୒͢Δ NLP λεΫ (Turney 2000) ԿͳΜͩΑ೔ຊɻ 
 Ұԯ૯׆༂ࣾձ͡ΌͶʔͷ͔Αɻ 
 ࡢ೔ݟࣄʹอҭԂམͪͨΘɻ • ݻ༗දݱநग़ (NER) • ओ୊ʹؔ܎ͳ͍΋ͷ΋औΓग़͢ • ΤϯςΟςΟϦϯΩϯά (Entity Linking) • ΩʔϑϨʔζͱΤϯςΟςΟ (Wikipedia ͷλΠτϧͳͲ) Λ݁ͼ͚ͭΔ http://anond.hatelabo.jp/20160215171759

Slide 4

Slide 4 text

• ୯ޠʢܗଶૉʣͷ୯ҐͰ͸ࡉ͔͗͢Δ • ܗଶૉղੳ͚ͩͰ͸ҙຯͷ͋Δ·ͱ·Γ͕औΕͳ͍͜ͱ͕͋Δ • ਓ໊: ্_നੴ_๖_Ի • ஍໊: ӊؙ_ޚ஑ • ࡞඼໊: ͋ͷ_೔_ݟ_ͨ_Ֆ_ͷ_໊લ_Λ_๻ୡ_͸_·ͩ_஌Β_ͳ͍_ɻ ࣗಈΩʔϑϨʔζநग़ 
 Automatic keyphrase extraction

Slide 5

Slide 5 text

• ࣮༻ੑͷߴ͍ NLP λεΫ • ςΩετΛѻ͏αʔϏεͳΒ͍Ζ͍Ζ࢖͍ॴ͕͋Δ • ϒοΫϚʔΫɺϒϩάɺΧΫϤϜɺBS͸ͯͳ... • ΑΓҙຯʹ͍ۙ୯ҐͰ୯ޠΛѻ͑Δ͜ͱͰʮਂ͍ʯղੳ͕Մೳʹ • ʢࡉ͔͍୯ҐͰे෼ͳ͜ͱ΋͋ΔͷͰ༻్ʹΑΔʣ ࣗಈΩʔϑϨʔζநग़ 
 Automatic keyphrase extraction

Slide 6

Slide 6 text

• ղ͔Εͨͱ͸ݴ͍͕͍ͨ • ӳޠ͔ͭ࠷ߴੑೳͷख๏Ͱ΋είΞ͸௿͍ (Hasan & Ng 2014) • ద߹཰ 27.2 ~ 35.0% • ࠶ݱ཰ 27.8 ~ 66.0% • F஋ 27.5 ~ 45.7% • ʢίʔύεʹΑ͚ͬͯͬ͜͏มΘΔʣ • ೔ຊޠͩͱ΋ͬͱݫ͘͠ͳΓͦ͏ ࣗಈΩʔϑϨʔζநग़ 
 Automatic keyphrase extraction ੡඼ʹ
 ଱͑Δਫ਼౓Λ
 Ί͍ͨ͟͠

Slide 7

Slide 7 text

Well known APIs • Yahoo! JAPAN ςΩετղੳ Web API • ΩʔϑϨʔζநग़ API • Microsoft Cognitive Services • Text Analytics API • ଞʹ΋͋Δ͔΋

Slide 8

Slide 8 text

Yahoo! ΩʔϑϨʔζநग़ API • http://developer.yahoo.co.jp/webapi/jlp/keyphrase/v1/extract.html

Slide 9

Slide 9 text

• อҭԂམͪͨͰࢼͨ͠ Yahoo! ΩʔϑϨʔζநग़ API "อҭԂ": 100, 
 "ࣇಐख౰": 79, 
 "࿫࿎": 76, 
 "ΰϚϯ": 63, 
 "൒෼ҐΫϏ": 58, 
 "ࢠڙ࢈Ή΍ͭͳΜ͔͍": 55, 
 "ΤϯϒϨϜ": 55, 
 "΢νϫ": 54, 
 "ࠃձٞһ": 48, 
 "গࢠԽ": 47, 
 "ΦϦϯϐοΫ": 45, 
 "σβΠφʔ": 44, 
 "೔ຊ": 42, 
 "Ϝγ": 41, 
 "ࡒݯ": 40, 
 "ࣇಐख౰20ສ": 39, 
 "ແঈ": 36, 
 "Ұԯ૯׆༂ࣾձ": 35, 
 "අ༻શͯ": 34, 
 "੫ۚ": 33

Slide 10

Slide 10 text

Yahoo! ΩʔϑϨʔζநग़ API ʰ͋ͷ೔ݟͨՖͷ໊લΛ๻ୡ͸·ͩ஌Βͳ͍ɻʱ͸ɺ A-1 Pictures੍࡞ͷ೔ຊͷςϨϏΞχϝ࡞඼ɻ2011೥ 4݄͔Β6݄·ͰϑδςϨϏɾϊΠλϛφ࿮ͳͲͰ์ૹ ͞Εͨɻશ11࿩ɻུশ͸ʮ͋ͷՖʯɺʮ͋ͷ͸ͳʯɻ 2012೥ʹອըԽɺ2013೥8݄31೔ʹ͸ܶ৔൛͕ެ։͞ Εͨɻ https://ja.wikipedia.org/wiki/͋ͷ೔ݟͨՖͷ໊લΛ๻ୡ͸·ͩ஌Βͳ͍ɻ औΕͯͳ͍

Slide 11

Slide 11 text

Yahoo! ΩʔϑϨʔζநग़ API

Slide 12

Slide 12 text

MS Cognitive Services • https://text-analytics-demo.azurewebsites.net/

Slide 13

Slide 13 text

http://anond.hatelabo.jp/20160215171759

Slide 14

Slide 14 text

https://ja.wikipedia.org/wiki/͋ͷ೔ݟͨՖͷ໊લΛ๻ୡ͸·ͩ஌Βͳ͍ɻ MS Cognitive Services • ʮ͋ͷ೔ݟͨ…ʯ΋΍ͬͺΓऔΕͳ͍

Slide 15

Slide 15 text

MS Cognitive Services

Slide 16

Slide 16 text

ΩʔϑϨʔζநग़Λ 
 ࣗલͰ࣮૷͢Δ

Slide 17

Slide 17 text

ΩʔϑϨʔζநग़Λ࣮૷͢Δ • ΍ͬͺΓࣗ෼ͨͪͰ࣮૷͍ͨ͠ • ༻్͝ͱʹνϡʔχϯά͍ͨ͠ • ͸ͯͳϒοΫϚʔΫͷେن໛ςΩετσʔλ͕࢖͑Δ • શจॻ͕ Elasticsearch ʹࡌ͍ͬͯΔʂʂ • จࣈྻΞϧΰϦζϜͷग़൪ • ݱ࣮తͳܭࢉ࣌ؒͰɺݱ࣮తͳਫ਼౓Λग़͢ • ίετΛ཈͑Δ

Slide 18

Slide 18 text

ܭࢉྔ͕໰୊ʹͳΔέʔε 1. ඇৗʹେ͖͍ςΩετΛѻ͏৔߹ 2. จॻू߹શମ͔ΒΩʔϑϨʔζΛநग़͍ͨ͠৔߹ • ͳʹ͔ͷج४ͰߜΓࠐΜͩจॻू߹ • ྫɿ࠷ۙ3೔ؒͷΤϯτϦʔ͔ΒΩʔϑϨʔζΛܭࢉ • ྫɿΩʔϫʔυʢྫɿ͋ͷՖʣͷݕࡧ݁Ռ͔ΒΩʔϑϨʔζΛܭࢉ Es ͷ೚ҙͷΫΤϦ݁Ռʹద༻Ͱ͖ΔͷͰɺ͍Ζ͍ΖԠ༻Ͱ͖Δ

Slide 19

Slide 19 text

ΩʔϑϨʔζநग़ͷྲྀΕ 1. ީิΩʔϑϨʔζͷநग़ • ͢΂ͯͷ෦෼จࣈྻΛީิʹ͢Δͱީิ਺͕ n^2 ʹͳΔ • ։࢝Ґஔ n ύλʔϯ x ௕͞ n ύλʔϯ • ਖ਼ղΩʔϑϨʔζΛΧόʔͭͭ͠ɺܭࢉྔΛ཈͑Δ޻෉͕ඞཁ • ޙଓͷύΠϓϥΠϯͰͷܭࢉΛߴ଎ʹߦ͍͍ͨ 2. ΩʔϑϨʔζͷείΞϦϯά • ͦΕͧΕͷީิΩʔϑϨʔζʹରͯ͠είΞΛ͚ͭΔ • είΞͷᮢ஋΍݅਺ͳͲͰείΞ্ҐͷΩʔϑϨʔζΛબ୒ ʢৄ͘͠͸ Hasan & Ng 2014, 3. Keyphrase Extraction Approaches Λࢀরʣ

Slide 20

Slide 20 text

ϑΣʔζ1: ީิΩʔϑϨʔζͷநग़ • φΠʔϒͳํ๏: ͢΂ͯͷ෦෼จࣈྻΛߟ͑Δͱ O(n^2) • ݱ࣮తʹ͸ 5 τʔΫϯͱ͔Ͱଧͪ੾Δ͜ͱʹͳΔ • ௕͍ϑϨʔζ͸औΕͳ͍ • ैདྷख๏: ώϡʔϦεςΟοΫͳϧʔϧϕʔε 1. ࣙॻΛ࢖ͬͯετοϓϫʔυΛ͸͘͡ (Liu+ 2009) 2. ඼ࢺྻύλʔϯʹϚον͢Δ΋ͷΛબ୒ (Mihalcea & Tarau 2004, Wan & Xiao 2008, Liu+ 2009) 3. ޠኮ౷ޠύλʔϯʹϚον͢Δ΋ͷΛબ୒ (Nguyen and Phan 2009) 4. Wikipedia λΠτϧͷ෦෼จࣈྻ (n-gram) ʹϚον͢Δ΋ͷΛબ୒ (Grineva+ 2009) ʢৄ͘͠͸ Hasan & Ng 2014, 3.1 Selecting Candidate Words and Phrases Λࢀরʣ

Slide 21

Slide 21 text

ީิΩʔϑϨʔζྻڍͰͷ՝୊ • ඼ࢺ͸ࡉ͔͘ݟͳ͍ͱਫ਼౓্͕Βͳ͍

Slide 22

Slide 22 text

඼ࢺ͋ͯήʔϜ • ໊ࢺʁಈࢺʁॿࢺʁॿಈࢺʁ෭ࢺʁ࿈ମࢺʁ • ʮͦ͏ʯɿʁʁʁ • ʮ͍Θ͘ʯɿʁʁʁ • ʮ͝ཡʯɿʁʁʁ • ʮ͖Βͼ΍͔ʯɿʁʁʁ

Slide 23

Slide 23 text

඼ࢺ͋ͯήʔϜ • ໊ࢺʁಈࢺʁॿࢺʁॿಈࢺʁ෭ࢺʁ࿈ମࢺʁ • ʮͦ͏ʯɿ໊ࢺ • ʮ͍Θ͘ʯɿ໊ࢺ • ʮ͝ཡʯɿ໊ࢺ • ʮ͖Βͼ΍͔ʯɿ໊ࢺ ͥΜͿ 
 ໊ࢺʂ ͦΕ͸ͨͿΜ͋ͳͨͷཉ໊͔ͬͨ͠ࢺͰ͸ͳ͍ - ԡͯ͠μϝͳΒ;ͯ৸͠Ζ http://ikawaha.hateblo.jp/entry/2016/05/20/155504

Slide 24

Slide 24 text

ީิΩʔϑϨʔζྻڍͰͷ՝୊ • ඼ࢺ͸ࡉ͔͘ݟͳ͍ͱਫ਼౓্͕Βͳ͍ • ʮͦ͏ʯʮ͍Θ͘ʯʮ͝ཡʯʮ͖Βͼ΍͔ʯ • IPADic తʹ͸ͥΜͿ໊ࢺͰ͢ • ඼ࢺͷࡉ෼ྨ·ͰύλʔϯʹೖΕΔͱσʔλ͕গͳ͘ͳΔ • ࣍ݩͷढ͍

Slide 25

Slide 25 text

• ࣍ʑʹݱΕΔ৽ޠɾ৽ύλʔϯʹରԠ͢Δͷ͕େม • ܅_ͷ_໊_͸_ɻ • ि࣍ɾ೔࣍όονͰࣙॻΛߋ৽͚ͭͮ͠Δʁʁ ީิΩʔϑϨʔζྻڍͰͷ՝୊2

Slide 26

Slide 26 text

• ࣙॻ΍ϧʔϧΛϝϯς͢Δͷ͕େม • ඼ࢺ΋ࡉ͔͘ݟͳ͍ͱ͍͚ͳ͍͕ • ࡉ͔͘ΈΔͱσʔλ͸গͳ͘ͳΔ ީิΩʔϑϨʔζྻڍͰͷ՝୊ 
 ·ͱΊ • ςΩετू߹͔Β͍͍ײ͡ʹυϝΠϯదԠͯ͠΄͍͠ • ͜ͷϑΣʔζͰ͸Χόʔ཰ʢ࠶ݱ཰ʣ͕࠷ॏཁ

Slide 27

Slide 27 text

ۃେ෦෼จࣈྻʹΑΔ ީิΩʔϑϨʔζநग़

Slide 28

Slide 28 text

ۃେ෦෼จࣈྻʹΑΔީิΩʔϑϨʔζநग़ • සग़͢ΔϑϨʔζΛ΋Εͳ͘ྻڍ͍ͨ͠ • ͜ͷϑΣʔζ̍Ͱ͸Χόʔ཰ʢ࠶ݱ཰ʣ͕࠷ॏཁ • ޙଓͷϑΣʔζ̎ͰϑΟϧλ͢Δ • ෦෼จࣈྻΛ·ͱΊͨʮ୅දʯΛߟ͑Δ͜ͱͰɺΧόʔ཰Λอͪͳ͕ΒީิΛݮΒ͢ • ෦෼จࣈྻͲ͏͠ͷʮग़ݱҐஔʯʹΑΔแؚؔ܎ΛΈΔ • ͨͩ͠จࣈྻ௕ͷ͚ࠩͩͣΒͯ͠Ұக͢ΔͳΒಉ͡ͱ͢Δʢޙड़ʣ • ͢΂ͯͷ෦෼จࣈྻΛแؚؔ܎ͰάϧʔϓԽ͢Δ • άϧʔϓͰ࠷௕ͷ෦෼จࣈྻ͕ۃେ෦෼จࣈྻ

Slide 29

Slide 29 text

• ۃେ෦෼จࣈ͸ abre ͨͩ̍ͭ • 2ճҎ্ݱΕΔ෦෼จࣈྻ͸͢΂ͯ abre ʹؚ·Ε͍ͯΔ • { a, b, r, e, ab, br, re, abr, bre, abre } • ͜ΕΒͷ෦෼จࣈྻ͕ಉ͡άϧʔϓ • ࠷௕ͷ abre ͕ۃେ෦෼จࣈྻ Y a b r e - K a b r e ۃେ෦෼จࣈྻͷྫʢ̍ʣ

Slide 30

Slide 30 text

• ۃେ෦෼จࣈ͸ abre ͨͩ̍ͭ • 2ճҎ্ݱΕΔ෦෼จࣈྻ͸͢΂ͯ abre ʹؚ·Ε͍ͯΔ • { a, b, r, e, ab, br, re, abr, bre, abre } • ͜ΕΒͷ෦෼จࣈྻ͕ಉ͡άϧʔϓ • ࠷௕ͷ abre ͕ۃେ෦෼จࣈྻ Y a b r e - K a b r e ۃେ෦෼จࣈྻͷྫʢ̍ʣ

Slide 31

Slide 31 text

• ۃେ෦෼จࣈ͸ abre ͨͩ̍ͭ • 2ճҎ্ݱΕΔ෦෼จࣈྻ͸͢΂ͯ abre ʹؚ·Ε͍ͯΔ • { a, b, r, e, ab, br, re, abr, bre, abre } • ͜ΕΒͷ෦෼จࣈྻ͕ಉ͡άϧʔϓ • ࠷௕ͷ abre ͕ۃେ෦෼จࣈྻ Y a b r e - K a b r e ۃେ෦෼จࣈྻͷྫʢ̍ʣ

Slide 32

Slide 32 text

ۃେ෦෼จࣈྻͷྫʢ̎ʣ • ۃେ෦෼จࣈ͸ abra ͱ a • a ͸ abra ͷதҎ֎ʹ΋ग़ݱ͢ΔͷͰผάϧʔϓ a b r a c a d a b r a

Slide 33

Slide 33 text

ۃେ෦෼จࣈྻͷྫʢ̎ʣ • ۃେ෦෼จࣈ͸ abra ͱ a • a ͸ abra ͷதҎ֎ʹ΋ग़ݱ͢ΔͷͰผάϧʔϓ a b r a c a d a b r a

Slide 34

Slide 34 text

ۃେ෦෼จࣈྻͷྫʢ̎ʣ • ۃେ෦෼จࣈ͸ abra ͱ a • a ͸ abra ͷதҎ֎ʹ΋ग़ݱ͢ΔͷͰผάϧʔϓ • ab ͸ abra ͱಉ͡άϧʔϓ a b r a c a d a b r a

Slide 35

Slide 35 text

ۃେ෦෼จࣈྻͷྫʢ̏ʣ • ۃେ෦෼จࣈ͸ shi ͱ a • i ͸̎จࣈͣΒ͢ͱ shi ͱग़ݱҐஔ͕Ұக͢ΔͷͰ shi ͱಉ͡άϧʔϓ s h i m o b a y a s h i a k a w a k a m i • ۃେ෦෼จࣈ͸ aka ͱ a • aka ͷதʹ̎ճ໨ʹݱΕΔ a ͸ग़ݱҐஔ͕ҟͳΔ

Slide 36

Slide 36 text

ۃେ෦෼จࣈྻʹΑΔީิΩʔϑϨʔζྻڍ • ۃେ෦෼จࣈྻΛߟ͑Δͱɺ೚ҙ௕ͷසग़จࣈྻΛྻڍͰ͖Δ • ۃେ෦෼จࣈྻ͸ ∞-gram ͱ΋ݺ͹ΕΔ • ྻڍ͢Δͱ͖ɺͦΕͧΕͷۃେ෦෼จࣈྻ͕Կճग़ݱ͔͕ͨ͠Θ ͔Δ • ग़ݱճ਺ʹΑͬͯϑΟϧλͰ͖Δ • ઀ඌࣙ໦ʹ͓͚Δ಺෦ϊʔυ͕ۃେ෦෼จࣈྻʹରԠ • ͨͩ͠ɺશϊʔυ͕ۃେ෦෼จࣈྻʹͳΔΘ͚Ͱ͸ͳ͍ʢޙड़ʣ

Slide 37

Slide 37 text

• ςΩετͷ͢΂ͯͷ઀ඌࣙ (suf fi x) ͷ Patricia Trie • ྫ: abracadabra$ ͷ઀ඌࣙ໦ ઀ඌࣙ໦ [Ԭ໺ݪ & ⁋Ҫ 08] Ԭ໺ݪ େี, ⁋Ҫ ५Ұ. "શͯͷ෦෼จࣈྻΛߟྀͨ͠จॻ෼ྨ", NL187 ࣗવݴޠॲཧݚڀձ 2008

Slide 38

Slide 38 text

઀ඌࣙ໦ • BWT ΛݟΔͱۃେ෦෼จࣈྻ͔Ͳ͏͔νΣοΫͰ͖Δ [1] • BWT = ͜͜Ͱ͸ɺͦΕͧΕͷ઀ඌࣙͷલͷจࣈ • ઀ඌࣙ໦ͷ֤ϊʔυʹରԠ͢Δ઀ඌࣙͷ BWT ͕̎छྨ Ҏ্͔Βͳ͍ͬͯΔͱ͖ɺͦΕ͸ۃେ෦෼จࣈྻ [1] ۃେ෦෼จࣈྻ - Ξεϖ೔ه http://d.hatena.ne.jp/takeda25/20101202/1291269994

Slide 39

Slide 39 text

֦ு઀ඌࣙ഑ྻ (ESA) • ઀ඌࣙ໦্ͷૢ࡞Λಉ༷ͷܭࢉྔͰܭࢉͰ͖Δσʔλߏ଄ • ઀ඌࣙ໦ͷϊʔυͷྻڍͳͲ • ֦ு઀ඌࣙ഑ྻ (ESA) = ઀ඌࣙ഑ྻ (SA) + ࠷௕ڞ௨઀಄ࣙ഑ྻ (LCP) • ςΩετ௕ n ʹରͯ͠ 9n bytes [1] • ઀ඌࣙ໦ (20n bytes~) ΑΓίϯύΫτ [2] [1] D. Okanohara and J. Tsujii. 2009. Text Categorization with All Substring Features. In the SIAM International Conference on Data Mining (SDM). [2] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. 2004. Replacing suf fi x trees with enhanced suf fi x arrays. J. Discrete Algs, 2:53–86.

Slide 40

Slide 40 text

ۃେ෦෼จࣈྻͷྻڍ 1. SA-IS ΞϧΰϦζϜͱ Kasai's algorithm ʹΑΓ ESA (SA + LCP) Λߏங 2. BWT ͕มԽ͢Δ઀ඌࣙΛνΣοΫ [2] 3. LCP Λ࢖ͬͯ಺෦ϊʔυΛྻڍ • ͜ͷͱ͖ BWT ΛνΣοΫͯ͠ۃେ෦෼จࣈྻͷΈྻڍ͢Δ • ઀ඌࣙ໦ͷ಺෦ϊʔυͷྻڍ͸ςΩετ௕ T ʹରͯ͠ઢܗ࣌ؒͰ࣮ߦͰ͖Δ [1] • BWT ͷมԽͷνΣοΫ΋ઢܗ࣌ؒͰՄೳ [1] T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park "LinearTime Longest-Common-Pre fi x Computation in Suf fi x Arrays and Its Applications", CPM 2001 [2] ۃେ෦෼จࣈྻ - Ξεϖ೔ه http://d.hatena.ne.jp/takeda25/20101202/1291269994

Slide 41

Slide 41 text

esaxx • ઀ඌࣙ໦ͷ಺෦ϊʔυΛྻڍ͢Δ C++ ϥΠϒϥϦ • ֦ு઀ඌࣙ഑ྻ (ESA) Λߏங͢Δ • ۃେ෦෼จࣈྻ͔Ͳ͏͔ͷνΣοΫ͸ [1] Λࢀߟʹࣗ෼Ͱ࣮૷͢Δ • https://code.google.com/archive/p/esaxx/ [1] ۃେ෦෼จࣈྻ - Ξεϖ೔ه http://d.hatena.ne.jp/takeda25/20101202/1291269994

Slide 42

Slide 42 text

ϑΣʔζ2 
 ΩʔϑϨʔζͷείΞϦϯά

Slide 43

Slide 43 text

ϑΣʔζ2: ΩʔϑϨʔζͷείΞϦϯά • ϑϨʔζͷείΞʢॏΈ෇͚ʣΛͲ͏ܭࢉ͢Δ͔ • ୯ޠͷॏΈ෇͚ʹ͸͍Ζ͍Ζͳํ๏͕͋Δ • TF-IDF • JLH είΞ • ૬ޓ৘ใྔ • ΧΠೋ৐஋ • φΠʔϒͳํ๏ 1. ϑϨʔζͷͳ͔ͷ୯ޠͷॏΈͷ࿨ΛͱΔ 2. ϑϨʔζʹରͯ͠ʢ୯ޠͱಉ͡Α͏ʹʣॏΈ෇͚Λܭࢉ͢Δ

Slide 44

Slide 44 text

࣮ݧ 1. ΩʔϫʔυͰ Elasticsearch Λݕ ࡧͯ͠จॻू߹Λऔಘ - ʮ͋ͷՖʯʮ܅ͷ໊͸ʯʮ೚ఱಊʯͳͲ 2. จॻͷຊจΛऔಘ - ࠓճ͸ઌ಄ͷ 300 จࣈͷΈ 3. MeCab Ͱܗଶૉղੳ - จࣈͰ͸ͳ͘ܗଶૉΛجຊ୯Ґͱ͢Δ ʢϊΠζ௿ݮʣ 4. ۃେ෦෼จࣈྻΛܭࢉͯ͠ީิϑ ϨʔζΛྻڍ - 5ճҎ্ग़ݱ͢Δ΋ͷ͚ͩ 5. ީิϑϨʔζʹରͯ͠ΧΠೋ৐஋Ͱ
 είΞϦϯά - Elasticsearch ͷϑϨʔζݕࡧΛ࢖͏ - ҎԼͷ౷ܭྔ͔ΒܭࢉͰ͖Δ - શମͷจॻ਺ - Ωʔϫʔυʹώοτͨ͠จॻ਺ - ͦΕΒͷதͰީิϑϨʔζΛؚΉจॻ਺ 6. είΞ͕ Top-K ͷۃେ෦෼จࣈྻΛ ฦ͢ - ࠓճ͸ 500 ݅

Slide 45

Slide 45 text

ʮ͋ͷՖʯʹର͢Δ݁Ռ • ্Ґ20݅ 142684.106 Ώ͖ ͋ͭ 135512.226 ௕Ҫ ཾ ઇ 121007.208 Ξχϝ ʮ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ 119079.563 ʮ ΊΜ · ʯ 118675.949 ɹ ͋ͷ Ֆ 118675.949 ʰ ͋ͷ Ֆ 118675.949 ʮ ͋ͷ Ֆ ʯ 118675.949 ʰ ͋ͷ Ֆ ʱ 118675.949 ʮ ͋ͷ Ֆ 94760.745 ʮ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ

Slide 46

Slide 46 text

86305.143 ʮ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍ 86305.143 ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍ ɻ 86305.143 ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍ 86305.143 ʰ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍ 55090.753 ాத ক լ 38692.776 Ԭా ຩ ཬ 35751.909 ௕Ҫ 29098.469 ʮ ৺ ͕ ڣͼ ͨ ͕ͬ ͯΔ Μ ͩ ɻ 29098.469 ʰ ৺ ͕ ڣͼ ͨ ͕ͬ ͯΔ Μ ͩ ɻ ʱ 29098.469 ʮ ৺ ͕ ڣͼ ͨ ͕ͬ ͯΔ Μ ͩ ɻ ʯ ʮ͋ͷՖʯʹର͢Δ݁Ռ

Slide 47

Slide 47 text

ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ • ্Ґ80݅ 390586.996 ୍ ͱ ࡾ ༿ 126664.722 ʯ ʮ ܅ ͷ ໊ ͸ ɻ ʯ 123792.563 өը ʮ ܅ ͷ ໊ ͸ ɻ ʯ 123792.563 өը ʰ ܅ ͷ ໊ ͸ ɻ ʱ 109792.731 ৽ւ 106256.894 ɺ ৽ւ 106256.894 ͷ ৽ւ 106256.894 ɻ ৽ւ 104401.937 ɻ ৽ւ ੣ ؂ಜ 103768.965 ͷ ৽ւ ੣ 103768.965 ͨ ৽ւ ੣

Slide 48

Slide 48 text

ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ 83881.345 ɻ ʮ ܅ ͷ ໊ ͸ ɻ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͷ 83881.345 ɻ ʰ ܅ ͷ ໊ ͸ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͕ ɺ 83881.345 ʰ ܅ ͷ ໊ ͸ ɻ ʱ 83881.345 ɺ ʮ ܅ ͷ ໊ ͸ ɻ ʯ 79635.412 ͷ ʮ ܅ ͷ ໊ ͸ ɻ 79635.412 ͨ ɻ ʮ ܅ ͷ ໊ ͸ 79635.412 ͨ ʰ ܅ ͷ ໊ ͸ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͸ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ 83881.345 ɻ ʮ ܅ ͷ ໊ ͸ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ 83881.345 ɻ ʰ ܅ ͷ ໊ ͸ ɻ ʱ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͕ 83881.345 ܅ ͷ ໊ ͸ 83881.345 ɻ ʰ ܅ ͷ ໊ ͸ ɻ 83881.345 ɺ ܅ ͷ ໊ ͸ 83881.345 ɺ ܅ ͷ ໊ ͸ ɻ 83881.345 ܅ ͷ ໊ ͸ ʁ 83881.345 ʮ ܅ ͷ ໊ ͸ 83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͸ ɺ 83881.345 ɻ ܅ ͷ ໊ ͸ 83881.345 ʮ ܅ ͷ ໊ ͸ ʯ ͱ

Slide 49

Slide 49 text

ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ 68454.408 ɺ ࡾ༿ 66923.168 ࢳ क 52875.263 ʮ ܅ ͷ ໊ ͸ ʯ 48620.102 ࡾ༿ 40170.705 өը ʰ ܅ ͷ ໊ ͸ ɻ ʱ Ͱ ώϩΠϯ ͷ 38947.540 ٶ ਫ 38243.976 ࡾ ༿ ͸ 37993.993 ৽ւ ੣ 29207.758 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ ݟ 29207.758 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ ݟ ͯ 22217.179 ৽ւ ੣ ؂ಜ ࠷৽ ࡞ 22161.230 ৽ւ ੣ ͷ 21938.554 ʮ લલ લੈ 21155.167 ৽ւ ੣ ࡞඼ 19854.987 ৽ւ ؂ಜ 19222.009 ٶ ਫ ࡾ ༿ 17993.221 ཱՖ ୍

Slide 50

Slide 50 text

17612.465 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ ؍ 13315.859 ʮ εύʔΫϧ 12497.399 ৽ւ ੣ ؂ಜ ࠷৽ ࡞ ʰ ܅ ͷ ໊ ͸ ɻ ʱ 12043.648 ࢳ क ொ 11079.889 ্ നੴ 9984.243 ʮ ඵ଎ 5 ηϯνϝʔτϧ ʯ 6407.954 ৽ւ ੣ ؂ಜ ͷ Ξχϝ өը 5858.545 ৽ւ ੣ ؂ಜ 5662.690 ʮ γϯ ɾ ΰδϥ ʯ 5472.189 γϯ ɾ ΰδϥ 3677.951 ୍ ͱ 3454.078 ೖΕସΘΓ ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ

Slide 51

Slide 51 text

3407.773 ʮ ඵ଎ 2634.372 ɻ ʮ ܅ ͷ 2634.372 ܅ ͷ 2634.372 ɻ ܅ 2634.372 ɻ ܅ ͷ 2492.065 ʮ ඵ଎ 5 ηϯνϝʔτϧ 2036.764 ্ നੴ ๖ Ի 1511.662 ͷ Ξχϝ өը 1511.662 ͷ Ξχϝ өը ʮ 1456.590 ʮ ܅ ͷ 1274.915 2016 - 08 - 1222.903 ԯ ԁ Λ ಥഁ 1134.913 ͷ େ ώοτ 1129.179 લલ લੈ ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ

Slide 52

Slide 52 text

ʮ೚ఱಊʯʹର͢Δ݁Ռ • ্Ґ50݅ 956576.307 … ೚ఱಊ 608643.451 ͸ ɺ ೚ఱಊ ͸ 473775.714 ؠా ૱ 464686.595 ؠా ૱ ࣾ௕ ͕ 458018.217 ͠ ɺ ೚ఱಊ 458018.217 ೚ఱಊ ͷ 458018.217 ɻ ೚ఱಊ ɺ 458018.217 ·͠ ͨ ɻ ೚ఱಊ 458018.217 ೚ఱಊ Λ 458018.217 ೚ఱಊ * 458018.217 ʹ ೚ఱಊ 458018.217 ͨ ೚ఱಊ 458018.217 ɻ ೚ఱಊ ͸ 404220.379 ̸̸̬

Slide 53

Slide 53 text

ʮ೚ఱಊʯʹର͢Δ݁Ռ 386131.784 ؠా ࣾ௕ ͷ 385544.541 ͸ ɺ ೚ఱಊ ͕ 314103.697 ؠా ૱ ࣾ௕ 309061.454 ϚϦΦ ʯ 275932.322 ؠా ࣾ௕ 271259.444 amiibo Λ 229298.880 ؠా ࣾ௕ ͕ 226600.614 Ͱ͢ ͕ ɺ ೚ఱಊ 219915.400 ͠ ͨ ɻ ؠా ࣾ௕ 217359.056 ɻ ɹ ؠా ࣾ௕ ͸ 217359.056 ɻ ɹ ؠా ࣾ௕ 217359.056 ɻ ؠా ࣾ௕ 205515.499 ϑΝϛϦʔ ίϯϐϡʔλ 169882.335 ͯ ͍Δ ɻ ೚ఱಊ 167779.063 New χϯςϯυʔ 3 DS

Slide 54

Slide 54 text

166283.753 ϚϦΦ ͷ 162672.396 ܕ ήʔϜ 143406.225 Miitomo 139477.360 ϚϦΦ ϝʔΧʔ 133729.277 ɺ ೚ఱಊ 130604.759 גओ ૯ձ Λ ܽ੮ 124955.944 ೚ఱಊ ͷ ؠా ૱ 124945.709 ϚϦΦ Χʔτ 119639.084 ɻ ʮ ϚϦΦ 112837.829 ਾ͑ஔ͖ ܕ ήʔϜ ػ 99299.695 ϛʔτϞ 89169.832 ̏ ̨̙ 83948.098 ถࠃ ೚ఱಊ ͷ ʮ೚ఱಊʯʹର͢Δ݁Ռ

Slide 55

Slide 55 text

ʮ೚ఱಊʯʹର͢Δ݁Ռ 78462.915 ʮ ϚϦΦ ʯ 67171.541 ೥຤ ঎ઓ 66552.483 New χϯςϯυʔ 3 DS / 66088.000 Mii 64830.347 ਾ͑ஔ͖ ܕ 63918.676 ʮ θϧμ ͷ ఻આ ʯ 61309.998 ܅ౡ ࢯ 60804.122 ٶຊ ࢯ 58416.532 େ ཚಆ εϚογϡϒϥβʔζ

Slide 56

Slide 56 text

ߟ࡯ • ͳΜͱͳ͘είΞͱϑϨʔζͷΑ͞͸૬ؔͯͦ͠͏ • ࠷ԼҐͷ΄͏͸͍͍ϑϨʔζ͕ͳ͍ • ҰํͰɺ͍͍ϑϨʔζ͕த͘Β͍ʹ͋ͬͨΓ΋͢Δ • ه߸͕͚ͬ͜͏ϊΠζʹͳ͍ͬͯΔ • ه߸ɾॿࢺͱ͔Ͱ࢝·ͬͯΔ৔߹͸ϑΟϧλ͢Ε͹Αͦ͞͏ • ϑϨʔζʹରͯ͠΋ TF-IDF ΍ΧΠೋ৐஋ͳͲ͸ҙຯΛ΋ͭͷ͔ʁ • ΋ͬͱੑೳͷྑ͍ϑϨʔζείΞϦϯάख๏͕ݚڀ͞Ε͍ͯΔ͔΋ʁ

Slide 57

Slide 57 text

είΞϦϯάख๏ͷαʔϕΠ • ैདྷख๏ͷαʔϕΠ࿦จɿ [Hasan & Ng 2014] • ΩʔϑϨʔζநग़ख๏ͷ state-of-the-art (2014 ೥࣌఺) • ڭࢣ͋Γͷख๏ͱڭࢣͳ͠ͷख๏͕͋Δ • ڭࢣ͋Γ͕ੑೳ͕ߴ͍ͱ΋͍͑ͳ͍ • 4ͭͷσʔληοτͷ͏ͪ3ͭͷ SOTA ͸ڭࢣͳ͠ [Hasan & Ng 2014] • ڭࢣ͋Γ͸ֶशσʔλΛ༻ҙͨ͠ΓϞσϧΛ؅ཧͨ͠Γ͍Ζ͍Ζେม

Slide 58

Slide 58 text

ैདྷख๏ʢڭࢣͳ͠ʣ 1. άϥϑϕʔεϥϯΩϯά • TextRank 2. τϐοΫϕʔεΫϥελϦϯά 1. KeyCluster 2. Topical PageRank (TPR) 3. Community Cluster 3. ݴޠϞσϧϕʔε 1, 2 ͸ֶश͕େมͦ͏ 1. άϥϑϕʔεɿάϥϑ͸୯ޠͲ͏͠ ͷ૊Έ߹ΘͤͳͷͰ2৐Φʔμʔ 2. τϐοΫϕʔεɿτϐοΫϞσϧʹ ͔͚Δͷ͕ॏ͍ 3. ݴޠϞσϧɿ୯ޠΛΧ΢ϯτ͍ͯ͠ ͚ͩ͘ͳͷͰઢܗΦʔμʔ

Slide 59

Slide 59 text

ݴޠϞσϧϕʔεͷϑϨʔζείΞϦϯά • ݴޠϞσϧʢʹ֬཰Ϟσϧʣؒͷҧ͍ͰείΞϦϯά • ΧϧόοΫɾϥΠϒϥʔɾμΠόʔδΣϯεͰଌΔ • ࢼͤͯͳ͍ͷͰ·ͨͷػձʹɾɾɾ Takashi Tomokiyo and Matthew Hurst. "A Language Model Approach to Keyphrase Extraction"

Slide 60

Slide 60 text

ࢀߟจݙ • [Turney 00] Peter D. Turney. "Learning algorithms for keyphrase extraction", Information retrieval 2.4 (2000): 303-336 • https://arxiv.org/pdf/cs/0212020.pdf • [Hasan & Ng 14] Kazi Saidul Hasan and Vincent Ng. "Automatic Keyphrase Extraction: A Survey of the State of the Art._ Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)" 2014, pages 1262-1273 • https://www.aclweb.org/anthology/P/P14/P14-1119.xhtml • ࣗಈΩʔϑϨʔζநग़ʹ͍ͭͯͷମܥతͳϨϏϡʔ࿦จ • [Liu+ 09] Z. Liu, P. Li, Y. Zheng and M. Sun. "Clustering to fi nd exemplar terms for keyphrase extraction", 2009, pp. 257–266 • ީิΩʔϑϨʔζΛͭ͘Δͱ͖ɺετοϓϫʔυͷࣙॻΛ࢖ͬͯετοϓϫʔυΛ͸͡ ͍͍ͯΔ

Slide 61

Slide 61 text

• [Ԭ໺ݪ & ⁋Ҫ 08] Ԭ໺ݪ େี, ⁋Ҫ ५Ұ. "શͯͷ෦෼จࣈྻΛߟྀͨ͠จॻ෼ྨ", NL187 ࣗવݴޠॲཧݚڀձ 2008 • http://ci.nii.ac.jp/naid/110006980330 • [Okanohara & Tsujii 09] D. Okanohara and J. Tsujii. "Text Categorization with All Substring Features", In the SIAM International Conference on Data Mining (SDM) 2009 • http://epubs.siam.org/doi/abs/10.1137/1.9781611972795.72 • [Abouelhoda+ 04] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. "Replacing suf fi x trees with enhanced suf fi x arrays.", J. Discrete Algs 2004, 2:53–86. • https://pdfs.semanticscholar.org/4ca9/ ea95a0a9846965e86619e646d9ca36930c18.pdf • [Kasai+ CPM 01] T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park "LinearTime Longest-Common-Pre fi x Computation in Suf fi x Arrays and Its Applications", CPM 2001 ࢀߟจݙ