Slide 1

Slide 1 text

͸ͯͳϒοΫϚʔΫ
 શจݕࡧͷਫ਼౓վળ Hatena Engineer Seminar #5 ઙ໺ ୎໵ id:takuya-a
 @takuya_a

Slide 2

Slide 2 text

id:takuya-a ϓϥοτϑΥʔϜˍΞυςΫνʔϜ 2015 ೥ 4 ݄ʹೖࣾ ڵຯ • ৘ใݕࡧ • ࣗવݴޠॲཧ • ػցֶश OSS ׆ಈ kuromoji.js ͳͲͷ JavaScript ϥΠϒϥϦΛ։ൃ ΞΠίϯมΘΓ·ͨ͠

Slide 3

Slide 3 text

ݕࡧΩʔϫʔυΛೖྗ

Slide 4

Slide 4 text

ࠓճ͸ຊจݕࡧͷΈ

Slide 5

Slide 5 text

৽ணॱͱਓؾॱ
 ͕ࢦఆͰ͖Δ

Slide 6

Slide 6 text

ຊจݕࡧͷ՝୊ɿ
 ਫ਼౓͕Α͘ͳ͍

Slide 7

Slide 7 text

վળલ

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

՝୊ɿຊจݕࡧͷਫ਼౓ ͺͬͱݟͰݕࡧϊΠζ͕ଟ͍ ͨͱ͑͹ɿ • ΤϯτϦʔͷओ୊ʢτϐοΫʣ͕ژ౎ͱશؔ͘܎ͳ͍ • ຊจநग़ͷϛε
 ʢؔ࿈هࣄλΠτϧͱ͜Ζʹʮژ౎ʯ͕͋Δͱ͔ʣ

Slide 11

Slide 11 text

͓୊ ʮژ౎ʯͰݕࡧͨ͠ͱ͖ʹ͸
 ʮژ౎ͬΆ͍ʯΤϯτϦʔ͕ग़͖ͯͯ΄͍͠

Slide 12

Slide 12 text

ͦ΋ͦ΋
 ʮژ౎ͬΆ͍ʯͱ͸

Slide 13

Slide 13 text

ژ౎ͬΆ͍ʢͱ͸ʣ • ݕࡧχʔζɾϢʔεέʔεΛ૝૾ͯ͠ΈΔ • ʮژ౎ʯͱ͍͏ΫΤϦΛೖྗ͢Δਓ͸
 ԿΛ͍ͨ͠ͷ͔ʁ

Slide 14

Slide 14 text

ژ౎ͬΆ͍ʢͱ͸ʣ ݕࡧχʔζʢ૝૾ʣ • ژ౎ͷ໊ॴɾݟͲ͜Ζɾ஍໊Λ஌Γ͍ͨ • ژ౎ͳΒͰ͸ͷ॓ധࢪઃΛ஌Γ͍ͨ • ژ౎ͷάϧϝʹ͍ͭͯ஌Γ͍ͨ • ژ౎ͷχϡʔεʹ͍ͭͯ஌Γ͍ͨ

Slide 15

Slide 15 text

ͨͩ͠
 ʮژ౎ʯ 1 Ωʔϫʔυ͚ͩͰ

Slide 16

Slide 16 text

՝୊ 1 ɿΫΤϦ࡞੒ͷϢʔβϏϦςΟ ΤϯυϢʔβʹͱͬͯɺΫΤϦͷ࡞੒͸େม • ద੾ͳΩʔϫʔυ͕ࢥ͍ු͔͹ͳ͍ • ͦ΋ͦ΋஌Γ͍ͨର৅ʹ͍ͭͯ͋·Γ஌Βͳ͍ • ΫΤϦΛߟ͑Δɾೖྗ͢Δͷ͕໘౗ʢಛʹεϚϗʣ Ϣʔβ͸ʮژ౎ʯͱ͍͏Ωʔϫʔυ͚ͩͰɺΑ͠ͳʹ΍ͬͯ΄͍͠
 ʢ࣮ࡍ 1 ΩʔϫʔυʹΑΔݕࡧ͕ѹ౗తʹଟ͍ʣ

Slide 17

Slide 17 text

՝୊ 2ɿ৽ணॱͰͷݕࡧਫ਼౓ શจݕࡧͰ͸ɺείΞॱҎ֎Ͱ͸ਫ਼౓͕ग़ʹ͍͘ ௨ৗͷશจݕࡧ͸ɺ୯ޠͷग़ݱճ਺ʢ=ස౓ʣͳͲΛߟྀͨ͠είΞॱ

Slide 18

Slide 18 text

ਫ਼౓ͱ͸ ༷ʑͳλεΫͰجຊͱͳΔ 2 ͭͷධՁࢦඪ • ద߹཰ (Precision) = ݕࡧϊΠζͷগͳ͞ • ࠶ݱ཰ (Recall) = ݕࡧ࿙Εͷগͳ͞ ͜ͷ 2 ͭ͸τϨʔυΦϑ • ద߹཰Λ্͛Α͏ͱ͢Δͱɺ࠶ݱ཰͕Լ͕Δ • ࠶ݱ཰Λ্͛Α͏ͱ͢Δͱɺద߹཰͕Լ͕Δ

Slide 19

Slide 19 text

ਫ਼౓ͱ͸ ۃ୺ͳྫ • ࣗ৴ͷ͋ΔΤϯτϦʔΛ 1 ͚݅ͩग़͢ ɹˠɹద߹཰͸ 100% ʹͳΔ͕ɺ࠶ݱ཰͸௿͘ͳΔ • ͢΂ͯͷΤϯτϦʔΛग़͢ ɹˠɹ࠶ݱ཰͸ 100% ʹͳΔ͕ɺద߹཰͸௿͘ͳΔ

Slide 20

Slide 20 text

՝୊ 3ɿద߹཰ͱ࠶ݱ཰ͷཱ྆ ݕࡧϊΠζͷগͳ͞ʢద߹཰ʣΛॏࢹ Ͱ΋ݕࡧ࿙Εͷগͳ͞ʢ࠶ݱ཰ʣ΋େࣄ

Slide 21

Slide 21 text

՝୊·ͱΊ 1. ΫΤϦ࡞੒ͷϢʔβϏϦςΟ 2. ৽ணॱͰͷݕࡧਫ਼౓ 3. ద߹཰ͱ࠶ݱ཰ͷཱ྆

Slide 22

Slide 22 text

Ͱ͖ͨ

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

http://b.hatena.ne.jp/search/text?q=ژ౎ ژ౎ͷ؍ޫɾάϧϝ৘ใɾχϡʔε͕
 ͲΜͲΜग़ͯ͘Δʂ

Slide 26

Slide 26 text

http://b.hatena.ne.jp/search/text?q=ίʔώʔ ίʔώʔʹؔ͢Δ৘ใɾχϡʔε͕
 ͲΜͲΜग़ͯ͘Δʂ

Slide 27

Slide 27 text

http://b.hatena.ne.jp/search/text?q=ػցֶश ػցֶशʹؔ͢Δ৘ใɾχϡʔε͕
 ͲΜͲΜग़ͯ͘Δʂ

Slide 28

Slide 28 text

Ͳ͏΍࣮ͬͯݱ͔ͨ͠ʁ

Slide 29

Slide 29 text

ద߹཰ͱ࠶ݱ཰Λ྆ํ্͛Δ
 ʢ͔͠΋ 1 Ωʔϫʔυɾ৽ணॱͰʣ ਖ਼߈๏Ͱ͸ແཧ είΞؔ਺Λ޻෉ͨ͠Γͯ͠΋ɺ௨ৗͷ
 ΩʔϫʔυݕࡧͰ͸ݶք͕͋Δ ʮژ౎ʯͱ͍͏ 1 ͭͷΩʔϫʔυ͔Βɺ
 ʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛදݱ͢Δඞཁ͕͋Δ

Slide 30

Slide 30 text

ΞΠσΞɿ͸ͯϒͷλάΛ࢖͏ 10 ೥ʹ౉ͬͯɺ͸ͯϒʹ஝ੵ͞Ε͖ͯͨϝλ৘ใΛ࢖͏ • ͸ͯͳϒοΫϚʔΫͷλά = ਓखʹΑΔʮਖ਼ղσʔλʯ • ژ౎ͱ͍͏λά͕͚ͭΒΕͨΤϯτϦʔ͸ʮژ౎ͬΆ͍ΤϯτϦʔʯͷ͸ͣʂ ɹˠɹͦΕΒͷΤϯτϦʔ͔Βɺʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛநग़͢Δ ɹˠɹಘΒΕͨʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛ࢖ͬͯݕࡧ͢Δ ʢλάͷ৘ใ͕ͳͯ͘΋ɺࠓճͷ࿩ͷΞΠσΞࣗମ͸͍ΖΜͳԠ༻͕Ͱ͖Δ͸ͣʣ

Slide 31

Slide 31 text

֓೦ݕࡧ ֓೦ݕࡧΛ࢖͍ͬͯΔݕࡧΤϯδϯ • Autonomy (Hewlett-Packard) • GETA (NII) • ConceptBase (δϟετγεςϜ) • ͦͷଞɺಛڐݕࡧΤϯδϯͳͲ ֓೦ݕࡧʢConcept Searchɺίϯηϓταʔνɺίϯηϓτݕࡧɺࣗવจݕࡧɺࣗવݴޠจݕࡧɺྨ ࣅจॻݕࡧɺ࿈૝ݕࡧʣ͸ɺࣗಈԽ͞Εͨ৘ใݕࡧͷख๏Ͱɺ஝ੵ͞Εͨඇߏ଄ԽσʔλʢిࢠΞʔ ΧΠϒɺిࢠϝʔϧɺՊֶจݙͳͲʣ͔ΒɺݕࡧΫΤϦʹରͯ͠ɺ֓೦͕ྨࣅ͢Δ৘ใΛݕࡧ͢Δͷ ʹ༻͍ΒΕΔɻ http://ja.wikipedia.org/wiki/֓೦ݕࡧɹΑΓҾ༻

Slide 32

Slide 32 text

֓೦ͱ͸ ֓೦ݕࡧͰͷ֓೦ͱ͸ɺΩʔϫʔυͷू߹ • ֓೦͸ɺؔ࿈͢ΔΩʔϫʔυͷू·ΓͰߏ੒͞Εͯ ͍Δͱߟ͑Δ • ෳ਺ͷؔ࿈Ωʔϫʔυͷू߹ʹΑͬͯۙࣅͨ͠΋ͷ

Slide 33

Slide 33 text

֓೦ݕࡧͷ͘͠Έ ΩʔͱͳΔΞϧΰϦζϜ͸ҎԼͷ 3 ͭ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு
 ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ
 ֦ுͨ͠ΫΤϦͰશจݕࡧ ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹
 ͦΕΛΫΤϦʹ଍͚ͩ͢

Slide 34

Slide 34 text

ಛ௃ޠબ୒ʹΑΔ
 ؔ࿈Ωʔϫʔυநग़

Slide 35

Slide 35 text

݁Ռ

Slide 36

Slide 36 text

ژ౎ͷؔ࿈Ωʔϫʔυ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ

Slide 37

Slide 37 text

ژ౎ͷಉٛޠʢγϊχϜʣ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ

Slide 38

Slide 38 text

ژ౎ͷ஍໊ʢԼҐޠʣ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ

Slide 39

Slide 39 text

ژ౎͔Β࿈૝͞ΕΔޠ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ ژ౎ͷࢢ֎ہ൪

Slide 40

Slide 40 text

ژ౎ͷྺ࢙ɾݟͲ͜Ζ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ

Slide 41

Slide 41 text

ژ౎ͷ؍ޫɾ॓ധɾަ௨ػؔ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ

Slide 42

Slide 42 text

ژ౎ͷάϧϝ৘ใ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ

Slide 43

Slide 43 text

ؔ࿈ΩʔϫʔυʹΑΔ ΫΤϦ֦ுͷޮՌ

Slide 44

Slide 44 text

ύφιχοΫͷؔ࿈Ωʔϫʔυ 1 lumix 2 dmc 3 viera 4 ௡լ 5 diga 6 ϓϥζϚ 7 pdp 8 ύφιχοΫ 9 ࡾ༸ 10 btob 11 ༗ػ 12 ࡾ༸ిػ 13 দԼ 14 େ௶ 15 ໊ࣾ 16 el 17 ి޻ 18 ి஑ 19 ces 20 ిػ 21 avc 22 sanyo 23 ϞόΠϧίϛχέʔγϣϯζ 24 ύωϧ 25 ӷথ

Slide 45

Slide 45 text

ΫΤϦ֦ுͷޮՌɿ
 ಉٛޠʢγϊχϜʣΛࣗಈ֫ಘ 1 lumix 2 dmc 3 viera 4 ௡լ 5 diga 6 ϓϥζϚ 7 pdp 8 ύφιχοΫ 9 ࡾ༸ 10 btob 11 ༗ػ 12 ࡾ༸ిػ 13 দԼ 14 େ௶ 15 ໊ࣾ 16 el 17 ి޻ 18 ి஑ 19 ces 20 ిػ 21 avc 22 sanyo 23 ϞόΠϧίϛχέʔγϣϯζ 24 ύωϧ 25 ӷথ

Slide 46

Slide 46 text

εϓϥτΡʔϯͷؔ࿈Ωʔϫʔυ 1 ϒΩ 2 εϓϥτΡʔϯ 3 splatoon 4 γϡʔλ 5 amiibo 6 wiiu 7 φϫόϦότϧ 8 ΠϯΫ 9 Θ͔͹ 10 tps 11 ࢼࣹ 12 ృΔ 13 γΦΧϥʔζ 14 νϟʔδϟ 15 fps 16 ϩʔϥ 17 pic 18 ཱͪճΓ 19 ΠΧ 20 ϥϯΫ 21 Ϛοϓ 22 ରઓ 23 ΤΠϜ 24 ϚϦΦ 25 ఢ γϡʔλʔ -> γϡʔλ
 ϩʔϥʔ -> ϩʔϥ
 ͱͳ͍ͬͯΔͷ͸
 Elasticsearch ͷΞφϥΠβʔ
 Ͱͷεςϛϯάॲཧͷ݁Ռ

Slide 47

Slide 47 text

ΫΤϦ֦ுͷޮՌɿ
 ৽ޠͰ΋ಉٛޠʢγϊχϜʣΛ֫ಘ 1 ϒΩ 2 εϓϥτΡʔϯ 3 splatoon 4 γϡʔλ 5 amiibo 6 wiiu 7 φϫόϦότϧ 8 ΠϯΫ 9 Θ͔͹ 10 tps 11 ࢼࣹ 12 ృΔ 13 γΦΧϥʔζ 14 νϟʔδϟ 15 fps 16 ϩʔϥ 17 pic 18 ཱͪճΓ 19 ΠΧ 20 ϥϯΫ 21 Ϛοϓ 22 ରઓ 23 ΤΠϜ 24 ϚϦΦ 25 ఢ ٯʹΞϧϑΝϕοτʮsplatoonʯ
 ͷؔ࿈ΩʔϫʔυΛग़͢ͱ
 ΧλΧφʮεϓϥτΡʔϯʯ ͕ग़ͯ͘Δʂ

Slide 48

Slide 48 text

ϢχΫϩͷؔ࿈Ωʔϫʔυ 1 ͠·ΉΒ 2 ༄Ҫ 3 Ϣχ 4 ώʔτςοΫ 5 Ϋϩ 6 uniqlock 7 uniqlo 8 ແҹ 9 ྑ඼ 10 ҥྉ 11 ض؋ 12 νϊ 13 μα͍ 14 ϑΝʔετϦςΠϦϯά 15 ࢒ۀ 16 ηʔλ 17 δʔϯζ 18 tγϟπ 19 ళ௕ 20 Ξ΢λ 21 ཭৬ 22 ෰ 23 ඼࣭ 24 ඦ՟ళ 25 ϑϦʔε

Slide 49

Slide 49 text

ΫΤϦ֦ுͷޮՌɿ
 ܗଶૉղੳϛεͷิ׬ 1 ͠·ΉΒ 2 ༄Ҫ 3 Ϣχ 4 ώʔτςοΫ 5 Ϋϩ 6 uniqlock 7 uniqlo 8 ແҹ 9 ྑ඼ 10 ҥྉ 11 ض؋ 12 νϊ 13 μα͍ 14 ϑΝʔετϦς ΠϦϯά 15 ࢒ۀ 16 ηʔλ 17 δʔϯζ 18 tγϟπ 19 ళ௕ 20 Ξ΢λ 21 ཭৬ 22 ෰ 23 ඼࣭ 24 ඦ՟ళ 25 ϑϦʔε ܗଶૉղੳϛεΛิ׬Ͱ͖Δ ʢϢχΫϩ͸ࣙॻʹͳ͍ޠʣ ܗଶૉղੳ͸݁Ռ͕ΏΕΔ ʮϢχ/Ϋϩʯͱ෼ׂ͞Εͯ ΠϯσοΫεʹొ࿥͞Ε͍ͯͯ ΫΤϦ͸ʮϢχΫϩʯͱ1τʔΫϯ ʹͳͬͨͱͯ͠΋ݕࡧͰ͖Δ ܗଶૉղੳثͷۤखͳύλʔϯΛ ࣗಈతʹֶशɾิ׬

Slide 50

Slide 50 text

ؔ࿈ΩʔϫʔυʹΑΔ ΫΤϦ֦ுͷݶք

Slide 51

Slide 51 text

͢΂ͯͷݕࡧχʔζΛ
 ΧόʔͰ͖ΔΘ͚Ͱ͸ͳ͍ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ ʮژ౎ͷఱؾʯΛ஌Γ͍ͨਓʹ͸ແҙຯ ࣌ؒతɾۭؒతܭࢉίετͷ੍໿͕͋ΔͨΊ ແݶʹ͸ΫΤϦ֦ுͰ͖ͳ͍

Slide 52

Slide 52 text

͜͜·Ͱͷ·ͱΊ • ؔ࿈Ωʔϫʔυͷநग़͸ɺ͏·͍͍ͬͯ͘ΔΑ͏ʹݟ͑Δ • ΫΤϦ֦ுʹΑͬͯɺϢʔβʹΑΔΫΤϦ࡞੒ΛิॿͰ͖Δ • ΫΤϦ֦ுʹ͸ݶք΋͋Δ • ͨͩ͠ಉٛޠల։΍ܗଶૉղੳϛεͷิ׬ͳͲϝϦοτ͕େ͖͍ Ͳ͏΍ͬͯؔ࿈ΩʔϫʔυΛநग़͔ͨ͠ʁ

Slide 53

Slide 53 text

ಛ௃ޠબ୒ʹΑΔ
 ؔ࿈Ωʔϫʔυநग़
 ʢΞϧΰϦζϜฤʣ

Slide 54

Slide 54 text

ؔ࿈Ωʔϫʔυͱ͸ ؔ࿈Ωʔϫʔυ͸ɺҎԼͷಛ௃Λ΋͍ͬͯΔʢͱߟ͑ͨʣ • ͦͷ֓೦ʹ͍ͭͯड़΂ΒΕͨจॻͷ಺༰Λද͍ͯ͠Δޠ (= ಛ௃ޠ) • ͦͷ֓೦ʹ͍ͭͯड़΂ΒΕͨෳ਺ͷจॻʹݱΕΔ ·ͣɺͻͱͭͷจॻʢΤϯτϦʔʣ͔Β
 ಛ௃ޠΛநग़͢Δ͜ͱΛߟ͑Δ

Slide 55

Slide 55 text

ಛ௃ޠநग़ͷํ਑ ػցֶशʢeg. ϥϯΫֶशʣ͸࢖Θͳ͍ • ؔ܎ऀ΁ͷઆ໌ɺ݁Ռͷղऍ͕೉͘͠ͳΔ • ਓखʹΑΔϧʔϧɾώϡʔϦεςΟοΫʹରԠ͠ʹ͍͘ • ༻ҙͰ͖Δσʔλ͕গͳ͍ͨΊɺաֶश͢ΔՄೳੑ͕ߴ͍ • ݹయతͳ৘ใݕࡧ (Information Retrieval) ͷख๏ʹཔΔ • Elasticsearch ͷ Term Vectors API ͰऔಘͰ͖Δɺ
 λʔϜʢ୯ޠʣͷ౷ܭ৘ใΛར༻

Slide 56

Slide 56 text

Elasticsearch Term Vectors API Term Vectors ͔ΒऔಘͰ͖ΔλʔϜͷ౷ܭ৘ใ λʔϜ = Elasticsearch ͷΞφϥΠβʔʹΑͬͯ෼ׂ͞Εͨ΋ͷ
 ʢ͍ΘΏΔ୯ޠͱҰக͠ͳ͍৔߹΋͋ΔͷͰ஫ҙʣ term_freq ΤϯτϦʔதͷλʔϜͷग़ݱճ਺ʢස౓ʣ doc_freq ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ ttf ΠϯσοΫεશମͰλʔϜ͕ݱΕΔग़ݱճ਺ʢස౓ʣͷ࿨ doc_count ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺

Slide 57

Slide 57 text

Elasticsearch Ͱͷ ؔ࿈Ωʔϫʔυநग़ͷྲྀΕ 1. ʢλάݕࡧʣೖྗ͞ΕͨΫΤϦͰɺλάͷϑΟʔϧυʹରͯ͠
 Filtered Query ͰߜΓࠐΈݕࡧ 2. ʢ౷ܭ৘ใऔಘʣTerm Vectors API (ݫີʹ͸ Multi Term Vectors API) Ͱ
 λʔϜͷ౷ܭ৘ใΛऔಘ 3. ʢಛ௃ޠநग़ʣͻͱͭͻͱͭͷจॻʹରͯ͠ɺλʔϜͷॏཁ౓ΛαʔόͰܭࢉ͠
 Top-25 ͷಛ௃ޠΛநग़ 4. ʢؔ࿈Ωʔϫʔυநग़ʣগ਺ͷจॻʹ͔͠ݱΕͳ͍λʔϜ͸མͱ͠ɺ
 ࠷΋είΞ͕ߴ͍ Top-25 ͷλʔϜΛநग़ ʢؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹ɺ͋ͱ͸ΫΤϦʹΩʔϫʔυΛ଍͚ͩ͢ʣ λʔϜͷॏཁ౓ΛͲͷΑ͏ʹܭࢉ͢Δ͔ʁ Tips ೋ෼ώʔϓΛ࢖ͬͯ Top-K ܭࢉΛߴ଎Խ

Slide 58

Slide 58 text

ಛ௃ޠͷܭࢉΞϧΰϦζϜ ৘ใݕࡧͰ͸ɺλʔϜͷॏཁ౓ʢॏΈ෇͚ʣͷࢦඪͱͯ͠͸
 ҎԼͷ 2 ͕ͭσϑΝΫτ • TF-IDF ( TF:୯ޠස౓ ͱ IDF:จॻස౓ͷٯ਺ ͷֻ͚ࢉ ) • BM25 ( ֬཰Ϟσϧʹج͖ͮɺจॻ௕΋ߟྀͨ͠΋ͷ ) TF-IDF Λ࠾༻ BM25 ͸ܭࢉίετ͕ߴ͍ & 2ͭͷύϥϝʔλௐ੔͕ඞཁ

Slide 59

Slide 59 text

TF-IDF TF-IDF ͷܭࢉࣜ ʢ͍͔ͭ͘ͷόϦΤʔγϣϯ͕͋Δʣ TF-IDF ͚ͩͰ͸͏·͍͔͘ͳ͍ • "1" ͷΑ͏ͳɺ͋·Γҙຯͷͳ͍਺ࣈ • "to" ͷΑ͏ͳɺӳޠͷετοϓϫʔυ
 ʢ೔ຊޠͷετοϓϫʔυ͸ɺ Elasticsearch ͷϑΟϧλͰ஄͔ΕΔʣ TF : λʔϜස౓ fi,j → จॻதʹԿճ΋ग़ͯ͘ΔλʔϜ΄Ͳߴ͘ͳΔ IDF : จॻස౓ ni ͷٯ਺ʢN ͸จॻ਺ʣ→ ϨΞͳλʔϜ΄Ͳߴ͘ͳΔ fi,j : ΤϯτϦʔ j ʹݱΕΔλʔϜͷग़ݱճ਺ʢස౓ʣ = term_freq N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq

Slide 60

Slide 60 text

ಈతετοϓϫʔυ
 (Dynamic stop word list) ετοϓϫʔυࣙॻΛࣄલʹ༻ҙɾϝϯς͢Δͷ͸ݶք͕͋Δʢ೔ຊޠɾӳޠɾɾɾʣ λʔϜͷ౷ܭ৘ใ͔ΒɺετοϓϫʔυΛಈతʹܭࢉ
 ʢΠϯσοΫεʹొ࿥͞Ε͍ͯΔΤϯτϦʔʹରͯ͠ॊೈʹରԠͰ͖Δʣ IDF ݹయతͳख๏ RIDF ػೳޠɾ಺༰ޠͷࠩʹ஫໨ͨ͠ख๏ Gain ৘ใྔͷརಘʹ஫໨ͨ͠ख๏ ͜ΕΒͷࢦඪ͕
 ͋Δᮢ஋ΑΓ
 ௿͍λʔϜΛ
 ετοϓϫʔυ
 ͱͯ͠ഉআ

Slide 61

Slide 61 text

IDF Inversed Document Frequency ΄ͱΜͲͷจॻʹग़ͯ͘ΔΑ͏ͳλʔϜ͸είΞ͕௿͘ͳΔ N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq

Slide 62

Slide 62 text

RIDF Residual IDF; ࢒ࠩ IDF Church, K. W. and Gale, W. A. (1995a). “Inverse Document Frequency (IDF): A Measure of Deviation from Poisson.” In Proc. of the 3rd Workshop on Very Large Corpora, pp. 121–130. N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq Fi : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔ૯਺ = ttf 1 จॻதͷλʔϜͷग़ݱճ਺Λ
 ϙΞιϯ෼෍ͰϞσϧԽ ػೳޠɿଟ਺ͷจॻʹ͹Β͚ͯଘࡏʢۉҰʹ෼෍ʣ ಺༰ޠɿগ਺ͷจॻʹूதͯ͠ଘࡏʢภͬͯ෼෍ʣ RIDF = ਪఆͨ͠ IDF ͱ࣮ࡍͷ IDF ͱͷࠩ ϙΞιϯ෼෍ P(k; λi) ͷύϥϝʔλ(=ظ଴஋) λi ͸ λʔϜͷશग़ݱճ਺ (Fi) / จॻ਺ (N) ͰਪఆʢλʔϜ͕ۉҰʹ෼෍͍ͯ͠ΔͱԾఆʣ P(0; λi) ͸ɺͦͷλʔϜ͕ 1 ճ΋ग़ͯ͜ͳ͍֬཰ ͭ·Γ 1 - P(0; λi) ͸ 1 ճͰ΋ग़ͯ͘Δ֬཰ → ෼฼ N (1 - P(0; λi)) ͸จॻස౓ ni ͷਪఆ஋ → ӈล͸ IDFi ͷਪఆ஋ RIDF ͕௿͍ RIDF ͕ߴ͍ ਪఆ஋ͱͷ͕ࠩখ ਪఆ஋ͱͷ͕ࠩେ

Slide 63

Slide 63 text

Gain ۃ୺ʹߴස౓ͷλʔϜʢ΄ͱΜͲͷΤϯτϦʔʹग़ͯ͘Δʣ
 ۃ୺ʹ௿ස౓ͷλʔϜʢ਺ݸͷΤϯτϦʔʹ͔͠ग़ͯ͜ͳ͍ʣͰείΞ͕௿͘ͳΔ ݁ՌɿޮՌ͕ͳ͔ͬͨ
 ߴස౓ͷλʔϜ͸ɺ Elasticsearch ʹΑͬͯɺ͢ͰʹϑΟϧλ͞Ε͍ͯΔ N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq Papineni, K. (2001). “Why Inverse Document Frequency?” In Proc. of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pp. 25–32.

Slide 64

Slide 64 text

·ͱΊɿಛ௃ޠநग़ • ಛ௃ޠநग़ͷΞϧΰϦζϜΛղઆ • TF-IDF • ಈతετοϓϫʔυͰετοϓϫʔυࣙॻͷϝϯςΛෆཁʹ • IDF • RIDF • Gain

Slide 65

Slide 65 text

ಛ௃ޠબ୒ͷධՁͱ࠷దԽ • Top-50 ͰධՁ • ద߹ͯ͠Δ (+1) ͔ɺ͍ͯ͠ͳ͍ (-1) ͔ͷ 2 ஋෼ྨͱΈͳ͢ • ਖ਼ղσʔλΛ༻ҙʢ500ݸ΄Ͳʣ いく -1 ゴッホ +1 あんまり -1 ... ܳज़ʹؔ͢ΔΤϯτϦʔ

Slide 66

Slide 66 text

ࠓճ͸ MAP ͰධՁ & ࠷దԽ MAP 90% Ҏ্Λୡ੒
 ʢରτϨʔχϯάσʔλʣ ධՁࢦඪɹP@n / AP / MAP
 MAP ʹΑΔ࠷దԽ • P@n; Precision at n
 ୈ n Ґ·Ͱͷద߹཰ • AP; Average Precision
 P@n Λ n ·ͰͰฏۉͨ͠ࢦඪ • MAP; Mean Average Precision
 AP Λ͢΂ͯͷΤϯτϦʔͰฏۉ Max MAP : 0.9173 ----------------------- IDF threshold : 6.0 RIDF threshold : 0.55 Gain threshold : 0.0 Α͍͜͸ަࠩݕఆ͠·͠ΐ͏

Slide 67

Slide 67 text

·ͱΊɿධՁͱ࠷దԽ • 3 ͭͷධՁࢦඪ (P@n / AP / MAP) Λ঺հ • ධՁࢦඪ (MAP) Λ࠷େԽ͢ΔΑ͏ʹ
 ύϥϝʔλΛ࠷దԽ • 3ͭͷείΞؔ਺ (TF-IDF, IDF, RIDF) Λ૊Έ߹ΘͤΔ
 ͜ͱͰߴ͍ਫ਼౓Λୡ੒

Slide 68

Slide 68 text

෮शɿ֓೦ݕࡧͷΞ΢τϥΠϯ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு
 ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ
 ֦ுͨ͠ΫΤϦͰશจݕࡧ ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹
 ͦΕΛΫΤϦʹ଍͚ͩ͢

Slide 69

Slide 69 text

Elasticsearch ʹΑΔ
 ΫΤϦ֦ுˍ֓೦ݕࡧ • ݩͷΩʔϫʔυ͸ʮඞؚͣΉʯ
 ʢ Bool Query ͷ must અʣ • ؔ࿈Ωʔϫʔυ͸είΞʹ଍͞ΕΔ͚ͩ
 ʢ Bool Query ͷ should અʣ • είΞʹᮢ஋
 ʢ Query ʹ min_score Λࢦఆʣ

Slide 70

Slide 70 text

֓೦ݕࡧͷΞ΢τϥΠϯ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு
 ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ
 ֦ுͨ͠ΫΤϦͰશจݕࡧ

Slide 71

Slide 71 text

શମͷ·ͱΊ • ؔ࿈Ωʔϫʔυநग़ʹΑΔΫΤϦ֦ுͰద߹཰ΛߴΊͨ • ೖྗΩʔϫʔυ͕ 1 ͭͰ΋͸ͯϒͷλά৘ใͰิ׬ • ৽ணॱͰ΋ߴ͍ద߹཰

Slide 72

Slide 72 text

͸ͯϒશจݕࡧͷ͜Ε͔Β • ݕࡧਫ਼౓ͷධՁɾύϥϝʔλͷ࠷దԽ • ద߹཰͸ߴ͍ϨϕϧΛอͪͭͭ࠶ݱ཰Λ޲্ • Elasticsearch ʹղੳ༻ͷϑΟʔϧυɾࣙॻΛ௥Ճ • ղੳϛεΛ͓͑͞ɺ͞ΒͳΔਫ਼౓޲্΁ • neologd ͷΑ͏ͳ৽ޠࣙॻΛݕ౼ • ؔ࿈Ωʔϫʔυநग़͸༷ʑͳԠ༻͕ظ଴Ͱ͖Δ

Slide 73

Slide 73 text

ࢀߟจݙ TF-IDF ͷόϦΤʔγϣϯ, BM25, Feature Selection, Information Gain, ٖࣅద߹ੑϑΟʔυόοΫ, ϥϯΫֶशͳͲ R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Recall, Precision, P@n, AP, MAP, Binary heap ʹΑΔ Top-K ͳͲ Büttcher S, Clarke C, Cormack GV.Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010. ϙΞιϯ෼෍Ϟσϧ, IDF, RIDF ͳͲ Manning, C. D., & Schutze, H. Foundations of statistical natural language processing. The MIT Press, 1999. IDF, RIDF ʹΑΔ Dynamic stop word list Amati, G., Carpineto, C., Romano, G. (Eds.). Advances in Information Retrieval, 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007, Proceedings. Lecture Notes in Computer Science Springer Volume 4425, 2007. IDF, RIDF ʹΑΔࡧҾޠͷॏΈ෇͚ ๺, ௡ా, ࢰʑງ. ৘ใݕࡧΞϧΰϦζϜ, ڞཱग़൛, 2002.