Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch で部内 Wiki 検索高速化

Elasticsearch で部内 Wiki 検索高速化

KMC 例会講座 資料

nonylene

June 05, 2017
Tweet

More Decks by nonylene

Other Decks in Technology

Transcript

  1. ໨࣍ 1. PukiWiki ͷ࿩ 2. ߴ଎Խ͢Δʹ͸ 3. Elasticsearch ಋೖ 4.

    Elasticsearch ݕࡧ 5. Heineken ( React ) 6. ੒Ռɾײ૝
  2. • όʔδϣϯ͕ݹ͍ • 1.4.8_alpha2 ( 2006 ೥ ) • Slack

    ͷ౤ߘͳͲඍົʹվ଄͍ͯ͠Δ PukiWiki at KMC ͷͭΒ͍ͱ͜
  3. • PukiWiki ͷσʔλ͸શͯςΩετϑΝΠϧ $ ls /…/pukiwiki/wiki/ 28A1A6A1FEA1A629.txt 31B6A6A5D0A5EAA5B1A1BCA5C9A5BFA5EFA1BCA5C7A5A3A5D5A5A7A5F 3A5B9A5B2A1BCA5E0.txt 323034384149A5B3A5F3A5C6A5B9A5C8.txt

    3332A5ADA5C3A5C1A5F3C0B0C8F7B7D7B2E8.txt … PukiWiki ͷσʔλ ※ λΠτϧΛ euc-jp ͰΤϯίʔυͨ͠όΠτྻͷ Hex ͕ϑΝΠϧ໊ʹͳΔ
  4. • PukiWiki ͷσʔλ͸શͯςΩετϑΝΠϧ $ nkf /…/pukiwiki/wiki/BFB7B4BFA5B3A5F3A5D132303137.txt [[৽׻ίϯύ]] ~&size(25){ͲΜͲΜࢀՃ͍ͯ͜͠͏ͳ}; *໨࣍ [#jfaa7b62]

    #contents ~৽ೖੜͷ͔ͨ͸΋ͪΖΜɺ্ճੜ΍OBͷํʑ΋Ұॹʹָ͠Έ·͠ΐ͏ʂ … ※ euc-jp ͳͷͰ nkf Ͱม׵͍ͯ͠Δ PukiWiki ͷσʔλ
  5. PukiWiki ͷݕࡧ͸ͳͥ஗͍ • PukiWiki ͕ݕࡧ͢Δ࣌… • PHP Ͱຖճશͯͷ Wiki ϑΝΠϧͷ಺༰Λऔಘ

    • lib/func.php ͷ do_search ࢀর • औಘͨ͠จࣈྻʹରͯ͠શจݕࡧ
  6. Elasticseach Ͱղੳ • Analyzer ͷߏ੒ • Char Filter Ͱਖ਼نԽ౳ •

    Tokenizer Ͱ෼ׂʢ͕͜͜େࣄʣ • Token Filter Ͱਖ਼نԽ౳
  7. Char Filter • จࣈؒͷҧ͍Λٵऩ͢Δ • ͂ʢ̴͉̺̰̺̽̈́ʣ → s (hankaku) •

    ᶨ → ϔΫλʔϧ • Tokenizer ʹೖΕΔલʹෆཁͳจࣈΛআ͘
  8. Char Filter • ICU Analysis Plugin • ެࣜϓϥάΠϯ • ྑ͍ײ͡ʹਖ਼نԽͯ͘͠ΕΔ

    • શ֯ˠ൒֯ɺه߸෼ղɺେจࣈˠখจࣈ౳ • ౉ᬒˠ౉ลͱ͔͸΍Βͳ͍
  9. Tokenizer • จষΛ͍͍ײ͡ʹ۠੾Δ • ۠੾ͬͨޠ۟ͷҐஔΛه࿥ ( Index ) ͢Δ →

    ݕࡧޠ͕۟͋Δ৔ॴ͕͙͢෼͔Δ → શจݕࡧΑΓ΋଎͍ʂ
  10. • Index: [(‘͜Μʹͪ͸’, 0), (‘KMC’, 
 6), (‘Hello!’, 10)]
 •

    ‘KMC’ Ͱݕࡧ → ʮ 6 ൪໨ʹ͋Δʯͱ͙͢෼͔Δ Tokenizer
  11. • Index: [(‘͜Μʹͪ͸’, 0), (‘KMC’, 
 6), (‘Hello!’, 10)]
 •

    ‘KM’ Ͱݕࡧ → ʮͦΜͳ΋ͷ͸ͳ͍ʯ Tokenizer
  12. • ํ๏ᶃ n จࣈ͝ͱʹ۠੾Δ ( N-Gram )
 • “͜Μʹͪ͸ɺࠓ೔΋͍͍ఱؾͰ͢Ͷ” →

    Index: [(‘͜Μ’, 0), (‘Μʹ’, 1), (‘ʹͪ’, 
 2), (‘ͪ͸’,3), … , (‘͢Ͷ’, 14)] Tokenizer
  13. Tokenizer • Index: [(‘͜Μ’, 0), (‘Μʹ’, 1), (‘ʹͪ’, 2), (‘ͪ͸’,3),

    … , (‘͢Ͷ’, 14)]
 • ‘͜Μʹͪ’ Ͱݕࡧ → ‘͜Μ’ ͕ 0 ൪໨ʹώοτ → ͦͷޙ΋ਖ਼ͦ͠͏ → ʮ 0 ൪໨ʹ͋Δʯͱ͙͢෼͔Δ
  14. • ํ๏ᶃ n จࣈ͝ͱʹه࿥͢Δ ( N-Gram ) • ར఺ •

    ඞͣώοτ͢ΔʢऔΓ͜΅͕͠ͳ͍ʣ • ۠੾ͬͨจࣈҎ্ͷ৔߹ʹݶΔ • ؆୯ Tokenizer
  15. • ํ๏ᶃ n จࣈ͝ͱʹه࿥͢Δ ( N-Gram ) • ܽ఺ •

    Index ͕ංେԽ͠΍͍͢ • ෆཁͳ΋ͷʹϚον͠΍͍͢ • ྫ: ’͍ఱ’ → ‘͍͍ఱؾ’ Tokenizer
  16. Token Filter • Tokenize ޙͷޠ۟ʹର͔͚ͯ͠ΔϑΟϧλ • ྨޠɾ-ed / -s ͷ౷ҰͳͲ

    • ͜͜ͰେจࣈখจࣈΛἧ͑Δ৔߹΋
 • Heineken Ͱ͸࢖͍ͬͯͳ͍
  17. Elasticseach ͷσʔλߏ଄ Cluster: KMC Index: hoge Index: piyo Type: page

    Type: relation Type: item Typ Field: title Field: body Field: modified Field: from Field: to Field: price Field: name Field: desc Field: available Fiel Fiel
  18. • Index ͱ Type Λఆٛ • Type ʹ Field Λઃఆ

    • Analyzer / Datatype ͳͲ • Index ʹ Type Λઃఆ
 (mapping) Elasticseach ͷσʔλߏ଄ Index: pukiwiki Type: page Field: title Field: body Field: modified
  19. • Analyzer ఆٛ { "settings": { "analysis": { "analyzer": {

    "jp_analyzer": { "tokenizer": "jp_tokenizer", "char_filter":
 [ "html_strip", “icu_normalizer" ], … } }, "tokenizer": { "jp_tokenizer": { "type": “ngram", … , "token_chars":
 [ "letter", "digit", "symbol", "punctuation" ] }}}}, … }
  20. • Mapping ఆٛ { … , "mappings": { "page": {

    … "properties": { "title": { … }, "title_url_encoded": { … }, "body": { "type": "text", "analyzer": "jp_analyzer", "term_vector" : "with_positions_offsets" }, "modified": { "type": "date", "format": "strict_date_optional_time||epoch_millis" } }}}} Index: pukiwiki Type: page Field: title Field: body Field: modified Field: title_url_encoded
  21. • Elasticsearch is RESTful • جຊతʹશͯ JSON Ͱ΍ΓऔΓ͢Δ • Elasticsearch

    Λىಈ͢Δͱ Web αʔόʔཱ͕ͭ • ͦ͜ʹ Python ౳Ͱ JSON ͷ
 Index ఆٛΛ౤͛ͯઃఆ Elasticseach ͷઃఆ
  22. Elasticseach Clusterʢࢀߟʣ Cluster: KMC Node: foo Node: bar Replica: hoge3

    Shard: hoge1 Replica: piyo1 Replica: piyo2 Shard: hoge3 Replica: hoge1 Sha Rep
  23. Heineken-crawler • Python3 Ͱॻ͔Εͨ PukiWiki ͷΫϩʔϥ • จࣈίʔυΛ UTF-8 ʹม׵

    • λΠτϧऔಘɾม׵ • PukiWiki σʔλΛ Elasticsearch ʹ౤͛Δ
  24. Heineken ͰͷείΞ • ᶄ ߋ৽೔࣌ʹॏΈΛஔ͘ • ௚ײͰௐ੔ • origin ->

    ݱࡏ • offset -> 150೔ • scale -> 500೔ • decay -> 0.75 https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
  25. Heineken ͰͷείΞ • ᶅ λΠτϧͷ୹͞ʹॏΈΛஔ͘ • + 1 ʹ͍ͯ͠Δͷ͸ 1

    จࣈͷ࣌ରࡦ • log ͸ͳΜͱͳ͘ • sqrt ͸ͳΜͱͳ͘ score ⇤ 1 p log ( title.length + 1)
  26. ଞͷػೳ • ݕࡧޠ͔۟Β͍͍ײ͡ͷ৔ॴΛநग़Ͱ͖Δ • ϋΠϥΠτ༻ͷ HTML λάૠೖ΋Ͱ͖Δ "fields": { "body":

    { "pre_tags": ["<mark>"], "post_tags": ["</mark>"], "fragment_size": 220, …, } }
  27. Heineken ΞϓϦ࣮૷ • Elasticsearch ͸ RESTful • શ෦ JSON Ͱฦͬͯ͘Δ

    → Elasticsearch Ҏ֎ʹσʔλ͸ෆཁ → શ෦ JavaScript Ͱ΍Ε͹ྑ͍
  28. React ུ֓ • UI Λߏ੒͢ΔͨΊͷ JS ϥΠϒϥϦ • Facebook ੡

    • ֤ॴͰ࢖ΘΕ͍ͯΔ • UI ͷ֤෦඼Λ Component ͱͯ͠ߏ੒͍ͯ͘͠
  29. React ུ֓ - Virtual DOM • Virtual DOM ͰԾ૝తʹ DOM

    Λอ࣋ • σʔλͷมߋ࣌ʹ͸ Virtual DOM Λมߋ • ͦͷޙ࣮ࡍͷ DOM ͱͷࠩ෼Λ൓ө • DOM ͷมߋΛ཈͑ΒΕͯޮ཰త ৄࡉ: http://qiita.com/mizchi/items/4d25bc26def1719d52e6
  30. React ུ֓ - JSX • JSX Ͱ JS ্ʹ HTML

    Λॻ͚Δ • ςϯϓϨʔτΤϯδϯͬΆ͘ॻ͚ͯศར const element = ( <h1> Hello, {username}! </h1> );
  31. React ։ൃ؀ڥ • create-react-app • Facebook ੡ͷ؆୯ React ߏ੒πʔϧ •

    ։ൃ؀ڥ • Ϗϧυ؀ڥ • ςετ؀ڥ https://github.com/facebookincubator/create-react-app
  32. React ։ൃ؀ڥ • ES6 ( ECMA Script 6 ) •

    JavaScript ͷ৽͍͠ඪ४ • class / Arrow function / const / Promise etc.. ৄࡉ: https://www.slideshare.net/1000ch/begin-ecmascript6
  33. ͦͷଞͷύʔπ • Bootstrap • Twitter ࣾ੡ CSS / JS 


    ϑϨʔϜϫʔΫ • ෦һ໊฽ɾ
 ਆֆΞοϓϩʔμ ͳͲ
  34. Heineken ։ൃ 1. create-react-app Ͱͻͳܗ࡞੒ 2. ྑ͍ײ͡ʹ Component ࡞Δ •

    Elasticsearch ͷ API Λୟ͍ͯ൓өͤ͞Δ 3. BabelɾWebpack ͰίϯύΠϧ 4. αʔόʔʹஔ͘
  35. ౰ࣾൺ 1 / 500 • 25000 ms -> 50 ms

    0 7500 15000 22500 30000 PukiWiki Heineken 50 25,000 
  36. ײ૝ • Elasticsearch ͍͢͝ • ͱʹ͔͍͍͘ײ͡ʹͳΔ • React ศརɾES6 ָ͍͠

    • αʔόʔͰԿ΋ಈ͔͞ͳ͍ͷ͸ָ • ΍Γ͔ͨͬͨ͜ͱ͕ग़དྷͨͷͰྑ͔ͬͨ