Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Поиск? Sphinx!

Поиск? Sphinx!

Sphinx считается одним из самых быстрых и гибких поисковых движков на рынке, но не является "коробочным" решением, чем отпугивает многих разработчиков. Я расскажу как быстро поднять полнотекстовый поиск для своего проекта на базе Sphinx, почему он крут и какие существуют интеграционные решения для Python.

Roman Zaiev

June 08, 2013
Tweet

More Decks by Roman Zaiev

Other Decks in Programming

Transcript

  1. полнотекстовый поиск фильтрация сортировка группировка сниппеты работа с существующими документами

    широкий набор атрибутов быстрая индексация гибкий язык запросов морфологический анализатор управление релевантностью масштабируемость скорость качество стоимость Обычные требования к поиску
  2. ПОИСК до 250 запросов в секунду на каждое ядро с

    1 000 000 документов * * зависит от размера индекса
  3. до 256 полей для индексации на один индекс до 32-х

    атрибутов различных типов * хватит на все случаи жизни
  4. $ apt-get install sphinx $ brew install sphinx Ubuntu OS

    X $ ./configure --with-pgsql $ make $ make install DIY
  5. sphinx.conf | connection source common { type = pgsql sql_host

    = localhost sql_user = sphinx sql_pass = sphinx sql_db = megaportal }
  6. sphinx.conf | source description source company: common { sql_query =\

    SELECT company.id, \ company.name, \ company.date_created \ FROM company sql_field_string = name sql_attr_timestamp = date_created }
  7. sphinx.conf | source description source company: common { sql_query =\

    SELECT company.id, \ company.name, \ company.date_created \ FROM company sql_field_string = name sql_attr_timestamp = date_created } выборка атрибуты
  8. sphinx.conf | index description index common { type = plain

    morphology = stem_en, stem_ru min_word_len = 2 charset_type = utf-8 html_strip = 1 html_remove_elements = script html_index_attrs = img=alt,title; a=title; min_stemming_len = 3 min_infix_len = 3 enable_star = 1 ... }
  9. sphinx.conf | index description index common { type = plain

    morphology = stem_en, stem_ru min_word_len = 2 charset_type = utf-8 html_strip = 1 html_remove_elements = script html_index_attrs = img=alt,title; a=title; min_stemming_len = 3 min_infix_len = 3 enable_star = 1 ... } «джентельменский набор» десятки других опций
  10. sphinx.conf | index description index company: common { source =

    company path = /path/to/index_files/megaportal }
  11. sphinx.conf | indexer tuning searchd { listen = 9306:mysql41 log

    = /path/to/logs/searchd.log query_log = /path/to/logs/query.log pid_file = /path/to/pid/searchd.pid }
  12. sphinx.conf | fatality source common { type = pgsql sql_host

    = localhost sql_user = sphinx sql_pass = sphinx sql_db = megaportal } source company: common { sql_query =\ SELECT company.id, \ company.name, \ company.date_created \ FROM company sql_field_string = name sql_attr_timestamp = date_created } index common { type = plain morphology = stem_en, stem_ru min_word_len = 2 charset_type = utf-8 html_strip = 1 html_remove_elements = script html_index_attrs = img=alt,title; a=title; min_stemming_len = 3 min_infix_len = 3 enable_star = 1 } index company: common { source = company path = /path/to/index_files/megaportal } indexer { mem_limit = 512M } searchd { listen = 9306:mysql41 log = /path/to/logs/searchd.log query_log = /path/to/logs/query.log pid_file = /path/to/pid/searchd.pid } описание документов описание индексов indexer + searchd
  13. Sphinx 2.0.6-release (r3473) Copyright (c) 2001-2012, Andrew Aksyonoff Copyright (c)

    2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/path/to/sphinx.conf'... indexing index 'common'... ERROR: index 'common': no valid sources configured; skipping. indexing index 'company'... collected 1066244 docs, 28.7 MB sorted 2.3 Mhits, 100.0% done total 1066244 docs, 28735925 bytes total 11.452 sec, 2509034 bytes/sec, 93097.50 docs/sec total 60 reads, 0.019 sec, 495.0 kb/call avg, 0.3 msec/call avg total 127 writes, 0.053 sec, 471.7 kb/call avg, 0.4 msec/call avg rotating indices: successfully sent SIGHUP to searchd (pid=56606).
  14. $ mysql -h 0 -P 9306 Welcome to the MySQL

    monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 2.0.6-release (r3473) Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql>
  15. mysql> select * from company where match('тнк'); +--------+--------+--------------+----------+ | id

    | status | date_created | owner_id | +--------+--------+--------------+----------+ | 5015 | 6 | 2008 | 5019 | | 25502 | 3 | 2009 | 25507 | | 39771 | 6 | 2009 | 39776 | | 152307 | 1 | 2010 | 152380 | | 183905 | 3 | 2010 | 184097 | | 194302 | 6 | 2010 | 194517 | | 218982 | 1 | 2011 | 219439 | | 235881 | 3 | 2011 | 236408 | | 287319 | 3 | 2011 | 288131 | | 338476 | 3 | 2011 | 339574 | | 340073 | 6 | 2011 | 341177 | | 471410 | 2 | 2012 | 473498 | | 513023 | 0 | 2012 | 515276 | | 768093 | 1 | 2012 | 770983 | | 823359 | 6 | 2012 | 826706 | | 915374 | 3 | 2013 | 919765 | +--------+--------+--------------+----------+ 16 rows in set (0.00 sec)
  16. mysql> select * from company where match('тнк') and date_created between

    2010 and 2012; +--------+--------+--------------+----------+ | id | status | date_created | owner_id | +--------+--------+--------------+----------+ | 152307 | 1 | 2010 | 152380 | | 183905 | 3 | 2010 | 184097 | | 194302 | 6 | 2010 | 194517 | | 218982 | 1 | 2011 | 219439 | | 235881 | 3 | 2011 | 236408 | | 287319 | 3 | 2011 | 288131 | | 338476 | 3 | 2011 | 339574 | | 340073 | 6 | 2011 | 341177 | | 471410 | 2 | 2012 | 473498 | | 513023 | 0 | 2012 | 515276 | | 768093 | 1 | 2012 | 770983 | | 823359 | 6 | 2012 | 826706 | +--------+--------+--------------+----------+ 12 rows in set (0.01 sec)
  17. mysql> SHOW META; +---------------+--------+ | Variable_name | Value | +---------------+--------+

    | total | 6 | | total_found | 6 | | time | 0.000 | | keyword[0] | тнк | | docs[0] | 16 | | hits[0] | 16 | +---------------+--------+ 6 rows in set (0.00 sec) mysql> DESC company; +--------------+-----------+ | Field | Type | +--------------+-----------+ | id | integer | | name | field | | status | uint | | date_created | timestamp | | owner_id | uint | +--------------+-----------+ 5 rows in set (0.00 sec)
  18. «Порционный» запрос source company: common { sql_query_range =\ SELECT MIN(id),

    MAX(id) from company sql_range_step = 10000 sql_query =\ SELECT company.id, \ company.name, \ company.date_created \ FROM company \ WHERE company.id BETWEEN $start AND $end sql_field_string = name sql_attr_timestamp = date_created }
  19. Расширяем «охват» индекса source company: common { ... sql_query =\

    SELECT company.id, \ company.name, \ company.date_created, \ "user".email as owner_email \ FROM company \ LEFT JOIN "user" \ ON company.owner_id = "user".id \ WHERE company.id BETWEEN $start AND $end sql_field_string = name sql_field_string = owner_email sql_attr_timestamp = date_created }
  20. One-to-many? M2M? MVA! source company: common { ... sql_field_string =

    name sql_field_string = owner_email sql_attr_timestamp = date_created sql_attr_multi =\ uint products from ranged-query; \ SELECT company_id, id FROM product \ WHERE id >= $start AND id <= $end; \ SELECT MIN(id), MAX(id) FROM product }
  21. mysql> select name, products from company where match('тнк') and products

    in (109503, 1123362); +------------------------------+----------+ | name | products | +------------------------------+----------+ | ТНК-Транс | 109503 | | ООО «ТНК-Транс» | 1123362 | +------------------------------+----------+ 2 rows in set (0.00 sec)
  22. source company: common { sql_query_pre = SET @maxts:=(SELECT NOW()) sql_query

    = SELECT company.id, company.name FROM company \ WHERE company.created_at < @maxts ... sql_query_post =\ REPLACE INTO search_deltacounters \ VALUES (@id, 'company_tmp', @maxts) sql_query_post_index =\ DELETE FROM search_deltacounters \ WHERE tablename='company' sql_query_post_index =\ UPDATE search_deltacounters \ SET tablename='company' WHERE tablename='company_tmp' } Модифицируем основной индекс
  23. source company_delta: company { sql_query_pre =\ SET @maxts=(SELECT maxts FROM

    search_deltacounters \ WHERE tablename='company') sql_query = SELECT company.id, company.name FROM company \ WHERE company.created_at >= @maxts ... sql_query_post = sql_query_post_index = } Создаём дельту index company_delta: common { source = company_delta path = /path/to/indexes/megaportal }
  24. mysql> select * from company, company_delta where match('тнк'); +--------+--------+--------------+----------+ |

    id | status | date_created | owner_id | +--------+--------+--------------+----------+ | 5015 | 6 | 2008 | 5019 | | 25502 | 3 | 2009 | 25507 | | 39771 | 6 | 2009 | 39776 | | 152307 | 1 | 2010 | 152380 | | 183905 | 3 | 2010 | 184097 | | 194302 | 6 | 2010 | 194517 | | 218982 | 1 | 2011 | 219439 | | 235881 | 3 | 2011 | 236408 | | 287319 | 3 | 2011 | 288131 | | 338476 | 3 | 2011 | 339574 | | 340073 | 6 | 2011 | 341177 | | 471410 | 2 | 2012 | 473498 | | 513023 | 0 | 2012 | 515276 | | 768093 | 1 | 2012 | 770983 | | 823359 | 6 | 2012 | 826706 | | 915374 | 3 | 2013 | 919765 | +--------+--------+--------------+----------+ 16 rows in set (0.00 sec)
  25. :(

  26. sphinxit | config class SearchConfig(object): DEBUG = True WITH_META =

    True WITH_STATUS = True SEARCHD_CONNECTION = { 'host': '127.0.0.1', 'port': 9306, }
  27. sphinxit | usage from sphinxit.core.processor import Search, Snippet company_search =

    ( Search( indexes=['company'], config=SearchConfig ) .match('ТНК') ) SELECT * FROM company WHERE MATCH('ТНК')
  28. sphinxit | usage search = Search(['company'], config=SearchConfig) search = (

    search .match('ТНК') .select('id', 'name') .options( ranker='proximity', max_matches=100, ) .order_by('name', 'desc') ) SELECT id, name FROM company WHERE MATCH('ТНК') ORDER BY name DESC OPTION max_matches=100, ranker=proximity
  29. sphinxit | usage search = Search(['company'], config=SearchConfig) search = (

    search .match('ТНК') .filter(date_created__lte=datetime.date.today()) ) results = search.ask() SELECT * FROM company WHERE MATCH('ТНК') AND date_created<=1370552400
  30. sphinxit | result { u'result': [ { 'date_created': 2008L, 'owner_email':

    u'[email protected]', 'products': u'2,5', 'id': 5015L, 'name': u'\u0422\u041d\u041a', }, { 'date_created': 2009L, 'owner_email': u'[email protected]', 'products': u'2,5', 'id': 25502L, 'name': u'\u0422\u041d\u041a \u0418\u043d\u0442\u0435\u0440\u043', } ], ... u'meta': { u'total': u'16', u'total_found': u'16', u'docs[0]': u'16', u'time': u'0.000', u'hits[0]': u'16', u'keyword[0]': u'\u0442\u043d\u043a' } }
  31. sphinxit | update syntax search = Search(['company'], config=SearchConfig) search =

    ( search .match('ТНК') .update(products=(5,2)) .filter(id__gt=1) ) UPDATE company SET products=(5,2) WHERE MATCH('ТНК') AND id>1
  32. sphinxit | snippets syntax snippets = ( Snippet(index='company', config=SearchConfig) .for_query("Me

    amore") .from_data("amore", "amore mia") ) CALL SNIPPETS ( ('amore', 'me amore'), 'company', 'Me amore' ); { u'result': [ {'snippet': u'<b>amore</b>'}, {'snippet': u'<b>amore</b> mia'} ] }
  33. sphinxit | snippets syntax snippets = ( Snippet(index='company', config=SearchConfig) .for_query("Me

    amore") .from_data("amore mia") .options( before_match='<strong>', after_match='</strong>', ) ) CALL SNIPPETS ( 'amore mia', 'company', 'Me amore', '<strong>' AS before_match, '</strong>' AS after_match )