Introducción a Sphinx

Introducción a Sphinx full text search on steroids
5 Febrero de 2013 Percona Mysql University Montevideo

DIEGO SAPRIZA Senior SoB Engineer Case Design Inc.
@AV4TAr hMp://about.me/diego.sapriza

full-‐text MyISAM full-‐text InnoDB mysql v5.6
Apache Solr java / webservices Sphinx Search C++

1.  tabla MyISAM 2.  columna con índice full text
3.  millones de registros L 4.  creamos un slave y listo!!! 5.  slave lag L

SPHINX?

motor de búsqueda full-‐text paradigma SQL alto
rendimiento

¿porqué? •  gran velocidad de búsqueda •  escalabilidad
•  no solo indexa textos (atributos numericos para ﬁltrar) •  puede trabajar como reemplazo de mysql para ordenar y agrupar •  selección de operadores y “matching modes”

y más •  búsqueda facetada •  morfología
•  geo-‐distancia •  HTML stripper •  …

dos programas •  indexer consulta la BD, crea índices.
•  searchd usa los índices para responder consultas de clientes. programas cliente – API NATIVA php, python, ruby, java… – SphinxSE Mysql Engine

orígenes de datos •  Mysql, PgSql, XML pipes (mongodb,...),
ODBC índices •  morfología, stopwords, wordforms, exceplons, charset, html strip… lpos •  disk based, lempo real, distribuidos.

host único

data source host único atributos ﬁltrado y agrupación
full text

índice local host único

indexando # ./indexer test1

# ./indexer user_timelines --rotate Sphinx 2.0.3-release (r3043) Copyright (c) 2001-2011,
Andrew Aksyonoff Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/sphinx/etc/sphinx.conf'... indexing index 'user_timelines'... collected 1303297 docs, 4631.5 MB sorted 769.8 Mhits, 100.0% done total 1303297 docs, 4631519329 bytes total 1463.481 sec, 3164727 bytes/sec, 890.54 docs/sec total 1665 reads, 62.531 sec, 1639.9 kb/call avg, 37.5 msec/call avg total 5302 writes, 12.536 sec, 1022.3 kb/call avg, 2.3 msec/call avg rotating indices: succesfully sent SIGHUP to searchd (pid=22994). ~24 minutos, 4GB. indexar main

mysql> CREATE FULLTEXT INDEX bft ON links_metadata (body); Query OK,
773125 rows affected (39 min 24.10 sec) Records: 773125 Duplicates: 0 Warnings: 0 mysql myISAM

search! <?php require_once '../lib/sphinxapi.php'; require_once '../lib/sphinx_help.php';
$mode = SPH_MATCH_ANY; $cl = new SphinxClient(); $cl-‐>SetServer('127.0.0.1',9312); $cl-‐>SetArrayResult ( true ); $cl-‐>SetWeights ( array ( 100, 1 ) ); $cl-‐>SetMatchMode ( $mode ); $res = $cl-‐>Query('rojo azul', 'main'); show_result($res, $cl); # ./search -i main 'azul|rojo’ # mysql -P9306 --protocol=tcp -- prompt='sphinxQL> ' mysphinxQL> select * from main WHERE MATCH('azul|rojo');”

distribuído

índice distribuído

remotos se consultan en paralelo se devuelve un
único resultado

¿qué indexo? ¿todo o solo lo nuevo? ¿reallme?

the delta, you must use.

el “delta”

# ./indexer delta_user_timelines --rotate Sphinx 2.0.3-release (r3043) Copyright (c) 2001-2011,
Andrew Aksyonoff Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/sphinx/etc/sphinx.conf'... indexing index 'delta_user_timelines'... collected 21350 docs, 75.6 MB sorted 12.3 Mhits, 100.0% done total 21350 docs, 75559310 bytes total 12.586 sec, 6003402 bytes/sec, 1696.31 docs/sec total 4 reads, 0.043 sec, 10717.1 kb/call avg, 10.7 msec/call avg total 95 writes, 0.155 sec, 916.9 kb/call avg, 1.6 msec/call avg rotating indices: succesfully sent SIGHUP to searchd (pid=22994). indexar delta

reindexamos todo

lempo indexación

Cuidado con el espacio en disco!!!

TIP: shpinx.conf.php #!/usr/bin/php <?php for ($i=1; $i<=4; $i++) {
?> source chunk<?= $i ?> { sql_host = localhost sql_user = sphinx_usr sql_pass = **** sql_db = dbchunk<?=$i?> . . . } <?php } // end source loop ?>

Estadíslcas src: hMp://www.percona.tv/percona-‐webinars/full-‐text-‐search-‐throwdown build insert storage
query soluDon LIKE expression 0 0 0 49k~399k ms SQL FULLTEXT MyISAM 31:18 33:28 2382MB 16~200ms MySQL FULLTEXT InnoDB 25:57 55:46 ? 350~740ms Mysql 5.6 Apache Solr n/a 14:28 2766MB 79ms Java Sphinx Search n/a 8:20 3487MB 13ms C++

facts •  standalone •  no es especíﬁco para
MySQL •  no actualiza los índices solo •  sphinx solo devuelve ids •  permite ordenar por relevancia •  exact search / boolean search / … •  API en varios lenguajes •  implementa protocolo MySQL

Preguntas? d.sapriza@case-‐inc.com @AV4TAr •  hMp://about.me/diego.sapriza

Montevideo Mysql Meetup hMp://meetup.uy/mysql 2o
jueves de cada mes 19 a 21hs

cta1sfter:/srv/sphinx/bin# mysql -‐P9306 -‐-‐protocol=tcp -‐-‐prompt='sphinxQL> ’ Welcome to
the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 2.0.3-‐release (r3043) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. sphinxQL> SELECT * from user_timelines WHERE MATCH ('superbowl'); +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | id | weight | twitter_id | tweets_id | link_id | tld_id | extracted | created_stamp | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | 109531197 | 4675 | 24488771 | 57371370 | 35471785 | 132427 | 1 | 1359858567 | | 109492540 | 4673 | 56690354 | 57351558 | 35459063 | 685 | 1 | 1359843568 | | 109493484 | 4673 | 24488771 | 57351953 | 35459063 | 685 | 1 | 1359843239 | | 109496715 | 4673 | 24488771 | 57353282 | 35459063 | 685 | 1 | 1359843352 | | 109496743 | 4673 | 24488771 | 57353292 | 35459063 | 685 | 1 | 1359843241 | | 109496779 | 4673 | 24488771 | 57353305 | 35459063 | 685 | 1 | 1359842932 | ... +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ 20 rows in set (0.04 sec) sphinxQL> show meta; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Variable_name | Value | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | total | 1000 | | total_found | 6302 | | time | 0.034 | | keyword[0] | superbowl | | docs[0] | 6302 | | hits[0] | 12189 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ 6 rows in set (0.00 sec)

source user_timelines : base { sql_query_pre = SELECT
@tt_id:=id FROM `tweets_timelines` WHERE `created` <= DATE_SUB(CURDATE(),INTERVAL 8 DAY) ORDER BY created DESC LIMIT 1 sql_query_pre = REPLACE INTO sph_counter SET counter_id = "user_timelines", modif=NOW(), max_doc_id = ( SELECT MAX(id) max FROM tweets_timelines), last_doc_id = max_doc_id sql_query = SELECT tt.id, tt.twitter_id, tt.tweets_id, lm.id AS link_id, lm.expanded_link, lm.title, lm.description, lm.body, lm.tld_id, lm.extracted, UNIX_TIMESTAMP(tt.created) AS created_stamp FROM links_metadata lm, tweets_timelines tt WHERE tt.id >= @tt_id AND lm.extracted = 1 AND tt.links_id = lm.id AND tt.id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id="user_timelines") sql_attr_uint = twitter_id sql_attr_uint = tweets_id sql_attr_uint = link_id sql_attr_uint = tld_id sql_attr_timestamp = created_stamp sql_attr_uint = extracted } index user_timelines { source = user_timelines html_strip = 1 html_remove_elements = a, img path = /sphinx/data/user_timelines_index docinfo = extern charset_type = utf-‐8 }

source delta_user_timelines : user_timelines{ sql_query_pre = SET NAMES
utf8 sql_query_pre = SELECT @tt_id:=id FROM `tweets_timelines` WHERE `created` <= \ DATE_SUB(CURDATE(),INTERVAL 8 DAY) ORDER BY created DESC LIMIT 1 sql_query_pre = SELECT @max:=max(tt.id) FROM links_metadata lm, tweets_timelines tt \ WHERE lm.extracted = 1 AND tt.links_id = lm.id sql_query = SELECT tt.id, tt.twitter_id, tt.tweets_id, lm.id AS link_id, lm.expanded_link, lm.title, lm.description, lm.body, lm.tld_id, lm.extracted, \ UNIX_TIMESTAMP(tt.created) AS created_stamp \ FROM links_metadata lm, tweets_timelines tt \ WHERE tt.id >= @tt_id AND lm.extracted = 1 AND tt.links_id = lm.id AND \ tt.id>( SELECT max_doc_id FROM sph_counter WHERE counter_id="user_timelines" ) sql_query_post = UPDATE sph_counter SET last_doc_id=@max WHERE counter_id="user_timelines" } index delta_user_timelines : user_timelines{ source = delta_user_timelines html_strip = 1 html_remove_elements = a, img path = /sphinx/data/delta_user_timelines_index docinfo = extern charset_type = utf-‐8 }

links •  Introduclon to Search with Sphinx – 
hMp://shop.oreilly.com/product/9780596809539.do •  hMp://sphinxsearch.com/ •  Comparación de motores –  hMp://www.percona.tv/percona-‐webinars/full-‐text-‐search-‐ throwdown •  hMp://www.percona.com/ﬁles/presentalons/ opensql2008_sphinx.pdf •  hMp://mysqlperformanceblog.com/tag/sphinx/ •  hMp://av4tar.blogspot.com •  hMp://av4tar.github.com/Mysql-‐Meetup-‐sphinxsearch/

Introducción a Sphinx

Introducción a Sphinx

More Decks by Diego Sapriza

Other Decks in Technology

Featured

Transcript