Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make It Searchable (Midwest PHP 2018)

Make It Searchable (Midwest PHP 2018)

With data available on the internet doubling at an amazing rate, the modern web is increasingly focused on search. There are many ways to implement search in your PHP application. Some of them are downright terrible, while others are truly amazing. This talk will introduce you to the basics of various search engines like Solr, Sphinx, and ElasticSearch. We’ll learn strengths and weaknesses of each as well as how to integrate them into your existing PHP application. Every application has differing and often very niche needs for search, so our goal is to show you what’s available and help you make an educated selection for which type of search is right for your application.

Josh Butts

March 09, 2018
Tweet

More Decks by Josh Butts

Other Decks in Programming

Transcript

  1. About Me • SVP of Engineering,
 Ziff Davis • Austin

    PHP Organizer • github.com/jimbojsb • @jimbojsb 2
  2. Search - Who cares? • In 2018, users expect search

    • The expect it to be as good as Google • Hopefully you’re creating enough data to warrant it • It always gets kicked to the next sprint 3
  3. Search Use Cases- Wildcards • Yes, this is sometimes ok!

    • Not always a good user experience, but it’s easy • Probably going to do this on varchar(255) or shorter 5
  4. Search Use Cases - Full Text • Probably good for

    TEXT fields • Articles, blog posts, item descriptions, etc • Usually language-aware, understands word boundaries • Possibly NLP components as well 6
  5. Search Use Cases - Faceted • Table stakes for retail

    e-commerce • Left-nav with categories & attributes • Checkboxes refine the list • Parenthetical counts of options 7
  6. Filtering vs Faceting • Filtering - reduce the set of

    results, excluding items that don’t match certain criteria • Faceting - Special kind of filtering, primarily for driving UI. Pre-indexed and counted • The UI we are accustomed to often is a combination of these things 8
  7. A Quick Note • Most of these examples are based

    on the Sakila free data set from MySQL 11
  8. I have no time, budget, plan or cares • We’ve

    all been there • Just throw a LIKE clause on there and call it good enough • Just so we’re clear, the database’s implementation of this is literally the worst 12
  9. Like, just use LIKE, right? 13 SELECT * FROM film

    WHERE title LIKE '%cat%' OR description LIKE ‘%cat%'
  10. So how bad is it really? 14 EXPLAIN SELECT *

    FROM film WHERE title LIKE '%cat%' OR description LIKE '%cat%'
  11. Surely we can do better than that • This will

    mostly likely produce erroneous results • As your data grows, the time to search grows linearly • Must check every character of every row and field you want to search 15
  12. MySQL Full Text Indices • Supports simple, NLP and boolean

    modes • Requires you to commit up front to the columns • Works on CHAR, VARCHAR, TEXT 16
  13. New-to-you SQL syntax 17 SELECT * FROM film WHERE MATCH

    (title,description) AGAINST (‘cat') SELECT * FROM film WHERE MATCH (title,description) AGAINST ('+drama -cat' IN BOOLEAN MODE) ALTER TABLE film ADD FULLTEXT INDEX description (title, description)
  14. So how much better is it? • I cheated and

    grabbed a big data set from work… • 2s - LIKE ‘%samsung%’ • 20ms - MATCH AGAINST(‘samsung’) • 4s - LIKE ‘%samsung%’ AND NOT LIKE ‘%tv%’ • 9ms - MATCH AGAINST(‘+samsung -tv IN BOOLEAN MODE’) 18
  15. So - Talk over? • MySQL Full Text clearly works

    well • It’s reasonably fast and it uses tools you know • Remember there are other search use cases • MySql Full Text isn’t all that configurable in the grand scheme of things 19
  16. Sphinx • Sphinx is a standalone full-text indexer that speaks

    to MySQL or Postgres • Far more tunable to your data set than MySQL • Performance may be orders of magnitude faster 20
  17. How do I Sphinx it up? • Install Sphinx server

    (sphinxsearch) • Write a Sphinx config • Index your data (real time or crontab) • Connect PHP to Sphinx • Profit? 21
  18. Sphinx Config 22 source sakila_films { type = mysql sql_host

    = mysql sql_user = sakila sql_db = sakila sql_pass = sakila sql_query = SELECT film_id, title, rating, description FROM film sql_attr_string = rating sql_attr_multi = uint actors from query; SELECT film_id, actor_id FROM film_actor } index films { source = sakila_films path = /var/idx/sphinx/sakila_films morphology = stem_en }
  19. Run the Indexer • Sphinx indexer is a separate binary

    • indexer --all --rotate • Run this on a schedule 23 Sphinx 2.1.9-id64-release (rel21-r4761) Copyright (c) 2001-2014, Andrew Aksyonoff Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/etc/sphinx.conf'... indexing index 'films'... collected 1000 docs, 0.1 MB collected 5462 attr values sorted 0.0 Mvalues, 100.0% done sorted 0.0 Mhits, 100.0% done total 1000 docs, 93842 bytes total 0.086 sec, 1087115 bytes/sec, 11584.53 docs/sec
  20. Test With CLI Tools 24 bash-4.1# search cat Sphinx 2.1.9-id64-release

    (rel21-r4761) Copyright (c) 2001-2014, Andrew Aksyonoff Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/etc/sphinx.conf'... index 'films': query 'cat ': returned 71 matches of 71 total in 0.003 sec displaying matches: 1. document=464, weight=1616, rating=PG-13, actors=(76,198) 2. document=11, weight=1584, rating=G, actors=(40,81,90,174) 3. document=15, weight=1584, rating=NC-17, actors=(36,69,105,117,164,170) 4. document=53, weight=1584, rating=NC-17, actors=(6,30,47,70,102,149,156) 5. document=55, weight=1584, rating=G, actors=(66,118,166,173,177) 6. document=56, weight=1584, rating=G, actors=(4,22,33,43,129,136,137,168) 7. document=70, weight=1584, rating=NC-17, actors=(6,104)
  21. Connect PHP to Sphinx 25 /** gigablah/sphinx **/ $client =

    new \Sphinx\SphinxClient(); $client->setServer("localhost", "9312"); $client->setLimits(0, 5); $client->setMatchMode(\Sphinx\SphinxClient::SPH_MATCH_ALL); $results = $client->query("saga", "films"); var_dump($results); ["matches"]=> array(5) { [7]=> array(2) { ["weight"]=> string(1) "1" ["attrs"]=> array(2) { ["rating"]=> string(5) "PG-13" ["actors"]=> array(5) { [0]=> int(99) [1]=> int(133) [2]=> int(162) [3]=> int(170) [4]=> int(185) } } }
  22. But my data…I needed that 26 $filmIds = array_keys($results["matches"]); $placeholders

    = implode(",", array_fill(0, count($filmIds), “?")); $films = $db->fetchObjects("SELECT * FROM film WHERE film_id IN ($placeholders)", $filmIds); foreach ($films as $film) { echo $film->title . " (" . $results["matches"][$film->film_id]["attrs"]["rating"] . ")\n"; } AIRPLANE SIERRA (PG-13) ALASKA PHANTOM (PG) ARMY FLINTSTONES (R) BEAR GRACELAND (R) BERETS AGENT (PG-13)
  23. Sphinx SQL • Sphinx also has SphinxSQL • Speaks MySQL

    4.1 protocol • You can query with this instead of the native API, using DB tools you already know • You can “INSERT” into the indexes real- time 27
  24. So I should use Sphinx then… • Depends on how

    big your data set is • “Samsung” search in 1ms • In production with 10M rows, it’s way faster • FTI’s can chew RAM and slow down your database inserts dramatically • Scale sphinx separately as needed 28
  25. Ok, maybe Solr is right for you • Standalone full-text

    search, written in Java • XML and stuff • Starting to support facets and other advanced search tools • If you’re indexing long-form text (academia, newspaper articles, etc), this is what you want • solarium/solarium as your client 30
  26. Sidebar - Lucene rules the search world • Apache Lucene

    powers Solr • Solr is really just Lucene in a daemon • You can use Lucene with raw files on disk directly from PHP… 31
  27. What’s that shiny thing over there? • You probably thought

    we’d talk about Elastic Search first • Spoiler alert, Elastic is also powered by Lucene • Indexes JSON documents • Rest API so you can talk to it from most anywhere 35
  28. Why is it called elastic? • Solr, Sphinx, etc work

    really well on one big server, but are harder to scale horizontally • Elastic is designed to be distributed on multi-node clusters • Elastic is fairly easy to use in a multi-tenant environment 36
  29. This could go very poorly… • You need to pay

    attention to the clustering dynamics of Elastic • The dreaded split brain • It’s complicated enough that AWS offers a managed version 37
  30. Wait, is this a document database? • Yeah, kinda •

    Can index arbitrarily deep JSON documents • That JSON may also include a giant blob of text for FTS • Proper support for filtering and faceting on numerical and string values • Note - PHP clients generally use associative arrays and not raw JSON 38
  31. Getting content in to Elastic Search 39 $clientBuilder = Elasticsearch\ClientBuilder::create();

    $clientBuilder->setHosts(["127.0.0.1"]); $client = $clientBuilder->build(); $indexParams = [ "index" => "films", "type" => "film", ]; foreach ($films as $film) { $docParams = [ "id" => $film["film_id"], "body" => $film ]; $client->index(array_merge($indexParams, $docParams)); }
  32. Yet another way to search 40 $clientBuilder = Elasticsearch\ClientBuilder::create(); $clientBuilder->setHosts(["127.0.0.1"]);

    $client = $clientBuilder->build(); $params = [ 'index' => 'films', 'type' => 'film', 'body' => [ 'query' => [ 'match' => [ 'title' => 'airplane' ] ] ] ]; $results = $client->search($params);
  33. 41 Yet another way to search array(3) { ["total"]=> int(2)

    ["max_score"]=> float(4.9511213) ["hits"]=> array(2) { [0]=> array(5) { ["_index"]=> string(5) "films" ["_type"]=> string(4) "film" ["_id"]=> string(3) "711" ["_score"]=> float(4.9511213) ["_source"]=> array(13) { ["film_id"]=> string(3) "711" ["title"]=> string(15) "RAGING AIRPLANE" ["description"]=> string(107) "A Astounding Display of a Secret Agent And a Technical Writer who must Escape a Mad Scientist in A Jet Boat" ["release_year"]=> string(4) "2006" ["language_id"]=> string(1) "1" ["original_language_id"]=>
  34. • Elastic Search called facets aggregations • This is a

    whole other talk unto itself • Aggregations are extremely powerful compared to what is available in other search products • These get rolled up into one result but are multiple parallels queries 42 Ok but you promised facets
  35. Ok but you promised facets 43 $params = [ 'index'

    => 'films', 'type' => 'film', 'body' => [ 'query' => [ 'multi_match' => [ 'query' => 'cat', 'fields' => ['title', 'description'] ] ], 'aggs' => [ 'ratings_count' => [ 'value_count' => [ 'field' => 'rating' ] ] ] ] ]; $results = $client->search($params);
  36. Listen, can I just pay someone to do this? •

    Consumers of my software do not care about these tech details • I don’t have the time or the manpower to add yet another thing to my tech stack • I need this to work, but I just don’t find it interesting enough to master 44
  37. Disclaimer - They are not paying me • Wicked fast

    • Great APIs / developer experience • Tight integration with Laravel Scout • Really nice admin interface • $$$ 46
  38. Another code snippet? I sense a pattern. 47 <?php require_once

    __DIR__ . '/vendor/autoload.php'; $pdo = new \PDO("mysql:dbname=sakila;host=127.0.0.1", "sakila", "sakila"); $db = new Aura\Sql\ExtendedPdo($pdo); $films = $db->fetchAssoc("SELECT * FROM film"); $client = new \AlgoliaSearch\Client('...applicationid...', '...apikey...'); $index = $client->initIndex('films'); $index->addObjects($films, "film_id");
  39. Now that the files are in the computer… 51 <?php

    require_once __DIR__ . '/vendor/autoload.php'; $client = new \AlgoliaSearch\Client('...applicationid...', '...apikey...'); $index = $client->initIndex('films'); $results = $index->search("cat", [ "hitsPerPage" => 10 ]); foreach ($results["hits"] as $film) { echo $film["title"] . " (" . $film["rating"] . ")\n"; }
  40. Search things by facet 54 var_dump($index->searchForFacetValues("rating", "R")); array(3) { ["facetHits"]=>

    array(1) { [0]=> array(3) { ["value"]=> string(1) "R" ["highlighted"]=> string(10) "<em>R</em>" ["count"]=> int(195) } } ["exhaustiveFacetsCount"]=> bool(true) ["processingTimeMS"]=> int(1) }
  41. Searching by facets - gotcha 56 $pdo = new \PDO("mysql:dbname=sakila;host=127.0.0.1",

    "sakila", "sakila", [PDO::ATTR_EMULATE_PREPARES => false]);
  42. What you really want 59 array(11) { ["hits"]=> array(20) {

    [0]=> array(15) { ["film_id"]=> int(128) ["title"]=> string(13) "CATCH AMISTAD” // snipped // ["facets"]=> array(1) { ["rating"]=> array(5) { ["PG-13"]=> int(19) ["PG"]=> int(15) ["G"]=> int(13) ["R"]=> int(13) ["NC-17"]=> int(12) } } $index = $client->initIndex('films'); $results = $index->search("cat", [ "facets" => ['rating'] ]); var_dump($results);
  43. So now what? • Figure out what your use case

    is • Start small, scale up • Take these slides and go do your own research 61