Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make It Searchable (Midwest PHP 2018)

Make It Searchable (Midwest PHP 2018)

With data available on the internet doubling at an amazing rate, the modern web is increasingly focused on search. There are many ways to implement search in your PHP application. Some of them are downright terrible, while others are truly amazing. This talk will introduce you to the basics of various search engines like Solr, Sphinx, and ElasticSearch. We’ll learn strengths and weaknesses of each as well as how to integrate them into your existing PHP application. Every application has differing and often very niche needs for search, so our goal is to show you what’s available and help you make an educated selection for which type of search is right for your application.

Josh Butts

March 09, 2018
Tweet

More Decks by Josh Butts

Other Decks in Programming

Transcript

  1. Make It Searchable
    Josh Butts
    Midwest PHP 2018

    View Slide

  2. About Me
    • SVP of Engineering,

    Ziff Davis
    • Austin PHP Organizer
    • github.com/jimbojsb
    • @jimbojsb
    2

    View Slide

  3. Search - Who cares?
    • In 2018, users expect search
    • The expect it to be as good as Google
    • Hopefully you’re creating enough data to
    warrant it
    • It always gets kicked to the next sprint
    3

    View Slide

  4. Search Use Cases

    View Slide

  5. Search Use Cases- Wildcards
    • Yes, this is sometimes ok!
    • Not always a good user experience, but it’s
    easy
    • Probably going to do this on varchar(255)
    or shorter
    5

    View Slide

  6. Search Use Cases - Full Text
    • Probably good for TEXT fields
    • Articles, blog posts, item descriptions, etc
    • Usually language-aware, understands word
    boundaries
    • Possibly NLP components as well
    6

    View Slide

  7. Search Use Cases - Faceted
    • Table stakes for retail e-commerce
    • Left-nav with categories & attributes
    • Checkboxes refine the list
    • Parenthetical counts of options
    7

    View Slide

  8. Filtering vs Faceting
    • Filtering - reduce the set of results,
    excluding items that don’t match certain
    criteria
    • Faceting - Special kind of filtering, primarily
    for driving UI. Pre-indexed and counted
    • The UI we are accustomed to often is a
    combination of these things
    8

    View Slide

  9. Faceted Search
    9

    View Slide

  10. Lets get started

    View Slide

  11. A Quick Note
    • Most of these examples are based on the
    Sakila free data set from MySQL
    11

    View Slide

  12. I have no time, budget, plan or cares
    • We’ve all been there
    • Just throw a LIKE clause on there and call it
    good enough
    • Just so we’re clear, the database’s
    implementation of this is literally the worst
    12

    View Slide

  13. Like, just use LIKE, right?
    13
    SELECT * FROM film WHERE title LIKE '%cat%' OR description LIKE ‘%cat%'

    View Slide

  14. So how bad is it really?
    14
    EXPLAIN SELECT * FROM film WHERE title LIKE '%cat%' OR description LIKE '%cat%'

    View Slide

  15. Surely we can do better than that
    • This will mostly likely produce erroneous
    results
    • As your data grows, the time to search
    grows linearly
    • Must check every character of every row
    and field you want to search
    15

    View Slide

  16. MySQL Full Text Indices
    • Supports simple, NLP and boolean modes
    • Requires you to commit up front to the
    columns
    • Works on CHAR, VARCHAR, TEXT
    16

    View Slide

  17. New-to-you SQL syntax
    17
    SELECT * FROM film WHERE MATCH (title,description) AGAINST (‘cat')
    SELECT * FROM film WHERE MATCH (title,description) AGAINST ('+drama -cat' IN BOOLEAN MODE)
    ALTER TABLE film ADD FULLTEXT INDEX description (title, description)

    View Slide

  18. So how much better is it?
    • I cheated and grabbed a big data set from
    work…
    • 2s - LIKE ‘%samsung%’
    • 20ms - MATCH AGAINST(‘samsung’)
    • 4s - LIKE ‘%samsung%’ AND NOT LIKE ‘%tv%’
    • 9ms - MATCH AGAINST(‘+samsung -tv IN BOOLEAN MODE’)
    18

    View Slide

  19. So - Talk over?
    • MySQL Full Text clearly works well
    • It’s reasonably fast and it uses tools you
    know
    • Remember there are other search use
    cases
    • MySql Full Text isn’t all that configurable in
    the grand scheme of things
    19

    View Slide

  20. Sphinx
    • Sphinx is a standalone full-text indexer that
    speaks to MySQL or Postgres
    • Far more tunable to your data set than
    MySQL
    • Performance may be orders of magnitude
    faster
    20

    View Slide

  21. How do I Sphinx it up?
    • Install Sphinx server (sphinxsearch)
    • Write a Sphinx config
    • Index your data (real time or crontab)
    • Connect PHP to Sphinx
    • Profit?
    21

    View Slide

  22. Sphinx Config
    22
    source sakila_films
    {
    type = mysql
    sql_host = mysql
    sql_user = sakila
    sql_db = sakila
    sql_pass = sakila
    sql_query = SELECT film_id, title, rating, description FROM film
    sql_attr_string = rating
    sql_attr_multi = uint actors from query; SELECT film_id, actor_id FROM film_actor
    }
    index films
    {
    source = sakila_films
    path = /var/idx/sphinx/sakila_films
    morphology = stem_en
    }

    View Slide

  23. Run the Indexer
    • Sphinx indexer is a separate binary
    • indexer --all --rotate
    • Run this on a schedule
    23
    Sphinx 2.1.9-id64-release (rel21-r4761)
    Copyright (c) 2001-2014, Andrew Aksyonoff
    Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com)
    using config file '/usr/local/etc/sphinx.conf'...
    indexing index 'films'...
    collected 1000 docs, 0.1 MB
    collected 5462 attr values
    sorted 0.0 Mvalues, 100.0% done
    sorted 0.0 Mhits, 100.0% done
    total 1000 docs, 93842 bytes
    total 0.086 sec, 1087115 bytes/sec, 11584.53 docs/sec

    View Slide

  24. Test With CLI Tools
    24
    bash-4.1# search cat
    Sphinx 2.1.9-id64-release (rel21-r4761)
    Copyright (c) 2001-2014, Andrew Aksyonoff
    Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com)
    using config file '/usr/local/etc/sphinx.conf'...
    index 'films': query 'cat ': returned 71 matches of 71 total in 0.003 sec
    displaying matches:
    1. document=464, weight=1616, rating=PG-13, actors=(76,198)
    2. document=11, weight=1584, rating=G, actors=(40,81,90,174)
    3. document=15, weight=1584, rating=NC-17, actors=(36,69,105,117,164,170)
    4. document=53, weight=1584, rating=NC-17, actors=(6,30,47,70,102,149,156)
    5. document=55, weight=1584, rating=G, actors=(66,118,166,173,177)
    6. document=56, weight=1584, rating=G, actors=(4,22,33,43,129,136,137,168)
    7. document=70, weight=1584, rating=NC-17, actors=(6,104)

    View Slide

  25. Connect PHP to Sphinx
    25
    /** gigablah/sphinx **/
    $client = new \Sphinx\SphinxClient();
    $client->setServer("localhost", "9312");
    $client->setLimits(0, 5);
    $client->setMatchMode(\Sphinx\SphinxClient::SPH_MATCH_ALL);
    $results = $client->query("saga", "films");
    var_dump($results);
    ["matches"]=>
    array(5) {
    [7]=>
    array(2) {
    ["weight"]=>
    string(1) "1"
    ["attrs"]=>
    array(2) {
    ["rating"]=>
    string(5) "PG-13"
    ["actors"]=>
    array(5) {
    [0]=>
    int(99)
    [1]=>
    int(133)
    [2]=>
    int(162)
    [3]=>
    int(170)
    [4]=>
    int(185)
    }
    }
    }

    View Slide

  26. But my data…I needed that
    26
    $filmIds = array_keys($results["matches"]);
    $placeholders = implode(",", array_fill(0, count($filmIds), “?"));
    $films = $db->fetchObjects("SELECT * FROM film WHERE film_id IN ($placeholders)", $filmIds);
    foreach ($films as $film) {
    echo $film->title . " (" . $results["matches"][$film->film_id]["attrs"]["rating"] . ")\n";
    }
    AIRPLANE SIERRA (PG-13)
    ALASKA PHANTOM (PG)
    ARMY FLINTSTONES (R)
    BEAR GRACELAND (R)
    BERETS AGENT (PG-13)

    View Slide

  27. Sphinx SQL
    • Sphinx also has SphinxSQL
    • Speaks MySQL 4.1 protocol
    • You can query with this instead of the
    native API, using DB tools you already
    know
    • You can “INSERT” into the indexes real-
    time
    27

    View Slide

  28. So I should use Sphinx then…
    • Depends on how big your data set is
    • “Samsung” search in 1ms
    • In production with 10M rows, it’s way faster
    • FTI’s can chew RAM and slow down your
    database inserts dramatically
    • Scale sphinx separately as needed
    28

    View Slide

  29. So I like running
    servers, what else
    have you got?

    View Slide

  30. Ok, maybe Solr is right for you
    • Standalone full-text search, written in Java
    • XML and stuff
    • Starting to support facets and other
    advanced search tools
    • If you’re indexing long-form text (academia,
    newspaper articles, etc), this is what you
    want
    • solarium/solarium as your client
    30

    View Slide

  31. Sidebar - Lucene rules the search world
    • Apache Lucene powers Solr
    • Solr is really just Lucene in a daemon
    • You can use Lucene with raw files on disk
    directly from PHP…
    31

    View Slide

  32. Unfortunately we
    don’t have time for a
    Solr demo

    View Slide

  33. solrtutorial.com

    View Slide

  34. There was some
    mention of facets?

    View Slide

  35. What’s that shiny thing over there?
    • You probably thought we’d talk about
    Elastic Search first
    • Spoiler alert, Elastic is also powered by
    Lucene
    • Indexes JSON documents
    • Rest API so you can talk to it from most
    anywhere
    35

    View Slide

  36. Why is it called elastic?
    • Solr, Sphinx, etc work really well on one big
    server, but are harder to scale horizontally
    • Elastic is designed to be distributed on
    multi-node clusters
    • Elastic is fairly easy to use in a multi-tenant
    environment
    36

    View Slide

  37. This could go very poorly…
    • You need to pay attention to the clustering
    dynamics of Elastic
    • The dreaded split brain
    • It’s complicated enough that AWS offers a
    managed version
    37

    View Slide

  38. Wait, is this a document database?
    • Yeah, kinda
    • Can index arbitrarily deep JSON documents
    • That JSON may also include a giant blob of
    text for FTS
    • Proper support for filtering and faceting on
    numerical and string values
    • Note - PHP clients generally use associative
    arrays and not raw JSON
    38

    View Slide

  39. Getting content in to Elastic Search
    39
    $clientBuilder = Elasticsearch\ClientBuilder::create();
    $clientBuilder->setHosts(["127.0.0.1"]);
    $client = $clientBuilder->build();
    $indexParams = [
    "index" => "films",
    "type" => "film",
    ];
    foreach ($films as $film) {
    $docParams = [
    "id" => $film["film_id"],
    "body" => $film
    ];
    $client->index(array_merge($indexParams, $docParams));
    }

    View Slide

  40. Yet another way to search
    40
    $clientBuilder = Elasticsearch\ClientBuilder::create();
    $clientBuilder->setHosts(["127.0.0.1"]);
    $client = $clientBuilder->build();
    $params = [
    'index' => 'films',
    'type' => 'film',
    'body' => [
    'query' => [
    'match' => [
    'title' => 'airplane'
    ]
    ]
    ]
    ];
    $results = $client->search($params);

    View Slide

  41. 41
    Yet another way to search
    array(3) {
    ["total"]=>
    int(2)
    ["max_score"]=>
    float(4.9511213)
    ["hits"]=>
    array(2) {
    [0]=>
    array(5) {
    ["_index"]=>
    string(5) "films"
    ["_type"]=>
    string(4) "film"
    ["_id"]=>
    string(3) "711"
    ["_score"]=>
    float(4.9511213)
    ["_source"]=>
    array(13) {
    ["film_id"]=>
    string(3) "711"
    ["title"]=>
    string(15) "RAGING AIRPLANE"
    ["description"]=>
    string(107) "A Astounding Display of a Secret Agent And a Technical Writer who must Escape a Mad Scientist
    in A Jet Boat"
    ["release_year"]=>
    string(4) "2006"
    ["language_id"]=>
    string(1) "1"
    ["original_language_id"]=>

    View Slide

  42. • Elastic Search called facets aggregations
    • This is a whole other talk unto itself
    • Aggregations are extremely powerful
    compared to what is available in other
    search products
    • These get rolled up into one result but are
    multiple parallels queries
    42
    Ok but you promised facets

    View Slide

  43. Ok but you promised facets
    43
    $params = [
    'index' => 'films',
    'type' => 'film',
    'body' => [
    'query' => [
    'multi_match' => [
    'query' => 'cat',
    'fields' => ['title', 'description']
    ]
    ],
    'aggs' => [
    'ratings_count' => [
    'value_count' => [
    'field' => 'rating'
    ]
    ]
    ]
    ]
    ];
    $results = $client->search($params);

    View Slide

  44. Listen, can I just pay someone to do this?
    • Consumers of my software do not care
    about these tech details
    • I don’t have the time or the manpower to
    add yet another thing to my tech stack
    • I need this to work, but I just don’t find it
    interesting enough to master
    44

    View Slide

  45. Yes, yes you can
    45

    View Slide

  46. Disclaimer - They are not paying me
    • Wicked fast
    • Great APIs / developer experience
    • Tight integration with Laravel Scout
    • Really nice admin interface
    • $$$
    46

    View Slide

  47. Another code snippet? I sense a pattern.
    47
    require_once __DIR__ . '/vendor/autoload.php';
    $pdo = new \PDO("mysql:dbname=sakila;host=127.0.0.1", "sakila", "sakila");
    $db = new Aura\Sql\ExtendedPdo($pdo);
    $films = $db->fetchAssoc("SELECT * FROM film");
    $client = new \AlgoliaSearch\Client('...applicationid...', '...apikey...');
    $index = $client->initIndex('films');
    $index->addObjects($films, "film_id");

    View Slide

  48. Algolia Dashboard
    48

    View Slide

  49. Browse your data
    49

    View Slide

  50. Browse your data
    50

    View Slide

  51. Now that the files are in the computer…
    51
    require_once __DIR__ . '/vendor/autoload.php';
    $client = new \AlgoliaSearch\Client('...applicationid...', '...apikey...');
    $index = $client->initIndex('films');
    $results = $index->search("cat", [
    "hitsPerPage" => 10
    ]);
    foreach ($results["hits"] as $film) {
    echo $film["title"] . " (" . $film["rating"] . ")\n";
    }

    View Slide

  52. Algolia does facets too
    52
    $index = $client->initIndex('films');
    $index->setSettings(["attributesForFaceting" => ["searchable(rating)", "searchable(length)"]]);

    View Slide

  53. Algolia does facets too
    53

    View Slide

  54. Search things by facet
    54
    var_dump($index->searchForFacetValues("rating", "R"));
    array(3) {
    ["facetHits"]=>
    array(1) {
    [0]=>
    array(3) {
    ["value"]=>
    string(1) "R"
    ["highlighted"]=>
    string(10) "R"
    ["count"]=>
    int(195)
    }
    }
    ["exhaustiveFacetsCount"]=>
    bool(true)
    ["processingTimeMS"]=>
    int(1)
    }

    View Slide

  55. Searching by facets - gotcha
    55

    View Slide

  56. Searching by facets - gotcha
    56
    $pdo = new \PDO("mysql:dbname=sakila;host=127.0.0.1", "sakila", "sakila", [PDO::ATTR_EMULATE_PREPARES => false]);

    View Slide

  57. Searching by facets - gotcha
    57

    View Slide

  58. Also Filtering
    58
    $index = $client->initIndex('films');
    $results = $index->search("cat", [
    "filters" => "length > 100 AND rating:R"
    ]);

    View Slide

  59. What you really want
    59
    array(11) {
    ["hits"]=>
    array(20) {
    [0]=>
    array(15) {
    ["film_id"]=>
    int(128)
    ["title"]=>
    string(13) "CATCH AMISTAD”
    // snipped //
    ["facets"]=>
    array(1) {
    ["rating"]=>
    array(5) {
    ["PG-13"]=>
    int(19)
    ["PG"]=>
    int(15)
    ["G"]=>
    int(13)
    ["R"]=>
    int(13)
    ["NC-17"]=>
    int(12)
    }
    }
    $index = $client->initIndex('films');
    $results = $index->search("cat", [
    "facets" => ['rating']
    ]);
    var_dump($results);

    View Slide

  60. Anyone else’s
    head spinning?

    View Slide

  61. So now what?
    • Figure out what your use case is
    • Start small, scale up
    • Take these slides and go do your own
    research
    61

    View Slide

  62. We’re Hiring - NYC, Austin, Montreal
    62
    And many other properties

    View Slide

  63. Questions?
    https://joind.in/talk/6da8f

    View Slide