Slide 1

Slide 1 text

Make It Searchable Josh Butts Midwest PHP 2018

Slide 2

Slide 2 text

About Me • SVP of Engineering,
 Ziff Davis • Austin PHP Organizer • github.com/jimbojsb • @jimbojsb 2

Slide 3

Slide 3 text

Search - Who cares? • In 2018, users expect search • The expect it to be as good as Google • Hopefully you’re creating enough data to warrant it • It always gets kicked to the next sprint 3

Slide 4

Slide 4 text

Search Use Cases

Slide 5

Slide 5 text

Search Use Cases- Wildcards • Yes, this is sometimes ok! • Not always a good user experience, but it’s easy • Probably going to do this on varchar(255) or shorter 5

Slide 6

Slide 6 text

Search Use Cases - Full Text • Probably good for TEXT fields • Articles, blog posts, item descriptions, etc • Usually language-aware, understands word boundaries • Possibly NLP components as well 6

Slide 7

Slide 7 text

Search Use Cases - Faceted • Table stakes for retail e-commerce • Left-nav with categories & attributes • Checkboxes refine the list • Parenthetical counts of options 7

Slide 8

Slide 8 text

Filtering vs Faceting • Filtering - reduce the set of results, excluding items that don’t match certain criteria • Faceting - Special kind of filtering, primarily for driving UI. Pre-indexed and counted • The UI we are accustomed to often is a combination of these things 8

Slide 9

Slide 9 text

Faceted Search 9

Slide 10

Slide 10 text

Lets get started

Slide 11

Slide 11 text

A Quick Note • Most of these examples are based on the Sakila free data set from MySQL 11

Slide 12

Slide 12 text

I have no time, budget, plan or cares • We’ve all been there • Just throw a LIKE clause on there and call it good enough • Just so we’re clear, the database’s implementation of this is literally the worst 12

Slide 13

Slide 13 text

Like, just use LIKE, right? 13 SELECT * FROM film WHERE title LIKE '%cat%' OR description LIKE ‘%cat%'

Slide 14

Slide 14 text

So how bad is it really? 14 EXPLAIN SELECT * FROM film WHERE title LIKE '%cat%' OR description LIKE '%cat%'

Slide 15

Slide 15 text

Surely we can do better than that • This will mostly likely produce erroneous results • As your data grows, the time to search grows linearly • Must check every character of every row and field you want to search 15

Slide 16

Slide 16 text

MySQL Full Text Indices • Supports simple, NLP and boolean modes • Requires you to commit up front to the columns • Works on CHAR, VARCHAR, TEXT 16

Slide 17

Slide 17 text

New-to-you SQL syntax 17 SELECT * FROM film WHERE MATCH (title,description) AGAINST (‘cat') SELECT * FROM film WHERE MATCH (title,description) AGAINST ('+drama -cat' IN BOOLEAN MODE) ALTER TABLE film ADD FULLTEXT INDEX description (title, description)

Slide 18

Slide 18 text

So how much better is it? • I cheated and grabbed a big data set from work… • 2s - LIKE ‘%samsung%’ • 20ms - MATCH AGAINST(‘samsung’) • 4s - LIKE ‘%samsung%’ AND NOT LIKE ‘%tv%’ • 9ms - MATCH AGAINST(‘+samsung -tv IN BOOLEAN MODE’) 18

Slide 19

Slide 19 text

So - Talk over? • MySQL Full Text clearly works well • It’s reasonably fast and it uses tools you know • Remember there are other search use cases • MySql Full Text isn’t all that configurable in the grand scheme of things 19

Slide 20

Slide 20 text

Sphinx • Sphinx is a standalone full-text indexer that speaks to MySQL or Postgres • Far more tunable to your data set than MySQL • Performance may be orders of magnitude faster 20

Slide 21

Slide 21 text

How do I Sphinx it up? • Install Sphinx server (sphinxsearch) • Write a Sphinx config • Index your data (real time or crontab) • Connect PHP to Sphinx • Profit? 21

Slide 22

Slide 22 text

Sphinx Config 22 source sakila_films { type = mysql sql_host = mysql sql_user = sakila sql_db = sakila sql_pass = sakila sql_query = SELECT film_id, title, rating, description FROM film sql_attr_string = rating sql_attr_multi = uint actors from query; SELECT film_id, actor_id FROM film_actor } index films { source = sakila_films path = /var/idx/sphinx/sakila_films morphology = stem_en }

Slide 23

Slide 23 text

Run the Indexer • Sphinx indexer is a separate binary • indexer --all --rotate • Run this on a schedule 23 Sphinx 2.1.9-id64-release (rel21-r4761) Copyright (c) 2001-2014, Andrew Aksyonoff Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/etc/sphinx.conf'... indexing index 'films'... collected 1000 docs, 0.1 MB collected 5462 attr values sorted 0.0 Mvalues, 100.0% done sorted 0.0 Mhits, 100.0% done total 1000 docs, 93842 bytes total 0.086 sec, 1087115 bytes/sec, 11584.53 docs/sec

Slide 24

Slide 24 text

Test With CLI Tools 24 bash-4.1# search cat Sphinx 2.1.9-id64-release (rel21-r4761) Copyright (c) 2001-2014, Andrew Aksyonoff Copyright (c) 2008-2014, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/usr/local/etc/sphinx.conf'... index 'films': query 'cat ': returned 71 matches of 71 total in 0.003 sec displaying matches: 1. document=464, weight=1616, rating=PG-13, actors=(76,198) 2. document=11, weight=1584, rating=G, actors=(40,81,90,174) 3. document=15, weight=1584, rating=NC-17, actors=(36,69,105,117,164,170) 4. document=53, weight=1584, rating=NC-17, actors=(6,30,47,70,102,149,156) 5. document=55, weight=1584, rating=G, actors=(66,118,166,173,177) 6. document=56, weight=1584, rating=G, actors=(4,22,33,43,129,136,137,168) 7. document=70, weight=1584, rating=NC-17, actors=(6,104)

Slide 25

Slide 25 text

Connect PHP to Sphinx 25 /** gigablah/sphinx **/ $client = new \Sphinx\SphinxClient(); $client->setServer("localhost", "9312"); $client->setLimits(0, 5); $client->setMatchMode(\Sphinx\SphinxClient::SPH_MATCH_ALL); $results = $client->query("saga", "films"); var_dump($results); ["matches"]=> array(5) { [7]=> array(2) { ["weight"]=> string(1) "1" ["attrs"]=> array(2) { ["rating"]=> string(5) "PG-13" ["actors"]=> array(5) { [0]=> int(99) [1]=> int(133) [2]=> int(162) [3]=> int(170) [4]=> int(185) } } }

Slide 26

Slide 26 text

But my data…I needed that 26 $filmIds = array_keys($results["matches"]); $placeholders = implode(",", array_fill(0, count($filmIds), “?")); $films = $db->fetchObjects("SELECT * FROM film WHERE film_id IN ($placeholders)", $filmIds); foreach ($films as $film) { echo $film->title . " (" . $results["matches"][$film->film_id]["attrs"]["rating"] . ")\n"; } AIRPLANE SIERRA (PG-13) ALASKA PHANTOM (PG) ARMY FLINTSTONES (R) BEAR GRACELAND (R) BERETS AGENT (PG-13)

Slide 27

Slide 27 text

Sphinx SQL • Sphinx also has SphinxSQL • Speaks MySQL 4.1 protocol • You can query with this instead of the native API, using DB tools you already know • You can “INSERT” into the indexes real- time 27

Slide 28

Slide 28 text

So I should use Sphinx then… • Depends on how big your data set is • “Samsung” search in 1ms • In production with 10M rows, it’s way faster • FTI’s can chew RAM and slow down your database inserts dramatically • Scale sphinx separately as needed 28

Slide 29

Slide 29 text

So I like running servers, what else have you got?

Slide 30

Slide 30 text

Ok, maybe Solr is right for you • Standalone full-text search, written in Java • XML and stuff • Starting to support facets and other advanced search tools • If you’re indexing long-form text (academia, newspaper articles, etc), this is what you want • solarium/solarium as your client 30

Slide 31

Slide 31 text

Sidebar - Lucene rules the search world • Apache Lucene powers Solr • Solr is really just Lucene in a daemon • You can use Lucene with raw files on disk directly from PHP… 31

Slide 32

Slide 32 text

Unfortunately we don’t have time for a Solr demo

Slide 33

Slide 33 text

solrtutorial.com

Slide 34

Slide 34 text

There was some mention of facets?

Slide 35

Slide 35 text

What’s that shiny thing over there? • You probably thought we’d talk about Elastic Search first • Spoiler alert, Elastic is also powered by Lucene • Indexes JSON documents • Rest API so you can talk to it from most anywhere 35

Slide 36

Slide 36 text

Why is it called elastic? • Solr, Sphinx, etc work really well on one big server, but are harder to scale horizontally • Elastic is designed to be distributed on multi-node clusters • Elastic is fairly easy to use in a multi-tenant environment 36

Slide 37

Slide 37 text

This could go very poorly… • You need to pay attention to the clustering dynamics of Elastic • The dreaded split brain • It’s complicated enough that AWS offers a managed version 37

Slide 38

Slide 38 text

Wait, is this a document database? • Yeah, kinda • Can index arbitrarily deep JSON documents • That JSON may also include a giant blob of text for FTS • Proper support for filtering and faceting on numerical and string values • Note - PHP clients generally use associative arrays and not raw JSON 38

Slide 39

Slide 39 text

Getting content in to Elastic Search 39 $clientBuilder = Elasticsearch\ClientBuilder::create(); $clientBuilder->setHosts(["127.0.0.1"]); $client = $clientBuilder->build(); $indexParams = [ "index" => "films", "type" => "film", ]; foreach ($films as $film) { $docParams = [ "id" => $film["film_id"], "body" => $film ]; $client->index(array_merge($indexParams, $docParams)); }

Slide 40

Slide 40 text

Yet another way to search 40 $clientBuilder = Elasticsearch\ClientBuilder::create(); $clientBuilder->setHosts(["127.0.0.1"]); $client = $clientBuilder->build(); $params = [ 'index' => 'films', 'type' => 'film', 'body' => [ 'query' => [ 'match' => [ 'title' => 'airplane' ] ] ] ]; $results = $client->search($params);

Slide 41

Slide 41 text

41 Yet another way to search array(3) { ["total"]=> int(2) ["max_score"]=> float(4.9511213) ["hits"]=> array(2) { [0]=> array(5) { ["_index"]=> string(5) "films" ["_type"]=> string(4) "film" ["_id"]=> string(3) "711" ["_score"]=> float(4.9511213) ["_source"]=> array(13) { ["film_id"]=> string(3) "711" ["title"]=> string(15) "RAGING AIRPLANE" ["description"]=> string(107) "A Astounding Display of a Secret Agent And a Technical Writer who must Escape a Mad Scientist in A Jet Boat" ["release_year"]=> string(4) "2006" ["language_id"]=> string(1) "1" ["original_language_id"]=>

Slide 42

Slide 42 text

• Elastic Search called facets aggregations • This is a whole other talk unto itself • Aggregations are extremely powerful compared to what is available in other search products • These get rolled up into one result but are multiple parallels queries 42 Ok but you promised facets

Slide 43

Slide 43 text

Ok but you promised facets 43 $params = [ 'index' => 'films', 'type' => 'film', 'body' => [ 'query' => [ 'multi_match' => [ 'query' => 'cat', 'fields' => ['title', 'description'] ] ], 'aggs' => [ 'ratings_count' => [ 'value_count' => [ 'field' => 'rating' ] ] ] ] ]; $results = $client->search($params);

Slide 44

Slide 44 text

Listen, can I just pay someone to do this? • Consumers of my software do not care about these tech details • I don’t have the time or the manpower to add yet another thing to my tech stack • I need this to work, but I just don’t find it interesting enough to master 44

Slide 45

Slide 45 text

Yes, yes you can 45

Slide 46

Slide 46 text

Disclaimer - They are not paying me • Wicked fast • Great APIs / developer experience • Tight integration with Laravel Scout • Really nice admin interface • $$$ 46

Slide 47

Slide 47 text

Another code snippet? I sense a pattern. 47 fetchAssoc("SELECT * FROM film"); $client = new \AlgoliaSearch\Client('...applicationid...', '...apikey...'); $index = $client->initIndex('films'); $index->addObjects($films, "film_id");

Slide 48

Slide 48 text

Algolia Dashboard 48

Slide 49

Slide 49 text

Browse your data 49

Slide 50

Slide 50 text

Browse your data 50

Slide 51

Slide 51 text

Now that the files are in the computer… 51 initIndex('films'); $results = $index->search("cat", [ "hitsPerPage" => 10 ]); foreach ($results["hits"] as $film) { echo $film["title"] . " (" . $film["rating"] . ")\n"; }

Slide 52

Slide 52 text

Algolia does facets too 52 $index = $client->initIndex('films'); $index->setSettings(["attributesForFaceting" => ["searchable(rating)", "searchable(length)"]]);

Slide 53

Slide 53 text

Algolia does facets too 53

Slide 54

Slide 54 text

Search things by facet 54 var_dump($index->searchForFacetValues("rating", "R")); array(3) { ["facetHits"]=> array(1) { [0]=> array(3) { ["value"]=> string(1) "R" ["highlighted"]=> string(10) "R" ["count"]=> int(195) } } ["exhaustiveFacetsCount"]=> bool(true) ["processingTimeMS"]=> int(1) }

Slide 55

Slide 55 text

Searching by facets - gotcha 55

Slide 56

Slide 56 text

Searching by facets - gotcha 56 $pdo = new \PDO("mysql:dbname=sakila;host=127.0.0.1", "sakila", "sakila", [PDO::ATTR_EMULATE_PREPARES => false]);

Slide 57

Slide 57 text

Searching by facets - gotcha 57

Slide 58

Slide 58 text

Also Filtering 58 $index = $client->initIndex('films'); $results = $index->search("cat", [ "filters" => "length > 100 AND rating:R" ]);

Slide 59

Slide 59 text

What you really want 59 array(11) { ["hits"]=> array(20) { [0]=> array(15) { ["film_id"]=> int(128) ["title"]=> string(13) "CATCH AMISTAD” // snipped // ["facets"]=> array(1) { ["rating"]=> array(5) { ["PG-13"]=> int(19) ["PG"]=> int(15) ["G"]=> int(13) ["R"]=> int(13) ["NC-17"]=> int(12) } } $index = $client->initIndex('films'); $results = $index->search("cat", [ "facets" => ['rating'] ]); var_dump($results);

Slide 60

Slide 60 text

Anyone else’s head spinning?

Slide 61

Slide 61 text

So now what? • Figure out what your use case is • Start small, scale up • Take these slides and go do your own research 61

Slide 62

Slide 62 text

We’re Hiring - NYC, Austin, Montreal 62 And many other properties

Slide 63

Slide 63 text

Questions? https://joind.in/talk/6da8f