Slide 1

Slide 1 text

Introduction to Searching with Solr Krzysztof Skarbek October 30, 2015

Slide 2

Slide 2 text

Searching

Slide 3

Slide 3 text

Searching ● Most of us start browsing the Internet with a search engine – that's how natural searching is

Slide 4

Slide 4 text

Searching ● Most of us start browsing the Internet with a search engine – that's how natural searching is ● Unless your site has very little information, or everything is reachable with a few clicks, searching becomes a vital feature

Slide 5

Slide 5 text

Searching ● Poor man's solution:

Slide 6

Slide 6 text

Searching ● Poor man's solution:

Slide 7

Slide 7 text

Searching ● Poor man's solution: ● You have to expose the data to Google

Slide 8

Slide 8 text

Searching ● Poor man's solution: ● You have to expose the data to Google ● Guess what? Ads

Slide 9

Slide 9 text

Searching in examples

Slide 10

Slide 10 text

Searching in examples ● We'll demonstrate and compare SQL database searches and searching with Solr

Slide 11

Slide 11 text

Searching in examples ● We'll demonstrate and compare SQL database searches and searching with Solr ● We'll use the same example database in all cases

Slide 12

Slide 12 text

Example database

Slide 13

Slide 13 text

Example database ● Extracted from a real-life web application

Slide 14

Slide 14 text

Example database ● Extracted from a real-life web application ● Contains blog posts and culinary recipes scraped from cooking blogs

Slide 15

Slide 15 text

Example database ● Extracted from a real-life web application ● Contains blog posts and culinary recipes scraped from cooking blogs ● 22 tables, 69300 records in posts table, 550 MB of textual data

Slide 16

Slide 16 text

Example database blogs id name url ... blogs id name url ...

Slide 17

Slide 17 text

Example database posts id blog_id title body ... posts id blog_id title body ... blogs id name url ... blogs id name url ...

Slide 18

Slide 18 text

Example database posts id blog_id title body ... posts id blog_id title body ... recipes id post_id title ... recipes id post_id title ... blogs id name url ... blogs id name url ...

Slide 19

Slide 19 text

Example database posts id blog_id title body ... posts id blog_id title body ... recipes id post_id title ... recipes id post_id title ... ingredients id name ... ingredients id name ... recipes_ingredients recipe_id ingredient_id amount ... recipes_ingredients recipe_id ingredient_id amount ... blogs id name url ... blogs id name url ...

Slide 20

Slide 20 text

Naïve approach with SQL LIKE

Slide 21

Slide 21 text

Naïve approach with SQL LIKE ● SELECT posts.id FROM recipes INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE posts.title LIKE '%cake%' OR posts.body LIKE '%cake%' OR recipes.title LIKE '%cake%';

Slide 22

Slide 22 text

Naïve approach with SQL LIKE ● SELECT posts.id FROM recipes INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE posts.title LIKE '%cake%' OR posts.body LIKE '%cake%' OR recipes.title LIKE '%cake%'; 949 rows in set (10.07 sec)

Slide 23

Slide 23 text

Naïve approach with SQL LIKE ● Slow

Slide 24

Slide 24 text

Naïve approach with SQL LIKE ● Slow ● Very bad effect on database performance

Slide 25

Slide 25 text

Naïve approach with SQL LIKE ● Slow ● Very bad effect on database performance ● Problems with whitespace/punctuation (e.g.: LIKE '% cake %' against “cake!”)

Slide 26

Slide 26 text

Naïve approach with SQL LIKE ● Slow ● Very bad effect on database performance ● Problems with whitespace/punctuation (e.g.: LIKE '% cake %' against “cake!”) ● Really, a bad idea

Slide 27

Slide 27 text

Searching features built into RDBMS ● MySQL – full text search indexes

Slide 28

Slide 28 text

Searching features built into RDBMS ● MySQL – full text search indexes CREATE TABLE posts ( id INT NOT NULL AUTO INCREMENT, title VARCHAR(200), body TEXT, ... FULLTEXT (title, body), ...

Slide 29

Slide 29 text

Searching features built into RDBMS SELECT posts.id FROM recipes INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE MATCH (posts.title, posts.body) AGAINST ('cake') OR MATCH (recipes.title) AGAINST ('cake');

Slide 30

Slide 30 text

Searching features built into RDBMS SELECT posts.id FROM recipes INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE MATCH (posts.title, posts.body) AGAINST ('cake') OR MATCH (recipes.title) AGAINST ('cake'); 1042 rows in set (0.22 sec)

Slide 31

Slide 31 text

Searching features built into RDBMS ● Much faster

Slide 32

Slide 32 text

Searching features built into RDBMS ● Much faster ● But still not blazingly fast

Slide 33

Slide 33 text

Searching features built into RDBMS ● Much faster ● But still not blazingly fast ● Used to only be available for MyISAM tables

Slide 34

Slide 34 text

Searching features built into RDBMS ● Much faster ● But still not blazingly fast ● Used to only be available for MyISAM tables (not anymore with MySQL 5.6)

Slide 35

Slide 35 text

Searching features built into RDBMS ● Similar features in other RDBMS: – PostgreSQL: Full Text Searching – MSSQL: Full-Text Search – Oracle: Oracle Text – SQLite: FTS5 extension

Slide 36

Slide 36 text

Enter Solr

Slide 37

Slide 37 text

Enter Solr ● Based on Lucene, a high-performance, full- featured text search engine library

Slide 38

Slide 38 text

Enter Solr ● Based on Lucene, a high-performance, full- featured text search engine library ● Notable features:

Slide 39

Slide 39 text

Enter Solr ● Based on Lucene, a high-performance, full- featured text search engine library ● Notable features: – ranked searching

Slide 40

Slide 40 text

Enter Solr ● Based on Lucene, a high-performance, full- featured text search engine library ● Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping

Slide 41

Slide 41 text

Enter Solr ● Based on Lucene, a high-performance, full- featured text search engine library ● Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping – allows simultaneous update and searching

Slide 42

Slide 42 text

Enter Solr ● Based on Lucene, a high-performance, full- featured text search engine library ● Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping – allows simultaneous update and searching – support for filters

Slide 43

Slide 43 text

How Solr works

Slide 44

Slide 44 text

How Solr works ● Importing data from a source (e.g. SQL database)

Slide 45

Slide 45 text

How Solr works ● Importing data from a source (e.g. SQL database) – full import

Slide 46

Slide 46 text

How Solr works ● Importing data from a source (e.g. SQL database) – full import – delta import

Slide 47

Slide 47 text

How Solr works ● Importing data from a source (e.g. SQL database) – full import – delta import ● Indexing imported data (with index filters)

Slide 48

Slide 48 text

How Solr works ● Importing data from a source (e.g. SQL database) – full import – delta import ● Indexing imported data (with index filters) ● Searching (with query filters)

Slide 49

Slide 49 text

Solr configuration

Slide 50

Slide 50 text

Solr configuration ● Brace yourselves, XML is coming

Slide 51

Slide 51 text

Solr configuration ...

Slide 52

Slide 52 text

Solr configuration

Slide 53

Slide 53 text

Solr configuration deltaQuery=" SELECT r.id AS id, r.post_id AS post_id, r.title AS recipe_title, ... FROM recipes AS r INNER JOIN posts AS p ON p.id = r.post_id INNER JOIN blogs AS b ON b.id = p.blog_id WHERE p.added_on > '${dataimporter.last_index_time}' ORDER BY posted_on DESC "

Slide 54

Slide 54 text

Solr configuration ...

Slide 55

Slide 55 text

Solr in action

Slide 56

Slide 56 text

Solr in action ● Delta import: curl http://host:port/solr/database/dataimport \ --data 'command=delta-import'

Slide 57

Slide 57 text

Solr in action ● Delta import: ● Search: curl http://host:port/solr/database/dataimport \ --data 'command=delta-import' curl http://host:port/solr/database/select? q=post_body%3Acake

Slide 58

Slide 58 text

Solr in action { "responseHeader": { ... "QTime": 2, "params": { ... "q": "post_body:cake", ... } }, "response": { "numFound": 1009, ...

Slide 59

Slide 59 text

Solr in action { "responseHeader": { ... "QTime": 2, "params": { ... "q": "post_body:cake", ... } }, "response": { "numFound": 1009, ... Blazingly fast

Slide 60

Slide 60 text

Using Solr in application code

Slide 61

Slide 61 text

Using Solr in application code ● As simple as making an HTTP request: $phrase = 'cake'; $url = 'http://host/solr/database/select?q=' . urlencode('post_body:' . $phrase); $data = json_decode(file_get_contents($url), TRUE);

Slide 62

Slide 62 text

Using Solr in application code ● There are client libraries/language bindings available for all popular languages

Slide 63

Slide 63 text

Filters

Slide 64

Slide 64 text

Filters ● Transform and normalize the search data

Slide 65

Slide 65 text

Filters ● Transform and normalize the search data ● Examples:

Slide 66

Slide 66 text

Filters ● Transform and normalize the search data ● Examples: – LowerCaseFilterFactory – lower case tokens

Slide 67

Slide 67 text

Filters ● Transform and normalize the search data ● Examples: – LowerCaseFilterFactory – lower case tokens – ASCIIFoldingFilterFactory – converts >127 unicode characters into their ASCII equivalents

Slide 68

Slide 68 text

Filters ● Transform and normalize the search data ● Examples: – LowerCaseFilterFactory – lower case tokens – ASCIIFoldingFilterFactory – converts >127 unicode characters into their ASCII equivalents – External: MorfologikFilterFactory – filter to use with Morfologik stemmer

Slide 69

Slide 69 text

¿Questions?

Slide 70

Slide 70 text

Resources ● Solr: http://lucene.apache.org/solr/ ● Lucene: https://lucene.apache.org/core/

Slide 71

Slide 71 text

Thank You!