Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Searching with Solr

Sidnet
October 30, 2015

Introduction to Searching with Solr

Sidnet

October 30, 2015
Tweet

More Decks by Sidnet

Other Decks in Programming

Transcript

  1. Searching • Most of us start browsing the Internet with

    a search engine – that's how natural searching is
  2. Searching • Most of us start browsing the Internet with

    a search engine – that's how natural searching is • Unless your site has very little information, or everything is reachable with a few clicks, searching becomes a vital feature
  3. Searching • Poor man's solution: • You have to expose

    the data to Google • Guess what? Ads
  4. Searching in examples • We'll demonstrate and compare SQL database

    searches and searching with Solr • We'll use the same example database in all cases
  5. Example database • Extracted from a real-life web application •

    Contains blog posts and culinary recipes scraped from cooking blogs
  6. Example database • Extracted from a real-life web application •

    Contains blog posts and culinary recipes scraped from cooking blogs • 22 tables, 69300 records in posts table, 550 MB of textual data
  7. Example database posts id blog_id title body ... posts id

    blog_id title body ... blogs id name url ... blogs id name url ...
  8. Example database posts id blog_id title body ... posts id

    blog_id title body ... recipes id post_id title ... recipes id post_id title ... blogs id name url ... blogs id name url ...
  9. Example database posts id blog_id title body ... posts id

    blog_id title body ... recipes id post_id title ... recipes id post_id title ... ingredients id name ... ingredients id name ... recipes_ingredients recipe_id ingredient_id amount ... recipes_ingredients recipe_id ingredient_id amount ... blogs id name url ... blogs id name url ...
  10. Naïve approach with SQL LIKE • SELECT posts.id FROM recipes

    INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE posts.title LIKE '%cake%' OR posts.body LIKE '%cake%' OR recipes.title LIKE '%cake%';
  11. Naïve approach with SQL LIKE • SELECT posts.id FROM recipes

    INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE posts.title LIKE '%cake%' OR posts.body LIKE '%cake%' OR recipes.title LIKE '%cake%'; 949 rows in set (10.07 sec)
  12. Naïve approach with SQL LIKE • Slow • Very bad

    effect on database performance • Problems with whitespace/punctuation (e.g.: LIKE '% cake %' against “cake!”)
  13. Naïve approach with SQL LIKE • Slow • Very bad

    effect on database performance • Problems with whitespace/punctuation (e.g.: LIKE '% cake %' against “cake!”) • Really, a bad idea
  14. Searching features built into RDBMS • MySQL – full text

    search indexes CREATE TABLE posts ( id INT NOT NULL AUTO INCREMENT, title VARCHAR(200), body TEXT, ... FULLTEXT (title, body), ...
  15. Searching features built into RDBMS SELECT posts.id FROM recipes INNER

    JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE MATCH (posts.title, posts.body) AGAINST ('cake') OR MATCH (recipes.title) AGAINST ('cake');
  16. Searching features built into RDBMS SELECT posts.id FROM recipes INNER

    JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE MATCH (posts.title, posts.body) AGAINST ('cake') OR MATCH (recipes.title) AGAINST ('cake'); 1042 rows in set (0.22 sec)
  17. Searching features built into RDBMS • Much faster • But

    still not blazingly fast • Used to only be available for MyISAM tables
  18. Searching features built into RDBMS • Much faster • But

    still not blazingly fast • Used to only be available for MyISAM tables (not anymore with MySQL 5.6)
  19. Searching features built into RDBMS • Similar features in other

    RDBMS: – PostgreSQL: Full Text Searching – MSSQL: Full-Text Search – Oracle: Oracle Text – SQLite: FTS5 extension
  20. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features:
  21. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching
  22. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping
  23. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping – allows simultaneous update and searching
  24. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping – allows simultaneous update and searching – support for filters
  25. How Solr works • Importing data from a source (e.g.

    SQL database) – full import – delta import
  26. How Solr works • Importing data from a source (e.g.

    SQL database) – full import – delta import • Indexing imported data (with index filters)
  27. How Solr works • Importing data from a source (e.g.

    SQL database) – full import – delta import • Indexing imported data (with index filters) • Searching (with query filters)
  28. Solr configuration <document name="blogs_solr"> <entity name="item" query=" SELECT r.id AS

    id, r.post_id AS post_id, r.title AS recipe_title, ... FROM recipes AS r INNER JOIN posts AS p ON p.id = r.post_id INNER JOIN blogs AS b ON b.id = p.blog_id ORDER BY posted_on DESC "
  29. Solr configuration deltaQuery=" SELECT r.id AS id, r.post_id AS post_id,

    r.title AS recipe_title, ... FROM recipes AS r INNER JOIN posts AS p ON p.id = r.post_id INNER JOIN blogs AS b ON b.id = p.blog_id WHERE p.added_on > '${dataimporter.last_index_time}' ORDER BY posted_on DESC "
  30. Solr configuration <entity name=”item”> <field column="id" name="id" /> <field column="post_id"

    name="post_id" /> <field column="recipe_title" name="recipe_title" /> ... <entity name="ingredients" query="SELECT i.name AS ingredient_name FROM ingredients AS i ... WHERE ri.recipe_id = '${item.id}'"> <field column="ingredient_name" name="ingredients" />
  31. Solr in action • Delta import: • Search: curl http://host:port/solr/database/dataimport

    \ --data 'command=delta-import' curl http://host:port/solr/database/select? q=post_body%3Acake
  32. Solr in action { "responseHeader": { ... "QTime": 2, "params":

    { ... "q": "post_body:cake", ... } }, "response": { "numFound": 1009, ...
  33. Solr in action { "responseHeader": { ... "QTime": 2, "params":

    { ... "q": "post_body:cake", ... } }, "response": { "numFound": 1009, ... Blazingly fast
  34. Using Solr in application code • As simple as making

    an HTTP request: $phrase = 'cake'; $url = 'http://host/solr/database/select?q=' . urlencode('post_body:' . $phrase); $data = json_decode(file_get_contents($url), TRUE);
  35. Using Solr in application code • There are client libraries/language

    bindings available for all popular languages
  36. Filters • Transform and normalize the search data • Examples:

    – LowerCaseFilterFactory – lower case tokens
  37. Filters • Transform and normalize the search data • Examples:

    – LowerCaseFilterFactory – lower case tokens – ASCIIFoldingFilterFactory – converts >127 unicode characters into their ASCII equivalents
  38. Filters • Transform and normalize the search data • Examples:

    – LowerCaseFilterFactory – lower case tokens – ASCIIFoldingFilterFactory – converts >127 unicode characters into their ASCII equivalents – External: MorfologikFilterFactory – filter to use with Morfologik stemmer