Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Searching with Solr

Avatar for Sidnet Sidnet
October 30, 2015

Introduction to Searching with Solr

Avatar for Sidnet

Sidnet

October 30, 2015
Tweet

More Decks by Sidnet

Other Decks in Programming

Transcript

  1. Searching • Most of us start browsing the Internet with

    a search engine – that's how natural searching is
  2. Searching • Most of us start browsing the Internet with

    a search engine – that's how natural searching is • Unless your site has very little information, or everything is reachable with a few clicks, searching becomes a vital feature
  3. Searching • Poor man's solution: • You have to expose

    the data to Google • Guess what? Ads
  4. Searching in examples • We'll demonstrate and compare SQL database

    searches and searching with Solr • We'll use the same example database in all cases
  5. Example database • Extracted from a real-life web application •

    Contains blog posts and culinary recipes scraped from cooking blogs
  6. Example database • Extracted from a real-life web application •

    Contains blog posts and culinary recipes scraped from cooking blogs • 22 tables, 69300 records in posts table, 550 MB of textual data
  7. Example database posts id blog_id title body ... posts id

    blog_id title body ... blogs id name url ... blogs id name url ...
  8. Example database posts id blog_id title body ... posts id

    blog_id title body ... recipes id post_id title ... recipes id post_id title ... blogs id name url ... blogs id name url ...
  9. Example database posts id blog_id title body ... posts id

    blog_id title body ... recipes id post_id title ... recipes id post_id title ... ingredients id name ... ingredients id name ... recipes_ingredients recipe_id ingredient_id amount ... recipes_ingredients recipe_id ingredient_id amount ... blogs id name url ... blogs id name url ...
  10. Naïve approach with SQL LIKE • SELECT posts.id FROM recipes

    INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE posts.title LIKE '%cake%' OR posts.body LIKE '%cake%' OR recipes.title LIKE '%cake%';
  11. Naïve approach with SQL LIKE • SELECT posts.id FROM recipes

    INNER JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE posts.title LIKE '%cake%' OR posts.body LIKE '%cake%' OR recipes.title LIKE '%cake%'; 949 rows in set (10.07 sec)
  12. Naïve approach with SQL LIKE • Slow • Very bad

    effect on database performance • Problems with whitespace/punctuation (e.g.: LIKE '% cake %' against “cake!”)
  13. Naïve approach with SQL LIKE • Slow • Very bad

    effect on database performance • Problems with whitespace/punctuation (e.g.: LIKE '% cake %' against “cake!”) • Really, a bad idea
  14. Searching features built into RDBMS • MySQL – full text

    search indexes CREATE TABLE posts ( id INT NOT NULL AUTO INCREMENT, title VARCHAR(200), body TEXT, ... FULLTEXT (title, body), ...
  15. Searching features built into RDBMS SELECT posts.id FROM recipes INNER

    JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE MATCH (posts.title, posts.body) AGAINST ('cake') OR MATCH (recipes.title) AGAINST ('cake');
  16. Searching features built into RDBMS SELECT posts.id FROM recipes INNER

    JOIN posts ON posts.id = recipes.post_id INNER JOIN blogs ON blogs.id = posts.blog_id WHERE MATCH (posts.title, posts.body) AGAINST ('cake') OR MATCH (recipes.title) AGAINST ('cake'); 1042 rows in set (0.22 sec)
  17. Searching features built into RDBMS • Much faster • But

    still not blazingly fast • Used to only be available for MyISAM tables
  18. Searching features built into RDBMS • Much faster • But

    still not blazingly fast • Used to only be available for MyISAM tables (not anymore with MySQL 5.6)
  19. Searching features built into RDBMS • Similar features in other

    RDBMS: – PostgreSQL: Full Text Searching – MSSQL: Full-Text Search – Oracle: Oracle Text – SQLite: FTS5 extension
  20. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features:
  21. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching
  22. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping
  23. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping – allows simultaneous update and searching
  24. Enter Solr • Based on Lucene, a high-performance, full- featured

    text search engine library • Notable features: – ranked searching – flexible faceting, highlighting, joins and result grouping – allows simultaneous update and searching – support for filters
  25. How Solr works • Importing data from a source (e.g.

    SQL database) – full import – delta import
  26. How Solr works • Importing data from a source (e.g.

    SQL database) – full import – delta import • Indexing imported data (with index filters)
  27. How Solr works • Importing data from a source (e.g.

    SQL database) – full import – delta import • Indexing imported data (with index filters) • Searching (with query filters)
  28. Solr configuration <document name="blogs_solr"> <entity name="item" query=" SELECT r.id AS

    id, r.post_id AS post_id, r.title AS recipe_title, ... FROM recipes AS r INNER JOIN posts AS p ON p.id = r.post_id INNER JOIN blogs AS b ON b.id = p.blog_id ORDER BY posted_on DESC "
  29. Solr configuration deltaQuery=" SELECT r.id AS id, r.post_id AS post_id,

    r.title AS recipe_title, ... FROM recipes AS r INNER JOIN posts AS p ON p.id = r.post_id INNER JOIN blogs AS b ON b.id = p.blog_id WHERE p.added_on > '${dataimporter.last_index_time}' ORDER BY posted_on DESC "
  30. Solr configuration <entity name=”item”> <field column="id" name="id" /> <field column="post_id"

    name="post_id" /> <field column="recipe_title" name="recipe_title" /> ... <entity name="ingredients" query="SELECT i.name AS ingredient_name FROM ingredients AS i ... WHERE ri.recipe_id = '${item.id}'"> <field column="ingredient_name" name="ingredients" />
  31. Solr in action • Delta import: • Search: curl http://host:port/solr/database/dataimport

    \ --data 'command=delta-import' curl http://host:port/solr/database/select? q=post_body%3Acake
  32. Solr in action { "responseHeader": { ... "QTime": 2, "params":

    { ... "q": "post_body:cake", ... } }, "response": { "numFound": 1009, ...
  33. Solr in action { "responseHeader": { ... "QTime": 2, "params":

    { ... "q": "post_body:cake", ... } }, "response": { "numFound": 1009, ... Blazingly fast
  34. Using Solr in application code • As simple as making

    an HTTP request: $phrase = 'cake'; $url = 'http://host/solr/database/select?q=' . urlencode('post_body:' . $phrase); $data = json_decode(file_get_contents($url), TRUE);
  35. Using Solr in application code • There are client libraries/language

    bindings available for all popular languages
  36. Filters • Transform and normalize the search data • Examples:

    – LowerCaseFilterFactory – lower case tokens
  37. Filters • Transform and normalize the search data • Examples:

    – LowerCaseFilterFactory – lower case tokens – ASCIIFoldingFilterFactory – converts >127 unicode characters into their ASCII equivalents
  38. Filters • Transform and normalize the search data • Examples:

    – LowerCaseFilterFactory – lower case tokens – ASCIIFoldingFilterFactory – converts >127 unicode characters into their ASCII equivalents – External: MorfologikFilterFactory – filter to use with Morfologik stemmer