Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Full-text search (with emphasis on Japanese)

nz
April 30, 2012

Full-text search (with emphasis on Japanese)

This short presentation at @herokujp on 20 Apr 2012 takes a look at the basics of full-text search. Why is a full-text search index so much faster than searching in SQL?

nz

April 30, 2012
Tweet

Other Decks in Programming

Transcript

  1. id title … 1 hello, world! 2 hello, ౦ژ! 3

    ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  2. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  3. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  4. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  5. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  6. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  7. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  8. id title LIKE "hello"; … 1 hello, world! 2 hello,

    ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  9. SQL LIKE O(N) = SLOW ☹ SQL like is SLOW

    - You are scanning through your entire database table and checking each record. Monday, April 30, 12
  10. hello Search • hello, world! • hello, ౦ژ! however, you

    do get results, and so that's as far as some go. but there is another reason why SQL LIKE is a bad idea… Monday, April 30, 12
  11. hello world Search Let's pretend we are the customer or

    a user, and start entering more queries. This should work, right? Monday, April 30, 12
  12. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ here's the same table again Monday, April 30, 12
  13. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ uh-oh… That didn't work, because SQL like is basically just testing the exact equality of bytes here. That comma breaks our query. Monday, April 30, 12
  14. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ And it gets worse: we're not done with our search yet. We still have to check the rest of the table! Again, we're back to being slow. Monday, April 30, 12
  15. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  16. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  17. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  18. id title LIKE "hello world"; … 1 hello, world! 2

    hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12
  19. hello world Search No results found to match your query.

    ʹҰக͢Δ৘ใ͸ݟ͔ͭΓ·ͤΜͰͨ͠ɻ We just scanned through our entire table of data without getting any useful results, and the user has no idea why. And it gets worse. Monday, April 30, 12
  20. hello ౦ژ Search No results found to match your query.

    ʹҰக͢Δ৘ใ͸ݟ͔ͭΓ·ͤΜͰͨ͠ɻ It's easy to make up queries here that seem like they should match something, but they don't. Monday, April 30, 12
  21. ࡍۭ͜Μʹͪ͸ Search No results found to match your query. ʹҰக͢Δ৘ใ͸ݟ͔ͭΓ·ͤΜͰͨ͠ɻ

    Ultimately, if you want flexible searches, you need to parse your queries and combine the results of multiple searches. But that is slow! Monday, April 30, 12
  22. QUERY PARSING Flexible queries combine the results of many separate

    queries. Required and optional terms Flexible order of terms But many slow searches is even slower! Users expect flexible queries, but making queries flexible will make a slow search much slower. Monday, April 30, 12
  23. STEP ONE: MAKE IT FAST So let's revisit the slow

    search problem and try to make it faster Monday, April 30, 12
  24. id title … 1 hello, world! 2 hello, ౦ژ! 3

    ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ This is our original data. It's stored sensibly enough for SQL, but it's not really optimized for searching. Let's improve it. Monday, April 30, 12
  25. id term 1 hello 1 world 2 hello 2 ౦ژ

    This is one step in a better direction. Separate each term, and maintain its association with the original record that it appears in. Monday, April 30, 12
  26. id term 1 hello 2 hello 1 world 2 ౦ژ

    This means we can do something clever, like sort by term instead, which will let us run a faster binary search. This hypothetical index looks a bit like a normal database index. Monday, April 30, 12
  27. PROBLEM: WHAT IS A TERM? But we have a problem.

    How do you decide what makes a "term"? This is easy in English, where words are separated by whitespace and punctuation. But languages like Japanese don't use whitespace, and have relatively little punctuation. Monday, April 30, 12
  28. N-GRAM One approach is to split the text into "n-grams"

    We can take the original text and break it into two-character "pairs" Monday, April 30, 12
  29. γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχ χΞ Ξϓ ϓϩ …

    ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ ੢ࠃ ࠃࡍ ࡍۭ ۭߓ The results would look something like this. It's not a very good technique, but it's better than nothing. Monday, April 30, 12
  30. N-GRAM Generates too many “terms” Terms don't preserve meaning Bad

    for index size and relevancy We can do better! The problem: 1. it generates a lot of terms 2. many of these "terms" don't have meaning 3. bad for index size and performance 4. bad for relevancy we can do better! Monday, April 30, 12
  31. KUROMOJI NEW IN LUCENE 3.6.0 Happily, Lucene 3.6.0 was released

    one week ago with an EXCELLENT Japanese morphological analyzer package called Kuromoji Monday, April 30, 12
  32. γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞ ϓϩδΣΫτ Ϛωʔδϟ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢

    ࠃࡍ ۭߓ We can see right away—if you read Japanese—that we get much better terms from this kind of analysis. Since we are confident that we can tokenize Japanese, let's continue building our hypothetical index Monday, April 30, 12
  33. term id hello 1 hello 2 world 1 ౦ژ 2

    ౦ژ 3 ͜Μʹͪ͸ 3 ͜Μʹͪ͸ 4 γχΞ 4 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 Our table now includes a few of the tokenized Japanese terms Monday, April 30, 12
  34. term id hello 1 hello 2 world 1 ͔Β 5

    ͔Β 6 ͜Μʹͪ͸ 3 ͜Μʹͪ͸ 4 ͠ 5 ͠ 6 γχΞ 4 ͨ 5 ͨ 6 term id ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5 ग़ൃ 6 ࠃࡍ 5 ࠃࡍ 6 ੒ా 6 ౦ژ 2 ౦ژ 3 ۭߓ 5 ۭߓ 6 ؔ੢ 5 When we finish tokenizing all the text, we end up with a table that looks something like this. This is progress—but we can do better. Monday, April 30, 12
  35. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 This table here maintains one entry per term, associated with a set of IDs for the records that are included. Monday, April 30, 12
  36. LUCENE “INVERSE INDEX” We have been building a structure that

    is similar to the "inverse index" built by Lucene. Lucene is a library that specializes in creating and maintaining efficient data structures for your index. Monday, April 30, 12
  37. ۭߓ Search Let's try a few more searches against this

    new data structure to compare it to our earlier slow searches Monday, April 30, 12
  38. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Because we now have a sorted list of each term, we can perform a binary search. Monday, April 30, 12
  39. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 We check the middle Monday, April 30, 12
  40. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Check the middle again Monday, April 30, 12
  41. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 And check the middle again. Three operations to find our matching records, from a list of 15 terms! Monday, April 30, 12
  42. id title … 1 hello, world! 2 hello, ౦ژ! 3

    ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Now that we have the matching IDs, it is a simple matter for SQL to fetch the matching rows Monday, April 30, 12
  43. FAST! O(LOG N) So the bottom line is that an

    inverse index is very fast! We can take advantage of this speed for better queries Monday, April 30, 12
  44. ۭߓͷϚωʔδϟ Search A good search engine processes your query into

    tokens, the same as it does your data, and runs a separate "query" for each term Monday, April 30, 12
  45. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ۭߓ Let's find documents matching our first term Monday, April 30, 12
  46. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ۭߓ Monday, April 30, 12
  47. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ۭߓ Monday, April 30, 12
  48. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ۭߓ Monday, April 30, 12
  49. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ Now let's look for the second term Monday, April 30, 12
  50. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ Monday, April 30, 12
  51. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ Monday, April 30, 12
  52. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ Monday, April 30, 12
  53. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ We didn't find it, but that's okay Monday, April 30, 12
  54. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ Now let's look for documents matching the third term Monday, April 30, 12
  55. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ Monday, April 30, 12
  56. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ Monday, April 30, 12
  57. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ Monday, April 30, 12
  58. term id hello 1, 2 world 1 ͔Β 5, 6

    ͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ We can combine the matched documents in many different ways using set theory Monday, April 30, 12
  59. id title … 1 hello, world! 2 hello, ౦ژ! 3

    ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ in this case, let's just fetch all the documents that match any of the terms Monday, April 30, 12
  60. REVIEW Using an index is FASTER Using an index is

    more FLEXIBLE Lucene creates and manages efficient index structures Monday, April 30, 12
  61. SOLR, ELASTICSEARCH HTTP interface to Lucene. Scale separately from your

    application. Use with any language or framework. Abstract away low-level Lucene implementation details. Monday, April 30, 12
  62. Solr ElasticSearch Created in 2004. Created in 2010. Well-established and

    widely adopted. Growing quickly with early- adopters. Pre-RESTful API design. Modern RESTful JSON. XML configuration files. Minimal JSON/YAML configuration. Distribution & real-time a work in progress Distributed & real-time by design. More features. More minimalist. Many developers. Just one “benevolent dictator.” Monday, April 30, 12