Full-text search (with emphasis on Japanese)

FULL-TEXT SEARCH Nick Zadrozny websolr.com :: bonsai.io Monday, April 30,
12

Search Monday, April 30, 12

id title … 1 hello, world! 2 hello, ౦ژ! 3
౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12

hello Search Monday, April 30, 12

SQL LIKE TITLE LIKE "HELLO"; Monday, April 30, 12

id title LIKE "hello"; … 1 hello, world! 2 hello,
౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12

SQL LIKE O(N) = SLOW ☹ SQL like is SLOW
- You are scanning through your entire database table and checking each record. Monday, April 30, 12

hello Search • hello, world! • hello, ౦ژ! however, you
do get results, and so that's as far as some go. but there is another reason why SQL LIKE is a bad idea… Monday, April 30, 12

hello world Search Let's pretend we are the customer or
a user, and start entering more queries. This should work, right? Monday, April 30, 12

id title LIKE "hello world"; … 1 hello, world! 2
hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ here's the same table again Monday, April 30, 12

hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ uh-oh… That didn't work, because SQL like is basically just testing the exact equality of bytes here. That comma breaks our query. Monday, April 30, 12

hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ And it gets worse: we're not done with our search yet. We still have to check the rest of the table! Again, we're back to being slow. Monday, April 30, 12

hello, ౦ژ! 3 ౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Monday, April 30, 12

hello world Search No results found to match your query.
ʹҰக͢Δ৘ใ͸ݟ͔ͭΓ·ͤΜͰͨ͠ɻ We just scanned through our entire table of data without getting any useful results, and the user has no idea why. And it gets worse. Monday, April 30, 12

hello ౦ژ Search No results found to match your query.
ʹҰக͢Δ৘ใ͸ݟ͔ͭΓ·ͤΜͰͨ͠ɻ It's easy to make up queries here that seem like they should match something, but they don't. Monday, April 30, 12

ࡍۭ͜Μʹͪ͸ Search No results found to match your query. ʹҰக͢Δ৘ใ͸ݟ͔ͭΓ·ͤΜͰͨ͠ɻ
Ultimately, if you want flexible searches, you need to parse your queries and combine the results of multiple searches. But that is slow! Monday, April 30, 12

QUERY PARSING Flexible queries combine the results of many separate
queries. Required and optional terms Flexible order of terms But many slow searches is even slower! Users expect flexible queries, but making queries flexible will make a slow search much slower. Monday, April 30, 12

STEP ONE: MAKE IT FAST So let's revisit the slow
search problem and try to make it faster Monday, April 30, 12

౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ This is our original data. It's stored sensibly enough for SQL, but it's not really optimized for searching. Let's improve it. Monday, April 30, 12

id term 1 hello 1 world 2 hello 2 ౦ژ
This is one step in a better direction. Separate each term, and maintain its association with the original record that it appears in. Monday, April 30, 12

id term 1 hello 2 hello 1 world 2 ౦ژ
This means we can do something clever, like sort by term instead, which will let us run a faster binary search. This hypothetical index looks a bit like a normal database index. Monday, April 30, 12

PROBLEM: WHAT IS A TERM? But we have a problem.
How do you decide what makes a "term"? This is easy in English, where words are separated by whitespace and punctuation. But languages like Japanese don't use whitespace, and have relatively little punctuation. Monday, April 30, 12

N-GRAM One approach is to split the text into "n-grams"
We can take the original text and break it into two-character "pairs" Monday, April 30, 12

γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχ χΞ Ξϓ ϓϩ …
ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ ੢ࠃ ࠃࡍ ࡍۭ ۭߓ The results would look something like this. It's not a very good technique, but it's better than nothing. Monday, April 30, 12

N-GRAM Generates too many “terms” Terms don't preserve meaning Bad
for index size and relevancy We can do better! The problem: 1. it generates a lot of terms 2. many of these "terms" don't have meaning 3. bad for index size and performance 4. bad for relevancy we can do better! Monday, April 30, 12

MORPHOLOGICAL ANALYSIS Morphological analysis uses a dictionary and statistical modeling
of the language to identify terms Monday, April 30, 12

KUROMOJI NEW IN LUCENE 3.6.0 Happily, Lucene 3.6.0 was released
one week ago with an EXCELLENT Japanese morphological analyzer package called Kuromoji Monday, April 30, 12

γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞϓϩδΣΫτϚωʔδϟʔ γχΞ ϓϩδΣΫτ Ϛωʔδϟ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢ࠃࡍۭߓ ؔ੢
ࠃࡍ ۭߓ We can see right away—if you read Japanese—that we get much better terms from this kind of analysis. Since we are confident that we can tokenize Japanese, let's continue building our hypothetical index Monday, April 30, 12

term id hello 1 hello 2 world 1 ౦ژ 2
౦ژ 3 ͜Μʹͪ͸ 3 ͜Μʹͪ͸ 4 γχΞ 4 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 Our table now includes a few of the tokenized Japanese terms Monday, April 30, 12

term id hello 1 hello 2 world 1 ͔Β 5
͔Β 6 ͜Μʹͪ͸ 3 ͜Μʹͪ͸ 4 ͠ 5 ͠ 6 γχΞ 4 ͨ 5 ͨ 6 term id ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5 ग़ൃ 6 ࠃࡍ 5 ࠃࡍ 6 ੒ా 6 ౦ژ 2 ౦ژ 3 ۭߓ 5 ۭߓ 6 ؔ੢ 5 When we finish tokenizing all the text, we end up with a table that looks something like this. This is progress—but we can do better. Monday, April 30, 12

term id hello 1, 2 world 1 ͔Β 5, 6
͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 This table here maintains one entry per term, associated with a set of IDs for the records that are included. Monday, April 30, 12

LUCENE “INVERSE INDEX” We have been building a structure that
is similar to the "inverse index" built by Lucene. Lucene is a library that specializes in creating and maintaining efficient data structures for your index. Monday, April 30, 12

ۭߓ Search Let's try a few more searches against this
new data structure to compare it to our earlier slow searches Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Because we now have a sorted list of each term, we can perform a binary search. Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 We check the middle Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Check the middle again Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 And check the middle again. Three operations to find our matching records, from a list of 15 terms! Monday, April 30, 12

౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ Now that we have the matching IDs, it is a simple matter for SQL to fetch the matching rows Monday, April 30, 12

ۭߓ Search • ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ • ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ And we have our
search results, in much less time Monday, April 30, 12

FAST! O(LOG N) So the bottom line is that an
inverse index is very fast! We can take advantage of this speed for better queries Monday, April 30, 12

ۭߓͷϚωʔδϟ Search A good search engine processes your query into
tokens, the same as it does your data, and runs a separate "query" for each term Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ۭߓ Let's find documents matching our first term Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ۭߓ Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ Now let's look for the second term Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 ͷ We didn't find it, but that's okay Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ Now let's look for documents matching the third term Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ Monday, April 30, 12

͜Μʹͪ͸ 3, 4 ͠ 5, 6 γχΞ 4 ͨ 5, 6 ϓϩδΣΫτ 4 Ϛωʔδϟ 4 ग़ൃ 5, 6 ࠃࡍ 5, 6 ੒ా 6 ౦ژ 2, 3 ۭߓ 5, 6 ؔ੢ 5 Ϛωʔδϟ We can combine the matched documents in many different ways using set theory Monday, April 30, 12

౦ژ͜Μʹͪ͸ʂ 4 ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ 5 ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ 6 ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ in this case, let's just fetch all the documents that match any of the terms Monday, April 30, 12

ۭߓͷϚωʔδϟ Search • ͜Μʹͪ͸γχΞϓϩδΣΫτϚωʔδϟʔ • ؔ੢ࠃࡍۭߓ͔Βग़ൃͨ͠ • ੒ాࠃࡍۭߓ͔Βग़ൃͨ͠ and we
have our results! Monday, April 30, 12

REVIEW Using an index is FASTER Using an index is
more FLEXIBLE Lucene creates and manages eﬃcient index structures Monday, April 30, 12

USING LUCENE: SOLR, ELASTICSEARCH Monday, April 30, 12

SOLR, ELASTICSEARCH HTTP interface to Lucene. Scale separately from your
application. Use with any language or framework. Abstract away low-level Lucene implementation details. Monday, April 30, 12

Solr ElasticSearch Created in 2004. Created in 2010. Well-established and
widely adopted. Growing quickly with early- adopters. Pre-RESTful API design. Modern RESTful JSON. XML configuration files. Minimal JSON/YAML configuration. Distribution & real-time a work in progress Distributed & real-time by design. More features. More minimalist. Many developers. Just one “benevolent dictator.” Monday, April 30, 12

https://devcenter.heroku.com/articles/websolr https://devcenter.heroku.com/articles/bonsai GIVE IT A TRY! Monday, April 30, 12

Full-text search (with emphasis on Japanese)

Full-text search (with emphasis on Japanese)

Other Decks in Programming

Featured

Transcript