Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQLiteで高速全文検索〜日本語編〜

A003bc0b49c7e5adf746b6d73cb57956?s=47 shoby
August 20, 2014

 SQLiteで高速全文検索〜日本語編〜

【第8回】potatotips (iOS/Android開発Tips共有会)

A003bc0b49c7e5adf746b6d73cb57956?s=128

shoby

August 20, 2014
Tweet

Transcript

  1. 42-JUFͰ ߴ଎શจݕࡧ ʙ೔ຊޠฤʙ @shobyshoby

  2. ࣗݾ঺հ pixivͰΞϓϦͭͬͯ͘·͢

  3. ຊ୊

  4. iOSΞϓϦͰ SQLiteΛ࢖ͬͯ ೔ຊޠΛ ߴ଎ʹશจݕࡧ͍ͨ͠

  5. ֓ཁ

  6. 42-JUFͷશจݕࡧ֦ு ೔ຊޠͷτʔΫϯ෼ׂ ݕࡧॱҐ σϞ

  7. 42-JUFͷ શจݕࡧ֦ு

  8. SQLite FTS3 and FTS4 Extensions http://www.sqlite.org/fts3.html

  9. iOS 6 SDK͔Β σϑΥϧτͰ࢖͑Δ

  10. جຊతͳ࢖͍ํ

  11. CREATE VIRTUAL TABLE article USING fts4 (title, body); FTS4 ςʔϒϧ

  12. INSERT INSERT INTO article(title, body) VALUES ('Apple', 'I have iPhone4

    and iPhone5s');
  13. جຊతͳQuery SELECT * FROM article WHERE body MATCH 'iPhone5s'; term

    match SELECT * FROM article WHERE body MATCH 'iPhone*'; term-prefix match
  14. phrase match SELECT * FROM article WHERE body MATCH 'have

    iPhone*'; ◦I have iPhone5s ×I have Android and iPhone5s
  15. ಛघͳQuery SELECT * FROM article WHERE article MATCH 'iPhone*'; all

    columns SELECT * FROM article WHERE article MATCH 'title:Apple iPhone*'; specified column
  16. ؆୯ʂ

  17. ※ͨͩ͠ӳޠʹݶΔ

  18. ೔ຊޠͷ৔߹ɺ FTSͷTokenizer͕ ͏·͘tokenʹ෼͚ΒΕͳ͍

  19. FTSͷTokenizer͕ ͏·͘ѻ͑Δܗࣜʹ Ճ޻ͯ͠σʔλΛೖΕΑ͏ʂ

  20. ೔ຊޠͷ τʔΫϯ෼ׂ

  21. FTSͷςʔϒϧʹ͸ ൒֯εϖʔεͰ ෼ׂͨ͠σʔλΛೖΕΔ

  22. CREATE TABLE article (id, title, body); CREATE VIRTUAL TABLE articleTokens

    USING fts4 (articleId, titleTokens, bodyTokens); ݩσʔλΛೖΕΔςʔϒϧ FTSςʔϒϧ
  23. INSERT͍ͨ͠σʔλ INSERT INTO article(id, title, body) VALUES (1, 'Ξοϓϧ', 'ࢲ͸ΞΠϑΥʔϯ4ͱΞΠϑΥʔϯ5sΛ͍࣋ͬͯ·

    ͢'); ! INSERT INTO articleTokens(articleId, titleTokens, bodyTokens) VALUES (1, 'Ξοϓϧ', 'ࢲ͸ ΞΠϑΥʔϯ4 ͱ ΞΠϑΥʔϯ5s Λ ࣋ͬ ͍ͯ·͢');
  24. ౤͍͛ͨSELECT SELECT * FROM article JOIN ( SELECT articleId FROM

    articleTokens WHERE bodyTokens MATCH 'ΞΠϑΥʔϯ*' LIMIT 100; ) AS result ON article.id = result.articleId ORDER BY article.id;
  25. τʔΫϯ෼ׂ͢Δ

  26. CFStringTokenizer

  27. ಺෦ͰMeCabΛ࢖ͬͯ ܗଶૉղੳͯ͠ΔΒ͍͠ http://stackoverflow.com/questions/8280824/how-to-use- cfstringtokenizer-with-chinese-and-japanese

  28. - (NSArray *)tokenArrayWithString:(NSString *)string { NSLocale *locale = [[NSLocale alloc]

    initWithLocaleIdentifier:@"ja"]; CFRange range = CFRangeMake(0, CFStringGetLength((CFStringRef)string)); CFStringTokenizerRef tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault, (CFStringRef)string, range, kCFStringTokenizerUnitWordBoundary, (CFLocaleRef)locale); NSMutableArray *tokenArray = [NSMutableArray array]; while(CFStringTokenizerAdvanceToNextToken(tokenizer) != kCFStringTokenizerTokenNone) { CFRange tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer); if(range.location != kCFNotFound) { NSString *token = [string substringWithRange:NSMakeRange(tokenRange.location, tokenRange.length)]; [tokenArray addObject:token]; } } CFRelease(tokenizer); return tokenArray; }
  29. INSERT

  30. tokenArrayΛ ൒֯εϖʔεͰ࿈݁ͯ͠ INSERT͢Δ [tokenArray componentsJoinedByString:@" "];

  31. ݕࡧ

  32. ֤Tokenʹ * Λ௥Ճ͠ ൒֯εϖʔεͰ࿈݁ͯ͠ QueryΛ૊ΈཱͯΔ NSMutableArray *searchTokens = [NSMutableArray arrayWithCapacity:tokenArray.count];

    for (NSString *token in tokenArray) { [searchTokens addObject:[token stringByAppendingString:@"*"]]; } [searchTokens componentsJoinedByString:@" "];
  33. ݕࡧॱҐ

  34. Ϛον౓͕ߴ͍ॱʹ ݁ՌΛฦ͍ͨ͠

  35. SQLite͸ ಠࣗͷSQLؔ਺Λ ఆٛͰ͖Δ

  36. rank()ͱ͍͏ SQLؔ਺͕ ͋ͬͨͱ͢Δͱ…

  37. ౤͍͛ͨSELECT SELECT * FROM article JOIN ( SELECT articleId, rank(matchinfo(articleTokens,

    'pcnalx')) AS rank FROM articleTokens WHERE bodyTokens MATCH 'ΞΠϑΥʔϯ*' LIMIT 100; ) AS result ON article.id = result.articleId ORDER BY rank DESC;
  38. match_info() ! Ϛον৘ใΛฦ͢
 FTSͷؔ਺ http://www.sqlite.org/fts3.html#matchinfo

  39. match_info()͔Β ϚονείΞΛฦ͢ SQLؔ਺Λ࡞Ζ͏ʂ

  40. CݴޠͰ…

  41. ϚονείΞͷ ΞϧΰϦζϜ

  42. Okapi BM25 http://en.wikipedia.org/wiki/Okapi_BM25

  43. sqlite-okapi-bm25 https://github.com/rads/sqlite-okapi-bm25

  44. ंྠͷ࠶ൃ໌͸ ආ͚ΒΕͨ ※஫ҙ ୯Ұͷcolumnʹ͔͠ରԠ͍ͯ͠ͳ͍ͷͰ ඞཁͳΒFork͢Δ

  45. σϞ

  46. ·ͱΊ

  47. 42-JUFͷ'54Λ࢖͏ͱɺߴ଎ʹ શจݕࡧͰ͖Δ ೔ຊޠ͸ࣄલʹ5PLFO෼ׂ͕ඞཁ ݕࡧॱҐ͸ಠࣗ42-ؔ਺Λ࡞ͬͯ ରԠ͢Δ

  48. ͓͠·͍