Upgrade to Pro — share decks privately, control downloads, hide ads and more …

In Memory Full Text Search in Kotlin

In Memory Full Text Search in Kotlin

Learn how full text search works and how to implement it in Kotlin.

Kshitij Chauhan

January 26, 2022
Tweet

More Decks by Kshitij Chauhan

Other Decks in Technology

Transcript

  1. Situation You want to allow your users to search the

    episodes QUICKLY, and THOROUGHLY
  2. • For every query the user enters, you search the

    entire catalog of episodes • For every episode, you match the search query with the title and the show notes Naive Search?
  3. Naive Search De fi ning Episodes data class Episode( val

    id: Int, val title: String, val showNotes: String, )
  4. Naive Search Matching Episodes searchBox.onQuery { query -> val searchResults

    = episodes.filter { episode -> episode.title.contains(query) || episode.showNotes.contains(query) } showResults(searchResults) }
  5. Problems • Slow - You have to search every episode

    for every query • Inaccurate - You have limited options to check if a string matches your query • Limited - No easy way to order search results according to relevance Naive Search
  6. Full Text Search (FTS) lets you look at all the

    words in every stored document* to match a search criteria *A document is anything with text, like a Podcast Episode’s details
  7. Inverted Index • Contains information of every word stored in

    every document • It’s “inverted” because instead of storing which document contains what text, it stores what “text” is present in which document. Map<Episode, List<String>> ❌ Map<String, List<Episode>> ✅
  8. Inverted Index • You need to build it before you

    can use it • Building the index takes time, but it speeds up search queries dramatically • Clever text processing allows the index to fi nd search results accurately • Cleverer data structures allow the index to be memory e ff i cient • Clevererer text analysis allows the index to rank search results e ff ectively TL; DR: The Inverted Index is 💫 Magic 💫
  9. Extract Text val episodes: List<Episode> = … val episodeData =

    mutableMapOf<Episode, String>() for (episode in episodes) { val text = buildString { append(episode.title) append(episode.showNotes) } episodeData[episode] = text } Step 1
  10. Step 1 Extract Text listOf( "Patton Oswalt & Meredith Salenger

    Comedian Comedian Patton Oswalt feels...", "Dating Your Family In Iceland Conan talks with Bjarki from Reykjavik...", "Bowen Yang Comedian Bowen Yang feels dissociative about being Conan...", "The Fiddler and The Ski Mall Conan speaks with Jesse at the World...", "Zach Galifianakis Returns Comedian Zach Galifianakis feels…sincere...", "Skull Soup Conan chats with Phil in Fall River, MA about working…", ... )
  11. Process Text • Users don’t search for sentences, they search

    for words • Makes sense to break input text into words, or “tokens” • Tokens can be indexed by the Inverted Index • But tokens contain noise, and must be cleaned before indexing Step 2
  12. Process Text val tokensByDoc = mutableMapOf<Episode, List<String>>() for (data in

    episodeData) { val tokens = data.split(" ") tokensByDoc[episode] = tokens } Step 2 "Rashida sits down with Conan to talk"
  13. Process Text for (data in episodeData) { with Conan to

    talk" val tokens = data .lowercase() .split(" ") Lowercasing data to allow queries 
 to be case-insensitive
  14. Reduce Text • Users don’t always enter exact search queries

    • If they search for “sit”, they expect to see results for “sits” and “sitting” too • “Stemming” is a technique that helps reduce text to its original core Step 3 “After 25 years at the Late Night desk Conan realized that the only people at his holiday party are the men and women who work for him. Over the years and despite thousands of interviews Conan has never made a real and lasting friendship with any of his celebrity guests So he started a podcast to do just that.”
  15. Reduce Text fun stem(token: String): String { // Complex stemming

    logic 🤓 } val stemmedTokens = tokensByDoc .mapValues { (_, tokens) -> val stemmed = tokens.map { token -> stem(token) } stemmed } Step 3
  16. Reduce Text val stemmedTokens = tokensByDoc // Complex stemming logic

    🤓 } We can use an existing implementation of the Porter Stemming algorithm. And thank author that wrote it 🙌
  17. Build the Index Step 4 val index: MutableMap<String, MutableList<Episode>> =

    mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) -> for (token in tokens) { val episodesWithToken = index[token] ?: mutableListOf() episodesWithToken.add(episode) index[token] = episodesWithToken } }
  18. Build the Index Step 4 val index: MutableMap<String, MutableList<Episode>> =

    mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) ->
  19. Build the Index Step 4 val index: MutableMap<String, MutableList<Episode>> =

    mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) -> for (token in tokens) { val episodesWithToken = index[token] ?: mutableListOf() episodesWithToken.add(episode) index[token] = episodesWithToken } }
  20. FTS Search Searching the Index searchBox.onQuery { query -> val

    queryTokens = query .lowercase() .split(" ") .map { stem(it) } val searchResults = queryTokens .mapNotNull { token -> index[token] } .flatten() .distinctBy { it.id } showResults(searchResults) }
  21. FTS Search Searching the Index searchBox.onQuery { query -> .lowercase()

    .split(" ") .map { stem(it) } 👈 Tokenise the 
 search query
  22. FTS Search Searching the Index searchBox.onQuery { query -> .mapNotNull

    { token -> index[token] } Find all episodes 
 matching a query 
 token 👇
  23. FTS Search Searching the Index searchBox.onQuery { query -> .distinctBy

    { it.id } Combine results for
 all tokens to a single list, and remove 
 duplicates 👇
  24. github.com/haroldadmin/lucilla Meet Lucilla! Lucilla is a fast, e ff i

    cient and customisable in-memory Full Text Search library for Kotlin - Memory e ff i cient FTS Index - Customisable Text Processing Pipeline - TF-IDF based search result ranking - Dead simple, concise API