In Memory Full Text Search in Kotlin

Kshitij Chauhan In-Memory Full Text Search with Kotlin Android Worldwide
@haroldadmin January 25th, 2022

Kshitij @haroldadmin

…why though? - You (probably)

Structure of the Talk Open Monologue  Conan O’Brien  Maybe some
F1 references

Situation You’re building a Podcast app

Situation Your server sent you a list of FIVE HUNDRED
EPISODES of a Podcast

Situation Each episode contains a TITLE and lengthy SHOW NOTES

Situation You want to allow your users to search the
episodes QUICKLY, and THOROUGHLY

What do you do?

• For every query the user enters, you search the
entire catalog of episodes • For every episode, you match the search query with the title and the show notes Naive Search?

Naive Search De fi ning Episodes data class Episode( val
id: Int, val title: String, val showNotes: String, )

Naive Search Fetching Queries val episodes: List<Episode> = dataSource.episodes searchBox.onQuery
{ query -> // Process Query }

Naive Search Matching Episodes searchBox.onQuery { query -> val searchResults
= episodes.filter { episode -> episode.title.contains(query) || episode.showNotes.contains(query) } showResults(searchResults) }

Naive Search Matching Episodes searchBox.onQuery { query -> episode.showNotes.contains(query)

Problems • Slow - You have to search every episode
for every query • Inaccurate - You have limited options to check if a string matches your query • Limited - No easy way to order search results according to relevance Naive Search

How Can We Improve This?

✨ Full Text Search ✨ Enter:

Full Text Search (FTS) lets you look at all the
words in every stored document* to match a search criteria *A document is anything with text, like a Podcast Episode’s details

It’s faster, more accurate, and more flexible than naive search

But how does it work?

🪄 Inverted Index 🪄 The Magic Behind FTS

Inverted Index • Contains information of every word stored in
every document • It’s “inverted” because instead of storing which document contains what text, it stores what “text” is present in which document. Map<Episode, List<String>> ❌ Map<String, List<Episode>> ✅

Inverted Index • You need to build it before you
can use it • Building the index takes time, but it speeds up search queries dramatically • Clever text processing allows the index to fi nd search results accurately • Cleverer data structures allow the index to be memory e ff i cient • Clevererer text analysis allows the index to rank search results e ff ectively TL; DR: The Inverted Index is 💫 Magic 💫

How Do We Build The Index?

Extract Text Step One

Extract Text val episodes: List<Episode> = … val episodeData =
mutableMapOf<Episode, String>() for (episode in episodes) { val text = buildString { append(episode.title) append(episode.showNotes) } episodeData[episode] = text } Step 1

Extract Text val episodes: List<Episode> = … appendln(episode.title) appendln(episode.showNotes) }
// “Down To The Cockaroo” // “Conan talks to tatt…”

Step 1 Extract Text listOf( "Patton Oswalt & Meredith Salenger
Comedian Comedian Patton Oswalt feels...", "Dating Your Family In Iceland Conan talks with Bjarki from Reykjavik...", "Bowen Yang Comedian Bowen Yang feels dissociative about being Conan...", "The Fiddler and The Ski Mall Conan speaks with Jesse at the World...", "Zach Galifianakis Returns Comedian Zach Galifianakis feels…sincere...", "Skull Soup Conan chats with Phil in Fall River, MA about working…", ... )

Process Text Step Two

Process Text • Users don’t search for sentences, they search
for words • Makes sense to break input text into words, or “tokens” • Tokens can be indexed by the Inverted Index • But tokens contain noise, and must be cleaned before indexing Step 2

Process Text val tokensByDoc = mutableMapOf<Episode, List<String>>() for (data in
episodeData) { val tokens = data.split(" ") tokensByDoc[episode] = tokens } Step 2 "Rashida sits down with Conan to talk"

Process Text val tokensByDoc = with Conan to talk" val
tokens = data.split(" ")

Process Text for (data in episodeData) { with Conan to
talk" val tokens = data .lowercase() .split(" ") Lowercasing data to allow queries   to be case-insensitive

Reduce Text Step Three

Reduce Text • Users don’t always enter exact search queries
• If they search for “sit”, they expect to see results for “sits” and “sitting” too • “Stemming” is a technique that helps reduce text to its original core Step 3 “After 25 years at the Late Night desk Conan realized that the only people at his holiday party are the men and women who work for him. Over the years and despite thousands of interviews Conan has never made a real and lasting friendship with any of his celebrity guests So he started a podcast to do just that.”

Reduce Text fun stem(token: String): String { // Complex stemming
logic 🤓 } val stemmedTokens = tokensByDoc .mapValues { (_, tokens) -> val stemmed = tokens.map { token -> stem(token) } stemmed } Step 3

Reduce Text val stemmedTokens = tokensByDoc // Complex stemming logic
🤓 } We can use an existing implementation of the Porter Stemming algorithm. And thank author that wrote it 🙌

Build the Index Step Four

Build the Index Step 4 val index: MutableMap<String, MutableList<Episode>> =
mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) -> for (token in tokens) { val episodesWithToken = index[token] ?: mutableListOf() episodesWithToken.add(episode) index[token] = episodesWithToken } }

mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) ->

mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) -> for (token in tokens) { val episodesWithToken = index[token] ?: mutableListOf() episodesWithToken.add(episode) index[token] = episodesWithToken } }

Searching the Index Step Last (I swear)

FTS Search Searching the Index searchBox.onQuery { query -> val
queryTokens = query .lowercase() .split(" ") .map { stem(it) } val searchResults = queryTokens .mapNotNull { token -> index[token] } .flatten() .distinctBy { it.id } showResults(searchResults) }

FTS Search Searching the Index searchBox.onQuery { query -> .lowercase()
.split(" ") .map { stem(it) } 👈 Tokenise the   search query

FTS Search Searching the Index searchBox.onQuery { query -> .mapNotNull
{ token -> index[token] } Find all episodes   matching a query   token 👇

FTS Search Searching the Index searchBox.onQuery { query -> .distinctBy
{ it.id } Combine results for  all tokens to a single list, and remove   duplicates 👇

🎉 And you’re done! 🎉

Whoa, that was cool! 🤯

🥳 It was! 🥳

But it’s complicated ☹

Put this in a library, please? 😬

…Okay

github.com/haroldadmin/lucilla Meet Lucilla! Lucilla is a fast, e ff i
cient and customisable in-memory Full Text Search library for Kotlin - Memory e ff i cient FTS Index - Customisable Text Processing Pipeline - TF-IDF based search result ranking - Dead simple, concise API

What about ranking search results?

Levenshtein distance, TF-IDF scores, etc.

Why not use SQLite’s built-in FTS?

Not Suitable for trivial/ephemeral data, Much more complicated

What else can we do with FTS?

Fuzzy Searching, Field boosts, Richer querying capability, Concurrent Text Processing

Kshitij https://haroldadmin.com @haroldadmin @haroldadmin That’s all, folks Lucilla: github.com/haroldadmin/lucilla

In Memory Full Text Search in Kotlin

In Memory Full Text Search in Kotlin

More Decks by Kshitij Chauhan

Other Decks in Technology

Featured

Transcript