Slide 1

Slide 1 text

Kshitij Chauhan In-Memory Full Text Search with Kotlin Android Worldwide @haroldadmin January 25th, 2022

Slide 2

Slide 2 text

Kshitij @haroldadmin

Slide 3

Slide 3 text

…why though? - You (probably)

Slide 4

Slide 4 text

Structure of the Talk Open Monologue
 Conan O’Brien
 Maybe some F1 references

Slide 5

Slide 5 text

Situation You’re building a Podcast app

Slide 6

Slide 6 text

Situation Your server sent you a list of FIVE HUNDRED EPISODES of a Podcast

Slide 7

Slide 7 text

Situation Each episode contains a TITLE and lengthy SHOW NOTES

Slide 8

Slide 8 text

Situation You want to allow your users to search the episodes QUICKLY, and THOROUGHLY

Slide 9

Slide 9 text

What do you do?

Slide 10

Slide 10 text

• For every query the user enters, you search the entire catalog of episodes • For every episode, you match the search query with the title and the show notes Naive Search?

Slide 11

Slide 11 text

Naive Search De fi ning Episodes data class Episode( val id: Int, val title: String, val showNotes: String, )

Slide 12

Slide 12 text

Naive Search Fetching Queries val episodes: List = dataSource.episodes searchBox.onQuery { query -> // Process Query }

Slide 13

Slide 13 text

Naive Search Matching Episodes searchBox.onQuery { query -> val searchResults = episodes.filter { episode -> episode.title.contains(query) || episode.showNotes.contains(query) } showResults(searchResults) }

Slide 14

Slide 14 text

Naive Search Matching Episodes searchBox.onQuery { query -> episode.showNotes.contains(query)

Slide 15

Slide 15 text

Problems • Slow - You have to search every episode for every query • Inaccurate - You have limited options to check if a string matches your query • Limited - No easy way to order search results according to relevance Naive Search

Slide 16

Slide 16 text

How Can We Improve This?

Slide 17

Slide 17 text

✨ Full Text Search ✨ Enter:

Slide 18

Slide 18 text

Full Text Search (FTS) lets you look at all the words in every stored document* to match a search criteria *A document is anything with text, like a Podcast Episode’s details

Slide 19

Slide 19 text

It’s faster, more accurate, and more flexible than naive search

Slide 20

Slide 20 text

But how does it work?

Slide 21

Slide 21 text

🪄 Inverted Index 🪄 The Magic Behind FTS

Slide 22

Slide 22 text

Inverted Index • Contains information of every word stored in every document • It’s “inverted” because instead of storing which document contains what text, it stores what “text” is present in which document. Map> ❌ Map> ✅

Slide 23

Slide 23 text

Inverted Index • You need to build it before you can use it • Building the index takes time, but it speeds up search queries dramatically • Clever text processing allows the index to fi nd search results accurately • Cleverer data structures allow the index to be memory e ff i cient • Clevererer text analysis allows the index to rank search results e ff ectively TL; DR: The Inverted Index is 💫 Magic 💫

Slide 24

Slide 24 text

How Do We Build The Index?

Slide 25

Slide 25 text

Extract Text Step One

Slide 26

Slide 26 text

Extract Text val episodes: List = … val episodeData = mutableMapOf() for (episode in episodes) { val text = buildString { append(episode.title) append(episode.showNotes) } episodeData[episode] = text } Step 1

Slide 27

Slide 27 text

Extract Text val episodes: List = … appendln(episode.title) appendln(episode.showNotes) } // “Down To The Cockaroo” // “Conan talks to tatt…”

Slide 28

Slide 28 text

Step 1 Extract Text listOf( "Patton Oswalt & Meredith Salenger Comedian Comedian Patton Oswalt feels...", "Dating Your Family In Iceland Conan talks with Bjarki from Reykjavik...", "Bowen Yang Comedian Bowen Yang feels dissociative about being Conan...", "The Fiddler and The Ski Mall Conan speaks with Jesse at the World...", "Zach Galifianakis Returns Comedian Zach Galifianakis feels…sincere...", "Skull Soup Conan chats with Phil in Fall River, MA about working…", ... )

Slide 29

Slide 29 text

Process Text Step Two

Slide 30

Slide 30 text

Process Text • Users don’t search for sentences, they search for words • Makes sense to break input text into words, or “tokens” • Tokens can be indexed by the Inverted Index • But tokens contain noise, and must be cleaned before indexing Step 2

Slide 31

Slide 31 text

Process Text val tokensByDoc = mutableMapOf>() for (data in episodeData) { val tokens = data.split(" ") tokensByDoc[episode] = tokens } Step 2 "Rashida sits down with Conan to talk"

Slide 32

Slide 32 text

Process Text val tokensByDoc = with Conan to talk" val tokens = data.split(" ")

Slide 33

Slide 33 text

Process Text for (data in episodeData) { with Conan to talk" val tokens = data .lowercase() .split(" ") Lowercasing data to allow queries 
 to be case-insensitive

Slide 34

Slide 34 text

Reduce Text Step Three

Slide 35

Slide 35 text

Reduce Text • Users don’t always enter exact search queries • If they search for “sit”, they expect to see results for “sits” and “sitting” too • “Stemming” is a technique that helps reduce text to its original core Step 3 “After 25 years at the Late Night desk Conan realized that the only people at his holiday party are the men and women who work for him. Over the years and despite thousands of interviews Conan has never made a real and lasting friendship with any of his celebrity guests So he started a podcast to do just that.”

Slide 36

Slide 36 text

Reduce Text fun stem(token: String): String { // Complex stemming logic 🤓 } val stemmedTokens = tokensByDoc .mapValues { (_, tokens) -> val stemmed = tokens.map { token -> stem(token) } stemmed } Step 3

Slide 37

Slide 37 text

Reduce Text val stemmedTokens = tokensByDoc // Complex stemming logic 🤓 } We can use an existing implementation of the Porter Stemming algorithm. And thank author that wrote it 🙌

Slide 38

Slide 38 text

Build the Index Step Four

Slide 39

Slide 39 text

Build the Index Step 4 val index: MutableMap> = mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) -> for (token in tokens) { val episodesWithToken = index[token] ?: mutableListOf() episodesWithToken.add(episode) index[token] = episodesWithToken } }

Slide 40

Slide 40 text

Build the Index Step 4 val index: MutableMap> = mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) ->

Slide 41

Slide 41 text

Build the Index Step 4 val index: MutableMap> = mutableMapOf() stemmedTokensByDoc.forEach { (episode, tokens) -> for (token in tokens) { val episodesWithToken = index[token] ?: mutableListOf() episodesWithToken.add(episode) index[token] = episodesWithToken } }

Slide 42

Slide 42 text

Searching the Index Step Last (I swear)

Slide 43

Slide 43 text

FTS Search Searching the Index searchBox.onQuery { query -> val queryTokens = query .lowercase() .split(" ") .map { stem(it) } val searchResults = queryTokens .mapNotNull { token -> index[token] } .flatten() .distinctBy { it.id } showResults(searchResults) }

Slide 44

Slide 44 text

FTS Search Searching the Index searchBox.onQuery { query -> .lowercase() .split(" ") .map { stem(it) } 👈 Tokenise the 
 search query

Slide 45

Slide 45 text

FTS Search Searching the Index searchBox.onQuery { query -> .mapNotNull { token -> index[token] } Find all episodes 
 matching a query 
 token 👇

Slide 46

Slide 46 text

FTS Search Searching the Index searchBox.onQuery { query -> .distinctBy { it.id } Combine results for
 all tokens to a single list, and remove 
 duplicates 👇

Slide 47

Slide 47 text

🎉 And you’re done! 🎉

Slide 48

Slide 48 text

Whoa, that was cool! 🤯

Slide 49

Slide 49 text

🥳 It was! 🥳

Slide 50

Slide 50 text

But it’s complicated ☹

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Put this in a library, please? 😬

Slide 53

Slide 53 text

…Okay

Slide 54

Slide 54 text

github.com/haroldadmin/lucilla Meet Lucilla! Lucilla is a fast, e ff i cient and customisable in-memory Full Text Search library for Kotlin - Memory e ff i cient FTS Index - Customisable Text Processing Pipeline - TF-IDF based search result ranking - Dead simple, concise API

Slide 55

Slide 55 text

What about ranking search results?

Slide 56

Slide 56 text

Levenshtein distance, TF-IDF scores, etc.

Slide 57

Slide 57 text

Why not use SQLite’s built-in FTS?

Slide 58

Slide 58 text

Not Suitable for trivial/ephemeral data, Much more complicated

Slide 59

Slide 59 text

What else can we do with FTS?

Slide 60

Slide 60 text

Fuzzy Searching, Field boosts, Richer querying capability, Concurrent Text Processing

Slide 61

Slide 61 text

Kshitij https://haroldadmin.com @haroldadmin @haroldadmin That’s all, folks Lucilla: github.com/haroldadmin/lucilla