Advanced Search

Tech Talk - Yohan - http://yohan.id - [email protected] Advanced Search
N-gram algorithm & Levenshtein distance

Previous Version ✤ Storing ✤ Stored “as is” ✤ Querying
✤ Exact word matching ✤ Cut off irrelevant results (using PostgreSQL ts_rank) ✤ Sorted by publish date

Problem 1 - Dirty Query ✤ Solution: ✤ Stripping tags
and special characters ✤ Removing stopwords ✤ Sastrawi (github.com/sastrawi/sastrawi) ✤ Example: “query yang baik!” -> “query baik”

Problem 2 - Stemming ✤ Solution: ✤ Sastrawi (github.com/sastrawi/sastrawi)

Problem 3 - Hostname ✤ bukalapak.com & rumah.com? ✤ Unique
words vs common words ✤ Solution: ✤ Regex for word classiﬁcation ✤ Use Sastrawi dictionary to identify common words

What’s Next? ✤ Word prediction / auto-complete ✤ Semantic awareness
✤ “rumah sakit”, “kamar mandi”, “kereta api” ✤ Auto-correct

Facts ✤ Most search are brands and names ✤ Lots
and lots of typo ✤ park chanyeol vs park chan-yeol ✤ agnes monica vs agnez monica ✤ angeline vs engeline vs anjelin vs enjelin ✤ ferrari vs ferarri

N-gram algorithm ✤ Language model to predict next item (character/
word) in a sequence ✤ Can be used for auto-complete (character n-gram) and semantic awareness (word n-gram)

Character N-gram (Auto-complete) ✤ Oversimpliﬁed example (N=3): ✤ Kebakaran ->
keb, eba, bak, aka, kar, ara, ran ✤ Kebanjiran -> keb, eba, ban, anj, nji, jir, ira, ran ✤ Kebangkitan -> keb, eba, ban, ang, ngk, gki, kit, ita, tan ✤ User types “Keba” -> Kebakaran / Kebanjiran / Kebakaran ✤ User types “Keban” -> Kebanjiran / Kebangkitan

Character N-gram (Auto-complete) ✤ Determining number of N ✤ Less
N = slower processing ✤ Kebakaran (N=2) -> ke, eb, ba, ak, ka, ar, ra, an ✤ More N = need longer sample ✤ Kebakaran (N=4) -> keba, ebak, baka, akar, kara, aran

Pros & Cons of Auto-complete ✤ Pros: ✤ Can signiﬁcantly
reduce typos in queries ✤ Cons: ✤ Too many queries to db ✤ Huge amount of vectors ✤ Auto-correct might be a better option

Word N-gram (Semantic Awareness) ✤ Oversimpliﬁed example (N=2): ✤ Article
1: “Memasak nasi goreng ayam” ✤ memasak nasi, nasi goreng, goreng ayam ✤ Article 2: “Memasak nasi goreng sapi” ✤ memasak nasi, nasi goreng, goreng sapi ✤ Article 3: “Ayam goreng dan nasi bakar” ✤ ayam goreng, goreng dan, dan nasi, nasi bakar ✤ User searches “nasi goreng ayam” -> Rank: Article 1, Article 2, Article 3

Pros & Cons of Semantic Awareness ✤ Pros: ✤ Doesn’t
require special index/vector, can be done on- the-ﬂy ✤ More relevant results ✤ Cons: ✤ Publish date ordering is more suitable for news website (Maybe? Needs research)

Levenshtein Distance ✤ Number of changes required to change a
word to another ✤ Widely used for auto-correct ✤ Example: ✤ Distance between “angeline” and “enjelin” is 3 ✤ Change ‘a’ to ‘e’ -> engeline ✤ Change ‘g’ to ‘j’ -> enjeline ✤ Remove ‘e’ -> enjelin

Auto-correct Implementation ✤ Have a dictionary consists of all available
vectors ✤ Store frequently searched words only ✤ Checks every words in user query ✤ If a word doesn’t exist on dictionary, correct it! ✤ Find word in dictionary with smallest distance ✤ Return search results based on corrected query

Why Auto-correct?

Pros & Cons of Auto-correct ✤ Pros ✤ Built-in PHP
function ✤ int levenshtein ( string $str1, string $str2 ) ✤ Can be done using cache/dictionary instead of db vector ✤ Cons ✤ Expensive, needs to be done selectively

References ✤ https://en.wikipedia.org/wiki/Levenshtein_distance ✤ http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/ Assignments/editdistance/Levenshtein%20Distance.htm ✤ https://en.wikipedia.org/wiki/N-gram ✤ http://nlpwp.org/book/chap-ngrams.xhtml
✤ http://www.aclweb.org/anthology/S12-1082 ✤ http://lucidworks.com/blog/auto-suggest-from-popular-queries- using-edgengrams/

Advanced Search

Advanced Search

KMKLabs

More Decks by KMKLabs

Other Decks in Technology

Featured

Transcript

Tech Talk - Yohan - http://yohan.id - [email protected] Advanced Search

Previous Version ✤ Storing ✤ Stored “as is” ✤ Querying

Problem 1 - Dirty Query ✤ Solution: ✤ Stripping tags

Problem 2 - Stemming ✤ Solution: ✤ Sastrawi (github.com/sastrawi/sastrawi)

Problem 3 - Hostname ✤ bukalapak.com & rumah.com? ✤ Unique

What’s Next? ✤ Word prediction / auto-complete ✤ Semantic awareness

Facts ✤ Most search are brands and names ✤ Lots

N-gram algorithm ✤ Language model to predict next item (character/

Character N-gram (Auto-complete) ✤ Oversimpliﬁed example (N=3): ✤ Kebakaran ->

Character N-gram (Auto-complete) ✤ Determining number of N ✤ Less

Pros & Cons of Auto-complete ✤ Pros: ✤ Can signiﬁcantly

Word N-gram (Semantic Awareness) ✤ Oversimpliﬁed example (N=2): ✤ Article

Pros & Cons of Semantic Awareness ✤ Pros: ✤ Doesn’t

Levenshtein Distance ✤ Number of changes required to change a

Auto-correct Implementation ✤ Have a dictionary consists of all available

Why Auto-correct?

Pros & Cons of Auto-correct ✤ Pros ✤ Built-in PHP

References ✤ https://en.wikipedia.org/wiki/Levenshtein_distance ✤ http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/ Assignments/editdistance/Levenshtein%20Distance.htm ✤ https://en.wikipedia.org/wiki/N-gram ✤ http://nlpwp.org/book/chap-ngrams.xhtml