Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced Search

Avatar for KMKLabs KMKLabs
July 31, 2015

Advanced Search

Avatar for KMKLabs

KMKLabs

July 31, 2015
Tweet

More Decks by KMKLabs

Other Decks in Technology

Transcript

  1. Previous Version ✤ Storing ✤ Stored “as is” ✤ Querying

    ✤ Exact word matching ✤ Cut off irrelevant results (using PostgreSQL ts_rank) ✤ Sorted by publish date
  2. Problem 1 - Dirty Query ✤ Solution: ✤ Stripping tags

    and special characters ✤ Removing stopwords ✤ Sastrawi (github.com/sastrawi/sastrawi) ✤ Example: “query yang baik!” -> “query baik”
  3. Problem 3 - Hostname ✤ bukalapak.com & rumah.com? ✤ Unique

    words vs common words ✤ Solution: ✤ Regex for word classification ✤ Use Sastrawi dictionary to identify common words
  4. What’s Next? ✤ Word prediction / auto-complete ✤ Semantic awareness

    ✤ “rumah sakit”, “kamar mandi”, “kereta api” ✤ Auto-correct
  5. Facts ✤ Most search are brands and names ✤ Lots

    and lots of typo ✤ park chanyeol vs park chan-yeol ✤ agnes monica vs agnez monica ✤ angeline vs engeline vs anjelin vs enjelin ✤ ferrari vs ferarri
  6. N-gram algorithm ✤ Language model to predict next item (character/

    word) in a sequence ✤ Can be used for auto-complete (character n-gram) and semantic awareness (word n-gram)
  7. Character N-gram (Auto-complete) ✤ Oversimplified example (N=3): ✤ Kebakaran ->

    keb, eba, bak, aka, kar, ara, ran ✤ Kebanjiran -> keb, eba, ban, anj, nji, jir, ira, ran ✤ Kebangkitan -> keb, eba, ban, ang, ngk, gki, kit, ita, tan ✤ User types “Keba” -> Kebakaran / Kebanjiran / Kebakaran ✤ User types “Keban” -> Kebanjiran / Kebangkitan
  8. Character N-gram (Auto-complete) ✤ Determining number of N ✤ Less

    N = slower processing ✤ Kebakaran (N=2) -> ke, eb, ba, ak, ka, ar, ra, an ✤ More N = need longer sample ✤ Kebakaran (N=4) -> keba, ebak, baka, akar, kara, aran
  9. Pros & Cons of Auto-complete ✤ Pros: ✤ Can significantly

    reduce typos in queries ✤ Cons: ✤ Too many queries to db ✤ Huge amount of vectors ✤ Auto-correct might be a better option
  10. Word N-gram (Semantic Awareness) ✤ Oversimplified example (N=2): ✤ Article

    1: “Memasak nasi goreng ayam” ✤ memasak nasi, nasi goreng, goreng ayam ✤ Article 2: “Memasak nasi goreng sapi” ✤ memasak nasi, nasi goreng, goreng sapi ✤ Article 3: “Ayam goreng dan nasi bakar” ✤ ayam goreng, goreng dan, dan nasi, nasi bakar ✤ User searches “nasi goreng ayam” -> Rank: Article 1, Article 2, Article 3
  11. Pros & Cons of Semantic Awareness ✤ Pros: ✤ Doesn’t

    require special index/vector, can be done on- the-fly ✤ More relevant results ✤ Cons: ✤ Publish date ordering is more suitable for news website (Maybe? Needs research)
  12. Levenshtein Distance ✤ Number of changes required to change a

    word to another ✤ Widely used for auto-correct ✤ Example: ✤ Distance between “angeline” and “enjelin” is 3 ✤ Change ‘a’ to ‘e’ -> engeline ✤ Change ‘g’ to ‘j’ -> enjeline ✤ Remove ‘e’ -> enjelin
  13. Auto-correct Implementation ✤ Have a dictionary consists of all available

    vectors ✤ Store frequently searched words only ✤ Checks every words in user query ✤ If a word doesn’t exist on dictionary, correct it! ✤ Find word in dictionary with smallest distance ✤ Return search results based on corrected query
  14. Pros & Cons of Auto-correct ✤ Pros ✤ Built-in PHP

    function ✤ int levenshtein ( string $str1, string $str2 ) ✤ Can be done using cache/dictionary instead of db vector ✤ Cons ✤ Expensive, needs to be done selectively