Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced Search

KMKLabs
July 31, 2015

Advanced Search

KMKLabs

July 31, 2015
Tweet

More Decks by KMKLabs

Other Decks in Technology

Transcript

  1. Previous Version ✤ Storing ✤ Stored “as is” ✤ Querying

    ✤ Exact word matching ✤ Cut off irrelevant results (using PostgreSQL ts_rank) ✤ Sorted by publish date
  2. Problem 1 - Dirty Query ✤ Solution: ✤ Stripping tags

    and special characters ✤ Removing stopwords ✤ Sastrawi (github.com/sastrawi/sastrawi) ✤ Example: “query yang baik!” -> “query baik”
  3. Problem 3 - Hostname ✤ bukalapak.com & rumah.com? ✤ Unique

    words vs common words ✤ Solution: ✤ Regex for word classification ✤ Use Sastrawi dictionary to identify common words
  4. What’s Next? ✤ Word prediction / auto-complete ✤ Semantic awareness

    ✤ “rumah sakit”, “kamar mandi”, “kereta api” ✤ Auto-correct
  5. Facts ✤ Most search are brands and names ✤ Lots

    and lots of typo ✤ park chanyeol vs park chan-yeol ✤ agnes monica vs agnez monica ✤ angeline vs engeline vs anjelin vs enjelin ✤ ferrari vs ferarri
  6. N-gram algorithm ✤ Language model to predict next item (character/

    word) in a sequence ✤ Can be used for auto-complete (character n-gram) and semantic awareness (word n-gram)
  7. Character N-gram (Auto-complete) ✤ Oversimplified example (N=3): ✤ Kebakaran ->

    keb, eba, bak, aka, kar, ara, ran ✤ Kebanjiran -> keb, eba, ban, anj, nji, jir, ira, ran ✤ Kebangkitan -> keb, eba, ban, ang, ngk, gki, kit, ita, tan ✤ User types “Keba” -> Kebakaran / Kebanjiran / Kebakaran ✤ User types “Keban” -> Kebanjiran / Kebangkitan
  8. Character N-gram (Auto-complete) ✤ Determining number of N ✤ Less

    N = slower processing ✤ Kebakaran (N=2) -> ke, eb, ba, ak, ka, ar, ra, an ✤ More N = need longer sample ✤ Kebakaran (N=4) -> keba, ebak, baka, akar, kara, aran
  9. Pros & Cons of Auto-complete ✤ Pros: ✤ Can significantly

    reduce typos in queries ✤ Cons: ✤ Too many queries to db ✤ Huge amount of vectors ✤ Auto-correct might be a better option
  10. Word N-gram (Semantic Awareness) ✤ Oversimplified example (N=2): ✤ Article

    1: “Memasak nasi goreng ayam” ✤ memasak nasi, nasi goreng, goreng ayam ✤ Article 2: “Memasak nasi goreng sapi” ✤ memasak nasi, nasi goreng, goreng sapi ✤ Article 3: “Ayam goreng dan nasi bakar” ✤ ayam goreng, goreng dan, dan nasi, nasi bakar ✤ User searches “nasi goreng ayam” -> Rank: Article 1, Article 2, Article 3
  11. Pros & Cons of Semantic Awareness ✤ Pros: ✤ Doesn’t

    require special index/vector, can be done on- the-fly ✤ More relevant results ✤ Cons: ✤ Publish date ordering is more suitable for news website (Maybe? Needs research)
  12. Levenshtein Distance ✤ Number of changes required to change a

    word to another ✤ Widely used for auto-correct ✤ Example: ✤ Distance between “angeline” and “enjelin” is 3 ✤ Change ‘a’ to ‘e’ -> engeline ✤ Change ‘g’ to ‘j’ -> enjeline ✤ Remove ‘e’ -> enjelin
  13. Auto-correct Implementation ✤ Have a dictionary consists of all available

    vectors ✤ Store frequently searched words only ✤ Checks every words in user query ✤ If a word doesn’t exist on dictionary, correct it! ✤ Find word in dictionary with smallest distance ✤ Return search results based on corrected query
  14. Pros & Cons of Auto-correct ✤ Pros ✤ Built-in PHP

    function ✤ int levenshtein ( string $str1, string $str2 ) ✤ Can be done using cache/dictionary instead of db vector ✤ Cons ✤ Expensive, needs to be done selectively