Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PWLB Geocoding Algorithm Based on Comparative Measure

Shravanthi
November 07, 2015

PWLB Geocoding Algorithm Based on Comparative Measure

Presentation slides for Paper We Love Bangalore talk
Link to Paper: http://thesis.eur.nl/pub/14891/Ranzijn.pdf

Shravanthi

November 07, 2015
Tweet

More Decks by Shravanthi

Other Decks in Research

Transcript

  1. Sections 1. About Geocoding 2. Some of the similarity measures

    3. Comparing different measures 4. Proposed Approach
  2. Geocoding Geocoding is most commonly considered to be the process

    of converting a locational description such as a street address into some form of geographic representation such as geographic coordinates (latitude and longitude).
  3. Common Problems • Misspellings • Alternative Spelling • Abbreviations •

    Different order of words • Addition/Omission of words
  4. Similarity Measure Algorithms • Edit Distance/Lavenshtein: Levenshtein distance between two

    words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: 1. kitten → sitten (substitution of "s" for "k") 2. sitten → sittin (substitution of "i" for "e") 3. sittin → sitting (insertion of "g" at the end). ❖ It is always at least the difference of the sizes of the two strings. ❖ It is at most the length of the longer string. ❖ It is zero if and only if the strings are equal.
  5. Contd.. • Jaro Winkler Jaro–Winkler distance is a measure of

    similarity between two strings. Best suited for short strings such as person names. A score of 0 equates to no similarity and 1 is an exact match. Example: Given string s1 MARTHA and string s2 MARHTA m(number of matching characters) = 6 t(number of transpositions required) = 1
  6. Contd.. • Q - Grams Q-Gram distance measure is the

    maximum number of occurrences of different q-grams in two strings. The strings are closer relatives greater the q-grams in common. Variants: 1. Jaccard Similarity 2. Dice Coefficient 3. Overlap Coefficient
  7. Contd.. Listing few more: 1. Term frequency - Inverse document

    frequency (using cosine similarity) 2. Recursive matching scheme 3. Phonetic
  8. Summary Similarity Measure Matching Quality Time complexity Levenshtein distance suitable

    for name match for correcting errors O(|s1|*|s2|) Jaro Winkler Suitable for short names. Change in character position reduces matches O(|s1|+|s2|) Q-Grams does not works good when typographical error exists O(|s1|+|s2|)
  9. Proposed Approach 1. Preprocess the data depending on inputs 2.

    Perform Zip code exact matching 3. Apply Trigrams Dice similarity measure