PWLB Geocoding Algorithm Based on Comparative Measure

A Geocoding Algorithm Based on Comparative Study of Address Matching
Techniques

Sections 1. About Geocoding 2. Some of the similarity measures
3. Comparing different measures 4. Proposed Approach

Geocoding Geocoding is most commonly considered to be the process
of converting a locational description such as a street address into some form of geographic representation such as geographic coordinates (latitude and longitude).

Common Problems • Misspellings • Alternative Spelling • Abbreviations •
Different order of words • Addition/Omission of words

Similarity Measure Algorithms • Edit Distance/Lavenshtein: Levenshtein distance between two
words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: 1. kitten → sitten (substitution of "s" for "k") 2. sitten → sittin (substitution of "i" for "e") 3. sittin → sitting (insertion of "g" at the end). ❖ It is always at least the difference of the sizes of the two strings. ❖ It is at most the length of the longer string. ❖ It is zero if and only if the strings are equal.

Contd.. • Jaro Winkler Jaro–Winkler distance is a measure of
similarity between two strings. Best suited for short strings such as person names. A score of 0 equates to no similarity and 1 is an exact match. Example: Given string s1 MARTHA and string s2 MARHTA m(number of matching characters) = 6 t(number of transpositions required) = 1

Contd.. • Q - Grams Q-Gram distance measure is the
maximum number of occurrences of different q-grams in two strings. The strings are closer relatives greater the q-grams in common. Variants: 1. Jaccard Similarity 2. Dice Coefficient 3. Overlap Coefficient

Contd.. Listing few more: 1. Term frequency - Inverse document
frequency (using cosine similarity) 2. Recursive matching scheme 3. Phonetic

Summary Similarity Measure Matching Quality Time complexity Levenshtein distance suitable
for name match for correcting errors O(|s1|*|s2|) Jaro Winkler Suitable for short names. Change in character position reduces matches O(|s1|+|s2|) Q-Grams does not works good when typographical error exists O(|s1|+|s2|)

Proposed Approach 1. Preprocess the data depending on inputs 2.
Perform Zip code exact matching 3. Apply Trigrams Dice similarity measure

References 1. https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf 2. ftp://ftp.cis.upenn.edu/pub/mbgreen/papers/ton98.pdf 3. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.4119 &rep=rep1&type=pdf

PWLB Geocoding Algorithm Based on Comparative Measure

PWLB Geocoding Algorithm Based on Comparative Measure

Shravanthi

More Decks by Shravanthi

Other Decks in Research

Featured

Transcript

A Geocoding Algorithm Based on Comparative Study of Address Matching

Sections 1. About Geocoding 2. Some of the similarity measures

Geocoding Geocoding is most commonly considered to be the process

Common Problems • Misspellings • Alternative Spelling • Abbreviations •

Similarity Measure Algorithms • Edit Distance/Lavenshtein: Levenshtein distance between two

Contd.. • Jaro Winkler Jaro–Winkler distance is a measure of

Contd.. • Q - Grams Q-Gram distance measure is the

Contd.. Listing few more: 1. Term frequency - Inverse document

Summary Similarity Measure Matching Quality Time complexity Levenshtein distance suitable

Proposed Approach 1. Preprocess the data depending on inputs 2.

References 1. https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf 2. ftp://ftp.cis.upenn.edu/pub/mbgreen/papers/ton98.pdf 3. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.4119 &rep=rep1&type=pdf