Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Missspell Detection

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for bk bk
February 10, 2020

Missspell Detection

Avatar for bk

bk

February 10, 2020
Tweet

More Decks by bk

Other Decks in Science

Transcript

  1. ՝୊ GUCCHI Tote Bag Black Leather flea • ग़඼औΓফ͠ •

    ग़඼ऀධՁ௿Լ • ΞΧ΢ϯτఀࢭ ϒϥϯυ໊ޡදه΁ͷ ϖφϧςΟ
  2. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ࡞ͬͨ΋ͷ
  3. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ࡞ͬͨ΋ͷ
  4. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ૯౰ͨΓ ࣅͨ୯ޠΛग़ྗ ࡞ͬͨ΋ͷ
  5. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ૯౰ͨΓ ࣅͨ୯ޠΛग़ྗ ࡞ͬͨ΋ͷ
  6. ஔ׵ ݩͷจࣈྻ G U T T I ൺֱ͢Δจࣈྻ G U

    C C I ஔ׵ ૢ࡞ճ਺ = ڑ཭ = 2 ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ
  7. ஔ׵ ࡟আ ૠೖ GUTTI GUCCI GUCCHI GUCCI GUCI GUCCI ฤूճ਺ʢڑ཭ʣ

    2 1 1 ݩͷจࣈྻ ൺֱ͢Δจࣈྻ ฤूํ๏ ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ
  8. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ δϟϩڑ཭ɿ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  9. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) m m m m s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ δϟϩڑ཭ɿ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  10. mɿ۠ؒ಺ͷҰகจࣈ਺ max(|s1 |, |s2 |) 2 − 1 ݩͷจࣈྻɿGCCUHI →

    6 ൺֱ͢ΔจࣈྻɿGUCCI → 5 max(6,5) 2 − 1 = 2 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  11. mɿ۠ؒ಺ͷҰகจࣈ਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I ۠ؒ಺ͰҰகจࣈΛݕࡧ Ұகจࣈ͕͋Ε͹Χ΢ϯτ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  12. mɿ۠ؒ಺ͷҰகจࣈ਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I m = 5 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  13. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) t s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  14. tɿҰகจࣈͷஔ׵਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I Ұகͨ͠จࣈΛநग़ ݩͷจࣈྻ G C C U I ൺֱ͢Δจࣈྻ G U C C I ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  15. tɿҰகจࣈͷஔ׵਺ ݩͷจࣈྻ G C C U I ൺֱ͢Δจࣈྻ G U

    C C I t = 2 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ ಉҰͷจࣈྻʹ͢ΔҝʹԿจࣈஔ׵͢Δͷ͔
  16. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ = 1 3 * ( 5 6 + 5 5 + 5 − 2 2 5 ) = 79 90 = 0.8777... ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  17. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ δϟϩɾ΢ΟϯΫϥʔڑ཭ɿ ઌ಄਺จࣈͷҰக͸ॏΈΛ͚ͭͯධՁ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  18. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ l δϟϩɾ΢ΟϯΫϥʔڑ཭ɿ ઌ಄਺จࣈͷҰக͸ॏΈΛ͚ͭͯධՁ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  19. lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ ݩͷจࣈྻ G C C U H I

    ൺֱ͢Δจࣈྻ G U C C I l = 1 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  20. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ = 79 90 + 1 * 1 10 * (1 − 79 90 ) = 801 900 = 0.89 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  21. ݩͷจࣈྻ ൺֱ͢Δ จࣈྻ *Levenshtein **Jaro-Winkler GUCCHI GUCCI 1 0.97 GUTTI

    2 0.79 GCCUHI 3 0.89 άον༟ࡾ 5 0.00 Jaro-Winkler͸Ұக͢Δจࣈ͕ ଘࡏ͍ͯ͠Δ͜ͱΛධՁ͍ͯ͠Δɻ LevenshteinͱJaro-WinklerͰ ۙ͞ͷॱং͕ҟͳΔɻ ฤूڑ཭ * https://github.com/ztane/python-Levenshtein/ **https://github.com/nap/jaro-winkler-distance
  22. ࢀߟจݙ • ̎ͭͷจࣈྻͷྨࣅ౓Λ਺஋ԽɹϨʔϕϯγϡλΠϯڑ཭ͱδϟϩɾ΢Ο ϯΫϥʔڑ཭ͷղઆ, ਓ޻஌ೳͰ͋ͦͿ, http://nkdkccmbr.hateblo.jp/entry/ 2016/08/18/102727 • ฤूڑ཭ (Levenshtein

    Distance), naoyaͷ͸ͯͳμΠΞϦʔ, https:// naoya-2.hatenadiary.org/entry/20090329/1238307757 • จࣈྻྨࣅ౓ධՁ ϨʔϕϯγϡλΠϯڑ཭ / δϟϩɾ΢ΟϯΫϥʔڑ཭, ਓ޻஌ೳͯ͠ΈΔ, http://grahamian.hatenablog.com/entry/word_similarity • Yaoshu Wang(B) , Jianbin Qin, and Wei Wang,: Efficient Approximate Entity Matching Using Jaro-Winkler Distance, Univeristy of New South Wales, http://qinjianbin.com/files/wise2017-wang.pdf