Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Missspell Detection

bk
February 10, 2020

Missspell Detection

bk

February 10, 2020
Tweet

More Decks by bk

Other Decks in Science

Transcript

  1. ՝୊ GUCCHI Tote Bag Black Leather flea • ग़඼औΓফ͠ •

    ग़඼ऀධՁ௿Լ • ΞΧ΢ϯτఀࢭ ϒϥϯυ໊ޡදه΁ͷ ϖφϧςΟ
  2. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ࡞ͬͨ΋ͷ
  3. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ࡞ͬͨ΋ͷ
  4. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ૯౰ͨΓ ࣅͨ୯ޠΛग़ྗ ࡞ͬͨ΋ͷ
  5. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ૯౰ͨΓ ࣅͨ୯ޠΛग़ྗ ࡞ͬͨ΋ͷ
  6. ஔ׵ ݩͷจࣈྻ G U T T I ൺֱ͢Δจࣈྻ G U

    C C I ஔ׵ ૢ࡞ճ਺ = ڑ཭ = 2 ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ
  7. ஔ׵ ࡟আ ૠೖ GUTTI GUCCI GUCCHI GUCCI GUCI GUCCI ฤूճ਺ʢڑ཭ʣ

    2 1 1 ݩͷจࣈྻ ൺֱ͢Δจࣈྻ ฤूํ๏ ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ
  8. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ δϟϩڑ཭ɿ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  9. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) m m m m s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ δϟϩڑ཭ɿ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  10. mɿ۠ؒ಺ͷҰகจࣈ਺ max(|s1 |, |s2 |) 2 − 1 ݩͷจࣈྻɿGCCUHI →

    6 ൺֱ͢ΔจࣈྻɿGUCCI → 5 max(6,5) 2 − 1 = 2 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  11. mɿ۠ؒ಺ͷҰகจࣈ਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I ۠ؒ಺ͰҰகจࣈΛݕࡧ Ұகจࣈ͕͋Ε͹Χ΢ϯτ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  12. mɿ۠ؒ಺ͷҰகจࣈ਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I m = 5 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  13. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) t s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  14. tɿҰகจࣈͷஔ׵਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I Ұகͨ͠จࣈΛநग़ ݩͷจࣈྻ G C C U I ൺֱ͢Δจࣈྻ G U C C I ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  15. tɿҰகจࣈͷஔ׵਺ ݩͷจࣈྻ G C C U I ൺֱ͢Δจࣈྻ G U

    C C I t = 2 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ ಉҰͷจࣈྻʹ͢ΔҝʹԿจࣈஔ׵͢Δͷ͔
  16. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ = 1 3 * ( 5 6 + 5 5 + 5 − 2 2 5 ) = 79 90 = 0.8777... ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  17. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ δϟϩɾ΢ΟϯΫϥʔڑ཭ɿ ઌ಄਺จࣈͷҰக͸ॏΈΛ͚ͭͯධՁ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  18. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ l δϟϩɾ΢ΟϯΫϥʔڑ཭ɿ ઌ಄਺จࣈͷҰக͸ॏΈΛ͚ͭͯධՁ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  19. lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ ݩͷจࣈྻ G C C U H I

    ൺֱ͢Δจࣈྻ G U C C I l = 1 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  20. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ = 79 90 + 1 * 1 10 * (1 − 79 90 ) = 801 900 = 0.89 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  21. ݩͷจࣈྻ ൺֱ͢Δ จࣈྻ *Levenshtein **Jaro-Winkler GUCCHI GUCCI 1 0.97 GUTTI

    2 0.79 GCCUHI 3 0.89 άον༟ࡾ 5 0.00 Jaro-Winkler͸Ұக͢Δจࣈ͕ ଘࡏ͍ͯ͠Δ͜ͱΛධՁ͍ͯ͠Δɻ LevenshteinͱJaro-WinklerͰ ۙ͞ͷॱং͕ҟͳΔɻ ฤूڑ཭ * https://github.com/ztane/python-Levenshtein/ **https://github.com/nap/jaro-winkler-distance
  22. ࢀߟจݙ • ̎ͭͷจࣈྻͷྨࣅ౓Λ਺஋ԽɹϨʔϕϯγϡλΠϯڑ཭ͱδϟϩɾ΢Ο ϯΫϥʔڑ཭ͷղઆ, ਓ޻஌ೳͰ͋ͦͿ, http://nkdkccmbr.hateblo.jp/entry/ 2016/08/18/102727 • ฤूڑ཭ (Levenshtein

    Distance), naoyaͷ͸ͯͳμΠΞϦʔ, https:// naoya-2.hatenadiary.org/entry/20090329/1238307757 • จࣈྻྨࣅ౓ධՁ ϨʔϕϯγϡλΠϯڑ཭ / δϟϩɾ΢ΟϯΫϥʔڑ཭, ਓ޻஌ೳͯ͠ΈΔ, http://grahamian.hatenablog.com/entry/word_similarity • Yaoshu Wang(B) , Jianbin Qin, and Wei Wang,: Efficient Approximate Entity Matching Using Jaro-Winkler Distance, Univeristy of New South Wales, http://qinjianbin.com/files/wise2017-wang.pdf