Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Missspell Detection

465a530291aedb53fff5a6333e3dedaa?s=47 bk
February 10, 2020

Missspell Detection

465a530291aedb53fff5a6333e3dedaa?s=128

bk

February 10, 2020
Tweet

More Decks by bk

Other Decks in Science

Transcript

  1. ฤूڑ཭ʹΑΔจࣈྻޡදهݕ஌ ϨʔϕϯγϡλΠϯڑ཭ͱδϟϩɾ΢ΟϯΫϥʔڑ཭

  2. ໨࣍ 1. ՝୊……………………………………p.3-10 2. ࡞ͬͨ΋ͷ……………………………p.11-16 3. ฤूڑ཭………………………………p.17-39 4. ݁Ռ……………………………………p.40-41 5.·ͱΊ…………………………………p.42

    6.ࢀߟจݙ………………………………p.43
  3. ՝୊

  4. ՝୊ ϒϥϯυ඼ͷࡏݿ

  5. ՝୊ flea ख࡞ۀͰग़඼

  6. ՝୊ flea

  7. ՝୊ GUCCI Tote Bag Black Leather flea ग़඼লྗԽ

  8. ՝୊ GUCCHI Tote Bag Black Leather flea

  9. ՝୊ GUCCHI Tote Bag Black Leather flea • ग़඼औΓফ͠ •

    ग़඼ऀධՁ௿Լ • ΞΧ΢ϯτఀࢭ ϒϥϯυ໊ޡදه΁ͷ ϖφϧςΟ
  10. ՝୊ AIͰͳΜͱ͔ͯ͠ Python ࣗવݴޠॲཧ

  11. ࡞ͬͨ΋ͷ

  12. ࡞ͬͨ΋ͷ ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ

    ɾɾɾ
  13. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ࡞ͬͨ΋ͷ
  14. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ࡞ͬͨ΋ͷ
  15. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ૯౰ͨΓ ࣅͨ୯ޠΛग़ྗ ࡞ͬͨ΋ͷ
  16. ग़඼λΠτϧϦετ GUCCHI Tote Bag Black Leather ɾɾɾ ɾɾɾ ɾɾɾ ɾɾɾ

    ୯ޠʹ෼ղ ग़඼୯ޠϦετ GUCCHI Tote Bag Black Leather ਖ਼ϒϥϯυ໊Ϧετ GUCCI VUITTON ɾɾɾ ɾɾɾ ɾɾɾ ૯౰ͨΓ ࣅͨ୯ޠΛग़ྗ ࡞ͬͨ΋ͷ
  17. ฤूڑ཭

  18. ฤूڑ཭ 1. ϨʔϕϯγϡλΠϯڑ཭ (Levenshtein Distance) 2. δϟϩɾ΢ΟϯΫϥʔڑ཭ (Jaro-Winkler Distance) GUCCHI

    GUCCI
  19. ฤूڑ཭ 1. ϨʔϕϯγϡλΠϯڑ཭ (Levenshtein Distance) 2. δϟϩɾ΢ΟϯΫϥʔڑ཭ (Jaro-Winkler Distance) GUCCHI

    GUCCI 1. ϨʔϕϯγϡλΠϯڑ཭ (Levenshtein Distance)
  20. ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ ͋Δจࣈྻ ൺֱ͢Δจࣈྻ จࣈΛૢ࡞ͯ͠Ұகͤ͞Δ

  21. ͋Δจࣈྻ ൺֱ͢Δจࣈྻ จࣈΛૢ࡞ͯ͠Ұகͤ͞Δ ૢ࡞ ஔ׵ ࡟আ ૠೖ ૢ࡞ճ਺=ڑ཭ ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ

  22. ஔ׵ ݩͷจࣈྻ G U T T I ൺֱ͢Δจࣈྻ G U

    C C I ஔ׵ ૢ࡞ճ਺ = ڑ཭ = 2 ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ
  23. ஔ׵ ࡟আ ૠೖ GUTTI GUCCI GUCCHI GUCCI GUCI GUCCI ฤूճ਺ʢڑ཭ʣ

    2 1 1 ݩͷจࣈྻ ൺֱ͢Δจࣈྻ ฤूํ๏ ฤूڑ཭ʢϨʔϕϯγϡλΠϯڑ཭ʣ
  24. ฤूڑ཭ 1. ϨʔϕϯγϡλΠϯڑ཭ (Levenshtein Distance) 2. δϟϩɾ΢ΟϯΫϥʔڑ཭ (Jaro-Winkler Distance) GUCCHI

    GUCCI
  25. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ δϟϩڑ཭ɿ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  26. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) m m m m s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ δϟϩڑ཭ɿ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  27. mɿ۠ؒ಺ͷҰகจࣈ਺ max(|s1 |, |s2 |) 2 − 1 ݩͷจࣈྻɿGCCUHI →

    6 ൺֱ͢ΔจࣈྻɿGUCCI → 5 max(6,5) 2 − 1 = 2 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  28. mɿ۠ؒ಺ͷҰகจࣈ਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I ۠ؒ಺ͰҰகจࣈΛݕࡧ Ұகจࣈ͕͋Ε͹Χ΢ϯτ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  29. mɿ۠ؒ಺ͷҰகจࣈ਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I m = 5 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  30. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) t s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ จࣈྻͷ෦෼తͳҰக౓߹͍ΛଌΔ ஋͕େ͖͍ํ͕ڑ཭͕͍ۙ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  31. tɿҰகจࣈͷஔ׵਺ ݩͷจࣈྻ G C C U H I ൺֱ͢Δจࣈྻ G

    U C C I Ұகͨ͠จࣈΛநग़ ݩͷจࣈྻ G C C U I ൺֱ͢Δจࣈྻ G U C C I ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  32. tɿҰகจࣈͷஔ׵਺ ݩͷจࣈྻ G C C U I ൺֱ͢Δจࣈྻ G U

    C C I t = 2 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ ಉҰͷจࣈྻʹ͢ΔҝʹԿจࣈஔ׵͢Δͷ͔
  33. Dj = 1 3 * ( m |s1 | +

    m |s2 | + m − t 2 m ) s1, s2 ɿจࣈྻͷ௕͞ mɿ۠ؒ಺ͷҰகจࣈ਺ tɿҰகจࣈͷஔ׵਺ = 1 3 * ( 5 6 + 5 5 + 5 − 2 2 5 ) = 79 90 = 0.8777... ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  34. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ δϟϩɾ΢ΟϯΫϥʔڑ཭ɿ ઌ಄਺จࣈͷҰக͸ॏΈΛ͚ͭͯධՁ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  35. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ l δϟϩɾ΢ΟϯΫϥʔڑ཭ɿ ઌ಄਺จࣈͷҰக͸ॏΈΛ͚ͭͯධՁ ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  36. lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ ݩͷจࣈྻ G C C U H I

    ൺֱ͢Δจࣈྻ G U C C I l = 1 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  37. Djw = Dj + l * 1 10 * (1

    − Dj ) Dj ɿJaro Distance lɿઌ಄͔ΒͷҰகจࣈ਺ʢl <= 4ʣ = 79 90 + 1 * 1 10 * (1 − 79 90 ) = 801 900 = 0.89 ฤूڑ཭ʢδϟϩɾ΢ΟϯΫϥʔڑ཭ʣ
  38. * https://github.com/ztane/python-Levenshtein/ **https://github.com/nap/jaro-winkler-distance Levenshteinɿখ͍͞΄Ͳ͍ۙ Jaro-Winklerɿେ͖͍΄Ͳ͍ۙ ݩͷจࣈྻ ൺֱ͢Δ จࣈྻ *Levenshtein **Jaro-Winkler

    GUCCHI GUCCI 1 0.97 GUTTI 2 0.79 GCCUHI 3 0.89 άον༟ࡾ 5 0.00 ฤूڑ཭
  39. ݩͷจࣈྻ ൺֱ͢Δ จࣈྻ *Levenshtein **Jaro-Winkler GUCCHI GUCCI 1 0.97 GUTTI

    2 0.79 GCCUHI 3 0.89 άον༟ࡾ 5 0.00 Jaro-Winkler͸Ұக͢Δจࣈ͕ ଘࡏ͍ͯ͠Δ͜ͱΛධՁ͍ͯ͠Δɻ LevenshteinͱJaro-WinklerͰ ۙ͞ͷॱং͕ҟͳΔɻ ฤूڑ཭ * https://github.com/ztane/python-Levenshtein/ **https://github.com/nap/jaro-winkler-distance
  40. ݁Ռ

  41. .py ͳΜ͔ಈ͍ͯΔ͔Βྑ͠ ݁Ռ * https://github.com/bk-18/Misspelled-Brand-Name-Detector

  42. ·ͱΊ • ग़඼࣌ͷϒϥϯυ໊ޡදهͱ͍͏՝୊ • Ϧετ૯౰ͨΓʹΑΔޡදهݕ஌ • ϨʔϕϯγϡλΠϯڑ཭ • δϟϩɾ΢ΟϯΫϥʔڑ཭

  43. ࢀߟจݙ • ̎ͭͷจࣈྻͷྨࣅ౓Λ਺஋ԽɹϨʔϕϯγϡλΠϯڑ཭ͱδϟϩɾ΢Ο ϯΫϥʔڑ཭ͷղઆ, ਓ޻஌ೳͰ͋ͦͿ, http://nkdkccmbr.hateblo.jp/entry/ 2016/08/18/102727 • ฤूڑ཭ (Levenshtein

    Distance), naoyaͷ͸ͯͳμΠΞϦʔ, https:// naoya-2.hatenadiary.org/entry/20090329/1238307757 • จࣈྻྨࣅ౓ධՁ ϨʔϕϯγϡλΠϯڑ཭ / δϟϩɾ΢ΟϯΫϥʔڑ཭, ਓ޻஌ೳͯ͠ΈΔ, http://grahamian.hatenablog.com/entry/word_similarity • Yaoshu Wang(B) , Jianbin Qin, and Wei Wang,: Efficient Approximate Entity Matching Using Jaro-Winkler Distance, Univeristy of New South Wales, http://qinjianbin.com/files/wise2017-wang.pdf
  44. ENJOY! ENJAY! EMJOY! ENJOI! ENZYOI!