Similarity metrics for Japanese kanji

Ef621203799c7dcc735dc469d1aaee6f?s=47 Lars Yencken
November 30, 2012

Similarity metrics for Japanese kanji

An overview of my PhD work on calculating the similarity between Japanese kanji, presented to the Melbourne Maths and Science meetup.

Ef621203799c7dcc735dc469d1aaee6f?s=128

Lars Yencken

November 30, 2012
Tweet

Transcript

  1. Similarity Metrics for Japanese Kanji Lars Yencken / 99designs Maths

    and Science Meetup, 30th Nov 2012
  2. Linguistics Computer Science Computational Linguistics

  3. Relative difficulty of languages

  4. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE
  5. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish
  6. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  7. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  8. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  9. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  10. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  11. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  12. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu
  13. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu Exceptionally difficult for native English speakers ‭‭‭‭‭‭ 2200 class hours Arabic, Cantonese, Mandarin, Japanese, Korean
  14. DIFFICULTY OF LEARNING LANGUAGES FOREIGN SERVICE INSTITUTE, US DEPARTMENT OF

    STATE Closely related to English ‭‱575-600 class hours Afrikaans, Danish, Dutch, French, Italian, Norwegian, Portuguese, Romanian, Spanish, Swedish Significant linguistic and/or cultural differences ‭‭‭ 1100 class hours Albanian, Amharic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Croatian, Czech, Estonian, Finish, Georgian, Greek, Hebrew, Hindi, Hungarian, Icelandic, Khmer, Lao, Latvian, Lithuanian, Macedonian, Mongolian, Nepali, Pashto, Persian (Dari, Farsi, Tajik), Polish, Russian, Serbian, Sinhalese, Slovak, Slovenian, Tagalog, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Xhosa, Zulu Exceptionally difficult for native English speakers ‭‭‭‭‭‭ 2200 class hours Arabic, Cantonese, Mandarin, Japanese, Korean
  15. 持 /mo(tsu)/ "to carry"

  16. 持 挂 拝

  17. distance(持, 挂) = ???

  18. The space of kanji

  19. None
  20. dog dough log

  21. 持 挂 拝 土

  22. Approaches

  23. Compare images

  24. 持挂

  25. None
  26. Compare components

  27.   㣡 ౔ ੇ ኱ ౔ ੇ

  28. Compare strokes

  29. P R O S P E R I T Y

    P R O P E R T I E S
  30. P R O S P E R I T Y

    P R O P E R T I E S distance: 6
  31.   3, 11a, 2a, 2a 3, 11a, 2a, 2a,

    2a distance: 1
  32. Compare trees

  33.          

              
  34.          

               tree edit distance
  35. So what works?

  36. None
  37. None
  38. Thanks!