Unicode Best Practices

Unicode Best Practices

Developing applications to handle the natural languages and written scripts of the world—or even a small handful of them—is an impressively large task. Fortunately, Unicode provides tools to do just that. It’s more than just a character set, it’s a collection of standards for working with the world’s textual data. The problem is: Unicode itself is complex!

This talk will help make supporting Unicode easier by providing some of the best practices for your projects—whether CPAN modules, RESTful services, or web applications. We’ll briefly review Unicode and then dive into best practices for handling Unicode text in the following areas:

◦  User experience
◦  Collation (comparison and sorting)
◦  Input, output, and logging
◦  Security considerations
◦  Debugging
◦  Testing (unit tests and QA)

Presented at:
◦  2013-06-19: Open Source Bridge 2013, Portland, OR

Speaker notes: http://opensourcebridge.org/wiki/2013/Unicode_Best_Practices
Example code in multiple languages: https://github.com/patch/unicode-programming

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

June 19, 2013
Tweet

Transcript

  1. Unicode Best Practices s///g nick patch @nickpatch 『 shutterstock 』

  2. UTF-8 encoded input

  3. UTF-8 encoded input ⇩ decode

  4. UTF-8 encoded input ⇩ decode ⇩ character string

  5. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack…
  6. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode
  7. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  8. UTF-8 source code

  9. UTF-8 source code UTF-8 I/O

  10. Perl 5 use utf8;

  11. Perl 5 use utf8; # hack… hack… hack…

  12. Perl 5 use utf8; # hack… hack… hack… =encoding UTF-8

  13. Perl 5 use utf8; # hack… hack… hack… =encoding UTF-8

    docs… docs… docs…
  14. Python 2 #-*- coding: UTF-8 -*- # hack… hack… hack…

  15. Python 3 # hack… hack… hack…

  16. Ruby # encoding: UTF-8 # hack… hack… hack…

  17. / ( ия # definite articles for nouns: | ът

    # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $ /x
  18. / ( دابآ | هراب | یدنب | يدنب |

    نیرت | یزیر | يزير | یزاس | يزاس | ییاه ) $ /x
  19. func remove_kasra (word) { word.subst(/ ∖N{ARABIC KASRA} $ /x, "")

    }
  20. ∖d ১২৩… ໑໒໓…

  21. ∖d 123… ১২৩… ໑໒໓…

  22. ∖d 123… ১২৩… ໑໒໓…

  23. ∖d 123… ১২৩… ໑໒໓…

  24. [0-9] 123… ১২৩… ໑໒໓…

  25. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  26. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  27. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  28. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  29. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  30. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  31. ∖w abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  32. ∖b abc… 123… _ αβγ… ㄅㄆㄇ… أ   ب  

    ج   …
  33. [A-Za-z0-9_] abc… 123… _

  34. ∖s

  35. ∖R

  36. ∖R LF (∖n) CR (∖r) FF (∖f)

  37. ∖R LF (∖n) CR (∖r) FF (∖f) CRLF (∖r∖n)

  38. ∖R LF (∖n) CR (∖r) FF (∖f) CRLF (∖r∖n) NEL

    VT LS PS
  39. .

  40. ∖X

  41. ∖X Spın̈al Tap

  42. ∖X Spın̈al Tap

  43. ∖X Spın̈al Tap n∖N{COMBINING DIAERESIS}

  44. ∖X Spın̈al Tap n∖N{COMBINING DIAERESIS} CRLF (∖r∖n)

  45. ∖X Spın̈al Tap n∖N{COMBINING DIAERESIS} CRLF (∖r∖n)

  46. ∖p{…}

  47. ∖p{ASCII}

  48. ∖P{ASCII}

  49. ∖p{General_Category=Letter}

  50. ∖p{Letter}

  51. ∖p{L}

  52. ∖pL

  53. L Letter M Mark N Number P Punctuation S Symbol

    Z Separator C Other
  54. S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

  55. ∖p{Script=Latin}

  56. ∖p{Latin}

  57. [∖p{Hiragana} ∖p{Katakana} ∖p{Han} ∖p{Latin} ∖p{Common}]

  58. [∖p{Hira} ∖p{Kana} ∖p{Hani} ∖p{Latn} ∖p{Common}]

  59. Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi

    Ethiopic Grek Greek Hang Hangul …
  60. return $word if $word =~ s{ зи $}{г}x || $word

    =~ s{ е ( ∖p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ ( та | ища ) $}{}x;
  61. "Größe".lc == "größe"

  62. "Größe".lc == "größe" "Größe".uc == "GRÖSSE"

  63. "Größe".lc == "größe" "Größe".uc == "GRÖSSE" "Größe".lc != "Größe".uc.lc

  64. "Größe".fc == "GROÖSSE".fc

  65. "Größe".nfc == "Gro◌ ̈ße".nfc

  66. "Größe".nfc.fc == "GRO◌ ̈SSE".nfc.fc

  67. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  68. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  69. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  70. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC
  71. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC
  72. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC
  73. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode
  74. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  75. None
  76. None
  77. collator = Unicode::Collator.new countries = collator.sort(countries)◌̈

  78. collator = Unicode::Collator.new countries = collator.sort(countries)◌̈

  79. collator = Unicode::Collator.new( locale: "de" # German ) de_words =

    collator.sort(de_words)◌̈
  80. collator = Unicode::Collator.new( level: 2 # ignore case ) collator.eq("Größe",

    "GRO◌̈SSE")
  81. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  82. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  83. None
  84. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  85. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  86. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  87. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  88. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem("zvířatech"), "zvíř", "rm -atech"; is stem("zvířatům"), "zvíř", "rm -atům"; is stem("zvířata"), "zvíř", "rm -ata"; is stem("zvířaty"), "zvíř", "rm -aty"; …
  89. @nickpatch

  90. None