Unicode Best Practices in Perl

Unicode Best Practices in Perl

Developing applications to handle the natural languages and written scripts of the world—or even a small handful of them—is an impressively large task. Fortunately, Unicode provides tools to do just that. It’s more than just a character set, it’s a collection of standards for working with the world’s textual data. The problem is: Unicode itself is complex!

This talk will help make supporting Unicode easier by providing some of the best practices for your projects—whether CPAN modules, RESTful services, or web applications. We’ll briefly review Unicode and then dive into best practices for handling Unicode text in the following areas:

◦  User experience
◦  Collation (comparison and sorting)
◦  Input, output, and logging
◦  Security considerations
◦  Debugging
◦  Testing (unit tests and QA)

Presented at:
◦  2013-06-04: YAPC::NA 2013, Austin, TX

Video: http://youtu.be/X2FQHUHjo8M

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

June 04, 2013
Tweet

Transcript

  1. Unicode Best Practices Nick Patch @nickpatch Shutterstock

  2. Perl

  3. Perl has some of the best Unicode support today, especially

    with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013
  4. use v5.8;

  5. use v5.12;

  6. use v5.14;

  7. UTF-8 encoded input

  8. UTF-8 encoded input ⇩ decode

  9. UTF-8 encoded input ⇩ decode ⇩ character string

  10. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack…
  11. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode
  12. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  13. use utf8;

  14. s/ / /g

  15. use utf8; $word =~ s{ (?: دابآ | هراب |

    یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;
  16. use utf8; $word =~ s{ (?: ия # definite articles

    for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $}{}x;
  17. =encoding UTF-8

  18. =encoding UTF-8 =head1 NAME Lingua::Stem::UniNE - University of Neuchâtel stemmers

  19. use open qw( :encoding(UTF-8) :std );

  20. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  21. decode_json($json)

  22. $res->decoded_content

  23. decode('UTF-8', $arg)

  24. use charnames ':full';

  25. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  26. use v5.16; sub remove_kasra { my ($word) = @_; $word

    =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  27. \d

  28. \d 123…

  29. \d 123… … ১২৩

  30. \d 123… … ১২৩ … ໑໒໓

  31. [0-9] 123…

  32. \w

  33. \w abc… 123… _

  34. \w abc… 123… _ αβγ… … ㄅㄆㄇ

  35. \w abc… 123… _ αβγ… … ㄅㄆㄇ … ج ب أ

  36. \b abc… 123… _ αβγ… … ㄅㄆㄇ … ج ب أ

  37. [A-Za-z0-9_] abc… 123… _

  38. \p{PerlWord} abc… 123… _

  39. \s

  40. \R

  41. \R LF (\n) CR (\r) FF (\f)

  42. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n)

  43. \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL

    VT LS PS
  44. .

  45. \X

  46. \X n ̈

  47. \X Spınal Tap ̈

  48. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS}

  49. \X Spınal Tap ̈ n\N{COMBINING DIAERESIS} \r\n

  50. \p

  51. \p{ASCII}

  52. \P{ASCII}

  53. \p{General_Category=Letter}

  54. \p{Letter}

  55. \p{L}

  56. \pL

  57. L Letter M Mark N Number P Punctuation S Symbol

    Z Separator C Other
  58. S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol

  59. \p{Script=Latin}

  60. \p{Latin}

  61. [\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

  62. [\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Common}]

  63. Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi

    Ethiopic Grek Greek Hang Hangul …
  64. return $word if $word =~ s{ зи $}{г}x || $word

    =~ s{ е ( \p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ (?: та | ища ) $}{}x;
  65. lc('Größe') eq 'größe'

  66. lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE'

  67. lc('Größe') eq 'größe' uc('Größe') eq 'GRÖSSE' lc('Größe') ne lc(uc('Größe'))

  68. use Unicode::CaseFold; fc('Größe') eq fc(GRÖSSE)

  69. use v5.16; fc('Größe') eq fc(GRÖSSE)

  70. use Unicode::Normalize; NFC('Größe') eq NFC('Gro◌̈ße')

  71. use v5.16; use Unicode::Normalize; NFC(fc('Größe')) eq NFC(fc('GRO◌̈SSE'))

  72. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  73. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  74. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  75. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  76. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  77. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  78. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  79. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  80. None
  81. None
  82. use Unicode::Collate; my $c = Unicode::Collate->new; @countries = $c->sort(@countries);

  83. use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #

    ignore case ); $c->eq('Größe', 'GRO◌̈SSE')
  84. use Unicode::Collate::Locale; my $c = Unicode::Collate::Locale->new( locale => 'de' );

    @words_de = $c->sort(@words_de);
  85. use Test::More; new_ok 'Text::CSV::Hashify' => [ file => $file, max_rows

    => '٤٠', ];
  86. @nickpatch