Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Best Practices in Perl

Unicode Best Practices in Perl

Developing applications to handle the natural languages and written scripts of the world—or even a small handful of them—is an impressively large task. Fortunately, Unicode provides tools to do just that. It’s more than just a character set, it’s a collection of standards for working with the world’s textual data. The problem is: Unicode itself is complex!

This talk will help make supporting Unicode easier by providing some of the best practices for your projects—whether CPAN modules, RESTful services, or web applications. We’ll briefly review Unicode and then dive into best practices for handling Unicode text in the following areas:

◦  User experience
◦  Collation (comparison and sorting)
◦  Input, output, and logging
◦  Security considerations
◦  Debugging
◦  Testing (unit tests and QA)

Presented at:
◦  2013-06-04: YAPC::NA 2013, Austin, TX

Video: http://youtu.be/X2FQHUHjo8M

Nova Patch

June 04, 2013
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. Perl has some of the best Unicode support today, especially

    with respect to regular expressions. Benjamin Peterson The Guts of Unicode in Python PyCon 2013
  2. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  3. use utf8; $word =~ s{ (?: دابآ | هراب |

    یدنب | يدنب | نیرت | یزیر | يزیر | یزاس | يزاس | ییاه ) $}{}x;
  4. use utf8; $word =~ s{ (?: ия # definite articles

    for nouns: | ът # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $}{}x;
  5. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem('zvířatech'), 'zvíř', 'rm -atech'; is stem('zvířatům'), 'zvíř', 'rm -atům'; is stem('zvířata'), 'zvíř', 'rm -ata'; is stem('zvířaty'), 'zvíř', 'rm -aty';
  6. use charnames ':full'; sub remove_kasra { my ($word) = @_;

    $word =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  7. use v5.16; sub remove_kasra { my ($word) = @_; $word

    =~ s{ \N{ARABIC KASRA} $}{}x; return $word; }
  8. \d

  9. \w

  10. \s

  11. \R

  12. .

  13. \X

  14. \p

  15. \pL

  16. return $word if $word =~ s{ зи $}{г}x || $word

    =~ s{ е ( \p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ (?: та | ища ) $}{}x;
  17. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  18. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  19. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  20. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  21. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  22. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  23. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  24. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    hack ⇨ … hack… hack… NFC ⇨ ⇩ encode ⇩ UTF-8 encoded output
  25. use Unicode::Collate; my $c = Unicode::Collate->new( level => 2 #

    ignore case ); $c->eq('Größe', 'GRO◌̈SSE')