Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Best Practices

Unicode Best Practices

Developing applications to handle the natural languages and written scripts of the world—or even a small handful of them—is an impressively large task. Fortunately, Unicode provides tools to do just that. It’s more than just a character set, it’s a collection of standards for working with the world’s textual data. The problem is: Unicode itself is complex!

This talk will help make supporting Unicode easier by providing some of the best practices for your projects—whether CPAN modules, RESTful services, or web applications. We’ll briefly review Unicode and then dive into best practices for handling Unicode text in the following areas:

◦  User experience
◦  Collation (comparison and sorting)
◦  Input, output, and logging
◦  Security considerations
◦  Debugging
◦  Testing (unit tests and QA)

Presented at:
◦  2013-06-19: Open Source Bridge 2013, Portland, OR

Speaker notes: http://opensourcebridge.org/wiki/2013/Unicode_Best_Practices
Example code in multiple languages: https://github.com/patch/unicode-programming

Nova Patch

June 19, 2013
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ hack…

    hack… hack… ⇩ encode ⇩ UTF-8 encoded output
  2. / ( ия # definite articles for nouns: | ът

    # ∙ masculine | та # ∙ feminine | то # ∙ neutral | те # ∙ plural ) $ /x
  3. / ( دابآ | هراب | یدنب | يدنب |

    نیرت | یزیر | يزير | یزاس | يزاس | ییاه ) $ /x
  4. .

  5. return $word if $word =~ s{ зи $}{г}x || $word

    =~ s{ е ( ∖p{Cyrl} ) и $}{я$1}x || $word =~ s{ ци $}{к}x || $word =~ s{ ( та | ища ) $}{}x;
  6. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  7. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  8. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  9. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode
  10. UTF-8 encoded input ⇩ decode ⇩ character string ⇩ NFD

    ⇨ hack… hack… hack… ⇨ NFC ⇩ encode ⇩ UTF-8 encoded output
  11. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  12. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  13. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  14. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  15. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  16. Testing characters > 7F: Nóirín characters > FF: 松本行弘 characters

    > FFFF: ⩨ grapheme clusters: Spın̈al Tap foreign digits: ٤٠
  17. use utf8; use open qw( :encoding(UTF-8) :std ); use Test::More

    tests => 66; use Lingua::Stem::UniNE::CS qw( stem ); is stem("zvířatech"), "zvíř", "rm -atech"; is stem("zvířatům"), "zvíř", "rm -atům"; is stem("zvířata"), "zvíř", "rm -ata"; is stem("zvířaty"), "zvíř", "rm -aty"; …