Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fundamental Unicode in Perl

Nova Patch
February 12, 2013

Fundamental Unicode in Perl

Unicode defines all characters in common use throughout the world and standards for parsing, collating, and normalizing textual data. It provides the only official encodings for many protocols, languages, and serialization formats including JSON, YAML, and XML. Websites, applications, and services have to be developed with an understanding of Unicode and can greatly benefit from the features it provides. Fortunately, Perl is the premiere language for Unicode programming and is rich in functionality. This talk will bring you up to speed on Unicode and its encodings and then dive into Perl hacking with Unicode.

Presented at:
◦  2012-04-14: DC–Baltimore Perl Workshop (DCBPW) 2012, Baltimore, MD
◦  2012-06-13: YAPC::NA 2012, Madison, WI
◦  2012-10-03: Shutterstock “Brown Bag Lunch” Tech Talk, New York, NY
◦  2013-02-12: New York Perl Mongers (NY.pm), New York, NY

Video: http://youtu.be/iZgqhVu72zc

Nova Patch

February 12, 2013
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. “The smallest component of written language that has semantic value;

    refers to the abstract meaning and/or shape, rather than a specific shape.” —The Unicode Consortium What Is a Character?
  2. Glyphs are visual representations of characters. Fonts are collections of

    glyphs. There may be many different glyphs for the same character. This talk is not about fonts or glyphs. What Is a Glyph?
  3. Many people use “character set” to mean one or more

    of these: Character Code Character Encoding Character Repertoire Which makes for a confusing situation. Character Set
  4. A defined mapping of characters to numbers. A ⇒ 41

    B ⇒ 42 C ⇒ 43 Each value in a character code is called a code point. Character Code
  5. An algorithm to convert code points to a digital form

    for ease of transmitting or storing data. 41 (A) ⇒ 1000001 42 (B) ⇒ 1000010 43 (C) ⇒ 1000011 Character Encoding
  6. A character repertoire is a collection of distinct characters. Character

    codes, keyboards, and written languages all have well-defined character repertoires. Character Repertoire
  7. ASCII character code: 128 code points character encoding: 7 bits

    each Latin 1 (ISO-8859-1) character code: 256 code points character encoding: 8 bits (1 byte) each Character Codes & Encodings
  8. Unicode (character code) 1,112,064 code points (110,000+ defined) character encodings:

    UTF-8 — 1 to 4 bytes each UTF-16 — 2 or 4 bytes each UTF-32 — 4 bytes each Character Codes & Encodings
  9. A U+0041 LATIN CAPITAL LETTER A ໓ U+0ED3 LAO DIGIT

    THREE U+1F4A9 PILE OF POO Code Points
  10. Some code points have precomposed diacritics. ȫ U+022B LATIN SMALL

    LETTER O WITH DIAERESIS AND MACRON Code Points
  11. Other characters must be composed from multiple code points using

    “combing characters.” n̈ U+006E LATIN SMALL LETTER N U+0308 COMBINING DIAERESIS Code Points
  12. Any series of code points that are composed into a

    single user-perceived character. Informally known as “graphemes.” A (U+0041) n̥̈ (U+006E U+0308 U+0325) CRLF (U+000D U+000A) Grapheme Clusters
  13. use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL

    LETTER N WITH TILDE}o!"; String constants ... TIMTOWTDI
  14. use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL

    LETTER N WITH TILDE}o!"; use utf8; say '¡jalapeño!'; String constants ... TIMTOWTDI
  15. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename;

    binmode $fh, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; I/O
  16. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename;

    # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); I/O
  17. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename;

    # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); # CPAN module to enable everything UTF-8 use utf8::all; I/O
  18. use Encode; my $internal = decode('UTF-8', $input); my $output =

    encode('UTF-8', $internal); Explicit Encoding & Decoding
  19. Let’s use this grapheme cluster as the string in our

    next example: ю ́ U+044E CYRILLIC SMALL LETTER YU U+0301 COMBINING ACUTE ACCENT String Length
  20. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme;

    # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 String Length
  21. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme;

    # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 my $length = () = $grapheme =~ /\X/g; say $length; # 1 String Length
  22. # sort of complex for a simple length, eh? my

    $length = () = $str =~ /\X/g; say $length; String Length
  23. # sort of complex for a simple length, eh? my

    $length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); String Length
  24. # sort of complex for a simple length, eh? my

    $length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); # a little better $length++ while $str =~ /\X/g; say $length; String Length
  25. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; # and

    yet another (Warning: I wrote it!) use Unicode::Util qw( grapheme_length ); say grapheme_length($str); String Length
  26. Perl provides a collation algorithm based on code points. @words

    = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel Collation
  27. Perl provides a collation algorithm based on code points. @words

    = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel sort { lc $a cmp lc $b } @words # durian Xerxes Äpfel Collation
  28. Unicode Collation Algorithm (UCA) provides collation based on natural language

    usage. use Unicode::Collate; my $collator = Unicode::Collate->new; $collator->sort(@words); # Äpfel durian Xerxes Collation
  29. Unicode Collation Algorithm (UCA) provides collation based on natural language

    usage. $collator->sort(@names) $collator->cmp($a, $b) $collator->gt($x, $y) $collator->eq($foo, $bar) Collation
  30. UCA also provides locale-specific collations for different languages. use Unicode::Collate::Locale;

    my $kolator = Unicode::Collate::Locale->new( locale => 'pl' # Polish ); Collation
  31. Unicode has 4 normalization forms. The most important are: NFD:

    Normalization Form Canonical Decomposition NFC: Normalization Form Canonical Composition Normalization
  32. use Unicode::Normalize; # NFD can be helpful on input $str

    = NFD($input); # NFC is recommended on output $output = NFC($str); Normalization
  33. UTF-8 encoded input ⇩ decode ⇩ NFD ⇩ Perl Unicode

    string ⇩ NFC ⇩ encode ⇩ UTF-8 encoded output Normalization
  34. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: Unicode Semantics
  35. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); Unicode Semantics
  36. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; Unicode Semantics
  37. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; use feature 'unicode_strings'; Unicode Semantics
  38. You’ll see the “utf8” encoding used frequently in Perl. “utf8”

    follows the UTF-8 standard very loosely and allows many errors in your data without warnings. By default, use “UTF-8” instead. UTF-8 vs. utf8 vs. :utf8
  39. # utf8 is Perl's internal encoding form my $internal =

    decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); UTF-8 vs. utf8 vs. :utf8
  40. # utf8 is Perl's internal encoding form my $internal =

    decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); # insecure! no encoding validation at all open my $fh, '<:utf8', $filename; # proper UTF-8 validation open my $fh, '<:encoding(UTF-8)', $filename; UTF-8 vs. utf8 vs. :utf8