Fundamental Unicode in Perl

05bab33cfd102c84f045838aa4e05bec?s=47 Nova Patch
February 12, 2013

Fundamental Unicode in Perl

Unicode defines all characters in common use throughout the world and standards for parsing, collating, and normalizing textual data. It provides the only official encodings for many protocols, languages, and serialization formats including JSON, YAML, and XML. Websites, applications, and services have to be developed with an understanding of Unicode and can greatly benefit from the features it provides. Fortunately, Perl is the premiere language for Unicode programming and is rich in functionality. This talk will bring you up to speed on Unicode and its encodings and then dive into Perl hacking with Unicode.

Presented at:
◦  2012-04-14: DC–Baltimore Perl Workshop (DCBPW) 2012, Baltimore, MD
◦  2012-06-13: YAPC::NA 2012, Madison, WI
◦  2012-10-03: Shutterstock “Brown Bag Lunch” Tech Talk, New York, NY
◦  2013-02-12: New York Perl Mongers (NY.pm), New York, NY

Video: http://youtu.be/iZgqhVu72zc

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

February 12, 2013
Tweet

Transcript

  1. Fundamental Unicode Nick Patch

  2. “The smallest component of written language that has semantic value;

    refers to the abstract meaning and/or shape, rather than a specific shape.” —The Unicode Consortium What Is a Character?
  3. Glyphs are visual representations of characters. Fonts are collections of

    glyphs. There may be many different glyphs for the same character. This talk is not about fonts or glyphs. What Is a Glyph?
  4. a b c π ث й Letters

  5. 1 2 3 ໓ ๓ ३ Numbers

  6. . / ? 「 « » 」 Punctuation

  7. ™ © ≠ ☺ ☠ Symbols

  8. CARRIAGE RETURN NO-BREAK SPACE COMBINING GRAPHEME JOINER RIGHT-TO-LEFT MARK Control

    Characters
  9. Many people use “character set” to mean one or more

    of these: Character Code Character Encoding Character Repertoire Which makes for a confusing situation. Character Set
  10. A defined mapping of characters to numbers. A ⇒ 41

    B ⇒ 42 C ⇒ 43 Each value in a character code is called a code point. Character Code
  11. An algorithm to convert code points to a digital form

    for ease of transmitting or storing data. 41 (A) ⇒ 1000001 42 (B) ⇒ 1000010 43 (C) ⇒ 1000011 Character Encoding
  12. A character repertoire is a collection of distinct characters. Character

    codes, keyboards, and written languages all have well-defined character repertoires. Character Repertoire
  13. ASCII character code: 128 code points character encoding: 7 bits

    each Character Codes & Encodings
  14. ASCII character code: 128 code points character encoding: 7 bits

    each Latin 1 (ISO-8859-1) character code: 256 code points character encoding: 8 bits (1 byte) each Character Codes & Encodings
  15. Unicode (character code) 1,112,064 code points (110,000+ defined) Character Codes

    & Encodings
  16. Unicode (character code) 1,112,064 code points (110,000+ defined) character encodings:

    UTF-8 — 1 to 4 bytes each UTF-16 — 2 or 4 bytes each UTF-32 — 4 bytes each Character Codes & Encodings
  17. A U+0041 LATIN CAPITAL LETTER A ໓ U+0ED3 LAO DIGIT

    THREE U+1F4A9 PILE OF POO Code Points
  18. Some code points have precomposed diacritics. ȫ U+022B LATIN SMALL

    LETTER O WITH DIAERESIS AND MACRON Code Points
  19. Other characters must be composed from multiple code points using

    “combing characters.” n̈ U+006E LATIN SMALL LETTER N U+0308 COMBINING DIAERESIS Code Points
  20. Any series of code points that are composed into a

    single user-perceived character. Informally known as “graphemes.” A (U+0041) n̥̈ (U+006E U+0308 U+0325) CRLF (U+000D U+000A) Grapheme Clusters
  21. U+1F42A DROMEDARY CAMEL Time for some…

  22. # ¡jalapeño! say "\x{A1}jalape\x{D1}o!"; String constants ... TIMTOWTDI

  23. # ¡jalapeño! say "\x{A1}jalape\x{D1}o!"; use v5.12; say "\N{U+00A1}jalape\N{U+00D1}o!"; String constants

    ... TIMTOWTDI
  24. use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL

    LETTER N WITH TILDE}o!"; String constants ... TIMTOWTDI
  25. use charnames qw( :full ); say "\N{INVERTED EXCLAMATION MARK}jalape\N{LATIN SMALL

    LETTER N WITH TILDE}o!"; use utf8; say '¡jalapeño!'; String constants ... TIMTOWTDI
  26. =encoding UTF-8 =head1 ¡jalapeño! String constants ... POD

  27. UTF-8 encoded input ⇩ decode ⇩ Perl Unicode string ⇩

    encode ⇩ UTF-8 encoded output I/O
  28. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename;

    I/O
  29. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename;

    binmode $fh, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; I/O
  30. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename;

    I/O
  31. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename;

    # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); I/O
  32. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename;

    # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); # CPAN module to enable everything UTF-8 use utf8::all; I/O
  33. use Encode; my $internal = decode('UTF-8', $input); my $output =

    encode('UTF-8', $internal); Explicit Encoding & Decoding
  34. Let’s use this grapheme cluster as the string in our

    next example: ю ́ U+044E CYRILLIC SMALL LETTER YU U+0301 COMBINING ACUTE ACCENT String Length
  35. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme;

    # 4 String Length
  36. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme;

    # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 String Length
  37. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme;

    # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 my $length = () = $grapheme =~ /\X/g; say $length; # 1 String Length
  38. # sort of complex for a simple length, eh? my

    $length = () = $str =~ /\X/g; say $length; String Length
  39. # sort of complex for a simple length, eh? my

    $length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); String Length
  40. # sort of complex for a simple length, eh? my

    $length = () = $str =~ /\X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /\X/g ); # a little better $length++ while $str =~ /\X/g; say $length; String Length
  41. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; String Length

  42. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; # and

    yet another (Warning: I wrote it!) use Unicode::Util qw( grapheme_length ); say grapheme_length($str); String Length
  43. Standard ordering of strings for comparison and sorting. sort @names

    $a cmp $b $x gt $y $foo eq $bar Collation
  44. Perl provides a collation algorithm based on code points. Collation

  45. Perl provides a collation algorithm based on code points. @words

    = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel Collation
  46. Perl provides a collation algorithm based on code points. @words

    = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel sort { lc $a cmp lc $b } @words # durian Xerxes Äpfel Collation
  47. Unicode Collation Algorithm (UCA) provides collation based on natural language

    usage. Collation
  48. Unicode Collation Algorithm (UCA) provides collation based on natural language

    usage. use Unicode::Collate; my $collator = Unicode::Collate->new; $collator->sort(@words); # Äpfel durian Xerxes Collation
  49. Unicode Collation Algorithm (UCA) provides collation based on natural language

    usage. $collator->sort(@names) $collator->cmp($a, $b) $collator->gt($x, $y) $collator->eq($foo, $bar) Collation
  50. UCA also provides locale-specific collations for different languages. Collation

  51. UCA also provides locale-specific collations for different languages. use Unicode::Collate::Locale;

    my $kolator = Unicode::Collate::Locale->new( locale => 'pl' # Polish ); Collation
  52. Unicode has 4 normalization forms. The most important are: NFD:

    Normalization Form Canonical Decomposition NFC: Normalization Form Canonical Composition Normalization
  53. use Unicode::Normalize; # NFD can be helpful on input $str

    = NFD($input); # NFC is recommended on output $output = NFC($str); Normalization
  54. UTF-8 encoded input ⇩ decode ⇩ NFD ⇩ Perl Unicode

    string ⇩ NFC ⇩ encode ⇩ UTF-8 encoded output Normalization
  55. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: Unicode Semantics
  56. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); Unicode Semantics
  57. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; Unicode Semantics
  58. By default, unfortunately, strings and regexes are not guaranteed to

    use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; use feature 'unicode_strings'; Unicode Semantics
  59. You’ll see the “utf8” encoding used frequently in Perl. “utf8”

    follows the UTF-8 standard very loosely and allows many errors in your data without warnings. By default, use “UTF-8” instead. UTF-8 vs. utf8 vs. :utf8
  60. # utf8 is Perl's internal encoding form my $internal =

    decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); UTF-8 vs. utf8 vs. :utf8
  61. # utf8 is Perl's internal encoding form my $internal =

    decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); # insecure! no encoding validation at all open my $fh, '<:utf8', $filename; # proper UTF-8 validation open my $fh, '<:encoding(UTF-8)', $filename; UTF-8 vs. utf8 vs. :utf8
  62. Slides will be posted to: @nickpatch Questions?