Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is Unicode?

Kimtaro
December 15, 2016

What is Unicode?

A brief look at what Unicode is, what it contains and what we can do with it.

Presented at Web Platform London, December 14. https://www.meetup.com/WebPlatform-London/events/236089771/

Kimtaro

December 15, 2016
Tweet

More Decks by Kimtaro

Other Decks in Programming

Transcript

  1. Why we need Unicode What Unicode is and how it’s

    organized Terminology Sorting Emoji Practical considerations in Javascript, MySQL, Regular Expressions, and malformed text Helper data and libraries
  2. Why do we need Unicode? Reliably write, store and send

    text Most living languages, and many historical Does away with a multitude of standards
  3. What is Unicode? A numbered list of characters Ways to

    encode these numbers Information about the characters Algorithms for sorting, displaying, etc
  4. A numbered list of characters 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; 304B;HIRAGANA

    LETTER KA;Lo;0;L;;;;;N;;;;; 1F409;DRAGON;So;0;ON;;;;;N;;;;; http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
  5. Ways to encode these numbers UTF-8 UTF-16 UTF-32 A U+0041

    0x41 0x0041 0x00000041 ͔ U+304B 0xE3818B 0x304B 0x0000304B U+1F409 0xF09F9089 0xD83DDC09 0x0001F409
  6. Information about the characters 0600..0604 ; Arabic # Cf [5]

    ARABIC NUMBER SIGN..ARABIC SIGN SAMVAT 30FB;W # Po KATAKANA MIDDLE DOT 1F1F8 1F1EA ; Emoji_Flag_Sequence # 6.0 [1] (#) Flag for Sweden http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt http://www.unicode.org/Public/9.0.0/ucd/EastAsianWidth.txt http://www.unicode.org/Public/emoji/3.0//emoji-sequences.txt
  7. Algorithms for sorting, displaying, etc http://unicode.org/reports/ Bidirectional Algorithm Line Breaking

    Algorithm Text Segmentation Collation Algorithm Regular Expressions
  8. Organization 17 planes, with 65,535 code points each Blocks of

    related characters Modern scripts in Plane 0
 BMP, 0000-FFFF Emoji in Plane 1
 SMP, 10000-1FFFF
  9. Glyph Á Á Grapheme Á Character Sequence A ´ Code

    Point U+0041 + U+00B4 U+1F600 or U+D83D + U+DE00 Code Unit (UTF-8) 41 C2 B4 F0 9F 98 80
  10. Bytes Code points Glyphs $ 1 1 1 ¢ 2

    1 1 € 3 1 1 4 1 1 ⛪ 3 1 1 4 1 1 1 4 2 1 ' 8 2 1 ( 11 3 1 ) 27 8 1 https://twitter.com/FakeUnicode/status/717751277342490624
  11. UTF-16 BMP → One 16 bit code unit U+0041 (A)

    → 0041 Above BMP → Two 16 bit code units U+1F600 () → D83D + DE00
  12. UTF-16 Systems and programming languages Byte order matters UTF-16: FEFF

    0041 UTF-16 LE: FFFE 4100 Byte Order Mark (0xFEFF, ZERO WIDTH NO-BREAK SPACE )
  13. UTF-8 This is the encoding you are looking for One

    – four 8 bit code units Easy to parse Byte order does not matter
  14. UTF-8 U+0041 A 41 U+00D6 Ö C3 96 U+304B ͔

    E3 81 8B U+1F600 F0 9F 98 80
  15. UTF-32 32 bit code units, U+0041 (A) → 00000041 Space

    inefficient One code unit per character BORING!
  16. Å Å Å U+00C5 LATIN CAPITAL LETTER A WITH RING

    ABOVE U+212B ANGSTROM SIGN U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ABOVE
  17. Å Å Å U+00C5 LATIN CAPITAL LETTER A WITH RING

    ABOVE U+212B ANGSTROM SIGN U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ABOVE Equivalence
  18. Å Å Å U+00C5 LATIN CAPITAL LETTER A WITH RING

    ABOVE U+212B ANGSTROM SIGN U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ABOVE Normalization
  19. Emoji 2,198 emoji Most in SMP
 Two UTF-16 or four

    UTF-8 code units Separate data files
  20. ZWJ sequences , ) . / 1 2 3 4

    5 6 7 8 9 : ; < = >
  21. Fitzpatrick scale Emoji Modifier Fitzpatrick Type-1-2 Emoji Modifier Fitzpatrick Type-3

    Emoji Modifier Fitzpatrick Type-4 Emoji Modifier Fitzpatrick Type-5 Emoji Modifier Fitzpatrick Type-6 J K L M N + =
  22. Genders + skin tones O RUNNER EMOJI MODIFIER FITZPATRICK TYPE-4

    ZERO WIDTH JOINER ♀ FEMALE SIGN VARIATION SELECTOR-16
  23. ES6

  24. ES6

  25. ES6

  26. ES6

  27. ES6

  28. ES6

  29. ES6

  30. ES6

  31. ES6

  32. Common Locale Data Repository Language and country data Formatting dates,

    numbers, currency Translations of scripts, languages, date units Script characters, sorting and transliteration rules
  33. International Components for Unicode C/C++ and Java libraries for Unicode

    handling Uses CLDR data Encoding conversion Sorting, formatting, normalization Time and calendar conversion Regular expressions Text segmentation