Unicode, JavaScript and the Emoji family

22725c2d3eb331146549bf0d5d3c050c?s=47 stefan judis
November 07, 2016

Unicode, JavaScript and the Emoji family

22725c2d3eb331146549bf0d5d3c050c?s=128

stefan judis

November 07, 2016
Tweet

Transcript

  1. [...''] = ['', '', '', '', ''] Unicode, JavaScript and

    the Emoji family @stefanjudis
  2. Stefan Judis Frontend Developer, Occasional Teacher, Meetup Organizer ❤ Open

    Source, Performance and Accessibility ❤ @stefanjudis
  3. cssclass.es

  4. Stefan Judis Frontend Developer, Occasional Teacher, Meetup Organizer ❤ Open

    Source, Performance and Accessibility ❤ @stefanjudis
  5. ''.length

  6. ''.length 2

  7. '%'.length

  8. '%'.length 4

  9. ''.length

  10. ''.length 8

  11. [...'']

  12. [...''] ['', '', '', '', ''] length = 5

  13. Okay! What's going on here?

  14. It's all about Unicode

  15. UNICODE ... is an international encoding standard 01 02 03

    is a mapping from each letter, digit or symbol to a numeric value works across different platforms and programs
  16. U+0000 to U+10FFFF 1,114,112 code points usually formatted as hexadecimal

    numbers from UNICODE - overview -
  17. 1,114,112 code points in 17 planes Basic Multilingual Plane U+0000

    to U+FFFF Supplementary Planes u+10000 to U+10FFFF U+10000 to U+1FFFF U+20000 to U+2FFFF U+30000 to U+DFFFF U+E0000 to U+EFFFF U+F0000 to U+10FFFF Supplementary Multilingual Plane Supplementary Ideographic Plane Supplementary Special-purpose Plane Supplementary Private Use Area Planes unassigned 1 plane 1 plane 1 plane 1 plane 2 planes 16 planes 11 planes UNICODE - overview -
  18. characters for almost all modern languages + a lot of

    of symbols Basic Multilingual Plane U+0000 to U+FFFF Supplementary Planes U+10000 to U+10FFFF U+10000 to U+1FFFF U+20000 to U+2FFFF U+30000 to U+DFFFF U+E0000 to U+EFFFF U+F0000 to U+10FFFF Supplementary Multilingual Plane Supplementary Ideographic Plane Supplementary Special-purpose Plane Supplementary Private Use Area Planes unassigned 1 plane 1 plane 1 plane 1 plane 2 planes 16 planes 11 planes UNICODE - Basic Multilingual Plane -
  19. everything else Basic Multilingual Plane U+0000 to U+FFFF Supplementary Planes

    U+10000 to U+10FFFF U+10000 to U+1FFFF U+20000 to U+2FFFF U+30000 to U+DFFFF U+E0000 to U+EFFFF U+F0000 to U+10FFFF Supplementary Multilingual Plane Supplementary Ideographic Plane Supplementary Special-purpose Plane Supplementary Private Use Area Planes unassigned 1 plane 1 plane 1 plane 1 plane 2 planes 16 planes 11 planes UNICODE - Supplementary Planes -
  20. Emojis

  21. EMOJIS ... were initially used by Japanese mobile operators 01

    02 03 were added to Unicode v6 in October 2010 are supported since OS X 10.7 (Lion) and Windows 8
  22. Basic Multilingual Plane U+0000 to U+FFFF Supplementary Planes U+10000 to

    U+10FFFF U+10000 to U+1FFFF U+20000 to U+2FFFF U+30000 to U+DFFFF U+E0000 to U+EFFFF U+F0000 to U+10FFFF Supplementary Multilingual Plane Supplementary Ideographic Plane Supplementary Special-purpose Plane Supplementary Private Use Area Planes unassigned 1 Plane 1 Plane 1 Plane 1 Plane 2 Planes 16 Planes 11 Planes %' are in the Supplementary Multilingual Plane EMOJIS - overview -
  23. How many Emojis are out there? EMOJIS - overview -

  24. How many Emojis are out there? EMOJIS - overview -

    It depends how you count.
  25. Modifier Sequences Five modifiers for diversity U+1F3FB U+1F3FC U+1F3FD U+1F3FE

    U+1F3FF
  26. Modifier Sequences Five modifiers for diversity U+1F3FB U+1F3FC U+1F3FD U+1F3FE

    U+1F3FF ) = + U+1F3FD U+1F466 ( 2 code points )
  27. EMOJIS ZERO WIDTH JOINER U+200D Indicator that a single glyph

    should be presented for a sequence of characters - ZWJ sequences -
  28. EMOJIS U+1F46A - ZWJ sequences - ( 1 code point

    )
  29. EMOJIS * - ZWJ sequences -

  30. EMOJIS - ZWJ sequences - * U+1F468 + ZWJ U+200D

    + U+1F468 U+1F467 + ZWJ U+200D + ( 5 code points )
  31. EMOJIS - ZWJ sequences - woman astronaut ( 4 code

    points ) ZWJ + + man artist ( 4 code points ) ZWJ + + man getting hair cut ( 4 code points ) ♂ ZWJ + + - woman mountain biking ( 4 code points ) ♀ ZWJ + + /
  32. EMOJIS - ZWJ sequences - woman astronaut ( 4 code

    points ) ZWJ + + man artist ( 4 code points ) ZWJ + + man getting hair cut ( 4 code points ) ♂ ZWJ + + - woman mountain biking ( 4 code points ) ♀ ZWJ + + / "David Bowie" - Singer - ZWJ + + Apple Google ZWJ + +
  33. EMOJIS - ZWJ sequences - woman astronaut ( 4 code

    points ) ZWJ + + man artist ( 4 code points ) ZWJ + + man getting hair cut ( 4 code points ) ♂ ZWJ + + - woman mountain biking ( 4 code points ) ♀ ZWJ + + / "David Bowie" Emoji is not yet supported.
  34. EMOJIS - ZWJ sequences - woman astronaut ( 4 code

    points ) ZWJ + + man artist ( 4 code points ) ZWJ + + man getting hair cut ( 4 code points ) ♂ ZWJ + + - woman mountain biking ( 4 code points ) ♀ ZWJ + + / Sequences degrade gracefully! '\u{1F468}\u{200D}\u{1F3A4}' "" '\u{1F469}\u{200D}\u{1F3A4}' ""
  35. EMOJIS - flags - ... 26 regional indicators used in

    pairs to represent regions U+1F1E6 U+1F1FF
  36. EMOJIS - flags - ... 26 regional indicators used in

    pairs to represent regions U+1F1E6 U+1F1FF 7 U+1F1E9 U+1F1EA : U+1F1EC U+1F1E7 < U+1F1E8 U+1F1FD ( 2 code points ) ( 2 code points ) ( 2 code points )
  37. EMOJIS - flags - www.dwitter.net/d/2708 function() { x.font='96px a' S=String.fromCodePoint

    W=e=>x.measureText(e).width i=t*4%257|0 W(S(F=0x1F1E6,F))>W(_=S(F+i%26,F+i/26|0))&&x.fillText(_,9,99) } Dweet by @veubeke
  38. How many Emojis are out there? EMOJIS - overview -

    2198 unicode.org/reports/tr51/#Identification (excluding incomplete singletons) (excluding duplicates) (including all combined sequences)
  39. 39 What about Unicode in JavaScript

  40. JAVASCRIPT UTF-16, the string format used by JavaScript, uses a

    single 16-bit code unit to represent the most common characters. - string representation -
  41. 16-bit code unit 65536 code points JAVASCRIPT - string representation

    -
  42. \u0000 - \uFFFF can fit into 16bit ツ ('\uFF82') 

    ('\uF8FF') ‚ ('\u9731') ⛷ ('\u26F7') JAVASCRIPT - characters with one code unit -
  43. \u0000 - \uFFFF can fit into 16bit 'ツ'.length ''.length '‚'.length

    '⛷'.length 1 JAVASCRIPT - characters with one code unit -
  44. How can we use code points out of the 16bit

    range? JAVASCRIPT - surrogate pairs -
  45. Surrogate Pairs JAVASCRIPT - surrogate pairs - 2048 surrogate code

    points included in the Basic Multilingual Plane
  46. Surrogate Pairs JAVASCRIPT - surrogate pairs - 2048 surrogate code

    points included in the Basic Multilingual Plane Leading/High Surrogates U+D800 to U+DBFF
  47. Surrogate Pairs JAVASCRIPT - surrogate pairs - 2048 surrogate code

    points included in the Basic Multilingual Plane Leading/High Surrogates Trailing/Low Surrogates U+D800 to U+DBFF U+DC00 to U+DFFF
  48. Surrogate Pairs JAVASCRIPT - surrogate pairs - 2048 surrogate code

    points included in the Basic Multilingual Plane Leading/High Surrogates Trailing/Low Surrogates U+D800 to U+DBFF U+DC00 to U+DFFF C = (H - 0xD800) * 0x400 + L - 0xDC00 + 0x10000 Formula to get code point C = (H - 55296) * 1024 + L - 56320 + 65536
  49. Surrogate Pairs JAVASCRIPT - surrogate pairs - ''.length // 2

    U+1F468 128104
  50. Surrogate Pairs JAVASCRIPT - surrogate pairs - ''.charCodeAt(0) U+D83D 55357

    U+1F468 128104 ''.length // 2
  51. Surrogate Pairs JAVASCRIPT - surrogate pairs - ''.charCodeAt(0) U+D83D 55357

    ''.charCodeAt(1) U+DC68 56424 U+1F468 128104 ''.length // 2
  52. Surrogate Pairs JAVASCRIPT - surrogate pairs - ''.charCodeAt(0) U+D83D 55357

    ''.charCodeAt(1) U+DC68 56424 U+1F468 128104 0x1F468 = (0xD83D - 0xD800) * 0x400 + 0xDC68 - 0xDC00 + 0x10000 128104 = (55357 - 55296) * 1024 + 56424 - 56320 + 65536 ''.length // 2
  53. Surrogate Pairs JAVASCRIPT - surrogate pairs - ''.charCodeAt(0) U+D83D 55357

    ''.charCodeAt(1) U+DC68 56424 U+1F468 128104 0x1F468 = (0xD83D - 0xD800) * 0x400 + 0xDC68 - 0xDC00 + 0x10000 128104 = (55357 - 55296) * 1024 + 56424 - 56320 + 65536 ''.length // 2
  54. charCodeAt() vs codePointAt() JAVASCRIPT - surrogate pairs - U+1F468 128104

    ''.codePointAt(0) U+1F468 128104 ''.codePointAt(1) U+DC68 56424 ''.charCodeAt(0) U+D83D 55357 ''.charCodeAt(1) U+DC68 56424
  55. charCodeAt() vs codePointAt() JAVASCRIPT - surrogate pairs - U+1F468 128104

    ''.codePointAt(0) U+1F468 128104 ''.codePointAt(1) U+DC68 56424 ''.charCodeAt(0) U+D83D 55357 ''.charCodeAt(1) U+DC68 56424
  56. JAVASCRIPT - surrogate pairs - U+1F468 128104 '\uD83D\uDC68' simple Unicode

    escapes Unicode code point escapes '\u{1F468}'
  57. 57 Okay, what's the deal?

  58. JAVASCRIPT - String.prototype.length - This property returns the number of

    code units in the string. String.prototype.length
  59. - the spread operator - The spread operator works for

    every iterable object. [...'ABC'] JAVASCRIPT
  60. - the spread operator - The spread operator works for

    every iterable object. [...'ABC'] JAVASCRIPT > ''[Symbol.iterator] function [Symbol.iterator]() { [native code] }
  61. - the spread operator - [...] iterates over the code

    points of a String value, returning each code point as a String value. String.prototype [ @@iterator ]( ) JAVASCRIPT
  62. 62 Let's go back to the examples

  63. ''.length 2 1 code point but 2 code units (surrogate

    pair)
  64. '%'.length 4 2 code points but 4 code units (2

    surrogate pairs) +
  65. ''.length 8 5 code points but 8 code units (3

    surrogate pairs) ZWJ ZWJ
  66. [...''] ['', '', '', '', ''] U+200D (ZWJ) U+1F468 U+1F469

    U+1F466 U+200D (ZWJ)
  67. Thanks! @stefanjudis Slides ctfl.io/javascript-emoji-family Article ctfl.io/emoji-prototype-dot-length