$30 off During Our Annual Pro Sale. View Details »

Unicode, JavaScript and the Emoji family

stefan judis
November 07, 2016

Unicode, JavaScript and the Emoji family

stefan judis

November 07, 2016
Tweet

More Decks by stefan judis

Other Decks in Technology

Transcript

  1. [...''] = ['', '', '', '', '']
    Unicode, JavaScript and
    the Emoji family
    @stefanjudis

    View Slide

  2. Stefan Judis
    Frontend Developer, Occasional Teacher, Meetup Organizer
    ❤ Open Source, Performance and Accessibility ❤
    @stefanjudis

    View Slide

  3. cssclass.es

    View Slide

  4. Stefan Judis
    Frontend Developer, Occasional Teacher, Meetup Organizer
    ❤ Open Source, Performance and Accessibility ❤
    @stefanjudis

    View Slide

  5. ''.length

    View Slide

  6. ''.length
    2

    View Slide

  7. '%'.length

    View Slide

  8. '%'.length
    4

    View Slide

  9. ''.length

    View Slide

  10. ''.length
    8

    View Slide

  11. [...'']

    View Slide

  12. [...'']
    ['', '', '', '', '']
    length = 5

    View Slide

  13. Okay!
    What's going on here?

    View Slide

  14. It's all about
    Unicode

    View Slide

  15. UNICODE ...
    is an international
    encoding standard
    01
    02
    03
    is a mapping from each letter, digit
    or symbol to a numeric value
    works across different platforms
    and programs

    View Slide

  16. U+0000 to U+10FFFF
    1,114,112 code points
    usually formatted as hexadecimal numbers from
    UNICODE
    - overview -

    View Slide

  17. 1,114,112 code points in 17 planes
    Basic
    Multilingual Plane
    U+0000 to U+FFFF
    Supplementary
    Planes
    u+10000 to U+10FFFF
    U+10000
    to
    U+1FFFF
    U+20000
    to
    U+2FFFF
    U+30000
    to
    U+DFFFF
    U+E0000
    to
    U+EFFFF
    U+F0000
    to
    U+10FFFF
    Supplementary
    Multilingual
    Plane
    Supplementary
    Ideographic
    Plane
    Supplementary
    Special-purpose
    Plane
    Supplementary
    Private Use Area
    Planes
    unassigned
    1 plane
    1 plane 1 plane 1 plane 2 planes
    16 planes
    11 planes
    UNICODE
    - overview -

    View Slide

  18. characters for almost all modern languages + a lot of of symbols
    Basic
    Multilingual Plane
    U+0000 to U+FFFF
    Supplementary
    Planes
    U+10000 to U+10FFFF
    U+10000
    to
    U+1FFFF
    U+20000
    to
    U+2FFFF
    U+30000
    to
    U+DFFFF
    U+E0000
    to
    U+EFFFF
    U+F0000
    to
    U+10FFFF
    Supplementary
    Multilingual
    Plane
    Supplementary
    Ideographic
    Plane
    Supplementary
    Special-purpose
    Plane
    Supplementary
    Private Use Area
    Planes
    unassigned
    1 plane
    1 plane 1 plane 1 plane 2 planes
    16 planes
    11 planes
    UNICODE
    - Basic Multilingual Plane -

    View Slide

  19. everything else
    Basic
    Multilingual Plane
    U+0000 to U+FFFF
    Supplementary
    Planes
    U+10000 to U+10FFFF
    U+10000
    to
    U+1FFFF
    U+20000
    to
    U+2FFFF
    U+30000
    to
    U+DFFFF
    U+E0000
    to
    U+EFFFF
    U+F0000
    to
    U+10FFFF
    Supplementary
    Multilingual
    Plane
    Supplementary
    Ideographic
    Plane
    Supplementary
    Special-purpose
    Plane
    Supplementary
    Private Use Area
    Planes
    unassigned
    1 plane
    1 plane 1 plane 1 plane 2 planes
    16 planes
    11 planes
    UNICODE
    - Supplementary Planes -

    View Slide

  20. Emojis

    View Slide

  21. EMOJIS ...
    were initially used by
    Japanese mobile operators
    01
    02
    03
    were added to Unicode v6 in
    October 2010
    are supported since OS X 10.7
    (Lion) and Windows 8

    View Slide

  22. Basic
    Multilingual Plane
    U+0000 to U+FFFF
    Supplementary
    Planes
    U+10000 to U+10FFFF
    U+10000
    to
    U+1FFFF
    U+20000
    to
    U+2FFFF
    U+30000
    to
    U+DFFFF
    U+E0000
    to
    U+EFFFF
    U+F0000
    to
    U+10FFFF
    Supplementary
    Multilingual
    Plane
    Supplementary
    Ideographic
    Plane
    Supplementary
    Special-purpose
    Plane
    Supplementary
    Private Use Area
    Planes
    unassigned
    1 Plane
    1 Plane 1 Plane 1 Plane 2 Planes
    16 Planes
    11 Planes
    %' are in the Supplementary Multilingual Plane
    EMOJIS
    - overview -

    View Slide

  23. How many Emojis are out there?
    EMOJIS
    - overview -

    View Slide

  24. How many Emojis are out there?
    EMOJIS
    - overview -
    It depends how you count.

    View Slide

  25. Modifier Sequences
    Five modifiers for diversity
    U+1F3FB
    U+1F3FC
    U+1F3FD
    U+1F3FE
    U+1F3FF

    View Slide

  26. Modifier Sequences
    Five modifiers for diversity
    U+1F3FB
    U+1F3FC
    U+1F3FD
    U+1F3FE
    U+1F3FF
    ) = +
    U+1F3FD
    U+1F466
    ( 2 code points )

    View Slide

  27. EMOJIS
    ZERO WIDTH JOINER
    U+200D
    Indicator that a single glyph should be presented
    for a sequence of characters
    - ZWJ sequences -

    View Slide

  28. EMOJIS

    U+1F46A
    - ZWJ sequences -
    ( 1 code point )

    View Slide

  29. EMOJIS
    *
    - ZWJ sequences -

    View Slide

  30. EMOJIS
    - ZWJ sequences -
    *

    U+1F468
    + ZWJ
    U+200D
    +
    U+1F468

    U+1F467
    + ZWJ
    U+200D
    +
    ( 5 code points )

    View Slide

  31. EMOJIS
    - ZWJ sequences -
    woman astronaut
    ( 4 code points )


    ZWJ
    + +
    man artist
    ( 4 code points )

    ZWJ
    + +

    man getting hair cut
    ( 4 code points )


    ZWJ
    + +
    - woman mountain biking
    ( 4 code points )


    ZWJ
    + +
    /

    View Slide

  32. EMOJIS
    - ZWJ sequences -
    woman astronaut
    ( 4 code points )


    ZWJ
    + +
    man artist
    ( 4 code points )

    ZWJ
    + +

    man getting hair cut
    ( 4 code points )


    ZWJ
    + +
    - woman mountain biking
    ( 4 code points )


    ZWJ
    + +
    /
    "David Bowie"
    - Singer -


    ZWJ
    + +
    Apple
    Google

    ZWJ
    + +

    View Slide

  33. EMOJIS
    - ZWJ sequences -
    woman astronaut
    ( 4 code points )


    ZWJ
    + +
    man artist
    ( 4 code points )

    ZWJ
    + +

    man getting hair cut
    ( 4 code points )


    ZWJ
    + +
    - woman mountain biking
    ( 4 code points )


    ZWJ
    + +
    /
    "David Bowie" Emoji is not yet supported.

    View Slide

  34. EMOJIS
    - ZWJ sequences -
    woman astronaut
    ( 4 code points )


    ZWJ
    + +
    man artist
    ( 4 code points )

    ZWJ
    + +

    man getting hair cut
    ( 4 code points )


    ZWJ
    + +
    - woman mountain biking
    ( 4 code points )


    ZWJ
    + +
    /
    Sequences degrade gracefully!
    '\u{1F468}\u{200D}\u{1F3A4}'
    ""
    '\u{1F469}\u{200D}\u{1F3A4}'
    ""

    View Slide

  35. EMOJIS
    - flags -
    ...
    26 regional indicators used
    in pairs to represent regions
    U+1F1E6 U+1F1FF

    View Slide

  36. EMOJIS
    - flags -
    ...
    26 regional indicators used
    in pairs to represent regions
    U+1F1E6 U+1F1FF
    7
    U+1F1E9 U+1F1EA

    :
    U+1F1EC U+1F1E7

    <
    U+1F1E8 U+1F1FD

    ( 2 code points ) ( 2 code points ) ( 2 code points )

    View Slide

  37. EMOJIS
    - flags -
    www.dwitter.net/d/2708
    function() {
    x.font='96px a'
    S=String.fromCodePoint
    W=e=>x.measureText(e).width
    i=t*4%257|0
    W(S(F=0x1F1E6,F))>W(_=S(F+i%26,F+i/26|0))&&x.fillText(_,9,99)
    }
    Dweet by @veubeke

    View Slide

  38. How many Emojis are out there?
    EMOJIS
    - overview -
    2198
    unicode.org/reports/tr51/#Identification
    (excluding incomplete singletons)
    (excluding duplicates)
    (including all combined sequences)

    View Slide

  39. 39
    What about
    Unicode
    in
    JavaScript

    View Slide

  40. JAVASCRIPT
    UTF-16, the string format used by JavaScript,
    uses a single 16-bit code unit
    to represent the most common characters.
    - string representation -

    View Slide

  41. 16-bit code unit
    65536 code points
    JAVASCRIPT
    - string representation -

    View Slide

  42. \u0000 - \uFFFF
    can fit into 16bit

    ('\uFF82')

    ('\uF8FF')

    ('\u9731')

    ('\u26F7')
    JAVASCRIPT
    - characters with one code unit -

    View Slide

  43. \u0000 - \uFFFF
    can fit into 16bit
    'ツ'.length
    ''.length
    '‚'.length
    '⛷'.length
    1
    JAVASCRIPT
    - characters with one code unit -

    View Slide

  44. How can we use code points
    out of the 16bit range?
    JAVASCRIPT
    - surrogate pairs -

    View Slide

  45. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    2048 surrogate code points
    included in the Basic Multilingual Plane

    View Slide

  46. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    2048 surrogate code points
    included in the Basic Multilingual Plane
    Leading/High Surrogates
    U+D800 to U+DBFF

    View Slide

  47. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    2048 surrogate code points
    included in the Basic Multilingual Plane
    Leading/High Surrogates Trailing/Low Surrogates
    U+D800 to U+DBFF U+DC00 to U+DFFF

    View Slide

  48. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    2048 surrogate code points
    included in the Basic Multilingual Plane
    Leading/High Surrogates Trailing/Low Surrogates
    U+D800 to U+DBFF U+DC00 to U+DFFF
    C = (H - 0xD800) * 0x400 + L - 0xDC00 + 0x10000
    Formula to get code point
    C = (H - 55296) * 1024 + L - 56320 + 65536

    View Slide

  49. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    ''.length // 2
    U+1F468
    128104

    View Slide

  50. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    ''.charCodeAt(0)
    U+D83D
    55357
    U+1F468
    128104
    ''.length // 2

    View Slide

  51. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    ''.charCodeAt(0)
    U+D83D
    55357
    ''.charCodeAt(1)
    U+DC68
    56424
    U+1F468
    128104
    ''.length // 2

    View Slide

  52. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    ''.charCodeAt(0)
    U+D83D
    55357
    ''.charCodeAt(1)
    U+DC68
    56424
    U+1F468
    128104
    0x1F468 = (0xD83D - 0xD800) * 0x400 + 0xDC68 - 0xDC00 + 0x10000
    128104 = (55357 - 55296) * 1024 + 56424 - 56320 + 65536
    ''.length // 2

    View Slide

  53. Surrogate Pairs
    JAVASCRIPT
    - surrogate pairs -
    ''.charCodeAt(0)
    U+D83D
    55357
    ''.charCodeAt(1)
    U+DC68
    56424
    U+1F468
    128104
    0x1F468 = (0xD83D - 0xD800) * 0x400 + 0xDC68 - 0xDC00 + 0x10000
    128104 = (55357 - 55296) * 1024 + 56424 - 56320 + 65536
    ''.length // 2

    View Slide

  54. charCodeAt() vs codePointAt()
    JAVASCRIPT
    - surrogate pairs -

    U+1F468
    128104
    ''.codePointAt(0)
    U+1F468
    128104
    ''.codePointAt(1)
    U+DC68
    56424
    ''.charCodeAt(0)
    U+D83D
    55357
    ''.charCodeAt(1)
    U+DC68
    56424

    View Slide

  55. charCodeAt() vs codePointAt()
    JAVASCRIPT
    - surrogate pairs -

    U+1F468
    128104
    ''.codePointAt(0)
    U+1F468
    128104
    ''.codePointAt(1)
    U+DC68
    56424
    ''.charCodeAt(0)
    U+D83D
    55357
    ''.charCodeAt(1)
    U+DC68
    56424

    View Slide

  56. JAVASCRIPT
    - surrogate pairs -

    U+1F468
    128104
    '\uD83D\uDC68'
    simple Unicode escapes
    Unicode code point escapes
    '\u{1F468}'

    View Slide

  57. 57
    Okay, what's the deal?

    View Slide

  58. JAVASCRIPT
    - String.prototype.length -
    This property returns
    the number of code units in the string.
    String.prototype.length

    View Slide

  59. - the spread operator -
    The spread operator works for
    every iterable object.
    [...'ABC']
    JAVASCRIPT

    View Slide

  60. - the spread operator -
    The spread operator works for
    every iterable object.
    [...'ABC']
    JAVASCRIPT
    > ''[Symbol.iterator]
    function [Symbol.iterator]() { [native code] }

    View Slide

  61. - the spread operator -
    [...] iterates over the code points of a String value,
    returning each code point as a String value.
    String.prototype [ @@iterator ]( )
    JAVASCRIPT

    View Slide

  62. 62
    Let's go back
    to the examples

    View Slide

  63. ''.length
    2
    1 code point
    but
    2 code units
    (surrogate pair)

    View Slide

  64. '%'.length
    4
    2 code points
    but
    4 code units
    (2 surrogate pairs)
    +

    View Slide

  65. ''.length
    8
    5 code points
    but
    8 code units
    (3 surrogate pairs)

    ZWJ

    ZWJ

    View Slide

  66. [...'']
    ['', '', '', '', '']
    U+200D
    (ZWJ)
    U+1F468 U+1F469 U+1F466
    U+200D
    (ZWJ)

    View Slide

  67. Thanks!
    @stefanjudis
    Slides
    ctfl.io/javascript-emoji-family
    Article
    ctfl.io/emoji-prototype-dot-length

    View Slide