Slide 1

Slide 1 text

@mathias JavaScript ❤️ Unicode

Slide 2

Slide 2 text

@mathias

Slide 3

Slide 3 text

JavaScript has a Unicode problem

Slide 4

Slide 4 text

Unicode

Slide 5

Slide 5 text

code point unique name symbol/glyph

Slide 6

Slide 6 text

A LATIN CAPITAL LETTER A U+0041

Slide 7

Slide 7 text

a LATIN SMALL LETTER A U+0061

Slide 8

Slide 8 text

© COPYRIGHT SIGN U+00A9

Slide 9

Slide 9 text

‚ SNOWMAN U+2603

Slide 10

Slide 10 text

PILE OF POO U+1F4A9 !

Slide 11

Slide 11 text

U+000000 → U+10FFFF

Slide 12

Slide 12 text

(0x10FFFF + 1) code points ! ↓ ! 17 planes (0xFFFF + 1) code points each

Slide 13

Slide 13 text

Unicode plane #1 U+0000 → U+FFFF Basic Multilingual Plane

Slide 14

Slide 14 text

Unicode planes #2-17 ! U+010000 → U+10FFFF ! supplementary planes astral planes

Slide 15

Slide 15 text

JavaScript

Slide 16

Slide 16 text

Hexadecimal escape sequences >> '\x41\x42\x43' 'ABC' >> '\x61\x62\x63' 'abc' >> '\xA9 Caf\xE9 XYZ' '© Café XYZ' ! can be used for U+0000 → U+00FF

Slide 17

Slide 17 text

Unicode escape sequences >> '\u0041\u0042\u0043' 'ABC' >> 'I \u2661 JavaScript!' 'I ὑ JavaScript!' ! can be used for U+0000 → U+FFFF

Slide 18

Slide 18 text

…what about astral code points?

Slide 19

Slide 19 text

…what about !? *…and other, equally important astral symbols *

Slide 20

Slide 20 text

!

Slide 21

Slide 21 text

Unicode code point escapes >> '\u{41}\u{42}\u{43}' 'ABC' >> '\u{1F4A9}' '!' // U+1F4A9 ! can be used for U+000000 → U+10FFFF ES6

Slide 22

Slide 22 text

Surrogate pairs $ >> '\uD83D\uDCA9' '#' // U+1F4A9 ! can be used for U+010000 → U+10FFFF

Slide 23

Slide 23 text

Surrogate pairs % // for astral code points (> 0xFFFF) function getSurrogates(codePoint) { var high = Math.floor((codePoint - 0x10000) / 0x400) + 0xD800; var low = (codePoint - 0x10000) % 0x400 + 0xDC00; return [ high, low ]; } ! function getCodePoint(high, low) { var codePoint = (high - 0xD800) * 0x400 + low - 0xDC00 + 0x10000; return codePoint; } ! >> getSurrogates(0x1F4A9); // U+1F4A9 is # [ 0xD83D, 0xDCA9 ] >> getCodePoint(0xD83D, 0xDCA9); 0x1F4A9 mths.be/bed

Slide 24

Slide 24 text

JavaScript string length >> 'A'.length // U+0041 1 >> 'A' == '\u0041' true >> 'B'.length // U+0042 1 >> 'B' == '\u0042' true

Slide 25

Slide 25 text

String length ≠ char count >> '!'.length // U+1D400 2 >> '!' == '\uD835\uDC00' true >> '"'.length // U+1D401 2 >> '"' == '\uD835\uDC01' true

Slide 26

Slide 26 text

String length ≠ char count >> '!'.length // U+1F4A9 2 >> '!' == '\uD83D\uDCA9' true insert obligatory “number two” joke here

Slide 27

Slide 27 text

Real-world example

Slide 28

Slide 28 text

Real-world example

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

String character count function countSymbols(string) { return punycode.ucs2.decode(string).length; } ! >> countSymbols('A') // U+0041 1 >> countSymbols('!') // U+1D400 1 >> countSymbols('#') // U+1F4A9 1 mths.be/punycode

Slide 32

Slide 32 text

String character count function countSymbols(string) { return Array.from(string).length; } ! >> countSymbols('A') // U+0041 1 >> countSymbols('!') // U+1D400 1 >> countSymbols('!') // U+1F4A9 1 ES6

Slide 33

Slide 33 text

JavaScript escape sequences mths.be/bmf

Slide 34

Slide 34 text

If we’re being pedantic… // it’s actually even more complicated: ! >> 'mañana' == 'mañana' false

Slide 35

Slide 35 text

If we’re being pedantic… // it’s actually even more complicated: ! >> 'mañana' == 'mañana' false >> 'ma\xF1ana' == 'man\u0303ana' false >> 'ma\xF1ana'.length 6 >> 'man\u0303ana'.length 7

Slide 36

Slide 36 text

function countSymbolsPedantically(string) { // Unicode Normalization, NFC form: var normalized = string.normalize('NFC'); // Account for astral symbols / surrogates: return Array.from(normalized).length; } ! >> countSymbolsPedantically('mañana') // U+00F1 6 >> countSymbolsPedantically('mañana') // U+006E + U+0303 6 Unicode normalization git.io/unorm ES6

Slide 37

Slide 37 text

Perfect? >> var zalgo = 'H ̹̙̦̮͉̩̗̗ ͧ̇̏̊̾ Eͨ͆͒̆ͮ̃ ͏̷̮̣̫̤̣ ̵̞̹̻ ̀̉̓ͬ͑͡ ͅ Cͯ̂͐ ͏̨̛͔̦̟͈̻ O ̜͎͍͙͚̬̝̣ ̽ͮ͐͗̀ͤ̍̀ ͢ M ̴̡̲̭͍͇̼̟̯̦ ̉̒͠ Ḛ̛̙̞̪̗ ͥ ͤͩ̾͑̔͐ ͅ Ṯ̴̷̷̗̼͍ ̿̿̓̽͐ H ̙̙ ̔̄ ͜ ';

Slide 38

Slide 38 text

Perfect? Nope. → can be ‘fixed’ using epic regex-fu >> var zalgo = 'H ̹̙̦̮͉̩̗̗ ͧ̇̏̊̾ Eͨ͆͒̆ͮ̃ ͏̷̮̣̫̤̣ ̵̞̹̻ ̀̉̓ͬ͑͡ ͅ Cͯ̂͐ ͏̨̛͔̦̟͈̻ O ̜͎͍͙͚̬̝̣ ̽ͮ͐͗̀ͤ̍̀ ͢ M ̴̡̲̭͍͇̼̟̯̦ ̉̒͠ Ḛ̛̙̞̪̗ ͥ ͤͩ̾͑̔͐ ͅ Ṯ̴̷̷̗̼͍ ̿̿̓̽͐ H ̙̙ ̔̄ ͜ '; ! >> countSymbolsPedantically(zalgo) 116 // not 9

Slide 39

Slide 39 text

Reversing a string in JavaScript // naive solution function reverse(string) { return string.split('').reverse().join(''); }

Slide 40

Slide 40 text

Reversing a string in JavaScript // naive solution function reverse(string) { return string.split('').reverse().join(''); } ! >> reverse('abc') 'cba'

Slide 41

Slide 41 text

Reversing a string in JavaScript // naive solution function reverse(string) { return string.split('').reverse().join(''); } ! >> reverse('abc') 'cba' >> reverse('mañana') // U+00F1 'anañam'

Slide 42

Slide 42 text

Reversing a string in JavaScript // naive solution function reverse(string) { return string.split('').reverse().join(''); } ! >> reverse('abc') 'cba' >> reverse('mañana') // U+00F1 'anañam' >> reverse('mañana') // U+006E + U+0303 'anãnam'

Slide 43

Slide 43 text

Reversing a string in JavaScript // naive solution function reverse(string) { return string.split('').reverse().join(''); } ! >> reverse('abc') 'cba' >> reverse('mañana') // U+00F1 'anañam' >> reverse('mañana') // U+006E + U+0303 'anãnam' >> reverse('!') // U+1F4A9 '��' '\uDCA9\uD83D' // the surrogate pair for !, in the wrong order

Slide 44

Slide 44 text

“I put my thang down, flip it, and reverse it” — Missy ‘Misdemeanor’ Elliot, 2002

Slide 45

Slide 45 text

Reversing a string in JavaScript // Using the Esrever library var reverse = esrever.reverse; ! >> reverse('abc') 'cba' >> reverse('mañana') // U+00F1 'anañam' >> reverse('mañana') // U+006E + U+0303 'anañam' >> reverse('!') // U+1F4A9 '!' mths.be/esrever

Slide 46

Slide 46 text

This behavior affects other string methods, too.

Slide 47

Slide 47 text

String.fromCharCode() >> String.fromCharCode(0x0041) // U+0041 'A' // U+0041 >> String.fromCharCode(0x1F4A9) // U+1F4A9 '!' // U+F4A9 ! only works as you’d expect for U+0000 → U+FFFF

Slide 48

Slide 48 text

String.fromCharCode() → use surrogate pairs for astral symbols: ! >> String.fromCharCode(0xD83D, 0xDCA9) '!' // U+1F4A9 !

Slide 49

Slide 49 text

String.fromCharCode() → use surrogate pairs for astral symbols: ! >> String.fromCharCode(0xD83D, 0xDCA9) '!' // U+1F4A9 ! → or just use Punycode.js: ! >> punycode.ucs2.encode([ 0x1F4A9 ]) '!' // U+1F4A9

Slide 50

Slide 50 text

String.fromCodePoint() >> String.fromCodePoint(0x1F4A9) '!' ! can be used for U+000000 → U+10FFFF ES6 mths.be/fromcodepoint

Slide 51

Slide 51 text

String#charAt() >> '!'.charAt(0) // U+1F4A9 '\uD83D' // U+D83D

Slide 52

Slide 52 text

String#at() >> '!'.at(0) // U+1F4A9 '!' // U+1F4A9 ES7 mths.be/at

Slide 53

Slide 53 text

String#charCodeAt() >> '!'.charCodeAt(0) // U+1F4A9 0xD83D

Slide 54

Slide 54 text

String#codePointAt() >> '!'.codePointAt(0) // U+1F4A9 0x1F4A9 ES6 mths.be/codepointat

Slide 55

Slide 55 text

Iterate over all symbols in a string function getSymbols(string) { var length = string.length; var index = -1; var output = []; var character; var charCode; while (++index < length) { character = string.charAt(index); charCode = character.charCodeAt(0); if (charCode >= 0xD800 && charCode <= 0xDBFF) { // note: this doesn’t account for lone high surrogates output.push(character + string.charAt(++index)); } else { output.push(character); } } return output; } ! var symbols = getSymbols('! '); symbols.forEach(function(symbol) { assert(symbol == '! '); });

Slide 56

Slide 56 text

Iterate over all symbols in a string for (let symbol of '!') { assert(symbol == '!'); } ES6

Slide 57

Slide 57 text

More string madness •String#substring •String#slice •…anything that involves strings

Slide 58

Slide 58 text

Regular expressions >> /foo.bar/.test('foo!bar') false

Slide 59

Slide 59 text

Match any Unicode symbol >> /^.$/.test('!') false // doesn’t match line breaks, either

Slide 60

Slide 60 text

Match any Unicode symbol >> /^.$/.test('!') false // doesn’t match line breaks, either ! >> /^[\s\S]$/.test('!') false // matches line breaks, but still doesn’t match whole astral symbols

Slide 61

Slide 61 text

Match any Unicode symbol >> /^.$/.test('!') false // doesn’t match line breaks, either ! >> /^[\s\S]$/.test('!') false // matches line breaks, but still doesn’t match whole astral symbols ! >> /^[\0-\uD7FF\uDC00-\uFFFF]|[\uD800- \uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF] $/.test('!') true // wtf

Slide 62

Slide 62 text

Create Unicode-aware regular expressions >> regenerate().addRange(0x0, 0x10FFFF).toString() mths.be/regenerate

Slide 63

Slide 63 text

Create Unicode-aware regular expressions >> regenerate().addRange(0x0, 0x10FFFF).toString() '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800- \uDBFF]' mths.be/regenerate

Slide 64

Slide 64 text

>> regenerate().addRange(0x0, 0x10FFFF).toString() '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800- \uDBFF]' >> regenerate() …… .addRange(0x000000, 0x10FFFF) // add all Unicode code points mths.be/regenerate Create Unicode-aware regular expressions

Slide 65

Slide 65 text

>> regenerate().addRange(0x0, 0x10FFFF).toString() '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800- \uDBFF]' >> regenerate() …… .addRange(0x000000, 0x10FFFF) // add all Unicode code points …… .removeRange('A', 'z') // remove all symbols from `A` to `z` mths.be/regenerate Create Unicode-aware regular expressions

Slide 66

Slide 66 text

>> regenerate().addRange(0x0, 0x10FFFF).toString() '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800- \uDBFF]' >> regenerate() …… .addRange(0x000000, 0x10FFFF) // add all Unicode code points …… .removeRange('A', 'z') // remove all symbols from `A` to `z` …… .remove('#') // remove U+1F4A9 PILE OF POO mths.be/regenerate Create Unicode-aware regular expressions

Slide 67

Slide 67 text

>> regenerate().addRange(0x0, 0x10FFFF).toString() '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800- \uDBFF]' >> regenerate() …… .addRange(0x000000, 0x10FFFF) // add all Unicode code points …… .removeRange('A', 'z') // remove all symbols from `A` to `z` …… .remove('#') // remove U+1F4A9 PILE OF POO …… .toString(); mths.be/regenerate Create Unicode-aware regular expressions

Slide 68

Slide 68 text

>> regenerate().addRange(0x0, 0x10FFFF).toString() '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800- \uDBFF]' >> regenerate() …… .addRange(0x000000, 0x10FFFF) // add all Unicode code points …… .removeRange('A', 'z') // remove all symbols from `A` to `z` …… .remove('#') // remove U+1F4A9 PILE OF POO …… .toString(); '[\0-\x1F\x21-\x40\x7B-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00- \uDFFF]|[\uD800-\uDBFF]' mths.be/regenerate Create Unicode-aware regular expressions

Slide 69

Slide 69 text

>> var regenerate = require('regenerate'); >> var symbols = require('unicode-6.3.0/scripts/Greek/symbols'); >> var set = regenerate(symbols); >> set.toString(); '[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u0384\u0386\u0388-\u038A\u038C \u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66- \u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50- \u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6- \u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126]| \uD800[\uDD40-\uDD8A]|\uD834[\uDE00-\uDE45]' mths.be/regenerate mths.be/node-unicode-data Create Unicode-aware regular expressions

Slide 70

Slide 70 text

>> var regenerate = require('regenerate'); >> var symbols = require('unicode-7.0.0/scripts/Greek/symbols'); >> var set = regenerate(symbols); >> set.toString(); '[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A \u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D- \u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D \u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6- \u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126\uAB65]| \uD800[\uDD40-\uDD8C\uDDA0]|\uD834[\uDE00-\uDE45]' mths.be/regenerate mths.be/node-unicode-data Create Unicode-aware regular expressions

Slide 71

Slide 71 text

Regular expressions >> /foo.bar/.test('foo!bar') false ! >> /foo.bar/u.test('foo!bar') true ES6

Slide 72

Slide 72 text

Regex character classes >> /[a-c]/ // matches: // U+0061 LATIN SMALL LETTER A // U+0062 LATIN SMALL LETTER B // U+0063 LATIN SMALL LETTER C >> /^[a-c]$/.test('a') true >> /^[a-c]$/.test('b') true >> /^[a-c]$/.test('c') true

Slide 73

Slide 73 text

>> /[!-"]/ // matches: // U+1F4A9 PILE OF POO // U+1F4AA FLEXED BICEPS // U+1F4AB DIZZY SYMBOL >> /^[!-"]$/.test('!') true >> /^[!-"]$/.test('#') true >> /^[!-"]$/.test('"') true Regex character classes

Slide 74

Slide 74 text

>> /[!-"]/ // matches: // U+1F4A9 PILE OF POO // U+1F4AA FLEXED BICEPS // U+1F4AB DIZZY SYMBOL >> /^[!-"]$/.test('!') true >> /^[!-"]$/.test('#') true >> /^[!-"]$/.test('"') true Regex character classes ✘

Slide 75

Slide 75 text

Regex character classes >> /[!-"]/ SyntaxError: Invalid regular expression: Range out of order in character class

Slide 76

Slide 76 text

Regex character classes >> /[!-"]/ SyntaxError: Invalid regular expression: Range out of order in character class >> /[\uD83D\uDCA9-\uD83D\uDCAB]/

Slide 77

Slide 77 text

Regex character classes >> /[!-"]/ SyntaxError: Invalid regular expression: Range out of order in character class >> /[\uD83D\uDCA9-\uD83D\uDCAB]/

Slide 78

Slide 78 text

Regex character classes ES6 ✔ >> /[!-"]/u // matches: // U+1F4A9 PILE OF POO // U+1F4AA FLEXED BICEPS // U+1F4AB DIZZY SYMBOL >> /^[!-"]$/u.test('!') true >> /^[!-"]$/u.test('#') true >> /^[!-"]$/u.test('"') true

Slide 79

Slide 79 text

>> regenerate().addRange('#', '&').toString() '\uD83D[\uDCA9-\uDCAB]' >> /^\uD83D[\uDCA9-\uDCAB]$/.test('#') true >> /^\uD83D[\uDCA9-\uDCAB]$/.test(''') true >> /^\uD83D[\uDCA9-\uDCAB]$/.test('&') true Regex character classes mths.be/regenerate

Slide 80

Slide 80 text

Unicode regex transpiler mths.be/regexpu

Slide 81

Slide 81 text

Unicode regex transpiler mths.be/regexpu

Slide 82

Slide 82 text

Unicode regex transpiler mths.be/regexpu

Slide 83

Slide 83 text

JavaScript has a Unicode problem

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

! The Pile of Poo Test™

Slide 86

Slide 86 text

Thanks! Questions? → @mathias mths.be/jsu