Slide 1

Slide 1 text

Regex and Unicode

Slide 2

Slide 2 text

Name !CSO ੨໺݈ར Occupation 'SPOUFOE%FWFMPQFS1SPEVDU0XOFS Company $ZCFSBHFOU"EUFDI4UVEJP"*.FTTFOHFS OSS $POUSJCVUPSPG7 About IUUQJOGPCODI

Slide 3

Slide 3 text

RegExp

Slide 4

Slide 4 text

"abc".match(/abc|efg/) // => ["abc"] RegExp in javascript

Slide 5

Slide 5 text

Unicode

Slide 6

Slide 6 text

Unicode UCS4/UTF-32 UCS2/UTF-16 UTF-8

Slide 7

Slide 7 text

Unicode in javascript In Javascript, string values are treated as UTF-16 codepoint.

Slide 8

Slide 8 text

Surrogate Pairs? UTF-16 can treat U+0000ʙU+10FFFF, but UTF-16 can’t express codepoint exceeded 0x10000. So Unicode has solve it by creating Surrogate code points. Surrogate code points are code points that are not assigned by all Unicode encoding form. UTF-16 use this Surrogate zone to express code points that are exceeded 0x10000 by combine two code point.

Slide 9

Slide 9 text

Surrogate Pair 㘏 U+2000B U+D840 U+DC0B

Slide 10

Slide 10 text

Surrogate Pair "㘏".match(/./) // => [‘?']

Slide 11

Slide 11 text

What is occured? RegExp in javascript treats string value same as javascript. So any(.) match 1 code point but not 1 character, if string contains surrogate pair. So how do we treat surrogate pair with RegExp easily?

Slide 12

Slide 12 text

Unicode Flag

Slide 13

Slide 13 text

Unicode Flag Now javascript has unicode RegExp flag (expressed by ‘u’). If this flag is specified, some RegExp behaviors are will change.

Slide 14

Slide 14 text

Surrogate Pair with Unicode flag "㘏".match(/./u) // => ['㘏']

Slide 15

Slide 15 text

Unicode Flag If unicode flag is specified, RegExp treat surrogate pair as one single charactor.

Slide 16

Slide 16 text

Unicode Property Unicode code point has category like “Cc”, “Cf” and etc… So RegExp can use these category as matching pattern.

Slide 17

Slide 17 text

Unicode Property "a1".match(/\p{General_Category=Decimal_Number}/) // => ["1"] "\u0100a".match(/\p{ASCII}/) // => ["a"]

Slide 18

Slide 18 text

Unicode Property /\p{[PropertyName ‘=’ CategoryType] | CategoryType }/

Slide 19

Slide 19 text

Thank you for your time.