Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RegExp and Unicode in javascript

RegExp and Unicode in javascript

About treatment of RegExp and Unicode in javascript.

Transcript

  1. Regex and Unicode

    View Slide

  2. Name
    !CSO ੨໺݈ར

    Occupation
    'SPOUFOE%FWFMPQFS1SPEVDU0XOFS
    Company
    $ZCFSBHFOU"EUFDI4UVEJP"*.FTTFOHFS
    OSS
    $POUSJCVUPSPG7
    About
    IUUQJOGPCODI

    View Slide

  3. RegExp

    View Slide

  4. "abc".match(/abc|efg/)
    // => ["abc"]
    RegExp in javascript

    View Slide

  5. Unicode

    View Slide

  6. Unicode
    UCS4/UTF-32
    UCS2/UTF-16
    UTF-8

    View Slide

  7. Unicode in javascript
    In Javascript, string values are treated as UTF-16 codepoint.

    View Slide

  8. Surrogate Pairs?
    UTF-16 can treat U+0000ʙU+10FFFF, but UTF-16 can’t express codepoint
    exceeded 0x10000.
    So Unicode has solve it by creating Surrogate code points.
    Surrogate code points are code points that are not assigned by all Unicode
    encoding form.
    UTF-16 use this Surrogate zone to express code points that are exceeded
    0x10000 by combine two code point.

    View Slide

  9. Surrogate Pair

    U+2000B
    U+D840 U+DC0B

    View Slide

  10. Surrogate Pair
    "㘏".match(/./)
    // => [‘?']

    View Slide

  11. What is occured?
    RegExp in javascript treats string value same as javascript.
    So any(.) match 1 code point but not 1 character, if string contains surrogate
    pair.
    So how do we treat surrogate pair with RegExp easily?

    View Slide

  12. Unicode Flag

    View Slide

  13. Unicode Flag
    Now javascript has unicode RegExp flag (expressed by ‘u’).
    If this flag is specified, some RegExp behaviors are will change.

    View Slide

  14. Surrogate Pair with Unicode flag
    "㘏".match(/./u)
    // => ['㘏']

    View Slide

  15. Unicode Flag
    If unicode flag is specified, RegExp treat surrogate pair as one single
    charactor.

    View Slide

  16. Unicode Property
    Unicode code point has category like “Cc”, “Cf” and etc…
    So RegExp can use these category as matching pattern.

    View Slide

  17. Unicode Property
    "a1".match(/\p{General_Category=Decimal_Number}/)
    // => ["1"]
    "\u0100a".match(/\p{ASCII}/)
    // => ["a"]

    View Slide

  18. Unicode Property
    /\p{[PropertyName ‘=’ CategoryType] | CategoryType }/

    View Slide

  19. Thank you for your time.

    View Slide