Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RegExp and Unicode in javascript

RegExp and Unicode in javascript

About treatment of RegExp and Unicode in javascript.


  1. Regex and Unicode


  3. RegExp

  4. "abc".match(/abc|efg/) // => ["abc"] RegExp in javascript

  5. Unicode

  6. Unicode UCS4/UTF-32 UCS2/UTF-16 UTF-8

  7. Unicode in javascript In Javascript, string values are treated as

    UTF-16 codepoint.
  8. Surrogate Pairs? UTF-16 can treat U+0000ʙU+10FFFF, but UTF-16 can’t express

    codepoint exceeded 0x10000. So Unicode has solve it by creating Surrogate code points. Surrogate code points are code points that are not assigned by all Unicode encoding form. UTF-16 use this Surrogate zone to express code points that are exceeded 0x10000 by combine two code point.
  9. Surrogate Pair 㘏 U+2000B U+D840 U+DC0B

  10. Surrogate Pair "㘏".match(/./) // => [‘?']

  11. What is occured? RegExp in javascript treats string value same

    as javascript. So any(.) match 1 code point but not 1 character, if string contains surrogate pair. So how do we treat surrogate pair with RegExp easily?
  12. Unicode Flag

  13. Unicode Flag Now javascript has unicode RegExp flag (expressed by

    ‘u’). If this flag is specified, some RegExp behaviors are will change.
  14. Surrogate Pair with Unicode flag "㘏".match(/./u) // => ['㘏']

  15. Unicode Flag If unicode flag is specified, RegExp treat surrogate

    pair as one single charactor.
  16. Unicode Property Unicode code point has category like “Cc”, “Cf”

    and etc… So RegExp can use these category as matching pattern.
  17. Unicode Property "a1".match(/\p{General_Category=Decimal_Number}/) // => ["1"] "\u0100a".match(/\p{ASCII}/) // => ["a"]

  18. Unicode Property /\p{[PropertyName ‘=’ CategoryType] | CategoryType }/

  19. Thank you for your time.