Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RegExp and Unicode in javascript

RegExp and Unicode in javascript

About treatment of RegExp and Unicode in javascript.


  1. Surrogate Pairs? UTF-16 can treat U+0000ʙU+10FFFF, but UTF-16 can’t express

    codepoint exceeded 0x10000. So Unicode has solve it by creating Surrogate code points. Surrogate code points are code points that are not assigned by all Unicode encoding form. UTF-16 use this Surrogate zone to express code points that are exceeded 0x10000 by combine two code point.
  2. What is occured? RegExp in javascript treats string value same

    as javascript. So any(.) match 1 code point but not 1 character, if string contains surrogate pair. So how do we treat surrogate pair with RegExp easily?
  3. Unicode Flag Now javascript has unicode RegExp flag (expressed by

    ‘u’). If this flag is specified, some RegExp behaviors are will change.
  4. Unicode Property Unicode code point has category like “Cc”, “Cf”

    and etc… So RegExp can use these category as matching pattern.