Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Regular Expressions

Unicode Regular Expressions

Introductory lecture on Unicode Regular Expressions. Starts with ASCII text then progresses to Unicode text. Slide 33 has Egyptian Hieroglyphs😀

9b0b4db968a25548cec20ae3a9781610?s=128

André Schappo

October 04, 2017
Tweet

Transcript

  1. weibo.com/andreschappo twitter.com/andreschappo schappo.blogspot.co.uk Unicode Regular Expressions 1 André Schappo

  2. 2 Data Validation Regular Expressions - also written as RegExp,

    regex Regular Expressions are very useful for validating user input Regular Expressions can replace hundreds of lines of code We will look at Regular Expressions usage on Linux using a terminal app client side with JavaScript server side with PHP
  3. 3 regex meta characters Meta Character Function . match with

    any single character * preceding item matched 0 or more times ? preceding item matched 0 or 1 times + preceding item matched 1 or more times {n} preceding item matched n times {n,} preceding item matched n or more times
  4. 4 regex meta characters Meta Character Function {,m} preceding item

    matched at most m times {n,m} preceding item matched n thru m times [...] matched with any single char in brackets [abcd] [⼩小中⼤大] [^...] matched with any single character not in brackets [^༂༆༊] [.-.] matched with any single character in range [0-9] ^ matched with start of string/line
  5. 5 regex meta characters Meta Character Function $ matched with

    end of string/line (...) group | or \ escape \s white space (SPACE, IDEOGRAPHIC SPACE, EM QUAD, EN SPACE, THREE-PER-EM SPACE ...)
  6. 6 egrep grep is a unix/linux command for finding patterns

    in text data grep provides basic regexp/regex meta characters egrep extends the set of regexp/regex meta characters I will use egrep interactively to demonstrate the use of regex on sci-project
  7. 7 egrep connecting to sci-project OSX use terminal — ssh

    login-id@sci-project Windows use PuTTY or similar terminal app interactive usage from command line egrep 'regex' ⾒見見 man egrep
  8. 8 egrep - regex examples match using regex '.*' string:

    absolutely anything ✔ match using regex 'a.*t' string: 72auytqw ✔ string: qwerty ✘ match using regex '^a.*t$' string: azxcvbt ✔ string: azxcvbtm ✘
  9. 9 egrep - regex examples match using regex '^[a-zA-Z]$' string:

    B ✔ string: Bb ✘ match using regex '^[a-zA-Z]+$' string: ahYgx ✔ string: B8 ✘ match using regex '^[a-zA-Z]{5}$' string: mnOdm ✔ string: Tda ✘
  10. 10 egrep - regex examples match using regex '^0[0-9]{10}$' string:

    01234567890 ✔ string: 01234 567890 ✘ match using regex '0[0-9]{4}\s*[0-9]{6}$' string: 01234 567890 ✔ string: 1234 567890 ✘ match using regex '^0([0-79]*8){3,}[0-79]*$' string: 01834867880 ✔ string: 01234567890 ✘
  11. 11 egrep - regex examples match using regex '^[⼀一⼆二三四五六七⼋八九]+$' string:

    五七⼋八 ✔ string: ⼩小⼭山 ✘ match using regex '^⼈人+鸭⼈人+$' string: ⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人鸭⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✔ string: ⼈人⼈人⼈人⼈人⼈人鸡⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✘ match using regex '^[]+[]+$' string: ✔ string: ✘ ⾒見見 edition.cnn.com/travel/article/hong-kong-giant-duck
  12. 12 Nottingham Arboretum

  13. 13 egrep - regex examples match using regex '^鸭+⼈人鸭+$' string:

    鸭鸭鸭鸭鸭⼈人鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭 ✔ string: 鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡⼈人鸡鸡鸡鸡鸡鸡 ✘ match using regex '^+[]+$' string: ✔ string: ✘
  14. 14 grep -P (PCRE) PCRE — Perl Compatible Regular Expressions

    Counting in Unicode codepoints grep -P '^.{2}$' will match with flags *+,-./0123456 grep -P '^.{3}$' will match with keycap digits 789:;<=>? grep -P '^.{7}$' will match with full family emoji @ A B C D E F G
  15. 15 regex online modifiers g - global match i -

    case insensitive match u - unicode regex101.com regex101 /\X/gu works for keycap digits but not flags or family groups emoji
  16. 16 regex in JavaScript JavaScript RegEx Object /pattern/modifiers /ABC/ /AbC/i

    /АБВ/ (Cyrillic) /АбВ/i (Cyrillic)
  17. 17 regex in JavaScript RegEx Object Method regex.test(string) — test

    for match in string, returns true or false function latin(s){ return (/ABC/i.test(s)); } function cyrillic(s){ return (/^АБВ$/i.test(s)); }
  18. 18 regex in JavaScript You have already seen the ASCII

    letter function Letʼs rewrite the function using regex function letter(c){ //returns true if c is ascii letter, else false return ((c>='a'&&c<='z')||(c>='A'&&c<='Z')); } function letter(c){ //returns true if c is ascii letter, else false return (/[a-zA-Z]/.test(c)); //or /[A-Z]/i.test(c) }
  19. 19 regex in JavaScript Letʼs build a function to validate

    a Thai private car registration number entered into a form by a user. function car(reg){ //true if valid reg number, else false return (/^[1-9]?[กขงจฉธฐพภวษศสชฌฎญฆ]{2}[       ]*[1-9][0-9] {0,3}$/.test(reg)); }
  20. 20 regex in JavaScript function to validate a Javanese decimal

    digit function javaneseDigit(d){ //true if javanese digit, else false return (/[꧐-꧙]/.test(d)); }
  21. 21 regex in JavaScript function to validate an ascii email

    address function email(e){ //true if valid ascii email address, else false return (/^[A-Z0-9]+\.([A-Z]\.)*[A-Z0-9]+@([A-Z0-9]+\.)+[A-Z] {2,}$/i.test(e)); }
  22. 22 regex in JavaScript function to validate a Korean email

    address Evaluation of Websites for Acceptance of Email Addresses ⾒見見 usag.tech/wp-content/uploads/2017/09/UASG-Report- UASG017.pdf function email(e){ //true if valid email address, else false return (/^[о-䛪0-9]+\.[о-䛪0-9]+@([о-䛪0-9]+\.)+ೠҴ$/.test(e)); }
  23. 23 regex in JavaScript twitter use regex in their JavaScript

    tweet text parsing code eg validation TLDs (Top Level Domains) ⾒見見 github.com/twitter/twitter-text/blob/master/js/ twitter-text.js Note use of RegExp object constructor regex can be built as quoted strings consequently regex can be split over multiple lines, making for a more easily readable regex
  24. 24 regex in JavaScript twitter use JavaScript regex to validate

    hashtags ⾒見見 stackoverflow.com/questions/8451846/actual-twitter- format-for-hashtags-not-your-regex-not-his-code-the- actual/ Note: no SMP chars (\u10000-\u1FFFF) hence no emoji hashtags but yet it does include SIP chars (\u20000-\u2FFFF)
  25. 25 regex in JavaScript RegEx Object exec() Method regex.exec(string) —

    test for match in string, returns matched text ⾒見見 w3schools.com/jsref/jsref_regexp_exec.asp function cnum(s){ //returns first chinese number in string return (/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/.exec(s)); }
  26. 26 regex in JavaScript String match() Method string.match(RegExp) — returns

    all matches (as array) if g flag is used ⾒見見 w3schools.com/jsref/jsref_match.asp function cnum(s){ return (s.match(/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/g)); } cnum('⼩小⼭山⼀一⼆二三北北京四五南京') returns ["⼀一⼆二三","四五"]
  27. 27 regex in JavaScript String replace() Method string.replace(regex,replacement) — returns

    a string with all (if g flag is used) matches replaced with replacement a programming challenge ➜ jsfiddle.net/coas/wda45gLp w3schools.com/jsref/jsref_replace.asp
  28. 28 PCRE PCRE ó Perl Compatible Regular Expressions PCRE has

    aforementioned JavaScript RegExp constructs + additional RegExp constructs to process Unicode character properties We will look at the use of PCRE in PHP ⾒見見 pcre.org
  29. 29 PHP PCRE PHP PCRE functions include — preg_match preg_replace

    ⾒見見 php.net/manual/en/book.pcre.php
  30. 30 PHP PCRE Unicode character properties include the General Category

    Every Unicode character is assigned to a single General Category Lu — Letter uppercase Ll — Letter lowercase Lo — Letter other Nd — Number decimal Sm — Symbol mathematical Sc — Symbol currency ... ⾒見見 codepoints.net/search?gc=Sm
  31. 31 PHP PCRE A Unicode character can belong to a

    human language Script Balinese Cyrillic Han Latin Mongolian Thai Tibetan ⾒見見 unicode.org/Public/UCD/latest/ucd/Scripts.txt
  32. 32 PHP PCRE the property matching construct is \p{property} —

    match with char having property \P{property} — do not match with char having property \p{Lu}{3} — match with 3 uppercase letters \p{Sm}+ — match with 1 or more maths symbols \p{Devanagari} — match with a single devanagari char (devanagari is used for writing Hindi) php.net/manual/en/regexp.reference.unicode.php
  33. 33 PHP PCRE use the u modifier flag which directs

    preg to use Unicode UTF-8 <?php $s=""; echo preg_match("/^\p{Egyptian_Hieroglyphs}+$/u",$s) ?>
  34. 34 PHP PCRE match Han (Chinese, Japanese, Korean - CJK)

    add Latin, which includes the English alphabet add Common, which includes punctuation ⾒見見 en.wikipedia.org/wiki/Latin_script_in_Unicode $s="⼩小⼭山"; preg_match("/^\p{Han}+$/u",$s); $s="André is ⼩小⼭山"; preg_match("/^(\p{Han}|\p{Latin}|\s)+$/u",$s); $s="Andréʼs adopted Chinese name is ⼩小⼭山!"; preg_match("/^(\p{Han}|\p{Latin}|\p{Common})+$/u",$s);
  35. 35 PHP PCRE match Number decimal (Nd) and again —

    match Number decimal ⾒見見 codepoints.net/search?gc=Nd $s="123"; preg_match("/^\p{Nd}+$/u",$s); $s="123१२३๑๒๓᭑᭒᭓"; preg_match("/^\p{Nd}+$/u",$s);
  36. 36 PHP PRCE Letʼs visit sci-project with terminal, and revisit

    \X character sequences without U+200D ZERO WIDTH JOINER character sequences with U+200D ZERO WIDTH JOINER(s) php70 -r 'echo preg_match("/^\X{1}$/u","H")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","9")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","☃")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","F")."\n";' #returns 0(false) php70 -r 'echo preg_match("/^\X{1}$/u","J")."\n";' #returns 0(false)
  37. 37 PCRE2 PCRE2 successfully matches with all Grapheme Clusters pcre2grep

    uses PCRE2 pcre2grep -u '^\X{1}$' successfully matches with H 9 ☃ F J schappo.blogspot.co.uk/2017/12/computer-science- internationalization_18.html
  38. 38 PHP PCRE Also see — ⾒見見 regular-expressions.info/refunicode.html ⾒見見 regular-expressions.info/unicode.html

    ⾒見見 schappo.blogspot.co.uk/2015/12/unicode-regular- expressions.html ⾒見見 schappo.blogspot.co.uk/2016/08/internationalizing- regular-expressions.html
  39. 39 The End Fini