Unicode Regular Expressions

weibo.com/andreschappo twitter.com/andreschappo schappo.blogspot.co.uk Unicode Regular Expressions 1 André Schappo

2 Data Validation Regular Expressions - also written as RegExp,
regex Regular Expressions are very useful for validating user input Regular Expressions can replace hundreds of lines of code We will look at Regular Expressions usage on Linux using a terminal app client side with JavaScript server side with PHP

3 regex meta characters Meta Character Function . match with
any single character * preceding item matched 0 or more times ? preceding item matched 0 or 1 times + preceding item matched 1 or more times {n} preceding item matched n times {n,} preceding item matched n or more times

4 regex meta characters Meta Character Function {,m} preceding item
matched at most m times {n,m} preceding item matched n thru m times [...] matched with any single char in brackets [abcd] [⼩小中⼤大] [^...] matched with any single character not in brackets [^༂༆༊] [.-.] matched with any single character in range [0-9] ^ matched with start of string/line

5 regex meta characters Meta Character Function $ matched with
end of string/line (...) group | or \ escape \s white space (SPACE, IDEOGRAPHIC SPACE, EM QUAD, EN SPACE, THREE-PER-EM SPACE ...)

6 egrep grep is a unix/linux command for finding patterns
in text data grep provides basic regexp/regex meta characters egrep extends the set of regexp/regex meta characters I will use egrep interactively to demonstrate the use of regex on sci-project

7 egrep connecting to sci-project OSX use terminal — ssh
login-id@sci-project Windows use PuTTY or similar terminal app interactive usage from command line egrep 'regex' ⾒見見 man egrep

8 egrep - regex examples match using regex '.*' string:
absolutely anything ✔ match using regex 'a.*t' string: 72auytqw ✔ string: qwerty ✘ match using regex '^a.*t$' string: azxcvbt ✔ string: azxcvbtm ✘

9 egrep - regex examples match using regex '^[a-zA-Z]$' string:
B ✔ string: Bb ✘ match using regex '^[a-zA-Z]+$' string: ahYgx ✔ string: B8 ✘ match using regex '^[a-zA-Z]{5}$' string: mnOdm ✔ string: Tda ✘

10 egrep - regex examples match using regex '^0[0-9]{10}$' string:
01234567890 ✔ string: 01234 567890 ✘ match using regex '0[0-9]{4}\s*[0-9]{6}$' string: 01234 567890 ✔ string: 1234 567890 ✘ match using regex '^0([0-79]*8){3,}[0-79]*$' string: 01834867880 ✔ string: 01234567890 ✘

11 egrep - regex examples match using regex '^[⼀一⼆二三四五六七⼋八九]+$' string:
五七⼋八 ✔ string: ⼩小⼭山 ✘ match using regex '^⼈人+鸭⼈人+$' string: ⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人鸭⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✔ string: ⼈人⼈人⼈人⼈人⼈人鸡⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✘ match using regex '^[]+[]+$' string: ✔ string: ✘ ⾒見見 edition.cnn.com/travel/article/hong-kong-giant-duck

12 Nottingham Arboretum

13 egrep - regex examples match using regex '^鸭+⼈人鸭+$' string:
鸭鸭鸭鸭鸭⼈人鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭 ✔ string: 鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡⼈人鸡鸡鸡鸡鸡鸡 ✘ match using regex '^+[]+$' string: ✔ string: ✘

14 grep -P (PCRE) PCRE — Perl Compatible Regular Expressions
Counting in Unicode codepoints grep -P '^.{2}$' will match with flags *+,-./0123456 grep -P '^.{3}$' will match with keycap digits 789:;<=>? grep -P '^.{7}$' will match with full family emoji @ A B C D E F G

15 regex online modifiers g - global match i -
case insensitive match u - unicode regex101.com regex101 /\X/gu works for keycap digits but not flags or family groups emoji

16 regex in JavaScript JavaScript RegEx Object /pattern/modifiers /ABC/ /AbC/i
/АБВ/ (Cyrillic) /АбВ/i (Cyrillic)

17 regex in JavaScript RegEx Object Method regex.test(string) — test
for match in string, returns true or false function latin(s){ return (/ABC/i.test(s)); } function cyrillic(s){ return (/^АБВ$/i.test(s)); }

18 regex in JavaScript You have already seen the ASCII
letter function Letʼs rewrite the function using regex function letter(c){ //returns true if c is ascii letter, else false return ((c>='a'&&c<='z')||(c>='A'&&c<='Z')); } function letter(c){ //returns true if c is ascii letter, else false return (/[a-zA-Z]/.test(c)); //or /[A-Z]/i.test(c) }

19 regex in JavaScript Letʼs build a function to validate
a Thai private car registration number entered into a form by a user. function car(reg){ //true if valid reg number, else false return (/^[1-9]?[กขงจฉธฐพภวษศสชฌฎญฆ]{2}[ 　]*[1-9][0-9] {0,3}$/.test(reg)); }

20 regex in JavaScript function to validate a Javanese decimal
digit function javaneseDigit(d){ //true if javanese digit, else false return (/[꧐-꧙]/.test(d)); }

21 regex in JavaScript function to validate an ascii email
address function email(e){ //true if valid ascii email address, else false return (/^[A-Z0-9]+\.([A-Z]\.)*[A-Z0-9]+@([A-Z0-9]+\.)+[A-Z] {2,}$/i.test(e)); }

22 regex in JavaScript function to validate a Korean email
address Evaluation of Websites for Acceptance of Email Addresses ⾒見見 usag.tech/wp-content/uploads/2017/09/UASG-Report- UASG017.pdf function email(e){ //true if valid email address, else false return (/^[о-䛪0-9]+\.[о-䛪0-9]+@([о-䛪0-9]+\.)+ೠҴ$/.test(e)); }

23 regex in JavaScript twitter use regex in their JavaScript
tweet text parsing code eg validation TLDs (Top Level Domains) ⾒見見 github.com/twitter/twitter-text/blob/master/js/ twitter-text.js Note use of RegExp object constructor regex can be built as quoted strings consequently regex can be split over multiple lines, making for a more easily readable regex

24 regex in JavaScript twitter use JavaScript regex to validate
hashtags ⾒見見 stackoverflow.com/questions/8451846/actual-twitter- format-for-hashtags-not-your-regex-not-his-code-the- actual/ Note: no SMP chars (\u10000-\u1FFFF) hence no emoji hashtags but yet it does include SIP chars (\u20000-\u2FFFF)

25 regex in JavaScript RegEx Object exec() Method regex.exec(string) —
test for match in string, returns matched text ⾒見見 w3schools.com/jsref/jsref_regexp_exec.asp function cnum(s){ //returns first chinese number in string return (/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/.exec(s)); }

26 regex in JavaScript String match() Method string.match(RegExp) — returns
all matches (as array) if g flag is used ⾒見見 w3schools.com/jsref/jsref_match.asp function cnum(s){ return (s.match(/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/g)); } cnum('⼩小⼭山⼀一⼆二三北北京四五南京') returns ["⼀一⼆二三","四五"]

27 regex in JavaScript String replace() Method string.replace(regex,replacement) — returns
a string with all (if g flag is used) matches replaced with replacement a programming challenge ➜ jsfiddle.net/coas/wda45gLp w3schools.com/jsref/jsref_replace.asp

28 PCRE PCRE ó Perl Compatible Regular Expressions PCRE has
aforementioned JavaScript RegExp constructs + additional RegExp constructs to process Unicode character properties We will look at the use of PCRE in PHP ⾒見見 pcre.org

29 PHP PCRE PHP PCRE functions include — preg_match preg_replace
⾒見見 php.net/manual/en/book.pcre.php

30 PHP PCRE Unicode character properties include the General Category
Every Unicode character is assigned to a single General Category Lu — Letter uppercase Ll — Letter lowercase Lo — Letter other Nd — Number decimal Sm — Symbol mathematical Sc — Symbol currency ... ⾒見見 codepoints.net/search?gc=Sm

31 PHP PCRE A Unicode character can belong to a
human language Script Balinese Cyrillic Han Latin Mongolian Thai Tibetan ⾒見見 unicode.org/Public/UCD/latest/ucd/Scripts.txt

32 PHP PCRE the property matching construct is \p{property} —
match with char having property \P{property} — do not match with char having property \p{Lu}{3} — match with 3 uppercase letters \p{Sm}+ — match with 1 or more maths symbols \p{Devanagari} — match with a single devanagari char (devanagari is used for writing Hindi) php.net/manual/en/regexp.reference.unicode.php

33 PHP PCRE use the u modifier flag which directs
preg to use Unicode UTF-8 <?php $s=""; echo preg_match("/^\p{Egyptian_Hieroglyphs}+$/u",$s) ?>

34 PHP PCRE match Han (Chinese, Japanese, Korean - CJK)
add Latin, which includes the English alphabet add Common, which includes punctuation ⾒見見 en.wikipedia.org/wiki/Latin_script_in_Unicode $s="⼩小⼭山"; preg_match("/^\p{Han}+$/u",$s); $s="André is ⼩小⼭山"; preg_match("/^(\p{Han}|\p{Latin}|\s)+$/u",$s); $s="Andréʼs adopted Chinese name is ⼩小⼭山!"; preg_match("/^(\p{Han}|\p{Latin}|\p{Common})+$/u",$s);

35 PHP PCRE match Number decimal (Nd) and again —
match Number decimal ⾒見見 codepoints.net/search?gc=Nd $s="123"; preg_match("/^\p{Nd}+$/u",$s); $s="123१२३๑๒๓᭑᭒᭓"; preg_match("/^\p{Nd}+$/u",$s);

36 PHP PRCE Letʼs visit sci-project with terminal, and revisit
\X character sequences without U+200D ZERO WIDTH JOINER character sequences with U+200D ZERO WIDTH JOINER(s) php70 -r 'echo preg_match("/^\X{1}$/u","H")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","9")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","☃")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","F")."\n";' #returns 0(false) php70 -r 'echo preg_match("/^\X{1}$/u","J")."\n";' #returns 0(false)

37 PCRE2 PCRE2 successfully matches with all Grapheme Clusters pcre2grep
uses PCRE2 pcre2grep -u '^\X{1}$' successfully matches with H 9 ☃ F J schappo.blogspot.co.uk/2017/12/computer-science- internationalization_18.html

38 PHP PCRE Also see — ⾒見見 regular-expressions.info/refunicode.html ⾒見見 regular-expressions.info/unicode.html
⾒見見 schappo.blogspot.co.uk/2015/12/unicode-regular- expressions.html ⾒見見 schappo.blogspot.co.uk/2016/08/internationalizing- regular-expressions.html

39 The End Fini

Unicode Regular Expressions

Unicode Regular Expressions

Other Decks in Programming

Featured

Transcript