Slide 1

Slide 1 text

weibo.com/andreschappo twitter.com/andreschappo schappo.blogspot.co.uk Unicode Regular Expressions 1 André Schappo

Slide 2

Slide 2 text

2 Data Validation Regular Expressions - also written as RegExp, regex Regular Expressions are very useful for validating user input Regular Expressions can replace hundreds of lines of code We will look at Regular Expressions usage on Linux using a terminal app client side with JavaScript server side with PHP

Slide 3

Slide 3 text

3 regex meta characters Meta Character Function . match with any single character * preceding item matched 0 or more times ? preceding item matched 0 or 1 times + preceding item matched 1 or more times {n} preceding item matched n times {n,} preceding item matched n or more times

Slide 4

Slide 4 text

4 regex meta characters Meta Character Function {,m} preceding item matched at most m times {n,m} preceding item matched n thru m times [...] matched with any single char in brackets [abcd] [⼩小中⼤大] [^...] matched with any single character not in brackets [^༂༆༊] [.-.] matched with any single character in range [0-9] ^ matched with start of string/line

Slide 5

Slide 5 text

5 regex meta characters Meta Character Function $ matched with end of string/line (...) group | or \ escape \s white space (SPACE, IDEOGRAPHIC SPACE, EM QUAD, EN SPACE, THREE-PER-EM SPACE ...)

Slide 6

Slide 6 text

6 egrep grep is a unix/linux command for finding patterns in text data grep provides basic regexp/regex meta characters egrep extends the set of regexp/regex meta characters I will use egrep interactively to demonstrate the use of regex on sci-project

Slide 7

Slide 7 text

7 egrep connecting to sci-project OSX use terminal — ssh login-id@sci-project Windows use PuTTY or similar terminal app interactive usage from command line egrep 'regex' ⾒見見 man egrep

Slide 8

Slide 8 text

8 egrep - regex examples match using regex '.*' string: absolutely anything ✔ match using regex 'a.*t' string: 72auytqw ✔ string: qwerty ✘ match using regex '^a.*t$' string: azxcvbt ✔ string: azxcvbtm ✘

Slide 9

Slide 9 text

9 egrep - regex examples match using regex '^[a-zA-Z]$' string: B ✔ string: Bb ✘ match using regex '^[a-zA-Z]+$' string: ahYgx ✔ string: B8 ✘ match using regex '^[a-zA-Z]{5}$' string: mnOdm ✔ string: Tda ✘

Slide 10

Slide 10 text

10 egrep - regex examples match using regex '^0[0-9]{10}$' string: 01234567890 ✔ string: 01234 567890 ✘ match using regex '0[0-9]{4}\s*[0-9]{6}$' string: 01234 567890 ✔ string: 1234 567890 ✘ match using regex '^0([0-79]*8){3,}[0-79]*$' string: 01834867880 ✔ string: 01234567890 ✘

Slide 11

Slide 11 text

11 egrep - regex examples match using regex '^[⼀一⼆二三四五六七⼋八九]+$' string: 五七⼋八 ✔ string: ⼩小⼭山 ✘ match using regex '^⼈人+鸭⼈人+$' string: ⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人鸭⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✔ string: ⼈人⼈人⼈人⼈人⼈人鸡⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✘ match using regex '^[]+[]+$' string: ✔ string: ✘ ⾒見見 edition.cnn.com/travel/article/hong-kong-giant-duck

Slide 12

Slide 12 text

12 Nottingham Arboretum

Slide 13

Slide 13 text

13 egrep - regex examples match using regex '^鸭+⼈人鸭+$' string: 鸭鸭鸭鸭鸭⼈人鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭 ✔ string: 鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡⼈人鸡鸡鸡鸡鸡鸡 ✘ match using regex '^+[]+$' string: ✔ string: ✘

Slide 14

Slide 14 text

14 grep -P (PCRE) PCRE — Perl Compatible Regular Expressions Counting in Unicode codepoints grep -P '^.{2}$' will match with flags *+,-./0123456 grep -P '^.{3}$' will match with keycap digits 789:;<=>? grep -P '^.{7}$' will match with full family emoji @ A B C D E F G

Slide 15

Slide 15 text

15 regex online modifiers g - global match i - case insensitive match u - unicode regex101.com regex101 /\X/gu works for keycap digits but not flags or family groups emoji

Slide 16

Slide 16 text

16 regex in JavaScript JavaScript RegEx Object /pattern/modifiers /ABC/ /AbC/i /АБВ/ (Cyrillic) /АбВ/i (Cyrillic)

Slide 17

Slide 17 text

17 regex in JavaScript RegEx Object Method regex.test(string) — test for match in string, returns true or false function latin(s){ return (/ABC/i.test(s)); } function cyrillic(s){ return (/^АБВ$/i.test(s)); }

Slide 18

Slide 18 text

18 regex in JavaScript You have already seen the ASCII letter function Letʼs rewrite the function using regex function letter(c){ //returns true if c is ascii letter, else false return ((c>='a'&&c<='z')||(c>='A'&&c<='Z')); } function letter(c){ //returns true if c is ascii letter, else false return (/[a-zA-Z]/.test(c)); //or /[A-Z]/i.test(c) }

Slide 19

Slide 19 text

19 regex in JavaScript Letʼs build a function to validate a Thai private car registration number entered into a form by a user. function car(reg){ //true if valid reg number, else false return (/^[1-9]?[กขงจฉธฐพภวษศสชฌฎญฆ]{2}[       ]*[1-9][0-9] {0,3}$/.test(reg)); }

Slide 20

Slide 20 text

20 regex in JavaScript function to validate a Javanese decimal digit function javaneseDigit(d){ //true if javanese digit, else false return (/[꧐-꧙]/.test(d)); }

Slide 21

Slide 21 text

21 regex in JavaScript function to validate an ascii email address function email(e){ //true if valid ascii email address, else false return (/^[A-Z0-9]+\.([A-Z]\.)*[A-Z0-9]+@([A-Z0-9]+\.)+[A-Z] {2,}$/i.test(e)); }

Slide 22

Slide 22 text

22 regex in JavaScript function to validate a Korean email address Evaluation of Websites for Acceptance of Email Addresses ⾒見見 usag.tech/wp-content/uploads/2017/09/UASG-Report- UASG017.pdf function email(e){ //true if valid email address, else false return (/^[о-䛪0-9]+\.[о-䛪0-9]+@([о-䛪0-9]+\.)+ೠҴ$/.test(e)); }

Slide 23

Slide 23 text

23 regex in JavaScript twitter use regex in their JavaScript tweet text parsing code eg validation TLDs (Top Level Domains) ⾒見見 github.com/twitter/twitter-text/blob/master/js/ twitter-text.js Note use of RegExp object constructor regex can be built as quoted strings consequently regex can be split over multiple lines, making for a more easily readable regex

Slide 24

Slide 24 text

24 regex in JavaScript twitter use JavaScript regex to validate hashtags ⾒見見 stackoverflow.com/questions/8451846/actual-twitter- format-for-hashtags-not-your-regex-not-his-code-the- actual/ Note: no SMP chars (\u10000-\u1FFFF) hence no emoji hashtags but yet it does include SIP chars (\u20000-\u2FFFF)

Slide 25

Slide 25 text

25 regex in JavaScript RegEx Object exec() Method regex.exec(string) — test for match in string, returns matched text ⾒見見 w3schools.com/jsref/jsref_regexp_exec.asp function cnum(s){ //returns first chinese number in string return (/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/.exec(s)); }

Slide 26

Slide 26 text

26 regex in JavaScript String match() Method string.match(RegExp) — returns all matches (as array) if g flag is used ⾒見見 w3schools.com/jsref/jsref_match.asp function cnum(s){ return (s.match(/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/g)); } cnum('⼩小⼭山⼀一⼆二三北北京四五南京') returns ["⼀一⼆二三","四五"]

Slide 27

Slide 27 text

27 regex in JavaScript String replace() Method string.replace(regex,replacement) — returns a string with all (if g flag is used) matches replaced with replacement a programming challenge ➜ jsfiddle.net/coas/wda45gLp w3schools.com/jsref/jsref_replace.asp

Slide 28

Slide 28 text

28 PCRE PCRE ó Perl Compatible Regular Expressions PCRE has aforementioned JavaScript RegExp constructs + additional RegExp constructs to process Unicode character properties We will look at the use of PCRE in PHP ⾒見見 pcre.org

Slide 29

Slide 29 text

29 PHP PCRE PHP PCRE functions include — preg_match preg_replace ⾒見見 php.net/manual/en/book.pcre.php

Slide 30

Slide 30 text

30 PHP PCRE Unicode character properties include the General Category Every Unicode character is assigned to a single General Category Lu — Letter uppercase Ll — Letter lowercase Lo — Letter other Nd — Number decimal Sm — Symbol mathematical Sc — Symbol currency ... ⾒見見 codepoints.net/search?gc=Sm

Slide 31

Slide 31 text

31 PHP PCRE A Unicode character can belong to a human language Script Balinese Cyrillic Han Latin Mongolian Thai Tibetan ⾒見見 unicode.org/Public/UCD/latest/ucd/Scripts.txt

Slide 32

Slide 32 text

32 PHP PCRE the property matching construct is \p{property} — match with char having property \P{property} — do not match with char having property \p{Lu}{3} — match with 3 uppercase letters \p{Sm}+ — match with 1 or more maths symbols \p{Devanagari} — match with a single devanagari char (devanagari is used for writing Hindi) php.net/manual/en/regexp.reference.unicode.php

Slide 33

Slide 33 text

33 PHP PCRE use the u modifier flag which directs preg to use Unicode UTF-8

Slide 34

Slide 34 text

34 PHP PCRE match Han (Chinese, Japanese, Korean - CJK) add Latin, which includes the English alphabet add Common, which includes punctuation ⾒見見 en.wikipedia.org/wiki/Latin_script_in_Unicode $s="⼩小⼭山"; preg_match("/^\p{Han}+$/u",$s); $s="André is ⼩小⼭山"; preg_match("/^(\p{Han}|\p{Latin}|\s)+$/u",$s); $s="Andréʼs adopted Chinese name is ⼩小⼭山!"; preg_match("/^(\p{Han}|\p{Latin}|\p{Common})+$/u",$s);

Slide 35

Slide 35 text

35 PHP PCRE match Number decimal (Nd) and again — match Number decimal ⾒見見 codepoints.net/search?gc=Nd $s="123"; preg_match("/^\p{Nd}+$/u",$s); $s="123१२३๑๒๓᭑᭒᭓"; preg_match("/^\p{Nd}+$/u",$s);

Slide 36

Slide 36 text

36 PHP PRCE Letʼs visit sci-project with terminal, and revisit \X character sequences without U+200D ZERO WIDTH JOINER character sequences with U+200D ZERO WIDTH JOINER(s) php70 -r 'echo preg_match("/^\X{1}$/u","H")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","9")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","☃")."\n";' #returns 1(true) php70 -r 'echo preg_match("/^\X{1}$/u","F")."\n";' #returns 0(false) php70 -r 'echo preg_match("/^\X{1}$/u","J")."\n";' #returns 0(false)

Slide 37

Slide 37 text

37 PCRE2 PCRE2 successfully matches with all Grapheme Clusters pcre2grep uses PCRE2 pcre2grep -u '^\X{1}$' successfully matches with H 9 ☃ F J schappo.blogspot.co.uk/2017/12/computer-science- internationalization_18.html

Slide 38

Slide 38 text

38 PHP PCRE Also see — ⾒見見 regular-expressions.info/refunicode.html ⾒見見 regular-expressions.info/unicode.html ⾒見見 schappo.blogspot.co.uk/2015/12/unicode-regular- expressions.html ⾒見見 schappo.blogspot.co.uk/2016/08/internationalizing- regular-expressions.html

Slide 39

Slide 39 text

39 The End Fini