2 Data Validation Regular Expressions - also written as RegExp, regex Regular Expressions are very useful for validating user input Regular Expressions can replace hundreds of lines of code We will look at Regular Expressions usage on Linux using a terminal app client side with JavaScript server side with PHP
3 regex meta characters Meta Character Function . match with any single character * preceding item matched 0 or more times ? preceding item matched 0 or 1 times + preceding item matched 1 or more times {n} preceding item matched n times {n,} preceding item matched n or more times
4 regex meta characters Meta Character Function {,m} preceding item matched at most m times {n,m} preceding item matched n thru m times [...] matched with any single char in brackets [abcd] [⼩小中⼤大] [^...] matched with any single character not in brackets [^༂༆༊] [.-.] matched with any single character in range [0-9] ^ matched with start of string/line
5 regex meta characters Meta Character Function $ matched with end of string/line (...) group | or \ escape \s white space (SPACE, IDEOGRAPHIC SPACE, EM QUAD, EN SPACE, THREE-PER-EM SPACE ...)
6 egrep grep is a unix/linux command for finding patterns in text data grep provides basic regexp/regex meta characters egrep extends the set of regexp/regex meta characters I will use egrep interactively to demonstrate the use of regex on sci-project
7 egrep connecting to sci-project OSX use terminal — ssh [email protected] Windows use PuTTY or similar terminal app interactive usage from command line egrep 'regex' ⾒見見 man egrep
9 egrep - regex examples match using regex '^[a-zA-Z]$' string: B ✔ string: Bb ✘ match using regex '^[a-zA-Z]+$' string: ahYgx ✔ string: B8 ✘ match using regex '^[a-zA-Z]{5}$' string: mnOdm ✔ string: Tda ✘
14 grep -P (PCRE) PCRE — Perl Compatible Regular Expressions Counting in Unicode codepoints grep -P '^.{2}$' will match with flags *+,-./0123456 grep -P '^.{3}$' will match with keycap digits 789:;<=>? grep -P '^.{7}$' will match with full family emoji @ A B C D E F G
15 regex online modifiers g - global match i - case insensitive match u - unicode regex101.com regex101 /\X/gu works for keycap digits but not flags or family groups emoji
17 regex in JavaScript RegEx Object Method regex.test(string) — test for match in string, returns true or false function latin(s){ return (/ABC/i.test(s)); } function cyrillic(s){ return (/^АБВ$/i.test(s)); }
18 regex in JavaScript You have already seen the ASCII letter function Letʼs rewrite the function using regex function letter(c){ //returns true if c is ascii letter, else false return ((c>='a'&&c<='z')||(c>='A'&&c<='Z')); } function letter(c){ //returns true if c is ascii letter, else false return (/[a-zA-Z]/.test(c)); //or /[A-Z]/i.test(c) }
19 regex in JavaScript Letʼs build a function to validate a Thai private car registration number entered into a form by a user. function car(reg){ //true if valid reg number, else false return (/^[1-9]?[กขงจฉธฐพภวษศสชฌฎญฆ]{2}[ ]*[1-9][0-9] {0,3}$/.test(reg)); }
20 regex in JavaScript function to validate a Javanese decimal digit function javaneseDigit(d){ //true if javanese digit, else false return (/[꧐-꧙]/.test(d)); }
21 regex in JavaScript function to validate an ascii email address function email(e){ //true if valid ascii email address, else false return (/^[A-Z0-9]+\.([A-Z]\.)*[A-Z0-9][email protected]([A-Z0-9]+\.)+[A-Z] {2,}$/i.test(e)); }
22 regex in JavaScript function to validate a Korean email address Evaluation of Websites for Acceptance of Email Addresses ⾒見見 usag.tech/wp-content/uploads/2017/09/UASG-Report- UASG017.pdf function email(e){ //true if valid email address, else false return (/^[о-䛪0-9]+\.[о-䛪0-9][email protected]([о-䛪0-9]+\.)+ೠҴ$/.test(e)); }
23 regex in JavaScript twitter use regex in their JavaScript tweet text parsing code eg validation TLDs (Top Level Domains) ⾒見見 github.com/twitter/twitter-text/blob/master/js/ twitter-text.js Note use of RegExp object constructor regex can be built as quoted strings consequently regex can be split over multiple lines, making for a more easily readable regex
24 regex in JavaScript twitter use JavaScript regex to validate hashtags ⾒見見 stackoverflow.com/questions/8451846/actual-twitter- format-for-hashtags-not-your-regex-not-his-code-the- actual/ Note: no SMP chars (\u10000-\u1FFFF) hence no emoji hashtags but yet it does include SIP chars (\u20000-\u2FFFF)
25 regex in JavaScript RegEx Object exec() Method regex.exec(string) — test for match in string, returns matched text ⾒見見 w3schools.com/jsref/jsref_regexp_exec.asp function cnum(s){ //returns first chinese number in string return (/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/.exec(s)); }
26 regex in JavaScript String match() Method string.match(RegExp) — returns all matches (as array) if g flag is used ⾒見見 w3schools.com/jsref/jsref_match.asp function cnum(s){ return (s.match(/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/g)); } cnum('⼩小⼭山⼀一⼆二三北北京四五南京') returns ["⼀一⼆二三","四五"]
27 regex in JavaScript String replace() Method string.replace(regex,replacement) — returns a string with all (if g flag is used) matches replaced with replacement a programming challenge ➜ jsfiddle.net/coas/wda45gLp w3schools.com/jsref/jsref_replace.asp
28 PCRE PCRE ó Perl Compatible Regular Expressions PCRE has aforementioned JavaScript RegExp constructs + additional RegExp constructs to process Unicode character properties We will look at the use of PCRE in PHP ⾒見見 pcre.org
30 PHP PCRE Unicode character properties include the General Category Every Unicode character is assigned to a single General Category Lu — Letter uppercase Ll — Letter lowercase Lo — Letter other Nd — Number decimal Sm — Symbol mathematical Sc — Symbol currency ... ⾒見見 codepoints.net/search?gc=Sm
31 PHP PCRE A Unicode character can belong to a human language Script Balinese Cyrillic Han Latin Mongolian Thai Tibetan ⾒見見 unicode.org/Public/UCD/latest/ucd/Scripts.txt
32 PHP PCRE the property matching construct is \p{property} — match with char having property \P{property} — do not match with char having property \p{Lu}{3} — match with 3 uppercase letters \p{Sm}+ — match with 1 or more maths symbols \p{Devanagari} — match with a single devanagari char (devanagari is used for writing Hindi) php.net/manual/en/regexp.reference.unicode.php
34 PHP PCRE match Han (Chinese, Japanese, Korean - CJK) add Latin, which includes the English alphabet add Common, which includes punctuation ⾒見見 en.wikipedia.org/wiki/Latin_script_in_Unicode $s="⼩小⼭山"; preg_match("/^\p{Han}+$/u",$s); $s="André is ⼩小⼭山"; preg_match("/^(\p{Han}|\p{Latin}|\s)+$/u",$s); $s="Andréʼs adopted Chinese name is ⼩小⼭山!"; preg_match("/^(\p{Han}|\p{Latin}|\p{Common})+$/u",$s);
35 PHP PCRE match Number decimal (Nd) and again — match Number decimal ⾒見見 codepoints.net/search?gc=Nd $s="123"; preg_match("/^\p{Nd}+$/u",$s); $s="123१२३๑๒๓᭑᭒᭓"; preg_match("/^\p{Nd}+$/u",$s);