Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Regular Expressions

Unicode Regular Expressions

Introductory lecture on Unicode Regular Expressions. Starts with ASCII text then progresses to Unicode text. Slide 33 has Egyptian Hieroglyphs😀

André Schappo

October 04, 2017
Tweet

Other Decks in Programming

Transcript

  1. weibo.com/andreschappo
    twitter.com/andreschappo
    schappo.blogspot.co.uk
    Unicode Regular Expressions
    1
    André Schappo

    View Slide

  2. 2
    Data Validation
    Regular Expressions - also written as RegExp, regex
    Regular Expressions are very useful for validating user input
    Regular Expressions can replace hundreds of lines of code
    We will look at Regular Expressions usage
    on Linux using a terminal app
    client side with JavaScript
    server side with PHP

    View Slide

  3. 3
    regex meta characters
    Meta
    Character
    Function
    . match with any single character
    * preceding item matched 0 or more times
    ? preceding item matched 0 or 1 times
    + preceding item matched 1 or more times
    {n} preceding item matched n times
    {n,} preceding item matched n or more times

    View Slide

  4. 4
    regex meta characters
    Meta
    Character
    Function
    {,m} preceding item matched at most m times
    {n,m} preceding item matched n thru m times
    [...] matched with any single char in brackets [abcd] [⼩小中⼤大]
    [^...] matched with any single character not in brackets [^༂༆༊]
    [.-.] matched with any single character in range [0-9]
    ^ matched with start of string/line

    View Slide

  5. 5
    regex meta characters
    Meta
    Character
    Function
    $ matched with end of string/line
    (...) group
    | or
    \ escape
    \s white space (SPACE, IDEOGRAPHIC SPACE, EM QUAD, EN
    SPACE, THREE-PER-EM SPACE ...)

    View Slide

  6. 6
    egrep
    grep is a unix/linux command for finding patterns in text data
    grep provides basic regexp/regex meta characters
    egrep extends the set of regexp/regex meta characters
    I will use egrep interactively to demonstrate the use of regex on
    sci-project

    View Slide

  7. 7
    egrep
    connecting to sci-project
    OSX
    use terminal — ssh [email protected]
    Windows
    use PuTTY or similar terminal app
    interactive usage from command line
    egrep 'regex'
    ⾒見見 man egrep

    View Slide

  8. 8
    egrep - regex examples
    match using regex '.*'
    string: absolutely anything ✔
    match using regex 'a.*t'
    string: 72auytqw ✔
    string: qwerty ✘
    match using regex '^a.*t$'
    string: azxcvbt ✔
    string: azxcvbtm ✘

    View Slide

  9. 9
    egrep - regex examples
    match using regex '^[a-zA-Z]$'
    string: B ✔
    string: Bb ✘
    match using regex '^[a-zA-Z]+$'
    string: ahYgx ✔
    string: B8 ✘
    match using regex '^[a-zA-Z]{5}$'
    string: mnOdm ✔
    string: Tda ✘

    View Slide

  10. 10
    egrep - regex examples
    match using regex '^0[0-9]{10}$'
    string: 01234567890 ✔
    string: 01234 567890 ✘
    match using regex '0[0-9]{4}\s*[0-9]{6}$'
    string: 01234 567890 ✔
    string: 1234 567890 ✘
    match using regex '^0([0-79]*8){3,}[0-79]*$'
    string: 01834867880 ✔
    string: 01234567890 ✘

    View Slide

  11. 11
    egrep - regex examples
    match using regex '^[⼀一⼆二三四五六七⼋八九]+$'
    string: 五七⼋八 ✔
    string: ⼩小⼭山 ✘
    match using regex '^⼈人+鸭⼈人+$'
    string: ⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人鸭⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✔
    string: ⼈人⼈人⼈人⼈人⼈人鸡⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人⼈人 ✘
    match using regex '^[]+[]+$'
    string: ✔
    string: ✘
    ⾒見見 edition.cnn.com/travel/article/hong-kong-giant-duck

    View Slide

  12. 12
    Nottingham Arboretum

    View Slide

  13. 13
    egrep - regex examples
    match using regex '^鸭+⼈人鸭+$'
    string: 鸭鸭鸭鸭鸭⼈人鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭鸭 ✔
    string: 鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡鸡⼈人鸡鸡鸡鸡鸡鸡 ✘
    match using regex '^+[]+$'
    string: ✔
    string: ✘

    View Slide

  14. 14
    grep -P (PCRE)
    PCRE — Perl Compatible Regular Expressions
    Counting in Unicode codepoints
    grep -P '^.{2}$'
    will match with flags *+,-./0123456
    grep -P '^.{3}$'
    will match with keycap digits 789:;<=>?
    grep -P '^.{7}$'
    will match with full family emoji @ A B C D E F G

    View Slide

  15. 15
    regex online
    modifiers
    g - global match
    i - case insensitive match
    u - unicode
    regex101.com
    regex101 /\X/gu works for keycap digits but not flags or family
    groups emoji

    View Slide

  16. 16
    regex in JavaScript
    JavaScript RegEx Object
    /pattern/modifiers
    /ABC/
    /AbC/i
    /АБВ/ (Cyrillic)
    /АбВ/i (Cyrillic)

    View Slide

  17. 17
    regex in JavaScript
    RegEx Object Method
    regex.test(string) — test for match in string, returns true or false
    function latin(s){
    return (/ABC/i.test(s));
    }
    function cyrillic(s){
    return (/^АБВ$/i.test(s));
    }

    View Slide

  18. 18
    regex in JavaScript
    You have already seen the ASCII letter function
    Letʼs rewrite the function using regex
    function letter(c){
    //returns true if c is ascii letter, else false
    return ((c>='a'&&c<='z')||(c>='A'&&c<='Z'));
    }
    function letter(c){
    //returns true if c is ascii letter, else false
    return (/[a-zA-Z]/.test(c));
    //or /[A-Z]/i.test(c)
    }

    View Slide

  19. 19
    regex in JavaScript
    Letʼs build a function to validate a Thai private car registration
    number entered into a form by a user.
    function car(reg){
    //true if valid reg number, else false
    return (/^[1-9]?[กขงจฉธฐพภวษศสชฌฎญฆ]{2}[       ]*[1-9][0-9]
    {0,3}$/.test(reg));
    }

    View Slide

  20. 20
    regex in JavaScript
    function to validate a Javanese decimal digit
    function javaneseDigit(d){
    //true if javanese digit, else false
    return (/[꧐-꧙]/.test(d));
    }

    View Slide

  21. 21
    regex in JavaScript
    function to validate an ascii email address
    function email(e){
    //true if valid ascii email address, else false
    return (/^[A-Z0-9]+\.([A-Z]\.)*[A-Z0-9][email protected]([A-Z0-9]+\.)+[A-Z]
    {2,}$/i.test(e));
    }

    View Slide

  22. 22
    regex in JavaScript
    function to validate a Korean email address
    Evaluation of Websites for Acceptance of Email Addresses
    ⾒見見 usag.tech/wp-content/uploads/2017/09/UASG-Report-
    UASG017.pdf
    function email(e){
    //true if valid email address, else false
    return (/^[о-䛪0-9]+\.[о-䛪0-9][email protected]([о-䛪0-9]+\.)+ೠҴ$/.test(e));
    }

    View Slide

  23. 23
    regex in JavaScript
    twitter use regex in their JavaScript tweet text parsing code
    eg validation TLDs (Top Level Domains)
    ⾒見見 github.com/twitter/twitter-text/blob/master/js/
    twitter-text.js
    Note use of RegExp object constructor
    regex can be built as quoted strings
    consequently regex can be split over multiple lines, making for a
    more easily readable regex

    View Slide

  24. 24
    regex in JavaScript
    twitter use JavaScript regex to validate hashtags
    ⾒見見 stackoverflow.com/questions/8451846/actual-twitter-
    format-for-hashtags-not-your-regex-not-his-code-the-
    actual/
    Note: no SMP chars (\u10000-\u1FFFF) hence no emoji
    hashtags
    but yet it does include SIP chars (\u20000-\u2FFFF)

    View Slide

  25. 25
    regex in JavaScript
    RegEx Object exec() Method
    regex.exec(string) — test for match in string, returns matched text
    ⾒見見 w3schools.com/jsref/jsref_regexp_exec.asp
    function cnum(s){
    //returns first chinese number in string
    return (/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/.exec(s));
    }

    View Slide

  26. 26
    regex in JavaScript
    String match() Method
    string.match(RegExp) — returns all matches (as array) if g flag is
    used
    ⾒見見 w3schools.com/jsref/jsref_match.asp
    function cnum(s){
    return (s.match(/[⼀一⼆二三四五六七⼋八九][零⼀一⼆二三四五六七⼋八九]*/g));
    }
    cnum('⼩小⼭山⼀一⼆二三北北京四五南京') returns ["⼀一⼆二三","四五"]

    View Slide

  27. 27
    regex in JavaScript
    String replace() Method
    string.replace(regex,replacement) — returns a string with all (if g
    flag is used) matches replaced with replacement
    a programming challenge ➜ jsfiddle.net/coas/wda45gLp
    w3schools.com/jsref/jsref_replace.asp

    View Slide

  28. 28
    PCRE
    PCRE ó Perl Compatible Regular Expressions
    PCRE has aforementioned JavaScript RegExp constructs +
    additional RegExp constructs to process Unicode character
    properties
    We will look at the use of PCRE in PHP
    ⾒見見 pcre.org

    View Slide

  29. 29
    PHP PCRE
    PHP PCRE functions include —
    preg_match
    preg_replace
    ⾒見見 php.net/manual/en/book.pcre.php

    View Slide

  30. 30
    PHP PCRE
    Unicode character properties include the General Category
    Every Unicode character is assigned to a single General
    Category
    Lu — Letter uppercase
    Ll — Letter lowercase
    Lo — Letter other
    Nd — Number decimal
    Sm — Symbol mathematical
    Sc — Symbol currency
    ...
    ⾒見見 codepoints.net/search?gc=Sm

    View Slide

  31. 31
    PHP PCRE
    A Unicode character can belong to a human language Script
    Balinese
    Cyrillic
    Han
    Latin
    Mongolian
    Thai
    Tibetan
    ⾒見見 unicode.org/Public/UCD/latest/ucd/Scripts.txt

    View Slide

  32. 32
    PHP PCRE
    the property matching construct is
    \p{property} — match with char having property
    \P{property} — do not match with char having property
    \p{Lu}{3} — match with 3 uppercase letters
    \p{Sm}+ — match with 1 or more maths symbols
    \p{Devanagari} — match with a single devanagari char
    (devanagari is used for writing Hindi)
    php.net/manual/en/regexp.reference.unicode.php

    View Slide

  33. 33
    PHP PCRE
    use the u modifier flag which directs preg to use Unicode UTF-8
    $s="";
    echo preg_match("/^\p{Egyptian_Hieroglyphs}+$/u",$s)
    ?>

    View Slide

  34. 34
    PHP PCRE
    match Han (Chinese, Japanese, Korean - CJK)
    add Latin, which includes the English alphabet
    add Common, which includes punctuation
    ⾒見見 en.wikipedia.org/wiki/Latin_script_in_Unicode
    $s="⼩小⼭山";
    preg_match("/^\p{Han}+$/u",$s);
    $s="André is ⼩小⼭山";
    preg_match("/^(\p{Han}|\p{Latin}|\s)+$/u",$s);
    $s="Andréʼs adopted Chinese name is ⼩小⼭山!";
    preg_match("/^(\p{Han}|\p{Latin}|\p{Common})+$/u",$s);

    View Slide

  35. 35
    PHP PCRE
    match Number decimal (Nd)
    and again — match Number decimal
    ⾒見見 codepoints.net/search?gc=Nd
    $s="123";
    preg_match("/^\p{Nd}+$/u",$s);
    $s="123१२३๑๒๓᭑᭒᭓";
    preg_match("/^\p{Nd}+$/u",$s);

    View Slide

  36. 36
    PHP PRCE
    Letʼs visit sci-project with terminal, and revisit \X
    character sequences without U+200D ZERO WIDTH JOINER
    character sequences with U+200D ZERO WIDTH JOINER(s)
    php70 -r 'echo preg_match("/^\X{1}$/u","H")."\n";' #returns 1(true)
    php70 -r 'echo preg_match("/^\X{1}$/u","9")."\n";' #returns 1(true)
    php70 -r 'echo preg_match("/^\X{1}$/u","☃")."\n";' #returns 1(true)
    php70 -r 'echo preg_match("/^\X{1}$/u","F")."\n";' #returns 0(false)
    php70 -r 'echo preg_match("/^\X{1}$/u","J")."\n";' #returns 0(false)

    View Slide

  37. 37
    PCRE2
    PCRE2 successfully matches with all Grapheme Clusters
    pcre2grep uses PCRE2
    pcre2grep -u '^\X{1}$' successfully matches with

    H 9 ☃ F J
    schappo.blogspot.co.uk/2017/12/computer-science-
    internationalization_18.html

    View Slide

  38. 38
    PHP PCRE
    Also see —
    ⾒見見 regular-expressions.info/refunicode.html
    ⾒見見 regular-expressions.info/unicode.html
    ⾒見見 schappo.blogspot.co.uk/2015/12/unicode-regular-
    expressions.html
    ⾒見見 schappo.blogspot.co.uk/2016/08/internationalizing-
    regular-expressions.html

    View Slide

  39. 39
    The End
    Fini

    View Slide