Unicode Regular Expression Engines

05bab33cfd102c84f045838aa4e05bec?s=47 Nova Patch
November 04, 2014

Unicode Regular Expression Engines

Regular expression engines in most modern programming languages and libraries have been rapidly adding Unicode features in recent years. At Shutterstock, along with most other companies, we use a variety of programming languages, so it's important to know each language's strengths, weaknesses, and differences.

This presentation reviews Unicode regex features and compares support for these features in many popular engines as of November 2014. Features discussed include escape sequences, character properties, character classes, grapheme clusters, boundary anchors, and line breaks. Languages with core regex engines including Perl, Python, Java, and JavaScript are compared along with the PCRE, .NET, Onigmo, and ICU libraries, as well as languages that use them like Ruby and PHP.

Presented at:
◦ 2014-11-04: Internationalization & Unicode Conference 38 (IUC38), Santa Clara, CA

05bab33cfd102c84f045838aa4e05bec?s=128

Nova Patch

November 04, 2014
Tweet

Transcript

  1. Unicode Regular Expression Engines Internationalization & Unicode Conference November 4,

    2014 #IUC38
  2. Unicode Regular Expression Engines Nick Patch @nickpatch Shutterstock

  3. Shutterstock Is Multilingual Český Magyar Türkçe Dansk Nederlands Русский Deutsch

    Norsk ไทย English Polski 한국어 Español Português 中文 Français Suomi 日本語 Italiano Svenska
  4. Shutterstock Is Multilingual C♯ PHP Java Python JavaScript Ruby Perl

    SQL … and Regular Expressions!
  5. This talk is about … 1. Real-word Unicode regular expressions

    2. Production regular expression engines 3. Modern programming language support
  6. This talk is not about … 1. UTS #18 specifcation

    2. Historic regex engines 3. Future regex development 4. Being a reference guide
  7. Languages Perl 5.20 2014-05-27 U6.3 Python 3.4 2014-03-16 U6.3 Ruby

    2.1.4 2014-10-27 U7.0 Java 8 2014-03-18 U6.2 JavaScript 1.8.5 2010-07-27 U3.0 PHP 5.6 2014-08-28 U6.3
  8. Languages with built-in regex engines Perl 5.20 2014-05-27 U6.3 Python

    3.4 2014-03-16 U6.3 Ruby 2.1.4 2014-10-27 U7.0 Java 8 2014-03-18 U6.2 JavaScript 1.8.5 2010-07-27 U3.0 PHP 5.6 2014-08-28 U6.3
  9. Libraries PCRE 8.36 2014-09-26 U7.0 .NET 4.5 2012-08-15 U5.0/6.0 Oniguruma

    5.9.5 2013-10-21 U?? Onigmo 5.15 2014-07-18 U7.0 ICU4C 54 2014-10-01 U7.0
  10. PCRE Perl Compatible Regular Expressions PHP R Erlang Elixir

  11. .NET Framework Visual Basic C♯ F♯ PowerShell

  12. Oniguruma PHP 5.0+ multibyte strings Ruby 1.9

  13. Onigmo fork of Oniguruma Ruby 2.0+

  14. Code point matcher . match a code point except newline

    (by default)
  15. Grapheme cluster matcher \X Spın̈al Tap n\N{COMBINING DIAERESIS} 각 ก

    ந िष CRLF (\r\n)
  16. \X grapheme cluster support Extended grapheme clusters Perl, PCRE, ICU4C

    Legacy grapheme clusters Onigma Unsupported Python, Java, JavaScript, .NET, Oniguruma
  17. \X alternatives Legacy grapheme cluster (?>\PM\pM*)

  18. \X alternatives Extended grapheme cluster (?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42 \u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4 \uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100- \u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960- \uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6]

    [\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0- \uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8- \u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB- \uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&& [^\u000D\u000A\u200C\u200D]]\u000D\u000A]) [[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD \u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670 \uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32 \u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))
  19. Property matchers \p{…}

  20. Property matchers General Category \p{General_Category=Letter}

  21. Property matchers General Category \p{gc=L} abbreviated

  22. Property matchers General Category \p{Letter} implicit category

  23. Property matchers General Category \p{L} implicit category + abbreviated

  24. Property matchers General Category \pL optional braces when single character

  25. Property matchers General Category \PL negation

  26. Property matchers General Category L Letter M Mark N Number

    P Punctuation S Symbol Z Separator C Other
  27. Property matchers General Category S Symbol Sm Math_Symbol Sc Currency_Symbol

    Sk Modifer_Symbol So Other_Symbol
  28. Property matchers Script \p{Script=Latin}

  29. Property matchers Script \p{sc=Latin} abbreviated

  30. Property matchers Script \p{Latin} implicit script

  31. Property matchers Script [\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

  32. Property matchers Script [\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Zyyy}]

  33. Property matchers Script Arab Arabic Beng Bengali Deva Devanagari Egyp

    Egyptian hieroglyphs Ethi Ethiopic Grek Greek Hang Hangul
  34. Property matchers Script s/ е ( \p{Cyrl} ) и $/я$1/x

  35. Property matchers Others \p{Numeric_Value=10} \p{East_Asian_Width=Fullwidth} \p{Script=Cyrillic} ✓ \p{Script_Extensions=Cyrillic} ✗ \p{Block=Cyrillic}

    ✗ \p{Block=Cyrillic_Extended_A}
  36. Property matchers Others \p{nv=10} \p{ea=F} \p{Cyrl} ✓ \p{scx=Cyrl} ✗ \p{blk=Cyrillic}

    ✗ \p{blk=Cyrillic_Ext_A}
  37. Property matchers Binary \p{White_Space=Yes} \p{Hex_Digit=Yes} \p{Variation_Selector=Yes} \p{Deprecated=Yes}

  38. Property matchers Binary \p{White_Space} \p{Hex_Digit} \p{Variation_Selector} \p{Deprecated}

  39. Property matchers Binary \p{WSpace} \p{Hex} \p{VS} \p{Dep}

  40. Property matchers Binary \p{White_Space=No} \p{Hex_Digit=No} \p{Variation_Selector=No} \p{Deprecated=No}

  41. Property matchers Binary \P{White_Space} \P{Hex_Digit} \P{Variation_Selector} \P{Deprecated}

  42. Property matchers Binary \P{WSpace} \P{Hex} \P{VS} \P{Dep}

  43. \p property support ICU4C full support Perl full support +

    Perl extensions Java GC, Script, Binary, Block + Java ext. Onigmo GC, Script, Binary, Block, Age PCRE GC, Script Oniguruma GC, Script .NET GC, Block Unsupported: Python, JavaScript
  44. Unicodifed escape sequences \s \p{White_Space} \d \p{gc=Decimal_Number} \w [ \p{alpha}

    \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} \p{Join_Control} ] \b \w-based boundaries
  45. Unicodifed escape sequences Unicode default (with ASCII option): ICU4C, Perl,

    Python, Java, .NET Unicode default (for Unicode encodings only): Oniguruma, Onigmo ASCII default (with Unicode option): PCRE Partial support: JavaScript (\s only!)
  46. Linebreak matcher \R LF (\n) CR (\r) FF (\f) CRLF

    (\r\n) NEL VT LS PS
  47. \R linebreak support Perl Java PCRE Onigmo Unsupported Python, JavaScript,

    .NET, Oniguruma, ICU4C
  48. Nick Patch @nickpatch Shutterstock