Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Regular Expression Engines

Nova Patch
November 04, 2014

Unicode Regular Expression Engines

Regular expression engines in most modern programming languages and libraries have been rapidly adding Unicode features in recent years. At Shutterstock, along with most other companies, we use a variety of programming languages, so it's important to know each language's strengths, weaknesses, and differences.

This presentation reviews Unicode regex features and compares support for these features in many popular engines as of November 2014. Features discussed include escape sequences, character properties, character classes, grapheme clusters, boundary anchors, and line breaks. Languages with core regex engines including Perl, Python, Java, and JavaScript are compared along with the PCRE, .NET, Onigmo, and ICU libraries, as well as languages that use them like Ruby and PHP.

Presented at:
◦ 2014-11-04: Internationalization & Unicode Conference 38 (IUC38), Santa Clara, CA

Nova Patch

November 04, 2014
Tweet

More Decks by Nova Patch

Other Decks in Programming

Transcript

  1. Shutterstock Is Multilingual Český Magyar Türkçe Dansk Nederlands Русский Deutsch

    Norsk ไทย English Polski 한국어 Español Português 中文 Français Suomi 日本語 Italiano Svenska
  2. This talk is about … 1. Real-word Unicode regular expressions

    2. Production regular expression engines 3. Modern programming language support
  3. This talk is not about … 1. UTS #18 specifcation

    2. Historic regex engines 3. Future regex development 4. Being a reference guide
  4. Languages Perl 5.20 2014-05-27 U6.3 Python 3.4 2014-03-16 U6.3 Ruby

    2.1.4 2014-10-27 U7.0 Java 8 2014-03-18 U6.2 JavaScript 1.8.5 2010-07-27 U3.0 PHP 5.6 2014-08-28 U6.3
  5. Languages with built-in regex engines Perl 5.20 2014-05-27 U6.3 Python

    3.4 2014-03-16 U6.3 Ruby 2.1.4 2014-10-27 U7.0 Java 8 2014-03-18 U6.2 JavaScript 1.8.5 2010-07-27 U3.0 PHP 5.6 2014-08-28 U6.3
  6. Libraries PCRE 8.36 2014-09-26 U7.0 .NET 4.5 2012-08-15 U5.0/6.0 Oniguruma

    5.9.5 2013-10-21 U?? Onigmo 5.15 2014-07-18 U7.0 ICU4C 54 2014-10-01 U7.0
  7. \X grapheme cluster support Extended grapheme clusters Perl, PCRE, ICU4C

    Legacy grapheme clusters Onigma Unsupported Python, Java, JavaScript, .NET, Oniguruma
  8. \X alternatives Extended grapheme cluster (?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42 \u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4 \uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100- \u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960- \uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6]

    [\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0- \uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8- \u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB- \uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&& [^\u000D\u000A\u200C\u200D]]\u000D\u000A]) [[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD \u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670 \uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32 \u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))
  9. Property matchers General Category L Letter M Mark N Number

    P Punctuation S Symbol Z Separator C Other
  10. Property matchers Script Arab Arabic Beng Bengali Deva Devanagari Egyp

    Egyptian hieroglyphs Ethi Ethiopic Grek Greek Hang Hangul
  11. \p property support ICU4C full support Perl full support +

    Perl extensions Java GC, Script, Binary, Block + Java ext. Onigmo GC, Script, Binary, Block, Age PCRE GC, Script Oniguruma GC, Script .NET GC, Block Unsupported: Python, JavaScript
  12. Unicodifed escape sequences \s \p{White_Space} \d \p{gc=Decimal_Number} \w [ \p{alpha}

    \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} \p{Join_Control} ] \b \w-based boundaries
  13. Unicodifed escape sequences Unicode default (with ASCII option): ICU4C, Perl,

    Python, Java, .NET Unicode default (for Unicode encodings only): Oniguruma, Onigmo ASCII default (with Unicode option): PCRE Partial support: JavaScript (\s only!)