Slide 1

Slide 1 text

Unicode Regular Expression Engines Internationalization & Unicode Conference November 4, 2014 #IUC38

Slide 2

Slide 2 text

Unicode Regular Expression Engines Nick Patch @nickpatch Shutterstock

Slide 3

Slide 3 text

Shutterstock Is Multilingual Český Magyar Türkçe Dansk Nederlands Русский Deutsch Norsk ไทย English Polski 한국어 Español Português 中文 Français Suomi 日本語 Italiano Svenska

Slide 4

Slide 4 text

Shutterstock Is Multilingual C♯ PHP Java Python JavaScript Ruby Perl SQL … and Regular Expressions!

Slide 5

Slide 5 text

This talk is about … 1. Real-word Unicode regular expressions 2. Production regular expression engines 3. Modern programming language support

Slide 6

Slide 6 text

This talk is not about … 1. UTS #18 specifcation 2. Historic regex engines 3. Future regex development 4. Being a reference guide

Slide 7

Slide 7 text

Languages Perl 5.20 2014-05-27 U6.3 Python 3.4 2014-03-16 U6.3 Ruby 2.1.4 2014-10-27 U7.0 Java 8 2014-03-18 U6.2 JavaScript 1.8.5 2010-07-27 U3.0 PHP 5.6 2014-08-28 U6.3

Slide 8

Slide 8 text

Languages with built-in regex engines Perl 5.20 2014-05-27 U6.3 Python 3.4 2014-03-16 U6.3 Ruby 2.1.4 2014-10-27 U7.0 Java 8 2014-03-18 U6.2 JavaScript 1.8.5 2010-07-27 U3.0 PHP 5.6 2014-08-28 U6.3

Slide 9

Slide 9 text

Libraries PCRE 8.36 2014-09-26 U7.0 .NET 4.5 2012-08-15 U5.0/6.0 Oniguruma 5.9.5 2013-10-21 U?? Onigmo 5.15 2014-07-18 U7.0 ICU4C 54 2014-10-01 U7.0

Slide 10

Slide 10 text

PCRE Perl Compatible Regular Expressions PHP R Erlang Elixir

Slide 11

Slide 11 text

.NET Framework Visual Basic C♯ F♯ PowerShell

Slide 12

Slide 12 text

Oniguruma PHP 5.0+ multibyte strings Ruby 1.9

Slide 13

Slide 13 text

Onigmo fork of Oniguruma Ruby 2.0+

Slide 14

Slide 14 text

Code point matcher . match a code point except newline (by default)

Slide 15

Slide 15 text

Grapheme cluster matcher \X Spın̈al Tap n\N{COMBINING DIAERESIS} 각 ก ந िष CRLF (\r\n)

Slide 16

Slide 16 text

\X grapheme cluster support Extended grapheme clusters Perl, PCRE, ICU4C Legacy grapheme clusters Onigma Unsupported Python, Java, JavaScript, .NET, Oniguruma

Slide 17

Slide 17 text

\X alternatives Legacy grapheme cluster (?>\PM\pM*)

Slide 18

Slide 18 text

\X alternatives Extended grapheme cluster (?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42 \u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4 \uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100- \u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960- \uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6] [\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0- \uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8- \u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB- \uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&& [^\u000D\u000A\u200C\u200D]]\u000D\u000A]) [[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD \u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670 \uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32 \u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

Slide 19

Slide 19 text

Property matchers \p{…}

Slide 20

Slide 20 text

Property matchers General Category \p{General_Category=Letter}

Slide 21

Slide 21 text

Property matchers General Category \p{gc=L} abbreviated

Slide 22

Slide 22 text

Property matchers General Category \p{Letter} implicit category

Slide 23

Slide 23 text

Property matchers General Category \p{L} implicit category + abbreviated

Slide 24

Slide 24 text

Property matchers General Category \pL optional braces when single character

Slide 25

Slide 25 text

Property matchers General Category \PL negation

Slide 26

Slide 26 text

Property matchers General Category L Letter M Mark N Number P Punctuation S Symbol Z Separator C Other

Slide 27

Slide 27 text

Property matchers General Category S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifer_Symbol So Other_Symbol

Slide 28

Slide 28 text

Property matchers Script \p{Script=Latin}

Slide 29

Slide 29 text

Property matchers Script \p{sc=Latin} abbreviated

Slide 30

Slide 30 text

Property matchers Script \p{Latin} implicit script

Slide 31

Slide 31 text

Property matchers Script [\p{Hiragana} \p{Katakana} \p{Han} \p{Latin} \p{Common}]

Slide 32

Slide 32 text

Property matchers Script [\p{Hira} \p{Kana} \p{Hani} \p{Latn} \p{Zyyy}]

Slide 33

Slide 33 text

Property matchers Script Arab Arabic Beng Bengali Deva Devanagari Egyp Egyptian hieroglyphs Ethi Ethiopic Grek Greek Hang Hangul

Slide 34

Slide 34 text

Property matchers Script s/ е ( \p{Cyrl} ) и $/я$1/x

Slide 35

Slide 35 text

Property matchers Others \p{Numeric_Value=10} \p{East_Asian_Width=Fullwidth} \p{Script=Cyrillic} ✓ \p{Script_Extensions=Cyrillic} ✗ \p{Block=Cyrillic} ✗ \p{Block=Cyrillic_Extended_A}

Slide 36

Slide 36 text

Property matchers Others \p{nv=10} \p{ea=F} \p{Cyrl} ✓ \p{scx=Cyrl} ✗ \p{blk=Cyrillic} ✗ \p{blk=Cyrillic_Ext_A}

Slide 37

Slide 37 text

Property matchers Binary \p{White_Space=Yes} \p{Hex_Digit=Yes} \p{Variation_Selector=Yes} \p{Deprecated=Yes}

Slide 38

Slide 38 text

Property matchers Binary \p{White_Space} \p{Hex_Digit} \p{Variation_Selector} \p{Deprecated}

Slide 39

Slide 39 text

Property matchers Binary \p{WSpace} \p{Hex} \p{VS} \p{Dep}

Slide 40

Slide 40 text

Property matchers Binary \p{White_Space=No} \p{Hex_Digit=No} \p{Variation_Selector=No} \p{Deprecated=No}

Slide 41

Slide 41 text

Property matchers Binary \P{White_Space} \P{Hex_Digit} \P{Variation_Selector} \P{Deprecated}

Slide 42

Slide 42 text

Property matchers Binary \P{WSpace} \P{Hex} \P{VS} \P{Dep}

Slide 43

Slide 43 text

\p property support ICU4C full support Perl full support + Perl extensions Java GC, Script, Binary, Block + Java ext. Onigmo GC, Script, Binary, Block, Age PCRE GC, Script Oniguruma GC, Script .NET GC, Block Unsupported: Python, JavaScript

Slide 44

Slide 44 text

Unicodifed escape sequences \s \p{White_Space} \d \p{gc=Decimal_Number} \w [ \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} \p{Join_Control} ] \b \w-based boundaries

Slide 45

Slide 45 text

Unicodifed escape sequences Unicode default (with ASCII option): ICU4C, Perl, Python, Java, .NET Unicode default (for Unicode encodings only): Oniguruma, Onigmo ASCII default (with Unicode option): PCRE Partial support: JavaScript (\s only!)

Slide 46

Slide 46 text

Linebreak matcher \R LF (\n) CR (\r) FF (\f) CRLF (\r\n) NEL VT LS PS

Slide 47

Slide 47 text

\R linebreak support Perl Java PCRE Onigmo Unsupported Python, JavaScript, .NET, Oniguruma, ICU4C

Slide 48

Slide 48 text

Nick Patch @nickpatch Shutterstock