Slide 1

Slide 1 text

Regexp in Android and Java Keishin Yokomaku @ Drivemode, Inc. potatotips #22

Slide 2

Slide 2 text

@KeithYokoma • Keishin Yokomaku at Drivemode, Inc. • Work • Android apps • Android Training and its publication • Like • Bicycle, Photography, Tumblr and Motorsport

Slide 3

Slide 3 text

Pattern.compile()

Slide 4

Slide 4 text

Android is not Java • Android and Java have different implementation respectively on regexp API. • Same regexp, different result. • Write once, not run anywhere (ˑ ‿ ˑ)

Slide 5

Slide 5 text

What’s the matter?

Slide 6

Slide 6 text

Test env vs. runtime • If your tests are running on JVM(e.g. with Robolectric)… • Some patterns pass the test, but won’t work at runtime. • Some patterns work at runtime, but won’t pass the test.

Slide 7

Slide 7 text

ʉ\ʊ(π)ʊ/ʉ

Slide 8

Slide 8 text

Differences in detail

Slide 9

Slide 9 text

Supported flags • Java • All flags defined at Pattern are supported. • Android • Only CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES are supported. • If any other flags are set, RuntimeException will be thrown.

Slide 10

Slide 10 text

Android Pattern public class Pattern { private Pattern(String pattern, int flags) throws PatternSyntaxException { if ((flags & CANON_EQ) != 0) { throw new UnsupportedOperationException(“CANON_EQ flag not supported”); } int supportedFlags = CASE_INSENSITIVE | COMMENTS | DOTALL | LITERAL | MULTILINE | UNICODE_CASE | UNIX_LINES; if ((flags & ~supportedFlags) != 0) { throw new IllegalArgumentException(“Unsupported flags: “ + (flags & ~supportedFlags)); } this.pattern = pattern; this.flags = flags; compile(); } }

Slide 11

Slide 11 text

Android Pattern public class Pattern { private Pattern(String pattern, int flags) throws PatternSyntaxException { if ((flags & CANON_EQ) != 0) { throw new UnsupportedOperationException(“CANON_EQ flag not supported”); } int supportedFlags = CASE_INSENSITIVE | COMMENTS | DOTALL | LITERAL | MULTILINE | UNICODE_CASE | UNIX_LINES; if ((flags & ~supportedFlags) != 0) { throw new IllegalArgumentException(“Unsupported flags: “ + (flags & ~supportedFlags)); } this.pattern = pattern; this.flags = flags; compile(); } }

Slide 12

Slide 12 text

Android Pattern public class Pattern { private Pattern(String pattern, int flags) throws PatternSyntaxException { if ((flags & CANON_EQ) != 0) { throw new UnsupportedOperationException(“CANON_EQ flag not supported”); } int supportedFlags = CASE_INSENSITIVE | COMMENTS | DOTALL | LITERAL | MULTILINE | UNICODE_CASE | UNIX_LINES; if ((flags & ~supportedFlags) != 0) { throw new IllegalArgumentException(“Unsupported flags: “ + (flags & ~supportedFlags)); } this.pattern = pattern; this.flags = flags; compile(); } }

Slide 13

Slide 13 text

Java Pattern public class Pattern { private void compile() { if (has(CANON_EQ) && !has(LITERAL)) { normalize(); } else { normalizedPattern = pattern; } patternLength = normalizedPattern.length(); // Copy pattern to int array for convenience // Use double zero to terminate pattern temp = new int[patternLength + 2]; hasSupplementary = false; int c, count = 0; // Convert all chars into code points for (int x = 0; x < patternLength; x += Character.charCount(c)) { c = normalizedPattern.codePointAt(x); if (isSupplementary(c)) { hasSupplementary = true; } temp[count++] = c; } patternLength = count; // patternLength now in code points // …… } }

Slide 14

Slide 14 text

Character class • Java • Matches only single byte characters • Android • Matches both single byte and multi byte characters. • Details here: http://bit.ly/1R73wkM

Slide 15

Slide 15 text

Regular expression engines • Java • java.util.regex Engine • Conform Unicode Technical Standard #18 Level1 and Release 2.1”Canonical Equivalents”. • Android • ICU(International Components for Unicode) Engine • Conform Unicode Technical Standard #18 Level 1 and Default Word Boundaries and Name Properties from Level2

Slide 16

Slide 16 text

Canonical Equivalents • Canonically equivalent code point sequences are assumed to have the same appearance and meaning when printed or displayed. • e.g. “ü” and “u¨” are canonically equivalent

Slide 17

Slide 17 text

Android is not Java ʉ\ʊ(π)ʊ/ʉ

Slide 18

Slide 18 text

Regexp in Android and Java Keishin Yokomaku @ Drivemode, Inc. potatotips #22