Regular expression engines in most modern programming languages and libraries have been rapidly adding Unicode features in recent years. At Shutterstock, along with most other companies, we use a variety of programming languages, so it's important to know each language's strengths, weaknesses, and differences.
This presentation reviews Unicode regex features and compares support for these features in many popular engines as of November 2014. Features discussed include escape sequences, character properties, character classes, grapheme clusters, boundary anchors, and line breaks. Languages with core regex engines including Perl, Python, Java, and JavaScript are compared along with the PCRE, .NET, Onigmo, and ICU libraries, as well as languages that use them like Ruby and PHP.
Presented at:
◦ 2014-11-04: Internationalization & Unicode Conference 38 (IUC38), Santa Clara, CA