Upgrade to Pro — share decks privately, control downloads, hide ads and more …

my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?...

Dan Kogai
March 19, 2023
1k

my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}ig;

正規表現。Perlが最も愛されそして(不当にも)憎まれる理由の一つ。しかし今や正規表現をサブ言語として持つ言語はPerlに限りません。本talkではこの最も人気のある言語内言語に関して(再び)時間が許す限り型って、もとい語っていきます。

Dan Kogai

March 19, 2023
Tweet

Transcript

  1. Table of Contents • regexp? what is it? • $supported_by

    ~~ @most_major_languages; • but how (much)?\? • Unicode support? • assertions? • modifiers? • Irregular expressions • qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} • use CPAN; • Regexp::Assemble; • Regexp::Common; • (ir)?regular questions (?:from|by) the audience
  2. regexp? what is it? ChatGPT says • A regular expression,

    commonly abbreviated as "regex," is a sequence of characters that de fi ne a search pattern. It is a powerful tool used in computer programming to manipulate and search for strings of text. • Regular expressions are used in various programming languages, text editors, and other applications that involve searching or manipulating text. With regular expressions, you can de fi ne a pattern that matches a speci fi c sequence of characters within a string, such as fi nding all instances of a certain word, or validating the format of an email address. • The syntax of regular expressions varies slightly among di ff erent programming languages, but the basic concepts are the same. A regular expression consists of a combination of normal characters and special characters, such as metacharacters and quanti fi ers, which have speci fi c meanings and are used to de fi ne the search pattern. • Regular expressions can be quite powerful, but they can also be complex to write and understand. However, once you have a good grasp of regular expressions, they can be an extremely useful tool in your programming toolbox.
  3. regexp? what is it? ChatGPT says • ਖ਼نදݱʢ͍͖ͤͻΐ͏͛Μʣ͸ɺจࣈྻͷݕࡧύ λʔϯΛఆٛ͢ΔͨΊͷจࣈྻͷγʔέϯεͰ͢ɻί ϯϐϡʔλϓϩάϥϛϯάʹ͓͍ͯɺςΩετͷૢ

    ࡞΍ݕࡧʹ࢖༻͞ΕΔڧྗͳπʔϧͰ͢ɻ • ਖ਼نදݱ͸ɺݕࡧର৅ͱͳΔจࣈྻ಺ͷಛఆͷจࣈ ྻͷγʔέϯεʹҰக͢ΔύλʔϯΛఆٛ͢Δ͜ͱ ͕Ͱ͖·͢ɻྫ͑͹ɺಛఆͷ୯ޠͷ͢΂ͯͷΠϯελ ϯεΛݟ͚ͭͨΓɺϝʔϧΞυϨεͷॻࣜΛݕূ͠ ͨΓ͢Δ͜ͱ͕Ͱ͖·͢ɻ • ਖ਼نදݱͷߏจ͸ɺϓϩάϥϛϯάݴޠʹΑͬͯΘ ͔ͣʹҟͳΓ·͕͢ɺجຊతͳ֓೦͸ಉ͡Ͱ͢ɻਖ਼ نදݱ͸ɺ௨ৗͷจࣈͱϝλจࣈ΍ྔࢦఆࢠͳͲͷ ಛघͳจࣈͷ૊Έ߹Θ͔ͤΒߏ੒͞ΕɺͦΕͧΕಛ ఆͷҙຯΛ͍࣋ͬͯ·͢ɻ • ਖ਼نදݱ͸ඇৗʹڧྗͳπʔϧͰ͕͢ɺॻ͖ํ΍ཧ ղ͢Δͷ͕೉͍͠৔߹͕͋Γ·͢ɻ͔͠͠ɺਖ਼نද ݱΛे෼ʹཧղ͢Δͱɺϓϩάϥϛϯάͷπʔϧ ϘοΫεͰඇৗʹ໾ཱͭπʔϧʹͳΓ·͢ɻ
  4. regexp? what is it? In our language • 0 or

    more of… (quantifier) • '' # empty string • 'string' # any string • '(?:string|จࣈྻ)' # any alteration of strings • That's it! • ? # {0,} • + # {1,} • [0-9] # (?:0|1|2|3|4|5|6|7|8|9)
  5. regexp? what is it? ChatGPT answers -- correctly! • This

    is a regular expression pattern that matches valid UTF-8 encoded characters. UTF-8 is a variable-length character encoding standard that can represent all possible Unicode code points using one to four 8-bit bytes.
  6. regexp? what is it? Asked ChatGPT: Give me a regexp

    that matches a valid UTF-8 string /^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII |[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$/x
  7. regexp? what is it? Asked ChatGPT: Give me a regexp

    that matches C99 fl oating-point literals; you forgot hexadecimal notations /^[-+]?((\d+\.\d*|\.\d+|\d+)([eE][-+]?\d+)?| [0-9a-fA-F]+\.[0-9a-fA-F]*([pP][-+]?\d+)?) [fFlL]?$/
  8. Irregular expressions /^(11+?)\1+$/ # is this a regular expression? $

    seq 2 100 | perl -nlE 'say $_ if (1x$_) !~ /^(11+?)\1+$/'
  9. Irregular expressions /^(11+?)\1+$/ # is this a regular expression? $

    seq 2 100 | perl -nlE 'say $_ if (1x$_) !~ /^(11+?)\1+$/' 2 3 5 7 … 79 83 89 97
  10. Irregular expressions /^(11+?)\1+$/ # is NOT EXACTLY a regular expression!

    • The problem is \1 • It is the result of the preceding capture • In other words, this expression is self-modifying. • So it is not mathematically a regular expression • Regexp ≠ Regular Expression • Regexp ⊆ Regular Expression
  11. Irregular expressions ChatGPT agrees Unfortunately, it's not possible to use

    a regular expression to sieve prime numbers. Regular expressions are a pattern matching tool, and they're limited in their ability to perform mathematical operations like checking for prime numbers. Prime number sieves typically involve iterating through a range of numbers and checking each one for divisibility by smaller numbers. This requires programmatic logic and looping constructs, which are beyond the scope of what regular expressions can do. That being said, regular expressions can be useful in certain cases for detecting patterns in prime numbers or other mathematical sequences. For example, you could use a regular expression to match the pattern of the prime numbers in a particular range, but you would still need additional logic to actually sieve the primes.
  12. Irregular expressions qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} • Q: Can a regular expression match

    nested parentheses? • A: No. But some regex engines allow you to do that.
  13. Irregular expressions qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} #!/usr/bin/env perl use strict; use warnings; use

    feature ':all'; my $re = qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))}; my $str = '$result = a(b(c),d(e,f(g,g,g)))'; $str =~ $re; say $1; say $2; say $3;
  14. Unicode Support What is a character? • String is /.*/

    but . = • [\x00-\xff] # legacy world of bytes • [\u0000-\uFFFF] # prematurely modern • [\u{0000}-\u{10FFFF}] # correctly modern
  15. Unicode Support What is a character? • String is /.*/

    but . = • [\x00-\xff] # Perl < 5.7 • [\u0000-\uFFFF] # Java(Script)?, Python2, … • [\u{0000}-\u{10FFFF}] # Perl, Ruby, Python3, …
  16. Unicode Support? What will the following say? $ perl -Mutf8

    -MData::Dumper -E \ 'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'
  17. Unicode Support? What will the following say? $ perl -Mutf8

    -MData::Dumper -E \ 'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f98f}", "\x{1f42a}", "\x{1f418}", "\x{1f40d}", "\x{1f48e}", "\x{2699}" ];
  18. Unicode Support? What will the following say? $ node -e

    \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))'
  19. Unicode Support? What will the following say? $ node -e

    \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))' [ '', '', '', '', '', '', '', '', '', '', '⚙' ]
  20. Unicode Support? What will the following say? $ node -e

    \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/ug))' [ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]
  21. Unicode Support? What will the following say? $ perl -Mutf8

    -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
  22. Unicode Support? What will the following say? $ perl -Mutf8

    -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}", "\x{1f1f5}", "\x{1f1fa}", "\x{1f1e6}" ];
  23. Unicode Support? What will the following say? $ perl -Mutf8

    -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}", # REGIONAL INDICATOR SYMBOL LETTER J "\x{1f1f5}", # REGIONAL INDICATOR SYMBOL LETTER P "\x{1f1fa}", # REGIONAL INDICATOR SYMBOL LETTER U "\x{1f1e6}" # REGIONAL INDICATOR SYMBOL LETTER A ];
  24. Unicode Support? What will the following say? $ perl -Mutf8

    -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(\X)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}\x{1f1f5}", "\x{1f1fa}\x{1f1e6}" ];
  25. Unicode Support? What will the following say? $ node -e

    \ 'console.log("🇯🇵🇺🇦".match(/(.)/ug))'
  26. Unicode Support? What will the following say? $ node -e

    \ 'console.log("🇯🇵🇺🇦".match(/(.)/ug))' [ '🇯', '🇵', '🇺', '🇦' ]
  27. Unicode Support? What will the following say? $ node -e

    \ 'console.log("🇯🇵🇺🇦".match(/(\X)/ug))' 🙅 [ '🇯🇵','🇺🇦' ] 🙆 SyntaxError: Invalid regular expression: /(\X)/: Invalid escape at [eval]:1:24 at Script.runInThisContext (node:vm:129:12) at Object.runInThisContext (node:vm:305:38) at node:internal/process/execution:75:19 at [eval]-wrapper:6:22 at evalScript (node:internal/process/execution:74:60) at node:internal/main/eval_string:27:3
  28. Unicode Support? FYI: A workaround for modern JS $ node

    -e 'segmenter = new Intl.Segmenter(); segment = [...segmenter.segment("🇯🇵🇺🇦")].map(v=>v.segment); console.log(segment)' 🙆 [ '🇯🇵','🇺🇦' ] // cf. https://developer.mozilla.org/ja/docs/Web/JavaScript/ Reference/Global_Objects/Intl/Segmenter/Segmenter // unsupported by Firefox (as of Mar 2023)
  29. Unicode Support? Grapheme Cluster • Defined in: • https://unicode.org/reports/tr29/ •

    \X is supported by: • 🐘 PHP • 🐪 Perl • 💎 Ruby • Not yet supported by: • 🦏 JavaScript (Intl.Segmenter where available) • 🐍 Python
  30. use CPAN Regexp::Common $perl -MRegexp::Common -E 'say $RE{net}{IPv6}' (?:(?|(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a- fA-F]{1,4})|(?::(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:):

    (?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F] {1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:)(?:):(?: [0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?: [0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F] {1,4}):(?:)(?:)(?:)(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F] {1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA- F]{1,4}):(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA- F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:) (?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:): (?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F] {1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):)))
  31. use CPAN Regexp::Assemble $ egrep '^.{5}$' /usr/share/dict/words \ | perl

    -MRegexp::Assemble -nl \ -E 'BEGIN{$ra=Regexp::Assemble->new}' \ -E '$ra->add($_);' \ -E 'END{say $ra->re}' # cf. https://www.nytimes.com/games/wordle/index.html
  32. Wrap↑ • (?:ir)?regular expressions • Regexp ≠ Regular Expression •

    Regexp ⊆ Regular Expression • Definition of characters • [\x00-\xff] • [\u0000-\uFFFF] • [\u{0000}-\u{10FFFF}] • (?:un)?availability of \X • Using perl? use CPAN! • Ask ChatGPT :-?