Slide 1

Slide 1 text

@dankogai my$talk=qr{((?:ir)?reg(?:ular )? exp(?:ressions?)?)}ig;

Slide 2

Slide 2 text

Table of Contents • regexp? what is it? • $supported_by ~~ @most_major_languages; • but how (much)?\? • Unicode support? • assertions? • modifiers? • Irregular expressions • qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} • use CPAN; • Regexp::Assemble; • Regexp::Common; • (ir)?regular questions (?:from|by) the audience

Slide 3

Slide 3 text

regexp? what is it? ChatGPT says • A regular expression, commonly abbreviated as "regex," is a sequence of characters that de fi ne a search pattern. It is a powerful tool used in computer programming to manipulate and search for strings of text. • Regular expressions are used in various programming languages, text editors, and other applications that involve searching or manipulating text. With regular expressions, you can de fi ne a pattern that matches a speci fi c sequence of characters within a string, such as fi nding all instances of a certain word, or validating the format of an email address. • The syntax of regular expressions varies slightly among di ff erent programming languages, but the basic concepts are the same. A regular expression consists of a combination of normal characters and special characters, such as metacharacters and quanti fi ers, which have speci fi c meanings and are used to de fi ne the search pattern. • Regular expressions can be quite powerful, but they can also be complex to write and understand. However, once you have a good grasp of regular expressions, they can be an extremely useful tool in your programming toolbox.

Slide 4

Slide 4 text

regexp? what is it? ChatGPT says • ਖ਼نදݱʢ͍͖ͤͻΐ͏͛Μʣ͸ɺจࣈྻͷݕࡧύ λʔϯΛఆٛ͢ΔͨΊͷจࣈྻͷγʔέϯεͰ͢ɻί ϯϐϡʔλϓϩάϥϛϯάʹ͓͍ͯɺςΩετͷૢ ࡞΍ݕࡧʹ࢖༻͞ΕΔڧྗͳπʔϧͰ͢ɻ • ਖ਼نදݱ͸ɺݕࡧର৅ͱͳΔจࣈྻ಺ͷಛఆͷจࣈ ྻͷγʔέϯεʹҰக͢ΔύλʔϯΛఆٛ͢Δ͜ͱ ͕Ͱ͖·͢ɻྫ͑͹ɺಛఆͷ୯ޠͷ͢΂ͯͷΠϯελ ϯεΛݟ͚ͭͨΓɺϝʔϧΞυϨεͷॻࣜΛݕূ͠ ͨΓ͢Δ͜ͱ͕Ͱ͖·͢ɻ • ਖ਼نදݱͷߏจ͸ɺϓϩάϥϛϯάݴޠʹΑͬͯΘ ͔ͣʹҟͳΓ·͕͢ɺجຊతͳ֓೦͸ಉ͡Ͱ͢ɻਖ਼ نදݱ͸ɺ௨ৗͷจࣈͱϝλจࣈ΍ྔࢦఆࢠͳͲͷ ಛघͳจࣈͷ૊Έ߹Θ͔ͤΒߏ੒͞ΕɺͦΕͧΕಛ ఆͷҙຯΛ͍࣋ͬͯ·͢ɻ • ਖ਼نදݱ͸ඇৗʹڧྗͳπʔϧͰ͕͢ɺॻ͖ํ΍ཧ ղ͢Δͷ͕೉͍͠৔߹͕͋Γ·͢ɻ͔͠͠ɺਖ਼نද ݱΛे෼ʹཧղ͢Δͱɺϓϩάϥϛϯάͷπʔϧ ϘοΫεͰඇৗʹ໾ཱͭπʔϧʹͳΓ·͢ɻ

Slide 5

Slide 5 text

regexp? what is it? In our language • 0 or more of… (quantifier) • '' # empty string • 'string' # any string • '(?:string|จࣈྻ)' # any alteration of strings • That's it! • ? # {0,} • + # {1,} • [0-9] # (?:0|1|2|3|4|5|6|7|8|9)

Slide 6

Slide 6 text

regexp? what is it? ((?:ir)?reg(?:ular )?exp(?:ressions?)?) Visualized by: regexper.com

Slide 7

Slide 7 text

regexp? what is it? (?:[\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]| \xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2}) Exerpt from: https://www.w3.org/International/questions/qa-forms-utf-8 Visualized by: regexper.com

Slide 8

Slide 8 text

regexp? what is it? ChatGPT answers -- correctly! • This is a regular expression pattern that matches valid UTF-8 encoded characters. UTF-8 is a variable-length character encoding standard that can represent all possible Unicode code points using one to four 8-bit bytes.

Slide 9

Slide 9 text

regexp? what is it? Asked ChatGPT: Give me a regexp that matches a valid UTF-8 string /^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII |[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$/x

Slide 10

Slide 10 text

regexp? what is it? (?:[+-]?)(?:0x[0-9a-fA-F]+(?:\.[0-9a-fA-F]+)?(?:[pP][+-]?[0-9]+)|(?:[1-9][0-9]*)(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?|0(?:\.0+|(?:\.0+)?(?:[eE] [+-]?[0-9]+))|(?:[Nn]a[Nn]|[Ii]nf(?:inity)?)) Exerpt from: https://github.com/dankogai/js-sion/blob/main/sion.ts Visualized by: regexper.com

Slide 11

Slide 11 text

regexp? what is it? Asked ChatGPT: Give me a regexp that matches C99 fl oating-point literals; you forgot hexadecimal notations /^[-+]?((\d+\.\d*|\.\d+|\d+)([eE][-+]?\d+)?| [0-9a-fA-F]+\.[0-9a-fA-F]*([pP][-+]?\d+)?) [fFlL]?$/

Slide 12

Slide 12 text

Irregular expressions /^(11+?)\1+$/ # is this a regular expression? $ seq 2 100 | perl -nlE 'say $_ if (1x$_) !~ /^(11+?)\1+$/'

Slide 13

Slide 13 text

Irregular expressions /^(11+?)\1+$/ # is this a regular expression? $ seq 2 100 | perl -nlE 'say $_ if (1x$_) !~ /^(11+?)\1+$/' 2 3 5 7 … 79 83 89 97

Slide 14

Slide 14 text

Irregular expressions /^(11+?)\1+$/ # is NOT EXACTLY a regular expression! • The problem is \1 • It is the result of the preceding capture • In other words, this expression is self-modifying. • So it is not mathematically a regular expression • Regexp ≠ Regular Expression • Regexp ⊆ Regular Expression

Slide 15

Slide 15 text

Irregular expressions ChatGPT agrees Unfortunately, it's not possible to use a regular expression to sieve prime numbers. Regular expressions are a pattern matching tool, and they're limited in their ability to perform mathematical operations like checking for prime numbers. Prime number sieves typically involve iterating through a range of numbers and checking each one for divisibility by smaller numbers. This requires programmatic logic and looping constructs, which are beyond the scope of what regular expressions can do. That being said, regular expressions can be useful in certain cases for detecting patterns in prime numbers or other mathematical sequences. For example, you could use a regular expression to match the pattern of the prime numbers in a particular range, but you would still need additional logic to actually sieve the primes.

Slide 16

Slide 16 text

Irregular expressions ChatGPT agrees ਖ਼نදݱͰૉ਺Λ൑ఆ͢Δ͜ͱ͸Ͱ͖·ͤΜɻਖ਼نදݱ͸ɺจࣈ ྻͷύλʔϯϚονϯάΛߦ͏πʔϧͰ͋Γɺૉ਺ͷΑ͏ͳ਺ֶ తͳܭࢉΛߦ͏͜ͱ͸Ͱ͖·ͤΜɻૉ਺Λ൑ఆ͢Δʹ͸ɺϓϩά ϥϛϯάݴޠΛ࢖༻͢Δඞཁ͕͋Γ·͢ɻ

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Irregular expressions qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} • Q: Can a regular expression match nested parentheses? • A: No. But some regex engines allow you to do that.

Slide 19

Slide 19 text

Irregular expressions qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} my $re = qr{( [A-Za-z_]\w*\s* ( \( ( (?: (?>[^()]+) | (?2) )* ) \) ) ) }x;

Slide 20

Slide 20 text

Irregular expressions qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))} #!/usr/bin/env perl use strict; use warnings; use feature ':all'; my $re = qr{([A-Za-z_]\w*\s*(\(((?:(?>[^()]+)|(?2))*)\)))}; my $str = '$result = a(b(c),d(e,f(g,g,g)))'; $str =~ $re; say $1; say $2; say $3;

Slide 21

Slide 21 text

Unicode Support What is a character? • String is /.*/ but . = • [\x00-\xff] # legacy world of bytes • [\u0000-\uFFFF] # prematurely modern • [\u{0000}-\u{10FFFF}] # correctly modern

Slide 22

Slide 22 text

Unicode Support What is a character? • String is /.*/ but . = • [\x00-\xff] # Perl < 5.7 • [\u0000-\uFFFF] # Java(Script)?, Python2, … • [\u{0000}-\u{10FFFF}] # Perl, Ruby, Python3, …

Slide 23

Slide 23 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'

Slide 24

Slide 24 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f98f}", "\x{1f42a}", "\x{1f418}", "\x{1f40d}", "\x{1f48e}", "\x{2699}" ];

Slide 25

Slide 25 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))'

Slide 26

Slide 26 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))' [ '', '', '', '', '', '', '', '', '', '', '⚙' ]

Slide 27

Slide 27 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/ug))' [ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]

Slide 28

Slide 28 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'

Slide 29

Slide 29 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}", "\x{1f1f5}", "\x{1f1fa}", "\x{1f1e6}" ];

Slide 30

Slide 30 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}", # REGIONAL INDICATOR SYMBOL LETTER J "\x{1f1f5}", # REGIONAL INDICATOR SYMBOL LETTER P "\x{1f1fa}", # REGIONAL INDICATOR SYMBOL LETTER U "\x{1f1e6}" # REGIONAL INDICATOR SYMBOL LETTER A ];

Slide 31

Slide 31 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(\X)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}\x{1f1f5}", "\x{1f1fa}\x{1f1e6}" ];

Slide 32

Slide 32 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🇯🇵🇺🇦".match(/(.)/ug))'

Slide 33

Slide 33 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🇯🇵🇺🇦".match(/(.)/ug))' [ '🇯', '🇵', '🇺', '🇦' ]

Slide 34

Slide 34 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🇯🇵🇺🇦".match(/(\X)/ug))' 🙅 [ '🇯🇵','🇺🇦' ] 🙆 SyntaxError: Invalid regular expression: /(\X)/: Invalid escape at [eval]:1:24 at Script.runInThisContext (node:vm:129:12) at Object.runInThisContext (node:vm:305:38) at node:internal/process/execution:75:19 at [eval]-wrapper:6:22 at evalScript (node:internal/process/execution:74:60) at node:internal/main/eval_string:27:3

Slide 35

Slide 35 text

🤦

Slide 36

Slide 36 text

Unicode Support? FYI: A workaround for modern JS $ node -e 'segmenter = new Intl.Segmenter(); segment = [...segmenter.segment("🇯🇵🇺🇦")].map(v=>v.segment); console.log(segment)' 🙆 [ '🇯🇵','🇺🇦' ] // cf. https://developer.mozilla.org/ja/docs/Web/JavaScript/ Reference/Global_Objects/Intl/Segmenter/Segmenter // unsupported by Firefox (as of Mar 2023)

Slide 37

Slide 37 text

Unicode Support? Grapheme Cluster • Defined in: • https://unicode.org/reports/tr29/ • \X is supported by: • 🐘 PHP • 🐪 Perl • 💎 Ruby • Not yet supported by: • 🦏 JavaScript (Intl.Segmenter where available) • 🐍 Python

Slide 38

Slide 38 text

use CPAN Regexp::Common $ perl -MRegexp::Common -E 'say $RE{net}{IPv6}'

Slide 39

Slide 39 text

use CPAN Regexp::Common $perl -MRegexp::Common -E 'say $RE{net}{IPv6}' (?:(?|(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a- fA-F]{1,4})|(?::(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:): (?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F] {1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:)(?:):(?: [0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?::(?:)(?:)(?:)(?:)(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?: [0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F] {1,4}):(?:)(?:)(?:)(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F] {1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA- F]{1,4}):(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA- F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:) (?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):(?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a- fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:): (?:[0-9a-fA-F]{1,4}))|(?:(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:)(?:):)|(?:(?:[0-9a-fA-F] {1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:[0-9a-fA-F]{1,4}):(?:)(?:):)))

Slide 40

Slide 40 text

use CPAN Regexp::Assemble $ egrep '^.{5}$' /usr/share/dict/words \ | perl -MRegexp::Assemble -nl \ -E 'BEGIN{$ra=Regexp::Assemble->new}' \ -E '$ra->add($_);' \ -E 'END{say $ra->re}' # cf. https://www.nytimes.com/games/wordle/index.html

Slide 41

Slide 41 text

Wrap↑ • (?:ir)?regular expressions • Regexp ≠ Regular Expression • Regexp ⊆ Regular Expression • Definition of characters • [\x00-\xff] • [\u0000-\uFFFF] • [\u{0000}-\u{10FFFF}] • (?:un)?availability of \X • Using perl? use CPAN! • Ask ChatGPT :-?

Slide 42

Slide 42 text

BTW Bible, an obsolete • 3rd edition: August 8, 2006 • Too old especially for JS

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Thank you 🙇

Slide 46

Slide 46 text

Questions and answers answer($_) foreach (/($questions)/sg);