Slide 1

Slide 1 text

@dankogai Perl and the rest of the world - What have(n't) changed in two decades?

Slide 2

Slide 2 text

Table of Contents • What have changed: 2003 -> 2023? • Perl • The rest of the world • Data matters more than language • Doing Unicode right

Slide 3

Slide 3 text

What have changed: 2003 -> 2023? Not much for Perl! • 🦏 JavaScript: ES3 -> ES2023 • 🐍 Python: 2.2 -> 3.11 • 💎 Ruby: 1.8 (no rails!) -> 3.2 • 🐘 PHP: 3 (not even 4) -> 7 • 🐪 Perl : 5.8 -> 5.36 • Perl6? It is raku now!

Slide 4

Slide 4 text

What have changed: 2003 -> 2023? And the rest of the world… • 💻 -> 📱 • 🦏 > 🐪ʴ🐘ʴ🐍ʴ💎ʴ… • 32bit -> 64bit • SOAP, XML… -> JSON • Bunch of legacy encodings -> UTF-8

Slide 5

Slide 5 text

Unicoding the World Perl 5.8 (Released 2001) • One of the first computer languages to harness Unicode • use utf8; • use Encode; • \x{} notation (\u{} in other languages) • /./ matches Unicode codepoint • /\X/ matches Unicode grapheme • /\p{Han}/ matches ׽ࣈ

Slide 6

Slide 6 text

Unicoding the World What is a character? • String is /.*/ but . = • [\x00-\xff] # legacy world of bytes • [\u0000-\uFFFF] # prematurely modern • [\u{0000}-\u{10FFFF}] # correctly modern

Slide 7

Slide 7 text

Unicoding the World What is a character? • String is /.*/ but . = • [\x00-\xff] # Perl < 5.7 • [\u0000-\uFFFF] # Java(Script)?, Python2, … • [\u{0000}-\u{10FFFF}] # Perl, Ruby, Python3, …

Slide 8

Slide 8 text

Unicode Support? What will the following say? $ python2 -c 'print(len("🐍"))' 2 # unless --enable-unicode=ucs4 $ python3 -c 'print(len("🐍"))' 1 # unconditionally. The way it is supposed to be

Slide 9

Slide 9 text

Unicode Support? What will the following say? $ node -e 'console.log("🐍".length)' 2 # 🤦 $ node -e 'console.log([..."🐍"].length)' 1 # 👍

Slide 10

Slide 10 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'

Slide 11

Slide 11 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f98f}", "\x{1f42a}", "\x{1f418}", "\x{1f40d}", "\x{1f48e}", "\x{2699}" ];

Slide 12

Slide 12 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))'

Slide 13

Slide 13 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))' [ '', '', '', '', '', '', '', '', '', '', '⚙' ]

Slide 14

Slide 14 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/ug))' [ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]

Slide 15

Slide 15 text

Unicode Support? What will the following say? $ node -e \ 'console.log([..."🦏🐪🐘🐍💎⚙"])' [ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]

Slide 16

Slide 16 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'

Slide 17

Slide 17 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}", "\x{1f1f5}", "\x{1f1fa}", "\x{1f1e6}" ];

Slide 18

Slide 18 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}", # REGIONAL INDICATOR SYMBOL LETTER J "\x{1f1f5}", # REGIONAL INDICATOR SYMBOL LETTER P "\x{1f1fa}", # REGIONAL INDICATOR SYMBOL LETTER U "\x{1f1e6}" # REGIONAL INDICATOR SYMBOL LETTER A ];

Slide 19

Slide 19 text

Unicode Support? What will the following say? $ perl -Mutf8 -MData::Dumper -E \ 'my@m=("🇯🇵🇺🇦" =~ /(\X)/g); say Dumper([@m])' $VAR1 = [ "\x{1f1ef}\x{1f1f5}", "\x{1f1fa}\x{1f1e6}" ];

Slide 20

Slide 20 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🇯🇵🇺🇦".match(/(.)/ug))'

Slide 21

Slide 21 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🇯🇵🇺🇦".match(/(.)/ug))' [ '🇯', '🇵', '🇺', '🇦' ]

Slide 22

Slide 22 text

Unicode Support? What will the following say? $ node -e \ 'console.log("🇯🇵🇺🇦".match(/(\X)/ug))' 🙅 [ '🇯🇵','🇺🇦' ] 🙆 SyntaxError: Invalid regular expression: /(\X)/: Invalid escape at [eval]:1:24 at Script.runInThisContext (node:vm:129:12) at Object.runInThisContext (node:vm:305:38) at node:internal/process/execution:75:19 at [eval]-wrapper:6:22 at evalScript (node:internal/process/execution:74:60) at node:internal/main/eval_string:27:3

Slide 23

Slide 23 text

🤦

Slide 24

Slide 24 text

Unicode Support? FYI: A workaround for modern JS $ node -e 'segmenter = new Intl.Segmenter(); segment = [...segmenter.segment("🇯🇵🇺🇦")].map(v=>v.segment); console.log(segment)' 🙆 [ '🇯🇵','🇺🇦' ] // cf. https://developer.mozilla.org/ja/docs/Web/JavaScript/ Reference/Global_Objects/Intl/Segmenter/Segmenter // unsupported by Firefox (as of Mar 2023)

Slide 25

Slide 25 text

Unicode Support? Grapheme Cluster • Defined in: • https://unicode.org/reports/tr29/ • \X is supported by: • 🐘 PHP (via preg_*()) • 🐪 Perl • 💎 Ruby • Not yet supported by: • 🦏 JavaScript (Intl.Segmenter where available) • 🐍 Python (pip install regex?) • https://pypi.org/project/regex/

Slide 26

Slide 26 text

Unicode Support? -- gone too far? What will the following say? $ swift Welcome to Swift version 5.5.2-dev. Type :help for assistance. 1> "\u{3060}\u{3093}" == "\u{305f}\u{3099}\u{3093}" // ͩΜ $R0: Bool = true

Slide 27

Slide 27 text

Unicode Support? -- gone too far? What will the following say? in fi x operator ===: ComparisonPrecedence extension String { static func ===(_ lhs:Self, _ rhs:Self)->Bool { return lhs.utf8.elementsEqual(rhs.utf8) } } let dan0 = "\u{3060}\u{3093}" let dan1 = "\u{305f}\u{3099}\u{3093}" dan0 == dan1 // true dan0 === dan1 // false dan0 === dan0 // true

Slide 28

Slide 28 text

Wrap↑ • Perl hasn't changed much • Because it didn't have to • Doing Unicode right since 5.8 • Other languages need some more catching up to do • PHP: ? • Ruby: well done! • Python: Kill 2! \X missing • JavaScript: use for-of; \X missing • Swift: gone too far?

Slide 29

Slide 29 text

Thank you 🙇

Slide 30

Slide 30 text

Questions and answers answer($_) foreach (/($questions)/sg);