@dankogai
Perl and the rest of the world -
What have(n't) changed in two
decades?
Slide 2
Slide 2 text
Table of Contents
• What have changed: 2003 -> 2023?
• Perl
• The rest of the world
• Data matters more than language
• Doing Unicode right
Slide 3
Slide 3 text
What have changed: 2003 -> 2023?
Not much for Perl!
• 🦏 JavaScript: ES3 -> ES2023
• 🐍 Python: 2.2 -> 3.11
• 💎 Ruby: 1.8 (no rails!) -> 3.2
• 🐘 PHP: 3 (not even 4) -> 7
• 🐪 Perl : 5.8 -> 5.36
• Perl6? It is raku now!
Slide 4
Slide 4 text
What have changed: 2003 -> 2023?
And the rest of the world…
• 💻 -> 📱
• 🦏 > 🐪ʴ🐘ʴ🐍ʴ💎ʴ…
• 32bit -> 64bit
• SOAP, XML… -> JSON
• Bunch of legacy encodings -> UTF-8
Slide 5
Slide 5 text
Unicoding the World
Perl 5.8 (Released 2001)
• One of the first computer languages to harness Unicode
• use utf8;
• use Encode;
• \x{} notation (\u{} in other languages)
• /./ matches Unicode codepoint
• /\X/ matches Unicode grapheme
• /\p{Han}/ matches ࣈ
Slide 6
Slide 6 text
Unicoding the World
What is a character?
• String is /.*/ but . =
• [\x00-\xff] # legacy world of bytes
• [\u0000-\uFFFF] # prematurely modern
• [\u{0000}-\u{10FFFF}] # correctly modern
Slide 7
Slide 7 text
Unicoding the World
What is a character?
• String is /.*/ but . =
• [\x00-\xff] # Perl < 5.7
• [\u0000-\uFFFF] # Java(Script)?, Python2, …
• [\u{0000}-\u{10FFFF}] # Perl, Ruby, Python3, …
Slide 8
Slide 8 text
Unicode Support?
What will the following say?
$ python2 -c 'print(len("🐍"))'
2 # unless --enable-unicode=ucs4
$ python3 -c 'print(len("🐍"))'
1 # unconditionally. The way it is supposed to be
Slide 9
Slide 9 text
Unicode Support?
What will the following say?
$ node -e 'console.log("🐍".length)'
2 # 🤦
$ node -e 'console.log([..."🐍"].length)'
1 # 👍
Slide 10
Slide 10 text
Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E \
'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'
Slide 11
Slide 11 text
Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E \
'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'
$VAR1 = [
"\x{1f98f}",
"\x{1f42a}",
"\x{1f418}",
"\x{1f40d}",
"\x{1f48e}",
"\x{2699}"
];
Slide 12
Slide 12 text
Unicode Support?
What will the following say?
$ node -e \
'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))'
Slide 13
Slide 13 text
Unicode Support?
What will the following say?
$ node -e \
'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))'
[
'', '', '', '',
'', '', '', '',
'', '', '⚙'
]
Slide 14
Slide 14 text
Unicode Support?
What will the following say?
$ node -e \
'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/ug))'
[ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]
Slide 15
Slide 15 text
Unicode Support?
What will the following say?
$ node -e \
'console.log([..."🦏🐪🐘🐍💎⚙"])'
[ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]
Slide 16
Slide 16 text
Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E \
'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
Slide 17
Slide 17 text
Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E \
'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
$VAR1 = [
"\x{1f1ef}",
"\x{1f1f5}",
"\x{1f1fa}",
"\x{1f1e6}"
];
Slide 18
Slide 18 text
Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E \
'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
$VAR1 = [
"\x{1f1ef}", # REGIONAL INDICATOR SYMBOL LETTER J
"\x{1f1f5}", # REGIONAL INDICATOR SYMBOL LETTER P
"\x{1f1fa}", # REGIONAL INDICATOR SYMBOL LETTER U
"\x{1f1e6}" # REGIONAL INDICATOR SYMBOL LETTER A
];
Slide 19
Slide 19 text
Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E \
'my@m=("🇯🇵🇺🇦" =~ /(\X)/g); say Dumper([@m])'
$VAR1 = [
"\x{1f1ef}\x{1f1f5}",
"\x{1f1fa}\x{1f1e6}"
];
Slide 20
Slide 20 text
Unicode Support?
What will the following say?
$ node -e \
'console.log("🇯🇵🇺🇦".match(/(.)/ug))'
Slide 21
Slide 21 text
Unicode Support?
What will the following say?
$ node -e \
'console.log("🇯🇵🇺🇦".match(/(.)/ug))'
[ '🇯', '🇵', '🇺', '🇦' ]
Slide 22
Slide 22 text
Unicode Support?
What will the following say?
$ node -e \
'console.log("🇯🇵🇺🇦".match(/(\X)/ug))'
🙅 [ '🇯🇵','🇺🇦' ]
🙆 SyntaxError: Invalid regular expression: /(\X)/: Invalid escape
at [eval]:1:24
at Script.runInThisContext (node:vm:129:12)
at Object.runInThisContext (node:vm:305:38)
at node:internal/process/execution:75:19
at [eval]-wrapper:6:22
at evalScript (node:internal/process/execution:74:60)
at node:internal/main/eval_string:27:3
Slide 23
Slide 23 text
🤦
Slide 24
Slide 24 text
Unicode Support?
FYI: A workaround for modern JS
$ node -e 'segmenter = new Intl.Segmenter(); segment =
[...segmenter.segment("🇯🇵🇺🇦")].map(v=>v.segment);
console.log(segment)'
🙆 [ '🇯🇵','🇺🇦' ]
// cf. https://developer.mozilla.org/ja/docs/Web/JavaScript/
Reference/Global_Objects/Intl/Segmenter/Segmenter
// unsupported by Firefox (as of Mar 2023)
Unicode Support? -- gone too far?
What will the following say?
$ swift
Welcome to Swift version 5.5.2-dev.
Type :help for assistance.
1> "\u{3060}\u{3093}" == "\u{305f}\u{3099}\u{3093}" // ͩΜ
$R0: Bool = true
Slide 27
Slide 27 text
Unicode Support? -- gone too far?
What will the following say?
in
fi
x operator ===: ComparisonPrecedence
extension String {
static func ===(_ lhs:Self, _ rhs:Self)->Bool {
return lhs.utf8.elementsEqual(rhs.utf8)
}
}
let dan0 = "\u{3060}\u{3093}"
let dan1 = "\u{305f}\u{3099}\u{3093}"
dan0 == dan1 // true
dan0 === dan1 // false
dan0 === dan0 // true
Slide 28
Slide 28 text
Wrap↑
• Perl hasn't changed much
• Because it didn't have to
• Doing Unicode right since 5.8
• Other languages need some more catching up to do
• PHP: ?
• Ruby: well done!
• Python: Kill 2! \X missing
• JavaScript: use for-of; \X missing
• Swift: gone too far?
Slide 29
Slide 29 text
Thank you
🙇
Slide 30
Slide 30 text
Questions and answers
answer($_) foreach (/($questions)/sg);