偶然見つけたEncode.pmのバグ

偶然見つけたEncode.pmのバグ @hiratara

ことの発端 • Perl 5.24.3 (2017-09-22) を新しい版へ上げるプロジェクト • 当然 CPAN モジュールも古い
◦ Encode 2.80_01 (2016-01-25) • 不意に上がってくる不思議な PR

diff --git a/escape-invalid-utf8.t b/escape-invalid-utf8.t index aca356b..3122bf3 100644 --- a/escape-invalid-utf8.t +++
b/escape-invalid-utf8.t @@ -9,6 +9,6 @@ sub escape_invalid_utf8 { encode 'UTF-8', $string, Encode::FB_CROAK; } -is escape_invalid_utf8("\xff\xE3\x81\x82"), "\\xFF\xE3\x81\x82"; +is escape_invalid_utf8("\xf4\xE3\x81\x82"), "\\xF4\xE3\x81\x82"; done_testing; Perl-5.38.2でもテストが通過するようにテストコードを修正する

✅someone approved these changes 34 seconds ago LGTM

待ってくれ • \xff \xf4 • たった 2byte の変更 • ちょっと意味がわからない
❌hiratara requested changes 58 seconds ago

どんな処理のテストか sub escape_invalid_utf8 { my $bytes = shift; my $string
= decode 'UTF-8', $bytes, Encode::FB_PERLQQ; encode 'UTF-8', $string, Encode::FB_CROAK; } UTF-8 として不正なシーケンスをエスケープしたい

どんな単体テストか is escape_invalid_utf8("\xff\xE3\x81\x82"), "\\xFF\xE3\x81\x82"; • "\xE3\x81\x82" は UTF-8 の「あ」 ◦
これは正しいので触らない • "\xff" はおかしいので、 Perl 風にエスケープ • つまり、"\xff" 1byte を ‘\’ ‘x’ ‘F’ ‘F’ の 4byteへ

なぜテストが落ちるのか escape_invalid_utf8("\xff\xE3\x81\x82") を print する古い Perl \xFFあ新しい Perl
\xFF\xE3\x81\x82

Perl banjo https://perlbanjo.com/

Perl5.38.2 のために何を直したのか • "\xff\xE3\x81\x82" • "\xf4\xE3\x81\x82" • "\xff" も "\xf4"
もまっとうな UTF-8 には見えない • "\xf5" では駄目なのか

Encode.pm • 2.94 までは UTF-8 の処理を独自実装 • 2.95 からは Perl
の API を利用 • https://github.com/dankogai/p5-encode/commit/6128f2 Perl 5.26 introduced infrastructure in the core that can be used by Encode to check UTF-8 stream validity much faster than before. This commit replaces the current scheme for checking UTF-8 validity if the infrastructure is availabe

Perl API による UTF-8 (en|de)code 異常なシーケンスが見つかるまでは読み飛ばす bool valid = is_utf8_string_loc_flags(s,
e - s, &e_or_where_failed, flags); 異常なシーケンスは別の API で 1 文字ずつ判定 uv = utf8n_to_uvchr(s, e - s, &ulen, UTF8_ALLOW_ANY); for (i=0; i<ulen; ++i) sprintf(esc+4*i, "\\x%02X", s[i]);

偶然見つけたEncode.pmのバグ @hiratara Perl

utf8n_to_uvchr(s, curlen, &retlen, flags) • 実体は inline.h の Perl_utf8n_to_uvchr_msgs •
文字列 s の最初の Unicode 文字 1 文字を decode • decode された Unicode 文字を返す • retlen に処理したバイト数を格納する

UTF-8を判定するDFA (perl.h)

UTF-8を判定するDFA • バイト 0x00～0xFF を 0 ～ 18 の 19
type に分類 • N0 ～ N11 と 1 の 13 状態を持つ • N0 ～ N11 は定数値で 19 の倍数とする • 0 (N0) をaccepting state 、 1 を reject state とする

DFAのリファクタリング • 2022年2月に a460925 で高速化された • 今回の挙動はそこでエンバグされたもの This commit effectively
removes a conditional from inside the loop, and avoids some conditionals when converting the common case of the input being UTF-8 invariant (ASCII on ASCII platforms).

UV state = PL_strict_utf8_dfa_tab[256 + type]; uv = (0xff >>
type) & NATIVE_UTF8_TO_I8(*s); while (++s < send) { type = PL_strict_utf8_dfa_tab[*s]; state = PL_strict_utf8_dfa_tab[256 + state + type]; uv = UTF8_ACCUMULATE(uv, *s); if (state == 0) { goto success; } if (UNLIKELY(state == 1)) { break; } } 終了判定がない！

\xff と \xf4 の違いバイトの分類表 8, 6, 6, 6, 5,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*F0-FF*/ 初期状態での遷移表 /*N0*/ 0, 1, N1, N2, N4, N7, N6, N3, N5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Perl 本体への影響 use utf8 されたコードに不正なシーケンスがあるとpanic $ echo -e 'use utf8;
"\xff\xE3\x81\x82"' | perl panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.

修正状況 • https://github.com/Perl/perl5/pull/22597 ◦ blead には取り込まれた • https://github.com/Perl/perl5/pull/22630 ◦ back
porting 待ち ◦ 5.36 は直らなそう

Thank you for your attention • Perl は新しい版を使おう • Encode.pm
にはいつもお世話になってます

偶然見つけたEncode.pmのバグ

偶然見つけたEncode.pmのバグ

Masahiro Honma

More Decks by Masahiro Honma

Other Decks in Programming

Featured

Transcript