Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
String meets Encoding
Search
ima1zumi
September 11, 2022
Programming
2
2.8k
String meets Encoding
https://rubykaigi.org/2022/presentations/ima1zumi.html#day3
ima1zumi
September 11, 2022
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Exploring Reline: Enhancing Command Line Usability
ima1zumi
0
54
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
64
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
280
Relineのその後の生活
ima1zumi
0
200
IRB and Reline Kaigi 2024
ima1zumi
0
7
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
14k
Reline 1分 Cooking
ima1zumi
0
28
続・mruby/cにUTF-8 を実装する
ima1zumi
1
23
UTF-8 is coming to mruby/c
ima1zumi
4
5.3k
Other Decks in Programming
See All in Programming
Domain-Driven Transformation
hschwentner
2
1.9k
Boost Performance and Developer Productivity with Jakarta EE 11
ivargrimstad
0
810
XStateを用いた堅牢なReact Components設計~複雑なClient Stateをシンプルに~ @React Tokyo ミートアップ #2
kfurusho
1
990
PRレビューのお供にDanger
stoticdev
1
230
Djangoにおける複数ユーザー種別認証の設計アプローチ@DjangoCongress JP 2025
delhi09
PRO
4
470
Jakarta EE meets AI
ivargrimstad
0
390
Better Code Design in PHP
afilina
0
170
新宿駅構内を三人称視点で探索してみる
satoshi7190
2
120
5分で理解する SOLID 原則 #phpcon_nagoya
shogogg
1
300
AIの力でお手軽Chrome拡張機能作り
taiseiue
0
190
苦しいTiDBへの移行を乗り越えて快適な運用を目指す
leveragestech
0
1k
データの整合性を保つ非同期処理アーキテクチャパターン / Async Architecture Patterns
mokuo
54
19k
Featured
See All Featured
Designing Experiences People Love
moore
140
23k
The Cult of Friendly URLs
andyhume
78
6.2k
Code Review Best Practice
trishagee
67
18k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
49k
Why You Should Never Use an ORM
jnunemaker
PRO
55
9.2k
Docker and Python
trallard
44
3.3k
Faster Mobile Websites
deanohume
306
31k
Optimizing for Happiness
mojombo
376
70k
KATA
mclloyd
29
14k
Mobile First: as difficult as doing things right
swwweet
223
9.4k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
30
4.6k
Transcript
String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi
Agenda • Motivation • CSV.read • stackprof • String#split •
perf • faster String#split • ruby/ruby #6351 2
Evaluation environments • MacBook Pro 2020 • macOS 12.4 •
2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character,
Character Encoding 4
https://hamadarb.connpass.com/event/260134/ 5
6
Dive into Encoding - RubyKaigi Takeout 2021 7
Motivation • > If you want to do something around
encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
🙆 String#encode 9
🙆 String#encode 🤔 CSV.read (String#split) 10
CSV.read("KEN_ALL.CSV") 11
KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •
16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),ւಓ,ࡳຈࢢதԝ۠,Ұʢ̎̌ʙ̎̔ஸʣ,1,0,1,0,0,0 ...(about 120000 lines)...
47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
Benchmark for CSV.read 14
Benchmark for CSV.read 15
stackprof 🔍 16
Stackprof • A sampling call-stack pro fi ler for Ruby
• https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
Stackprof 18
19
stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20
Stackprof 21
grep split 22
grep split 23
Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.
• CSV.read uses 29% for String#split 24
Measure String#split with perf 25
String#split • split(pattern = nil, limit = 0) • pattern:
Regexp, String, nil • limit: number of splits • return: Array or self 26
Try perf 27 • performance analyzing tool in Linux
28
perf record String#split 29
30
fl amegraph 31 →alphabetical order
fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
String#split • 1. Check arguments • 2. Check patterns •
3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr
13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
rb_str_subseq 37
rb_str_subseq 38 create substring from str
rb_str_subseq 39 set encoding and coderange to str2 create substring
from str
rb_ enc_ cr_ str_ copy_ for_ substr 40
rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding
rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding
set coderange
str_enc_copy 43
44
45
🤔 • Don't get encoding dynamically • just pass the
Encoding of the original string 46
Make rb_enc_set_index_fastpath 47
Benchmark for String#split 48
Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW CVJMUSVCZ 4USJOHTQMJU
65' 4USJOHTQMJU 64"4$** SVCZY SVCZEFWY SVCZY SVCZEFWY
Conclusion • String, Encoding check is a bit heavy •
must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50