Slide 1

Slide 1 text

String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi

Slide 2

Slide 2 text

Agenda • Motivation • CSV.read • stackprof • String#split • perf • faster String#split • ruby/ruby #6351 2

Slide 3

Slide 3 text

Evaluation environments • MacBook Pro 2020 • macOS 12.4 • 2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3

Slide 4

Slide 4 text

Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character, Character Encoding 4

Slide 5

Slide 5 text

https://hamadarb.connpass.com/event/260134/ 5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

Dive into Encoding - RubyKaigi Takeout 2021 7

Slide 8

Slide 8 text

Motivation • > If you want to do something around encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8

Slide 9

Slide 9 text

🙆 String#encode 9

Slide 10

Slide 10 text

🙆 String#encode 🤔 CSV.read (String#split) 10

Slide 11

Slide 11 text

CSV.read("KEN_ALL.CSV") 11

Slide 12

Slide 12 text

KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html • 16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12

Slide 13

Slide 13 text

KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,๺ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍৔߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,๺ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),๺ւಓ,ࡳຈࢢதԝ۠,๺Ұ৚੢ʢ̎̌ʙ̎̔ஸ໨ʣ,1,0,1,0,0,0 ...(about 120000 lines)... 47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍৔߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13

Slide 14

Slide 14 text

Benchmark for CSV.read 14

Slide 15

Slide 15 text

Benchmark for CSV.read 15

Slide 16

Slide 16 text

stackprof 🔍 16

Slide 17

Slide 17 text

Stackprof • A sampling call-stack pro fi ler for Ruby • https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17

Slide 18

Slide 18 text

Stackprof 18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20

Slide 21

Slide 21 text

Stackprof 21

Slide 22

Slide 22 text

grep split 22

Slide 23

Slide 23 text

grep split 23

Slide 24

Slide 24 text

Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds. • CSV.read uses 29% for String#split 24

Slide 25

Slide 25 text

Measure String#split with perf 25

Slide 26

Slide 26 text

String#split • split(pattern = nil, limit = 0) • pattern: Regexp, String, nil • limit: number of splits • return: Array or self 26

Slide 27

Slide 27 text

Try perf 27 • performance analyzing tool in Linux

Slide 28

Slide 28 text

28

Slide 29

Slide 29 text

perf record String#split 29

Slide 30

Slide 30 text

30

Slide 31

Slide 31 text

fl amegraph 31 →alphabetical order

Slide 32

Slide 32 text

fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0

Slide 33

Slide 33 text

Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33

Slide 34

Slide 34 text

String#split • 1. Check arguments • 2. Check patterns • 3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34

Slide 35

Slide 35 text

rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35

Slide 36

Slide 36 text

Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36

Slide 37

Slide 37 text

rb_str_subseq 37

Slide 38

Slide 38 text

rb_str_subseq 38 create substring from str

Slide 39

Slide 39 text

rb_str_subseq 39 set encoding and coderange to str2 create substring from str

Slide 40

Slide 40 text

rb_ enc_ cr_ str_ copy_ for_ substr 40

Slide 41

Slide 41 text

rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding

Slide 42

Slide 42 text

rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding set coderange

Slide 43

Slide 43 text

str_enc_copy 43

Slide 44

Slide 44 text

44

Slide 45

Slide 45 text

45

Slide 46

Slide 46 text

🤔 • Don't get encoding dynamically • just pass the Encoding of the original string 46

Slide 47

Slide 47 text

Make rb_enc_set_index_fastpath 47

Slide 48

Slide 48 text

Benchmark for String#split 48

Slide 49

Slide 49 text

Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW CVJMUSVCZ 4USJOHTQMJU 65' 4USJOHTQMJU 64"4$** SVCZY SVCZEFWY SVCZY SVCZEFWY

Slide 50

Slide 50 text

Conclusion • String, Encoding check is a bit heavy • must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50