Upgrade to Pro — share decks privately, control downloads, hide ads and more …

String meets Encoding

ima1zumi
September 11, 2022

String meets Encoding

ima1zumi

September 11, 2022
Tweet

More Decks by ima1zumi

Other Decks in Programming

Transcript

  1. String meets Encoding
    RubyKaigi 2022
    2022-09-10 Mari Imaizumi

    View Slide

  2. Agenda
    • Motivation


    • CSV.read


    • stackprof


    • String#split


    • perf


    • faster String#split


    • ruby/ruby #6351
    2

    View Slide

  3. Evaluation environments
    • MacBook Pro 2020


    • macOS 12.4


    • 2 GHz Quad-Core Intel Core i5


    • 32 GB 3733 MHz LPDDR4X


    • Vagrant


    • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64)


    • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21]
    3

    View Slide

  4. Introduction
    @ima1zumi (Mari Imaizumi)


    ESM, inc.


    Hamada.rb, Fukuoka.rb


    ❤ Character, Character Encoding
    4

    View Slide

  5. https://hamadarb.connpass.com/event/260134/
    5

    View Slide

  6. 6

    View Slide

  7. Dive into Encoding - RubyKaigi Takeout 2021
    7

    View Slide

  8. Motivation
    • > If you want to do something
    around encoding in Ruby, you
    need to speed up
    String#encode. Right now it
    takes as long to convert CP932
    to UTF-8 as it does to parse
    KEN_ALL.CSV in pure Ruby.
    (DeepL translate)


    • https://twitter.com/ktou/status/
    1436656477826019329
    8

    View Slide

  9. 🙆 String#encode


    9

    View Slide

  10. 🙆 String#encode


    🤔 CSV.read (String#split)
    10

    View Slide

  11. CSV.read("KEN_ALL.CSV")
    11

    View Slide

  12. KEN_ALL.CSV
    • Zip code data in Japan


    • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html


    • 16 MB


    • 15 lines


    • 124,541 rows


    • Encoding: CP932 (Windows-31J)
    12

    View Slide

  13. KEN_ALL.CSV
    01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,๺ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕
    ͳ͍৔߹,0,0,0,0,0,0
    01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,๺ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0
    řŸ),๺ւಓ,ࡳຈࢢதԝ۠,๺Ұ৚੢ʢ̎̌ʙ̎̔ஸ໨ʣ,1,0,1,0,0,0
    ...(about 120000 lines)...
    47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ
    ܝࡌ͕ͳ͍৔߹,0,0,0,0,0,0
    47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ
    ࠃ,0,0,0,0,0,0
    13

    View Slide

  14. Benchmark for CSV.read
    14

    View Slide

  15. Benchmark for CSV.read
    15

    View Slide

  16. stackprof 🔍
    16

    View Slide

  17. Stackprof
    • A sampling call-stack pro
    fi
    ler for Ruby


    • https://github.com/tmm1/stackprof


    • sampling mode


    • :wall, :cpu, :object, :custom



    fl
    amegraph
    17

    View Slide

  18. Stackprof
    18

    View Slide

  19. 19

    View Slide

  20. stackprof --d3-
    fl
    amegraph stackprof-cpu-cp932-
    csv.dump > stackprof.html
    20

    View Slide

  21. Stackprof
    21

    View Slide

  22. grep split
    22

    View Slide

  23. grep split
    23

    View Slide

  24. Summary
    • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.


    • CSV.read uses 29% for String#split
    24

    View Slide

  25. Measure String#split with
    perf
    25

    View Slide

  26. String#split
    • split(pattern = nil, limit = 0)


    • pattern: Regexp, String, nil


    • limit: number of splits


    • return: Array or self
    26

    View Slide

  27. Try perf
    27
    • performance analyzing tool in Linux

    View Slide

  28. 28

    View Slide

  29. perf record String#split
    29

    View Slide

  30. 30

    View Slide

  31. fl
    amegraph
    31
    →alphabetical order

    View Slide

  32. fl
    amegraph
    32
    rb_ary_push
    rb_enc_cr_str_copy_for_substr
    →alphabetical order
    str_new0

    View Slide

  33. Summary
    • str_new0 40.13%


    • rb_ary_push 19.25%


    • rb_enc_cr_str_copy_for_substr 13%


    • rb_mem_search 4.88%


    • rb_enc_right_char_head 3.68%
    33

    View Slide

  34. String#split
    • 1. Check arguments


    • 2. Check patterns


    • 3. loop


    • 1. Search substr


    • 2. create substr


    • 3. result << substr


    • 4. return result
    34

    View Slide

  35. rb_str_split_m summary
    • str_new0 40.13%


    • rb_ary_push 19.25%


    • rb_enc_cr_str_copy_for_substr 13%


    • rb_mem_search 4.88%


    • rb_enc_right_char_head 3.68%
    35

    View Slide

  36. Summary
    • str_new0 40.13%


    • rb_ary_push 19.25%


    • rb_enc_cr_str_copy_for_substr 13%


    • rb_mem_search 4.88%


    • rb_enc_right_char_head 3.68%
    36

    View Slide

  37. rb_str_subseq
    37

    View Slide

  38. rb_str_subseq
    38
    create substring from str

    View Slide

  39. rb_str_subseq
    39
    set encoding and coderange


    to str2
    create substring from str

    View Slide

  40. rb_


    enc_


    cr_


    str_


    copy_


    for_


    substr
    40

    View Slide

  41. rb_


    enc_


    cr_


    str_


    copy_


    for_


    substr
    41
    set encoding

    View Slide

  42. rb_


    enc_


    cr_


    str_


    copy_


    for_


    substr
    42
    set encoding
    set coderange

    View Slide

  43. str_enc_copy
    43

    View Slide

  44. 44

    View Slide

  45. 45

    View Slide

  46. 🤔
    • Don't get encoding dynamically


    • just pass the Encoding of the original string
    46

    View Slide

  47. Make rb_enc_set_index_fastpath
    47

    View Slide

  48. Benchmark for String#split
    48

    View Slide

  49. Benchmark for String#split
    https://github.com/ruby/ruby/pull/6351
    49
    SVCZ
    SVCZEFW


    CVJMUSVCZ
    4USJOHTQMJU
    65'

    4USJOHTQMJU
    64"4$**

    SVCZY
    SVCZEFWY
    SVCZY
    SVCZEFWY

    View Slide

  50. Conclusion
    • String, Encoding check is a bit heavy


    • must_encindex


    • mustnot_broken


    • Not checking or omitting unnecessary checks leads to faster speeds


    • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088
    50

    View Slide