Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improved REXML XML parsing performance using StringScanner

Improved REXML XML parsing performance using StringScanner

RubyKaigi 2024 LT

* Benchmark Code : https://gist.github.com/naitoh/abc5134fdf37bb3952e36f1fb77163b0
* RubyKaigi 2019 Better CSV processing with Ruby 2.6 : https://rubykaigi.org/2019/presentations/ktou.html#apr19

NAITOH Jun

May 16, 2024
Tweet

More Decks by NAITOH Jun

Other Decks in Programming

Transcript

  1. Improved REXML XML parsing performance using StringScanner NAITOH Jun A

    member of Red Data Tools. Software Engineer at MedPeer, Inc.
  2. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan 5/28(Ր) After event΍Γ·͢! https://medpeer.connpass.com/event/316741/
  3. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan I am the author of the PDF export library RBPDF gem. (https:// github.com/naitoh/rbpdf) I would like to support SVG in RBPDF using REXML which is easy to install. (SVG is an image fi le described in XML.) REXML performance is slower than C extension gem. I would like to improve REXML performance. Motivation
  4. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan Check the current status of REXML. (Before improvement)
  5. Processing time for the number of XML lines (Ruby3.3.0) Number

    of lines * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) Less than 0.01 second <?xml version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1" /> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root>
  6. <?xml version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1"

    /> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root> Processing time for the number of XML lines (Ruby3.3.0) rexml(dom) is a DOM parser that can easily retrieve desired elements by specifying XPath. * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) XPath: /root/child[@id0="1"] 65x slower DOM parser caches and retains all parsed results, so it is not memory ef fi cient for large XML.
  7. Processing time for the number of XML lines (Ruby3.3.0) <?xml

    version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1" /> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root> * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) The SAX parser does not need to retain parsed results, so it is memory ef fi cient even with large XML. 21x slower rexml(pull) is a SAX parser. It parses XML sequentially, retrieving the necessary information line by line from the beginning of each XML line.
  8. Processing time for the number of XML lines (REXML 3.2.6/YJIT

    enable) (Ruby3.3.0) 44x slower 14x slower (REXML 3.2.6/YJIT disable) * dom: 65.60x slower → 44.20x slower * pull(SAX): 21.56x slower → 14.64x slower -32% -31% 21x slower 65x slower
  9. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan How can I improve from here?
  10. In "Better CSV processing with Ruby 2.6,” there was a

    suggestion to use StringScanner for REXML. See: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2019/
  11. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan REXML is implemented in the Regexp class. Can this be replaced by StringScanner to speed up the process?🤔
  12. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan StringScanner is 1.67x faster than Regexp in simple case. 👍 prelude: | require 'strscan' str = 'test string' s = StringScanner.new(str) re = /\A\w+/ benchmark: 's.check(/\w+/)': s.check(/\w+/) 're.match(str)' : re.match(str) $ benchmark-driver test.yaml Calculating ------------------------------------- s.check(/\w+/) 9.936M i/s - 26.808M times in 2.698060s (100.64ns/i) re.match(str) 5.687M i/s - 17.161M times in 3.017700s (175.85ns/i) Comparison: s.check(/\w+/): 9916759.6 i/s re.match(str): 5938111.5 i/s - 1.67x slower Is Regexp faster than StringScanner?🤔
  13. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan Regexp is a batch process, so simple comparison is not possible. Processes the string while moving the pointer from the beginning of the string. T s = StringScanner.new('This is an example string') h i s i s p s.scan(/\w+/) #=> "This" p s.scan(/\w+/) #=> nil p s.scan(/\s+/) #=> " " p s.scan(/\w+/) #=> "is" a n e x a How to process in StringScanner. m p > /\A(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/.match(‘This is an example string') => #<MatchData "This is an example string" 1:"This" 2:"is" 3:"an" 4:"example" 5:”string”> > # $1 => “This", $2=> "is", $3 => "an", $4=> “example”, $5 => “string"
  14. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan Adding Benchmarks Search for optimization targets using a pro fi ler. Rewrite XML parsing process with StringScanner. Fixed bugs in StringScanner. Fixed XML speci fi cation violation in REXML. Benchmarking and measuring effectiveness. What I did.
  15. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan Check the current status of REXML. (After improvement)
  16. Processing time for the number of XML lines (Result) (Ruby3.3.0)

    (REXML master/YJIT disable) 17x slower * dom: 65.60x slower → 60.37x slower * pull(SAX): 21.56x slower → 17.56x slower (REXML 3.2.6/YJIT disable) -6% -17% 65x slower 21x slower 60x slower
  17. Processing time for the number of XML lines (Result) (Ruby3.3.0)

    (REXML 3.2.6/YJIT enable) * dom: 44.20x slower → 35.92x slower * pull(SAX): 14.64x slower → 8.63x slower 🎉 8.6x slower 35x slower -23% -44% 44x slower 14x slower (REXML master/YJIT enable)
  18. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan Use SAX (pull parser) instead of dom if you need high performance Enable YJIT Use the latest REXML (3.2.7+) By optimizing with StringScanner, the difference with libxml improved from 65x to 8.6x. 🎉 Summary
  19. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan If there is regular expression expansion in the method, it is slow because it is expanded each time the method is called. If #{ext} is constant, s.check(/[a- z]*#{ext}/o can reduce the expansion to only the fi rst one. In Ruby 3.3 YJIT, it is faster to declare it in module constants (only once at class initialization). @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/) end Optimization Points @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/o) end @s = StringScanner.new('foobar') TAG_MATCH = /[a-z]*#{ext}/ def foo s.check(TAG_MATCH) end Slow Fast Fast (for YJIT)
  20. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan Fix: StringScanner#captures https://github.com/ruby/strscan/pull/72 (merged) s = StringScanner.new('foobarbaz') #=> #<StringScanner 0/9 @ "fooba..."> s.scan /(foo)(bar)(BAZ)?/ #=> "foobar" s.captures #=> ["foo", "bar", ""] s.captures.compact #=> ["foo", "bar", ""] s = StringScanner.new('foobarbaz') #=> #<StringScanner 0/9 @ "fooba..."> s.scan /(foo)(bar)(BAZ)?/ #=> "foobar" s.captures #=> ["foo", "bar", nil] s.captures.compact #=> ["foo", "bar"] Before (not yet documented) After (MatchData#captures like)
  21. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan https://github.com/ruby/strscan/issues/78 ( fi xed) s = StringScanner.new('') s << XXX Fix: StringScanner.new('') Bug: In JRuby, StringScanner.new('') can only hold Encoding:US-ASCII encoding.
  22. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan I would like to support CSS selectors in REXML. StringScanner documentation is scarce. e.g. StringScanner#captures Next Actions
  23. May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

    Okinawa, Japan 5/28(Ր) After event΍Γ·͢! https://medpeer.connpass.com/event/316741/