Improved REXML XML parsing performance using StringScanner

Improved REXML XML parsing performance using StringScanner NAITOH Jun A
member of Red Data Tools. Software Engineer at MedPeer, Inc.

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,
Okinawa, Japan 5/28(Ր) After event΍Γ·͢! https://medpeer.connpass.com/event/316741/

Okinawa, Japan I am the author of the PDF export library RBPDF gem. (https:// github.com/naitoh/rbpdf) I would like to support SVG in RBPDF using REXML which is easy to install. (SVG is an image fi le described in XML.) REXML performance is slower than C extension gem. I would like to improve REXML performance. Motivation

Okinawa, Japan Check the current status of REXML. (Before improvement)

Processing time for the number of XML lines (Ruby3.3.0) Number
of lines * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) Less than 0.01 second <?xml version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1" /> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root>

<?xml version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1"
/> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root> Processing time for the number of XML lines (Ruby3.3.0) rexml(dom) is a DOM parser that can easily retrieve desired elements by specifying XPath. * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) XPath: /root/child[@id0="1"] 65x slower DOM parser caches and retains all parsed results, so it is not memory ef fi cient for large XML.

Processing time for the number of XML lines (Ruby3.3.0) <?xml
version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1" /> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root> * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) The SAX parser does not need to retain parsed results, so it is memory ef fi cient even with large XML. 21x slower rexml(pull) is a SAX parser. It parses XML sequentially, retrieving the necessary information line by line from the beginning of each XML line.

Processing time for the number of XML lines (REXML 3.2.6/YJIT
enable) (Ruby3.3.0) 44x slower 14x slower (REXML 3.2.6/YJIT disable) * dom: 65.60x slower → 44.20x slower * pull(SAX): 21.56x slower → 14.64x slower -32% -31% 21x slower 65x slower

Okinawa, Japan How can I improve from here?

In "Better CSV processing with Ruby 2.6,” there was a
suggestion to use StringScanner for REXML. See: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2019/

Okinawa, Japan REXML is implemented in the Regexp class. Can this be replaced by StringScanner to speed up the process?🤔

Okinawa, Japan StringScanner is 1.67x faster than Regexp in simple case. 👍 prelude: | require 'strscan' str = 'test string' s = StringScanner.new(str) re = /\A\w+/ benchmark: 's.check(/\w+/)': s.check(/\w+/) 're.match(str)' : re.match(str) $ benchmark-driver test.yaml Calculating ------------------------------------- s.check(/\w+/) 9.936M i/s - 26.808M times in 2.698060s (100.64ns/i) re.match(str) 5.687M i/s - 17.161M times in 3.017700s (175.85ns/i) Comparison: s.check(/\w+/): 9916759.6 i/s re.match(str): 5938111.5 i/s - 1.67x slower Is Regexp faster than StringScanner?🤔

Okinawa, Japan Regexp is a batch process, so simple comparison is not possible. Processes the string while moving the pointer from the beginning of the string. T s = StringScanner.new('This is an example string') h i s i s p s.scan(/\w+/) #=> "This" p s.scan(/\w+/) #=> nil p s.scan(/\s+/) #=> " " p s.scan(/\w+/) #=> "is" a n e x a How to process in StringScanner. m p > /\A(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/.match(‘This is an example string') => #<MatchData "This is an example string" 1:"This" 2:"is" 3:"an" 4:"example" 5:”string”> > # $1 => “This", $2=> "is", $3 => "an", $4=> “example”, $5 => “string"

Okinawa, Japan Adding Benchmarks Search for optimization targets using a pro fi ler. Rewrite XML parsing process with StringScanner. Fixed bugs in StringScanner. Fixed XML speci fi cation violation in REXML. Benchmarking and measuring effectiveness. What I did.

Okinawa, Japan Check the current status of REXML. (After improvement)

Processing time for the number of XML lines (Result) (Ruby3.3.0)
(REXML master/YJIT disable) 17x slower * dom: 65.60x slower → 60.37x slower * pull(SAX): 21.56x slower → 17.56x slower (REXML 3.2.6/YJIT disable) -6% -17% 65x slower 21x slower 60x slower

Processing time for the number of XML lines (Result) (Ruby3.3.0)
(REXML 3.2.6/YJIT enable) * dom: 44.20x slower → 35.92x slower * pull(SAX): 14.64x slower → 8.63x slower 🎉 8.6x slower 35x slower -23% -44% 44x slower 14x slower (REXML master/YJIT enable)

Okinawa, Japan Use SAX (pull parser) instead of dom if you need high performance Enable YJIT Use the latest REXML (3.2.7+) By optimizing with StringScanner, the difference with libxml improved from 65x to 8.6x. 🎉 Summary

Okinawa, Japan If there is regular expression expansion in the method, it is slow because it is expanded each time the method is called. If #{ext} is constant, s.check(/[a- z]*#{ext}/o can reduce the expansion to only the fi rst one. In Ruby 3.3 YJIT, it is faster to declare it in module constants (only once at class initialization). @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/) end Optimization Points @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/o) end @s = StringScanner.new('foobar') TAG_MATCH = /[a-z]*#{ext}/ def foo s.check(TAG_MATCH) end Slow Fast Fast (for YJIT)

Okinawa, Japan Fix: StringScanner#captures https://github.com/ruby/strscan/pull/72 (merged) s = StringScanner.new('foobarbaz') #=> #<StringScanner 0/9 @ "fooba..."> s.scan /(foo)(bar)(BAZ)?/ #=> "foobar" s.captures #=> ["foo", "bar", ""] s.captures.compact #=> ["foo", "bar", ""] s = StringScanner.new('foobarbaz') #=> #<StringScanner 0/9 @ "fooba..."> s.scan /(foo)(bar)(BAZ)?/ #=> "foobar" s.captures #=> ["foo", "bar", nil] s.captures.compact #=> ["foo", "bar"] Before (not yet documented) After (MatchData#captures like)

Okinawa, Japan https://github.com/ruby/strscan/issues/78 ( fi xed) s = StringScanner.new('') s << XXX Fix: StringScanner.new('') Bug: In JRuby, StringScanner.new('') can only hold Encoding:US-ASCII encoding.

Okinawa, Japan I would like to support CSS selectors in REXML. StringScanner documentation is scarce. e.g. StringScanner#captures Next Actions

Okinawa, Japan 5/28(Ր) After event΍Γ·͢! https://medpeer.connpass.com/event/316741/

Improved REXML XML parsing performance using St...

Improved REXML XML parsing performance using StringScanner

NAITOH Jun

More Decks by NAITOH Jun

Other Decks in Programming

Featured

Transcript

Improved REXML XML parsing performance using StringScanner NAITOH Jun A

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

Processing time for the number of XML lines (Ruby3.3.0) Number

<?xml version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1"

Processing time for the number of XML lines (Ruby3.3.0) <?xml

Processing time for the number of XML lines (REXML 3.2.6/YJIT

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

In "Better CSV processing with Ruby 2.6,” there was a

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

Processing time for the number of XML lines (Result) (Ruby3.3.0)

Processing time for the number of XML lines (Result) (Ruby3.3.0)

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,