Improved REXML XML parsing performance using StringScanner

Slide 1

Slide 1 text

Improved REXML XML parsing performance using StringScanner NAITOH Jun A member of Red Data Tools. Software Engineer at MedPeer, Inc.

Slide 2

Slide 2 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan 5/28(Ր) After event΍Γ·͢! https://medpeer.connpass.com/event/316741/

Slide 3

Slide 3 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan I am the author of the PDF export library RBPDF gem. (https:// github.com/naitoh/rbpdf) I would like to support SVG in RBPDF using REXML which is easy to install. (SVG is an image fi le described in XML.) REXML performance is slower than C extension gem. I would like to improve REXML performance. Motivation

Slide 4

Slide 4 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Check the current status of REXML. (Before improvement)

Slide 5

Slide 5 text

Processing time for the number of XML lines (Ruby3.3.0) Number of lines * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) Less than 0.01 second :

Slide 6

Slide 6 text

: Processing time for the number of XML lines (Ruby3.3.0) rexml(dom) is a DOM parser that can easily retrieve desired elements by specifying XPath. * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) XPath: /root/child[@id0="1"] 65x slower DOM parser caches and retains all parsed results, so it is not memory ef fi cient for large XML.

Slide 7

Slide 7 text

Processing time for the number of XML lines (Ruby3.3.0) : * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) The SAX parser does not need to retain parsed results, so it is memory ef fi cient even with large XML. 21x slower rexml(pull) is a SAX parser. It parses XML sequentially, retrieving the necessary information line by line from the beginning of each XML line.

Slide 8

Slide 8 text

Processing time for the number of XML lines (REXML 3.2.6/YJIT enable) (Ruby3.3.0) 44x slower 14x slower (REXML 3.2.6/YJIT disable) * dom: 65.60x slower → 44.20x slower * pull(SAX): 21.56x slower → 14.64x slower -32% -31% 21x slower 65x slower

Slide 9

Slide 9 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan How can I improve from here?

Slide 10

Slide 10 text

In "Better CSV processing with Ruby 2.6,” there was a suggestion to use StringScanner for REXML. See: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2019/

Slide 11

Slide 11 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan REXML is implemented in the Regexp class. Can this be replaced by StringScanner to speed up the process?🤔

Slide 12

Slide 12 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan StringScanner is 1.67x faster than Regexp in simple case. 👍 prelude: | require 'strscan' str = 'test string' s = StringScanner.new(str) re = /\A\w+/ benchmark: 's.check(/\w+/)': s.check(/\w+/) 're.match(str)' : re.match(str) $ benchmark-driver test.yaml Calculating ------------------------------------- s.check(/\w+/) 9.936M i/s - 26.808M times in 2.698060s (100.64ns/i) re.match(str) 5.687M i/s - 17.161M times in 3.017700s (175.85ns/i) Comparison: s.check(/\w+/): 9916759.6 i/s re.match(str): 5938111.5 i/s - 1.67x slower Is Regexp faster than StringScanner?🤔

Slide 13

Slide 13 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Regexp is a batch process, so simple comparison is not possible. Processes the string while moving the pointer from the beginning of the string. T s = StringScanner.new('This is an example string') h i s i s p s.scan(/\w+/) #=> "This" p s.scan(/\w+/) #=> nil p s.scan(/\s+/) #=> " " p s.scan(/\w+/) #=> "is" a n e x a How to process in StringScanner. m p > /\A(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/.match(‘This is an example string') => # > # $1 => “This", $2=> "is", $3 => "an", $4=> “example”, $5 => “string"

Slide 14

Slide 14 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Adding Benchmarks Search for optimization targets using a pro fi ler. Rewrite XML parsing process with StringScanner. Fixed bugs in StringScanner. Fixed XML speci fi cation violation in REXML. Benchmarking and measuring effectiveness. What I did.

Slide 15

Slide 15 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Check the current status of REXML. (After improvement)

Slide 16

Slide 16 text

Processing time for the number of XML lines (Result) (Ruby3.3.0) (REXML master/YJIT disable) 17x slower * dom: 65.60x slower → 60.37x slower * pull(SAX): 21.56x slower → 17.56x slower (REXML 3.2.6/YJIT disable) -6% -17% 65x slower 21x slower 60x slower

Slide 17

Slide 17 text

Processing time for the number of XML lines (Result) (Ruby3.3.0) (REXML 3.2.6/YJIT enable) * dom: 44.20x slower → 35.92x slower * pull(SAX): 14.64x slower → 8.63x slower 🎉 8.6x slower 35x slower -23% -44% 44x slower 14x slower (REXML master/YJIT enable)

Slide 18

Slide 18 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Use SAX (pull parser) instead of dom if you need high performance Enable YJIT Use the latest REXML (3.2.7+) By optimizing with StringScanner, the difference with libxml improved from 65x to 8.6x. 🎉 Summary

Slide 19

Slide 19 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan If there is regular expression expansion in the method, it is slow because it is expanded each time the method is called. If #{ext} is constant, s.check(/[a- z]*#{ext}/o can reduce the expansion to only the fi rst one. In Ruby 3.3 YJIT, it is faster to declare it in module constants (only once at class initialization). @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/) end Optimization Points @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/o) end @s = StringScanner.new('foobar') TAG_MATCH = /[a-z]*#{ext}/ def foo s.check(TAG_MATCH) end Slow Fast Fast (for YJIT)

Slide 20

Slide 20 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Fix: StringScanner#captures https://github.com/ruby/strscan/pull/72 (merged) s = StringScanner.new('foobarbaz') #=> # s.scan /(foo)(bar)(BAZ)?/ #=> "foobar" s.captures #=> ["foo", "bar", ""] s.captures.compact #=> ["foo", "bar", ""] s = StringScanner.new('foobarbaz') #=> # s.scan /(foo)(bar)(BAZ)?/ #=> "foobar" s.captures #=> ["foo", "bar", nil] s.captures.compact #=> ["foo", "bar"] Before (not yet documented) After (MatchData#captures like)

Slide 21

Slide 21 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan https://github.com/ruby/strscan/issues/78 ( fi xed) s = StringScanner.new('') s << XXX Fix: StringScanner.new('') Bug: In JRuby, StringScanner.new('') can only hold Encoding:US-ASCII encoding.

Slide 22

Slide 22 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan I would like to support CSS selectors in REXML. StringScanner documentation is scarce. e.g. StringScanner#captures Next Actions

Slide 23

Slide 23 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan 5/28(Ր) After event΍Γ·͢! https://medpeer.connpass.com/event/316741/