Okinawa, Japan I am the author of the PDF export library RBPDF gem. (https:// github.com/naitoh/rbpdf) I would like to support SVG in RBPDF using REXML which is easy to install. (SVG is an image fi le described in XML.) REXML performance is slower than C extension gem. I would like to improve REXML performance. Motivation
/> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root> Processing time for the number of XML lines (Ruby3.3.0) rexml(dom) is a DOM parser that can easily retrieve desired elements by specifying XPath. * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) XPath: /root/child[@id0="1"] 65x slower DOM parser caches and retains all parsed results, so it is not memory ef fi cient for large XML.
version="1.0"?> <root> <child id0="0" id1="0" /> <child id0="1" id1="1" /> <child id0="2" id1="2" /> <child id0="3" id1="3" /> : <child id0="9999" id1="9999" /> </root> * dom: 65.60x slower * pull(SAX): 21.56x slower (REXML 3.2.6/YJIT disable) The SAX parser does not need to retain parsed results, so it is memory ef fi cient even with large XML. 21x slower rexml(pull) is a SAX parser. It parses XML sequentially, retrieving the necessary information line by line from the beginning of each XML line.
Okinawa, Japan Regexp is a batch process, so simple comparison is not possible. Processes the string while moving the pointer from the beginning of the string. T s = StringScanner.new('This is an example string') h i s i s p s.scan(/\w+/) #=> "This" p s.scan(/\w+/) #=> nil p s.scan(/\s+/) #=> " " p s.scan(/\w+/) #=> "is" a n e x a How to process in StringScanner. m p > /\A(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/.match(‘This is an example string') => #<MatchData "This is an example string" 1:"This" 2:"is" 3:"an" 4:"example" 5:”string”> > # $1 => “This", $2=> "is", $3 => "an", $4=> “example”, $5 => “string"
Okinawa, Japan Adding Benchmarks Search for optimization targets using a pro fi ler. Rewrite XML parsing process with StringScanner. Fixed bugs in StringScanner. Fixed XML speci fi cation violation in REXML. Benchmarking and measuring effectiveness. What I did.
Okinawa, Japan Use SAX (pull parser) instead of dom if you need high performance Enable YJIT Use the latest REXML (3.2.7+) By optimizing with StringScanner, the difference with libxml improved from 65x to 8.6x. 🎉 Summary
Okinawa, Japan If there is regular expression expansion in the method, it is slow because it is expanded each time the method is called. If #{ext} is constant, s.check(/[a- z]*#{ext}/o can reduce the expansion to only the fi rst one. In Ruby 3.3 YJIT, it is faster to declare it in module constants (only once at class initialization). @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/) end Optimization Points @s = StringScanner.new('foobar') def foo @s.check(/[a-z]*#{ext}/o) end @s = StringScanner.new('foobar') TAG_MATCH = /[a-z]*#{ext}/ def foo s.check(TAG_MATCH) end Slow Fast Fast (for YJIT)
Okinawa, Japan https://github.com/ruby/strscan/issues/78 ( fi xed) s = StringScanner.new('') s << XXX Fix: StringScanner.new('') Bug: In JRuby, StringScanner.new('') can only hold Encoding:US-ASCII encoding.