Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ChicagoRuby: Downtow­­n - Handling String Encoding Failures/Normalizat­­­i­­on

ChicagoRuby: Downtow­­n - Handling String Encoding Failures/Normalizat­­­i­­on

http://www.meetup.com/ChicagoRuby/events/224393256/

Ruby 1.9 and better is encoding-aware. It has a representation of the external encoding and an internal encoding that it uses to process input and output. Each file has an encoding. The most common place we encounter encodings is when we read in, write, or otherwise manipulate strings. There are *a lot* of gotchas in working with strings and the possible exceptions they may raise.

Learn how to use string encodings and how to handle any encoding issues as you follow my journey from just installing `rack-utf8_sanitizer` through writing comprehensive (passing) tests around RSpec's EncodedString. Works on Windows, too!

Benjamin Fleischer

November 03, 2015
Tweet

More Decks by Benjamin Fleischer

Other Decks in Technology

Transcript

  1. EncodedString Working with encodings in Ruby without fear of exceptions

    Benjamin Fleischer bf@benjaminfleischer.com gh: bf4, twitter: hazula Chicago Ruby 2015
  2. Have you every had trouble with String Encoding? • Gotten

    exceptions? • That were hard to debug?
  3. PopQuiz: What’s encoded in Ruby? • Strings and Regexen: "Rüby".encoding.name

    #=> "UTF-8" • Source File: __ENCODING__.name #=> "UTF-8" • Environment: • External: Encoding.default_external #=> "US-ASCII" • Internal: Encoding.default_internal #=> "UTF-8"
  4. PopQuiz: What’s external encoding? • - The default external encoding

    is initialized by the locale or -E option. https:// github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398 • f = File.open(“example.txt”); [f.external_encoding.name, f.read.encoding.name] #=> [“UTF-8”, “UTF-8"] • Setting default external to UTF-8 • *nix: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 • windows: chcp 65001
  5. Encoding Concepts • Encoding: str.encoding • Transcoding: str.encode(“iso-8859-1") • https://github.com/ruby/ruby/blob/

    9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c • https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c • https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c
  6. The world before encodings • We just had -K •

    Then everything broke • https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ ruby_19s_string • http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ ruby-1-9-encodings • http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for- rails/ • https://github.com/rails/rails/issues/12881 Rails serialized columns • http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/ • http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/
  7. The world after encodings • I had a lot of

    recipes I carried around • But I found I could fix a lot of ArgumentError’s bugs by just using Rack-UTF8Sanitizer
  8. Mail gem ArgumentError • And I got an exception when

    running specs against the Mail gem on the Rubinius platform.
  9. Difficulties • Failed expectations go through the differ • The

    differ uses EncodedString • Failure messages were corrupted. • So, I needed a special comparison.
  10. Other concepts • BINARY encoding • Ruby later String#scrub to

    help • Dummy Encodings • Guess correct encoding? charlock_holmes gem • Encoding::UTF_8 vs. Encoding.find("UTF-8") vs. "UTF-8"
  11. EncodedString Slides: https://speakerdeck.com/ bf4/normalization Working with encodings in Ruby without

    fear of exceptions Benjamin Fleischer bf@benjaminfleischer.com gh: bf4, twitter: hazula Chicago Ruby 2015
  12. • Sources: (needs cleanup, kthankxbye) • - https://web.archive.org/web/20120805050401/http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings • -

    https://web.archive.org/web/20120805034228/http://blog.grayproductions.net/articles/understanding_m17n • - https://web.archive.org/web/20120815112820/http://blog.grayproductions.net/articles/miscellaneous_m17n_details • - https://web.archive.org/web/20120815131349/http://blog.grayproductions.net/articles/what_ruby_19_gives_us • - https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ruby_19s_string • - http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ruby-1-9-encodings • - http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ • - https://github.com/rails/rails/issues/12881 Rails serialized columns • - http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/ • - caching http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/ • Result: • - https://github.com/rspec/rspec-support/compare/ 19e967a834e5cc9bb9d727ef59c5f580c1a74423%5E...master#diff-6f77530d5756f8c02ca078ddecaa891cR63 • - Change in behavior in rubies
  13. • hazula: Ruby encoding ppl: Experience with ConverterNotFoundError changing behavior

    in 2.1? https://www.ruby-forum.com/topic/6861247 cc @n0kada @yukihiro_matz @nalsh • nlash: @hazula @n0kada @yukihiro_matz And 2.1 changes String#encode(invalid: :replace) behavior see https://github.com/ruby/ruby/ blob/v2_1_0/NEWS#L176 • n0kada: @nalsh @hazula @yukihiro_matz and Encoding.default_external is not involved at all. replied at the topic. • hazula: @n0kada @nalsh @yukihiro_matz Thanks so much! I actually did look at how internal is set when external != locale https://github.com/rspec/rspec- support/commit/ db2c3a43e1cdb0fc1491394328f78c712aa9ed19#diff-61bdbe8c6f22bdaffeeac f872bdf7d1bR20 … • https://github.com/ruby/ruby/blob/v2_1_0/NEWS#L176
  14. • - https://github.com/rspec/rspec-support/pull/151#discussion_r22572815 ▪ I was confusing enc.ascii_compatible? ▪ with

    rb_enc_check ◦ where enc_compatible is `Encoding.compatible?(str1,str2) ◦ - https://github.com/rspec/rspec-support/pull/151#discussion_r22572359 https://github.com/rspec/rspec-support/pull/151#issuecomment-70045539 - https://github.com/rspec/rspec-support/pull/151#discussion_r22637355 per http://stackoverflow.com/questions/21289181/char174-returning-the-value-of-char0174-why which links to http://www.theasciicode.com.ar/extended-ascii-code/angle-quotes-guillemets-right-pointing-double-angle-french-quotation-marks-ascii-code-174.html - https://github.com/rspec/rspec-support/pull/151#discussion_r22991439 Hmm, looking at http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using and https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540, it appears you can get the current encoding (codepage) on windows in the command prompt by running `chcp` and change it to utf8 via `chcp 65001`. - https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540 - https://github.com/rspec/rspec-support/pull/134 - Related to https://github.com/rspec/rspec-core/pull/1760 - https://github.com/rspec/rspec-support/pull/151 - https://github.com/rspec/rspec-support/pull/152 - https://github.com/rspec/rspec-support/pull/167 - https://github.com/rspec/rspec-support/pull/172 - https://github.com/rspec/rspec-support/pull/173 (closed without merging) - https://github.com/rspec/rspec-support/pull/174 - https://github.com/rspec/rspec-dev/pull/114 - https://github.com/rspec/rspec-dev/pull/115 (open) - https://github.com/rspec/rspec-core/pull/1871 (open) - https://github.com/rspec/rspec-support/pull/171 (split into other PRs) - https://github.com/rspec/rspec.github.io/pull/65 (open) - TBD - https://github.com/rspec/rspec-support/pull/176 - https://github.com/rspec/rspec-support/pull/167 - Differ tests no longer use Differ to report diff expectation Differ https://github.com/rspec/rspec-support/pull/174 https://github.com/ruby/ruby/blob/aacc35e144/encoding.c#L1741 - https://github.com/rspec/rspec-support/pull/151#discussion_r22637177 https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289 https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL312 and see https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 and https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 (and maybe https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 etc) And then probably also look at the weirdness that is converter not found: https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101 myronmarston: Wow, I had no idea just how weird and complex ruby encoding behavior is! - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL3121 - https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101 - https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/string/fixtures/utf-8-encoding.rb - https://github.com/bf4/rspec-support/commit/db2c3a43e1cdb0fc1491394328f78c712aa9ed19 - https://github.com/jruby/jruby/blob/c1be61a501d1295fa1ea9f894e1a8e186411f32a/test/mri/ruby/envutil.rb#L150 rack invalid https://github.com/ruby/ruby/commit/4a50d447d9618b2e3df126e159aa1d735e429a70 Refs: - https://github.com/rspec/rspec-support/pull/167 - https://github.com/rspec/rspec-support/pull/151 - https://github.com/rspec/rspec-support/pull/134#issuecomment-68984440 Documentation: How to test: - https://github.com/rspec/rspec-support/pull/171 - Ruby sources - The default external encoding is initialized by the locale or -E option. https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398 - Not sure how to do a better test, since locale depends on weird platform-specific stuff https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/find_spec.rb#L57 - Encoding.compatible? https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/compatible_spec.rb#L31 - ◦