Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ChicagoRuby: Downtow­­n - Handling String Encoding Failures/Normalizat­­­i­­on

ChicagoRuby: Downtow­­n - Handling String Encoding Failures/Normalizat­­­i­­on

http://www.meetup.com/ChicagoRuby/events/224393256/

Ruby 1.9 and better is encoding-aware. It has a representation of the external encoding and an internal encoding that it uses to process input and output. Each file has an encoding. The most common place we encounter encodings is when we read in, write, or otherwise manipulate strings. There are *a lot* of gotchas in working with strings and the possible exceptions they may raise.

Learn how to use string encodings and how to handle any encoding issues as you follow my journey from just installing `rack-utf8_sanitizer` through writing comprehensive (passing) tests around RSpec's EncodedString. Works on Windows, too!

30f2de7af9b9f26154c585a3a5a1f824?s=128

Benjamin Fleischer

November 03, 2015
Tweet

More Decks by Benjamin Fleischer

Other Decks in Technology

Transcript

  1. EncodedString Working with encodings in Ruby without fear of exceptions

    Benjamin Fleischer bf@benjaminfleischer.com gh: bf4, twitter: hazula Chicago Ruby 2015
  2. Who knows what String Encoding is?

  3. Have you every had trouble with String Encoding? • Gotten

    exceptions? • That were hard to debug?
  4. Follow my journey into Ruby’s String Encoding and Converting internals

  5. But first a quiz!

  6. PopQuiz: What’s encoded in Ruby? • Strings and Regexen: "Rüby".encoding.name

    #=> "UTF-8" • Source File: __ENCODING__.name #=> "UTF-8" • Environment: • External: Encoding.default_external #=> "US-ASCII" • Internal: Encoding.default_internal #=> "UTF-8"
  7. PopQuiz: What’s external encoding? • - The default external encoding

    is initialized by the locale or -E option. https:// github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398 • f = File.open(“example.txt”); [f.external_encoding.name, f.read.encoding.name] #=> [“UTF-8”, “UTF-8"] • Setting default external to UTF-8 • *nix: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 • windows: chcp 65001
  8. PopQuiz: What’s filesystem encoding? • Not from environment

  9. PopQuiz: What’s internal encoding? • Only set when the locale

    != the external
  10. PopQuiz: What’s locale encoding? • Very complex

  11. Can I test default external / internal? • Sure, why

    not?
  12. Can I test default external / internal?

  13. Can I test default external / internal? • No problem,

    amiright?
  14. Oh, and one more thing:

  15. Encoding Concepts • Encoding: str.encoding • Transcoding: str.encode(“iso-8859-1") • https://github.com/ruby/ruby/blob/

    9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c • https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c • https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c
  16. Back to my journey

  17. The world before encodings • We just had -K •

    Then everything broke • https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ ruby_19s_string • http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ ruby-1-9-encodings • http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for- rails/ • https://github.com/rails/rails/issues/12881 Rails serialized columns • http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/ • http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/
  18. The world after encodings • I had a lot of

    recipes I carried around • But I found I could fix a lot of ArgumentError’s bugs by just using Rack-UTF8Sanitizer
  19. Rack-UTF8Sanitizer • http://whitequark.org/blog/2013/03/05/rack- utf8sanitizer/ • Ensure request data is UTF-8.

    Removes invalid bytes.
  20. Mail gem ArgumentError • And I got an exception when

    running specs against the Mail gem on the Rubinius platform.
  21. Mail gem ArgumentError • There was an encoding failure in

    displaying RSpec's failure message
  22. Mail gem ArgumentError • Danger, rabbit hole ahead

  23. RSpec-Core • Opened https://github.com/rspec/rspec-core/pull/ 1760

  24. RSpec-Core • Applied one of my recipes, no problem

  25. RSpec-Core • Oh, and one more thing: • a few

    specs
  26. On to RSpec-Support EncodedString

  27. What’s EncodedString? • Enables safely interacting with Strings. • Is

    lossy.
  28. What’s EncodedString? • Enables safely interacting with Strings. • Is

    lossy. • Some strings are just incompatible.
  29. Difficulties • Different rubies • Different platforms • Diffing test

    failures
  30. Difficulties • Failed expectations go through the differ • The

    differ uses EncodedString • Failure messages were corrupted. • So, I needed a special comparison.
  31. Difficulties • https://github.com/rspec/rspec-support/blob/ master/lib/rspec/support/spec/string_matcher.rb

  32. Difficulties • https://github.com/rspec/rspec-support/blob/ master/lib/rspec/support/spec/string_matcher.rb

  33. Show me the code!

  34. Show me the code! Comments?

  35. Too many comments?

  36. Show me the code!

  37. But first, the exceptions

  38. Encoding::UndefinedConversionError

  39. Encoding::CompatibilityError

  40. Encoding::InvalidByteSequenceErr or

  41. ArgumentError

  42. TypeError

  43. Encoding::ConverterNotFoundError

  44. RangeError

  45. We can fix it!

  46. We can fix it!

  47. We can fix it!

  48. We can fix it!

  49. We can fix it!

  50. We can fix it!

  51. We can fix it!

  52. We can fix it!

  53. We can fix it!

  54. We can test it! https://github.com/bf4/encoded_string/blob/ 7a6413ee7e57afbc66c1d8159bde7fefd263d105/ spec/encoded_string_spec.rb

  55. None
  56. Extracted to Gem • gem install encoded_string

  57. Other concepts • BINARY encoding • Ruby later String#scrub to

    help • Dummy Encodings • Guess correct encoding? charlock_holmes gem • Encoding::UTF_8 vs. Encoding.find("UTF-8") vs. "UTF-8"
  58. Other concepts: Rack • Rack

  59. Other concepts: Rack • Rack

  60. Quite the Journey, eh?

  61. How to help out • Talk to me

  62. EncodedString Slides: https://speakerdeck.com/ bf4/normalization Working with encodings in Ruby without

    fear of exceptions Benjamin Fleischer bf@benjaminfleischer.com gh: bf4, twitter: hazula Chicago Ruby 2015
  63. • Sources: (needs cleanup, kthankxbye) • - https://web.archive.org/web/20120805050401/http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings • -

    https://web.archive.org/web/20120805034228/http://blog.grayproductions.net/articles/understanding_m17n • - https://web.archive.org/web/20120815112820/http://blog.grayproductions.net/articles/miscellaneous_m17n_details • - https://web.archive.org/web/20120815131349/http://blog.grayproductions.net/articles/what_ruby_19_gives_us • - https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ruby_19s_string • - http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ruby-1-9-encodings • - http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ • - https://github.com/rails/rails/issues/12881 Rails serialized columns • - http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/ • - caching http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/ • Result: • - https://github.com/rspec/rspec-support/compare/ 19e967a834e5cc9bb9d727ef59c5f580c1a74423%5E...master#diff-6f77530d5756f8c02ca078ddecaa891cR63 • - Change in behavior in rubies
  64. • hazula: Ruby encoding ppl: Experience with ConverterNotFoundError changing behavior

    in 2.1? https://www.ruby-forum.com/topic/6861247 cc @n0kada @yukihiro_matz @nalsh • nlash: @hazula @n0kada @yukihiro_matz And 2.1 changes String#encode(invalid: :replace) behavior see https://github.com/ruby/ruby/ blob/v2_1_0/NEWS#L176 • n0kada: @nalsh @hazula @yukihiro_matz and Encoding.default_external is not involved at all. replied at the topic. • hazula: @n0kada @nalsh @yukihiro_matz Thanks so much! I actually did look at how internal is set when external != locale https://github.com/rspec/rspec- support/commit/ db2c3a43e1cdb0fc1491394328f78c712aa9ed19#diff-61bdbe8c6f22bdaffeeac f872bdf7d1bR20 … • https://github.com/ruby/ruby/blob/v2_1_0/NEWS#L176
  65. • - https://github.com/rspec/rspec-support/pull/151#discussion_r22572815 ▪ I was confusing enc.ascii_compatible? ▪ with

    rb_enc_check ◦ where enc_compatible is `Encoding.compatible?(str1,str2) ◦ - https://github.com/rspec/rspec-support/pull/151#discussion_r22572359 https://github.com/rspec/rspec-support/pull/151#issuecomment-70045539 - https://github.com/rspec/rspec-support/pull/151#discussion_r22637355 per http://stackoverflow.com/questions/21289181/char174-returning-the-value-of-char0174-why which links to http://www.theasciicode.com.ar/extended-ascii-code/angle-quotes-guillemets-right-pointing-double-angle-french-quotation-marks-ascii-code-174.html - https://github.com/rspec/rspec-support/pull/151#discussion_r22991439 Hmm, looking at http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using and https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540, it appears you can get the current encoding (codepage) on windows in the command prompt by running `chcp` and change it to utf8 via `chcp 65001`. - https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540 - https://github.com/rspec/rspec-support/pull/134 - Related to https://github.com/rspec/rspec-core/pull/1760 - https://github.com/rspec/rspec-support/pull/151 - https://github.com/rspec/rspec-support/pull/152 - https://github.com/rspec/rspec-support/pull/167 - https://github.com/rspec/rspec-support/pull/172 - https://github.com/rspec/rspec-support/pull/173 (closed without merging) - https://github.com/rspec/rspec-support/pull/174 - https://github.com/rspec/rspec-dev/pull/114 - https://github.com/rspec/rspec-dev/pull/115 (open) - https://github.com/rspec/rspec-core/pull/1871 (open) - https://github.com/rspec/rspec-support/pull/171 (split into other PRs) - https://github.com/rspec/rspec.github.io/pull/65 (open) - TBD - https://github.com/rspec/rspec-support/pull/176 - https://github.com/rspec/rspec-support/pull/167 - Differ tests no longer use Differ to report diff expectation Differ https://github.com/rspec/rspec-support/pull/174 https://github.com/ruby/ruby/blob/aacc35e144/encoding.c#L1741 - https://github.com/rspec/rspec-support/pull/151#discussion_r22637177 https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289 https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL312 and see https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 and https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 (and maybe https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 etc) And then probably also look at the weirdness that is converter not found: https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101 myronmarston: Wow, I had no idea just how weird and complex ruby encoding behavior is! - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL3121 - https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101 - https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/string/fixtures/utf-8-encoding.rb - https://github.com/bf4/rspec-support/commit/db2c3a43e1cdb0fc1491394328f78c712aa9ed19 - https://github.com/jruby/jruby/blob/c1be61a501d1295fa1ea9f894e1a8e186411f32a/test/mri/ruby/envutil.rb#L150 rack invalid https://github.com/ruby/ruby/commit/4a50d447d9618b2e3df126e159aa1d735e429a70 Refs: - https://github.com/rspec/rspec-support/pull/167 - https://github.com/rspec/rspec-support/pull/151 - https://github.com/rspec/rspec-support/pull/134#issuecomment-68984440 Documentation: How to test: - https://github.com/rspec/rspec-support/pull/171 - Ruby sources - The default external encoding is initialized by the locale or -E option. https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398 - Not sure how to do a better test, since locale depends on weird platform-specific stuff https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/find_spec.rb#L57 - Encoding.compatible? https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/compatible_spec.rb#L31 - ◦