$30 off During Our Annual Pro Sale. View Details »

ChicagoRuby: Downtow­­n - Handling String Encoding Failures/Normalizat­­­i­­on

ChicagoRuby: Downtow­­n - Handling String Encoding Failures/Normalizat­­­i­­on

http://www.meetup.com/ChicagoRuby/events/224393256/

Ruby 1.9 and better is encoding-aware. It has a representation of the external encoding and an internal encoding that it uses to process input and output. Each file has an encoding. The most common place we encounter encodings is when we read in, write, or otherwise manipulate strings. There are *a lot* of gotchas in working with strings and the possible exceptions they may raise.

Learn how to use string encodings and how to handle any encoding issues as you follow my journey from just installing `rack-utf8_sanitizer` through writing comprehensive (passing) tests around RSpec's EncodedString. Works on Windows, too!

Benjamin Fleischer

November 03, 2015
Tweet

More Decks by Benjamin Fleischer

Other Decks in Technology

Transcript

  1. EncodedString
    Working with encodings in Ruby without fear of
    exceptions
    Benjamin Fleischer
    bf@benjaminfleischer.com
    gh: bf4, twitter: hazula
    Chicago Ruby 2015

    View Slide

  2. Who knows what String
    Encoding is?

    View Slide

  3. Have you every had trouble
    with String Encoding?
    • Gotten exceptions?
    • That were hard to debug?

    View Slide

  4. Follow my journey into
    Ruby’s String Encoding
    and Converting internals

    View Slide

  5. But first
    a quiz!

    View Slide

  6. PopQuiz:
    What’s encoded in Ruby?
    • Strings and Regexen: "Rüby".encoding.name #=> "UTF-8"
    • Source File: __ENCODING__.name #=> "UTF-8"
    • Environment:
    • External: Encoding.default_external #=> "US-ASCII"
    • Internal: Encoding.default_internal #=> "UTF-8"

    View Slide

  7. PopQuiz:
    What’s external encoding?
    • - The default external encoding is initialized by the locale or -E option. https://
    github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398
    • f = File.open(“example.txt”); [f.external_encoding.name, f.read.encoding.name]
    #=> [“UTF-8”, “UTF-8"]
    • Setting default external to UTF-8
    • *nix: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
    • windows: chcp 65001

    View Slide

  8. PopQuiz:
    What’s filesystem encoding?
    • Not from environment

    View Slide

  9. PopQuiz:
    What’s internal encoding?
    • Only set when the locale != the external

    View Slide

  10. PopQuiz:
    What’s locale encoding?
    • Very complex

    View Slide

  11. Can I test default external /
    internal?
    • Sure, why not?

    View Slide

  12. Can I test default external /
    internal?

    View Slide

  13. Can I test default external /
    internal?
    • No problem, amiright?

    View Slide

  14. Oh, and one more
    thing:

    View Slide

  15. Encoding Concepts
    • Encoding: str.encoding
    • Transcoding: str.encode(“iso-8859-1")
    • https://github.com/ruby/ruby/blob/
    9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c
    • https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c
    • https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c

    View Slide

  16. Back to my journey

    View Slide

  17. The world before encodings
    • We just had -K
    • Then everything broke
    • https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/
    ruby_19s_string
    • http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/
    ruby-1-9-encodings
    • http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-
    rails/
    • https://github.com/rails/rails/issues/12881 Rails serialized columns
    • http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/
    • http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/

    View Slide

  18. The world after encodings
    • I had a lot of recipes I carried around
    • But I found I could fix a lot of ArgumentError’s
    bugs by just using Rack-UTF8Sanitizer

    View Slide

  19. Rack-UTF8Sanitizer
    • http://whitequark.org/blog/2013/03/05/rack-
    utf8sanitizer/
    • Ensure request data is UTF-8. Removes invalid
    bytes.

    View Slide

  20. Mail gem ArgumentError
    • And I got an exception when running specs against
    the Mail gem on the Rubinius platform.

    View Slide

  21. Mail gem ArgumentError
    • There was an encoding failure in displaying
    RSpec's failure message

    View Slide

  22. Mail gem ArgumentError
    • Danger, rabbit hole ahead

    View Slide

  23. RSpec-Core
    • Opened https://github.com/rspec/rspec-core/pull/
    1760

    View Slide

  24. RSpec-Core
    • Applied one of my recipes, no problem

    View Slide

  25. RSpec-Core
    • Oh, and one more thing:
    • a few specs

    View Slide

  26. On to RSpec-Support
    EncodedString

    View Slide

  27. What’s EncodedString?
    • Enables safely interacting with Strings.
    • Is lossy.

    View Slide

  28. What’s EncodedString?
    • Enables safely interacting with Strings.
    • Is lossy.
    • Some strings are just incompatible.

    View Slide

  29. Difficulties
    • Different rubies
    • Different platforms
    • Diffing test failures

    View Slide

  30. Difficulties
    • Failed expectations go through the differ
    • The differ uses EncodedString
    • Failure messages were corrupted.
    • So, I needed a special comparison.

    View Slide

  31. Difficulties
    • https://github.com/rspec/rspec-support/blob/
    master/lib/rspec/support/spec/string_matcher.rb

    View Slide

  32. Difficulties
    • https://github.com/rspec/rspec-support/blob/
    master/lib/rspec/support/spec/string_matcher.rb

    View Slide

  33. Show me the code!

    View Slide

  34. Show me the code!
    Comments?

    View Slide

  35. Too many comments?

    View Slide

  36. Show me the code!

    View Slide

  37. But first, the exceptions

    View Slide

  38. Encoding::UndefinedConversionError

    View Slide

  39. Encoding::CompatibilityError

    View Slide

  40. Encoding::InvalidByteSequenceErr
    or

    View Slide

  41. ArgumentError

    View Slide

  42. TypeError

    View Slide

  43. Encoding::ConverterNotFoundError

    View Slide

  44. RangeError

    View Slide

  45. We can fix it!

    View Slide

  46. We can fix it!

    View Slide

  47. We can fix it!

    View Slide

  48. We can fix it!

    View Slide

  49. We can fix it!

    View Slide

  50. We can fix it!

    View Slide

  51. We can fix it!

    View Slide

  52. We can fix it!

    View Slide

  53. We can fix it!

    View Slide

  54. We can test it!
    https://github.com/bf4/encoded_string/blob/
    7a6413ee7e57afbc66c1d8159bde7fefd263d105/
    spec/encoded_string_spec.rb

    View Slide

  55. View Slide

  56. Extracted to Gem
    • gem install encoded_string

    View Slide

  57. Other concepts
    • BINARY encoding
    • Ruby later String#scrub to help
    • Dummy Encodings
    • Guess correct encoding? charlock_holmes gem
    • Encoding::UTF_8 vs. Encoding.find("UTF-8") vs.
    "UTF-8"

    View Slide

  58. Other concepts: Rack
    • Rack

    View Slide

  59. Other concepts: Rack
    • Rack

    View Slide

  60. Quite the Journey, eh?

    View Slide

  61. How to help out
    • Talk to me

    View Slide

  62. EncodedString
    Slides: https://speakerdeck.com/
    bf4/normalization
    Working with encodings in Ruby without fear of
    exceptions
    Benjamin Fleischer
    bf@benjaminfleischer.com
    gh: bf4, twitter: hazula
    Chicago Ruby 2015

    View Slide

  63. • Sources: (needs cleanup, kthankxbye)
    • - https://web.archive.org/web/20120805050401/http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings
    • - https://web.archive.org/web/20120805034228/http://blog.grayproductions.net/articles/understanding_m17n
    • - https://web.archive.org/web/20120815112820/http://blog.grayproductions.net/articles/miscellaneous_m17n_details
    • - https://web.archive.org/web/20120815131349/http://blog.grayproductions.net/articles/what_ruby_19_gives_us
    • - https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ruby_19s_string
    • - http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ruby-1-9-encodings
    • - http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
    • - https://github.com/rails/rails/issues/12881 Rails serialized columns
    • - http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/
    • - caching http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/
    • Result:
    • - https://github.com/rspec/rspec-support/compare/
    19e967a834e5cc9bb9d727ef59c5f580c1a74423%5E...master#diff-6f77530d5756f8c02ca078ddecaa891cR63
    • - Change in behavior in rubies

    View Slide

  64. • hazula: Ruby encoding ppl: Experience with ConverterNotFoundError
    changing behavior in 2.1? https://www.ruby-forum.com/topic/6861247 cc
    @n0kada @yukihiro_matz @nalsh
    • nlash: @hazula @n0kada @yukihiro_matz And 2.1 changes
    String#encode(invalid: :replace) behavior see https://github.com/ruby/ruby/
    blob/v2_1_0/NEWS#L176
    • n0kada: @nalsh @hazula @yukihiro_matz and Encoding.default_external is
    not involved at all. replied at the topic.
    • hazula: @n0kada @nalsh @yukihiro_matz Thanks so much! I actually did look
    at how internal is set when external != locale https://github.com/rspec/rspec-
    support/commit/
    db2c3a43e1cdb0fc1491394328f78c712aa9ed19#diff-61bdbe8c6f22bdaffeeac
    f872bdf7d1bR20 …
    • https://github.com/ruby/ruby/blob/v2_1_0/NEWS#L176

    View Slide

  65. • - https://github.com/rspec/rspec-support/pull/151#discussion_r22572815
    ▪ I was confusing enc.ascii_compatible? ▪ with rb_enc_check ◦ where enc_compatible is `Encoding.compatible?(str1,str2)

    ◦ - https://github.com/rspec/rspec-support/pull/151#discussion_r22572359

    https://github.com/rspec/rspec-support/pull/151#issuecomment-70045539

    - https://github.com/rspec/rspec-support/pull/151#discussion_r22637355

    per http://stackoverflow.com/questions/21289181/char174-returning-the-value-of-char0174-why which links to http://www.theasciicode.com.ar/extended-ascii-code/angle-quotes-guillemets-right-pointing-double-angle-french-quotation-marks-ascii-code-174.html

    - https://github.com/rspec/rspec-support/pull/151#discussion_r22991439

    Hmm, looking at http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using and https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540, it appears you can get the current encoding (codepage)
    on windows in the command prompt by running `chcp` and change it to utf8 via `chcp 65001`.

    - https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540

    - https://github.com/rspec/rspec-support/pull/134

    - Related to https://github.com/rspec/rspec-core/pull/1760

    - https://github.com/rspec/rspec-support/pull/151

    - https://github.com/rspec/rspec-support/pull/152

    - https://github.com/rspec/rspec-support/pull/167

    - https://github.com/rspec/rspec-support/pull/172

    - https://github.com/rspec/rspec-support/pull/173 (closed without merging)

    - https://github.com/rspec/rspec-support/pull/174

    - https://github.com/rspec/rspec-dev/pull/114

    - https://github.com/rspec/rspec-dev/pull/115 (open)

    - https://github.com/rspec/rspec-core/pull/1871 (open)

    - https://github.com/rspec/rspec-support/pull/171 (split into other PRs)

    - https://github.com/rspec/rspec.github.io/pull/65 (open)

    - TBD

    - https://github.com/rspec/rspec-support/pull/176

    - https://github.com/rspec/rspec-support/pull/167

    - Differ tests no longer use Differ to report diff expectation Differ https://github.com/rspec/rspec-support/pull/174

    https://github.com/ruby/ruby/blob/aacc35e144/encoding.c#L1741

    - https://github.com/rspec/rspec-support/pull/151#discussion_r22637177

    https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289

    https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL312

    and see https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 and https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 (and maybe https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 etc)

    And then probably also look at the weirdness that is converter not found: https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101

    myronmarston: Wow, I had no idea just how weird and complex ruby encoding behavior is!

    - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250

    - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973

    - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294

    - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL3121

    - https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101

    - https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/string/fixtures/utf-8-encoding.rb

    - https://github.com/bf4/rspec-support/commit/db2c3a43e1cdb0fc1491394328f78c712aa9ed19

    - https://github.com/jruby/jruby/blob/c1be61a501d1295fa1ea9f894e1a8e186411f32a/test/mri/ruby/envutil.rb#L150

    rack invalid https://github.com/ruby/ruby/commit/4a50d447d9618b2e3df126e159aa1d735e429a70

    Refs:

    - https://github.com/rspec/rspec-support/pull/167

    - https://github.com/rspec/rspec-support/pull/151

    - https://github.com/rspec/rspec-support/pull/134#issuecomment-68984440

    Documentation: How to test:

    - https://github.com/rspec/rspec-support/pull/171

    -

    Ruby sources

    - The default external encoding is initialized by the locale or -E option. https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398

    - Not sure how to do a better test, since locale depends on weird platform-specific stuff https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/find_spec.rb#L57

    - Encoding.compatible? https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/compatible_spec.rb#L31

    -



    View Slide