Slide 1

Slide 1 text

EncodedString Working with encodings in Ruby without fear of exceptions Benjamin Fleischer bf@benjaminfleischer.com gh: bf4, twitter: hazula Chicago Ruby 2015

Slide 2

Slide 2 text

Who knows what String Encoding is?

Slide 3

Slide 3 text

Have you every had trouble with String Encoding? • Gotten exceptions? • That were hard to debug?

Slide 4

Slide 4 text

Follow my journey into Ruby’s String Encoding and Converting internals

Slide 5

Slide 5 text

But first a quiz!

Slide 6

Slide 6 text

PopQuiz: What’s encoded in Ruby? • Strings and Regexen: "Rüby".encoding.name #=> "UTF-8" • Source File: __ENCODING__.name #=> "UTF-8" • Environment: • External: Encoding.default_external #=> "US-ASCII" • Internal: Encoding.default_internal #=> "UTF-8"

Slide 7

Slide 7 text

PopQuiz: What’s external encoding? • - The default external encoding is initialized by the locale or -E option. https:// github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398 • f = File.open(“example.txt”); [f.external_encoding.name, f.read.encoding.name] #=> [“UTF-8”, “UTF-8"] • Setting default external to UTF-8 • *nix: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 • windows: chcp 65001

Slide 8

Slide 8 text

PopQuiz: What’s filesystem encoding? • Not from environment

Slide 9

Slide 9 text

PopQuiz: What’s internal encoding? • Only set when the locale != the external

Slide 10

Slide 10 text

PopQuiz: What’s locale encoding? • Very complex

Slide 11

Slide 11 text

Can I test default external / internal? • Sure, why not?

Slide 12

Slide 12 text

Can I test default external / internal?

Slide 13

Slide 13 text

Can I test default external / internal? • No problem, amiright?

Slide 14

Slide 14 text

Oh, and one more thing:

Slide 15

Slide 15 text

Encoding Concepts • Encoding: str.encoding • Transcoding: str.encode(“iso-8859-1") • https://github.com/ruby/ruby/blob/ 9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c • https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c • https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c

Slide 16

Slide 16 text

Back to my journey

Slide 17

Slide 17 text

The world before encodings • We just had -K • Then everything broke • https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ ruby_19s_string • http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ ruby-1-9-encodings • http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for- rails/ • https://github.com/rails/rails/issues/12881 Rails serialized columns • http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/ • http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/

Slide 18

Slide 18 text

The world after encodings • I had a lot of recipes I carried around • But I found I could fix a lot of ArgumentError’s bugs by just using Rack-UTF8Sanitizer

Slide 19

Slide 19 text

Rack-UTF8Sanitizer • http://whitequark.org/blog/2013/03/05/rack- utf8sanitizer/ • Ensure request data is UTF-8. Removes invalid bytes.

Slide 20

Slide 20 text

Mail gem ArgumentError • And I got an exception when running specs against the Mail gem on the Rubinius platform.

Slide 21

Slide 21 text

Mail gem ArgumentError • There was an encoding failure in displaying RSpec's failure message

Slide 22

Slide 22 text

Mail gem ArgumentError • Danger, rabbit hole ahead

Slide 23

Slide 23 text

RSpec-Core • Opened https://github.com/rspec/rspec-core/pull/ 1760

Slide 24

Slide 24 text

RSpec-Core • Applied one of my recipes, no problem

Slide 25

Slide 25 text

RSpec-Core • Oh, and one more thing: • a few specs

Slide 26

Slide 26 text

On to RSpec-Support EncodedString

Slide 27

Slide 27 text

What’s EncodedString? • Enables safely interacting with Strings. • Is lossy.

Slide 28

Slide 28 text

What’s EncodedString? • Enables safely interacting with Strings. • Is lossy. • Some strings are just incompatible.

Slide 29

Slide 29 text

Difficulties • Different rubies • Different platforms • Diffing test failures

Slide 30

Slide 30 text

Difficulties • Failed expectations go through the differ • The differ uses EncodedString • Failure messages were corrupted. • So, I needed a special comparison.

Slide 31

Slide 31 text

Difficulties • https://github.com/rspec/rspec-support/blob/ master/lib/rspec/support/spec/string_matcher.rb

Slide 32

Slide 32 text

Difficulties • https://github.com/rspec/rspec-support/blob/ master/lib/rspec/support/spec/string_matcher.rb

Slide 33

Slide 33 text

Show me the code!

Slide 34

Slide 34 text

Show me the code! Comments?

Slide 35

Slide 35 text

Too many comments?

Slide 36

Slide 36 text

Show me the code!

Slide 37

Slide 37 text

But first, the exceptions

Slide 38

Slide 38 text

Encoding::UndefinedConversionError

Slide 39

Slide 39 text

Encoding::CompatibilityError

Slide 40

Slide 40 text

Encoding::InvalidByteSequenceErr or

Slide 41

Slide 41 text

ArgumentError

Slide 42

Slide 42 text

TypeError

Slide 43

Slide 43 text

Encoding::ConverterNotFoundError

Slide 44

Slide 44 text

RangeError

Slide 45

Slide 45 text

We can fix it!

Slide 46

Slide 46 text

We can fix it!

Slide 47

Slide 47 text

We can fix it!

Slide 48

Slide 48 text

We can fix it!

Slide 49

Slide 49 text

We can fix it!

Slide 50

Slide 50 text

We can fix it!

Slide 51

Slide 51 text

We can fix it!

Slide 52

Slide 52 text

We can fix it!

Slide 53

Slide 53 text

We can fix it!

Slide 54

Slide 54 text

We can test it! https://github.com/bf4/encoded_string/blob/ 7a6413ee7e57afbc66c1d8159bde7fefd263d105/ spec/encoded_string_spec.rb

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Extracted to Gem • gem install encoded_string

Slide 57

Slide 57 text

Other concepts • BINARY encoding • Ruby later String#scrub to help • Dummy Encodings • Guess correct encoding? charlock_holmes gem • Encoding::UTF_8 vs. Encoding.find("UTF-8") vs. "UTF-8"

Slide 58

Slide 58 text

Other concepts: Rack • Rack

Slide 59

Slide 59 text

Other concepts: Rack • Rack

Slide 60

Slide 60 text

Quite the Journey, eh?

Slide 61

Slide 61 text

How to help out • Talk to me

Slide 62

Slide 62 text

EncodedString Slides: https://speakerdeck.com/ bf4/normalization Working with encodings in Ruby without fear of exceptions Benjamin Fleischer bf@benjaminfleischer.com gh: bf4, twitter: hazula Chicago Ruby 2015

Slide 63

Slide 63 text

• Sources: (needs cleanup, kthankxbye) • - https://web.archive.org/web/20120805050401/http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings • - https://web.archive.org/web/20120805034228/http://blog.grayproductions.net/articles/understanding_m17n • - https://web.archive.org/web/20120815112820/http://blog.grayproductions.net/articles/miscellaneous_m17n_details • - https://web.archive.org/web/20120815131349/http://blog.grayproductions.net/articles/what_ruby_19_gives_us • - https://web.archive.org/web/20120805050559/http://blog.grayproductions.net/articles/ruby_19s_string • - http://wayback.archive.org/web/20120209160419/http://nuclearsquid.com/writings/ruby-1-9-encodings • - http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ • - https://github.com/rails/rails/issues/12881 Rails serialized columns • - http://nerds.airbnb.com/upgrading-from-ree-187-to-ruby-193/ • - caching http://www.benjaminfleischer.com/2013/06/10/ruby-19-upgrade-and-encoding-hell/ • Result: • - https://github.com/rspec/rspec-support/compare/ 19e967a834e5cc9bb9d727ef59c5f580c1a74423%5E...master#diff-6f77530d5756f8c02ca078ddecaa891cR63 • - Change in behavior in rubies

Slide 64

Slide 64 text

• hazula: Ruby encoding ppl: Experience with ConverterNotFoundError changing behavior in 2.1? https://www.ruby-forum.com/topic/6861247 cc @n0kada @yukihiro_matz @nalsh • nlash: @hazula @n0kada @yukihiro_matz And 2.1 changes String#encode(invalid: :replace) behavior see https://github.com/ruby/ruby/ blob/v2_1_0/NEWS#L176 • n0kada: @nalsh @hazula @yukihiro_matz and Encoding.default_external is not involved at all. replied at the topic. • hazula: @n0kada @nalsh @yukihiro_matz Thanks so much! I actually did look at how internal is set when external != locale https://github.com/rspec/rspec- support/commit/ db2c3a43e1cdb0fc1491394328f78c712aa9ed19#diff-61bdbe8c6f22bdaffeeac f872bdf7d1bR20 … • https://github.com/ruby/ruby/blob/v2_1_0/NEWS#L176

Slide 65

Slide 65 text

• - https://github.com/rspec/rspec-support/pull/151#discussion_r22572815 ▪ I was confusing enc.ascii_compatible? ▪ with rb_enc_check ◦ where enc_compatible is `Encoding.compatible?(str1,str2) ◦ - https://github.com/rspec/rspec-support/pull/151#discussion_r22572359 https://github.com/rspec/rspec-support/pull/151#issuecomment-70045539 - https://github.com/rspec/rspec-support/pull/151#discussion_r22637355 per http://stackoverflow.com/questions/21289181/char174-returning-the-value-of-char0174-why which links to http://www.theasciicode.com.ar/extended-ascii-code/angle-quotes-guillemets-right-pointing-double-angle-french-quotation-marks-ascii-code-174.html - https://github.com/rspec/rspec-support/pull/151#discussion_r22991439 Hmm, looking at http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using and https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540, it appears you can get the current encoding (codepage) on windows in the command prompt by running `chcp` and change it to utf8 via `chcp 65001`. - https://github.com/ruby/ruby/blob/9fd7afefd04134c98abe594154a527c6cfe2123b/ext/win32ole/win32ole.c#L540 - https://github.com/rspec/rspec-support/pull/134 - Related to https://github.com/rspec/rspec-core/pull/1760 - https://github.com/rspec/rspec-support/pull/151 - https://github.com/rspec/rspec-support/pull/152 - https://github.com/rspec/rspec-support/pull/167 - https://github.com/rspec/rspec-support/pull/172 - https://github.com/rspec/rspec-support/pull/173 (closed without merging) - https://github.com/rspec/rspec-support/pull/174 - https://github.com/rspec/rspec-dev/pull/114 - https://github.com/rspec/rspec-dev/pull/115 (open) - https://github.com/rspec/rspec-core/pull/1871 (open) - https://github.com/rspec/rspec-support/pull/171 (split into other PRs) - https://github.com/rspec/rspec.github.io/pull/65 (open) - TBD - https://github.com/rspec/rspec-support/pull/176 - https://github.com/rspec/rspec-support/pull/167 - Differ tests no longer use Differ to report diff expectation Differ https://github.com/rspec/rspec-support/pull/174 https://github.com/ruby/ruby/blob/aacc35e144/encoding.c#L1741 - https://github.com/rspec/rspec-support/pull/151#discussion_r22637177 https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289 https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL312 and see https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 and https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 (and maybe https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 etc) And then probably also look at the weirdness that is converter not found: https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101 myronmarston: Wow, I had no idea just how weird and complex ruby encoding behavior is! - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4242-L4250 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3917-L3973 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L4289-L4294 - https://github.com/ruby/ruby/blob/34fbf57aaa/transcode.c#L3119-lL3121 - https://github.com/rubyspec/rubyspec/blob/archive/core/string/shared/encode.rb#L96-L101 - https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/string/fixtures/utf-8-encoding.rb - https://github.com/bf4/rspec-support/commit/db2c3a43e1cdb0fc1491394328f78c712aa9ed19 - https://github.com/jruby/jruby/blob/c1be61a501d1295fa1ea9f894e1a8e186411f32a/test/mri/ruby/envutil.rb#L150 rack invalid https://github.com/ruby/ruby/commit/4a50d447d9618b2e3df126e159aa1d735e429a70 Refs: - https://github.com/rspec/rspec-support/pull/167 - https://github.com/rspec/rspec-support/pull/151 - https://github.com/rspec/rspec-support/pull/134#issuecomment-68984440 Documentation: How to test: - https://github.com/rspec/rspec-support/pull/171 - Ruby sources - The default external encoding is initialized by the locale or -E option. https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1398 - Not sure how to do a better test, since locale depends on weird platform-specific stuff https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/find_spec.rb#L57 - Encoding.compatible? https://github.com/rubyspec/rubyspec/blob/91ce9f6549/core/encoding/compatible_spec.rb#L31 - ◦