How to convert strings (and how not to). AKA Escaping is harder than it seems.

How to convert strings (and how not to). AKA Escaping is harder than it seems.

I have seen a lot of bugs that boil down to character-set mismatches between APIs. Some thoughts about that.

569f10721398d92f5033097ac6d9132c?s=128

Tim Hockin

March 12, 2020
Tweet

Transcript

  1. Google Cloud Platform How to convert strings (and how not

    to) AKA: escaping is harder than it seems Tim Hockin <thockin@google.com> Principal Software Engineer, Google @thockin
  2. Google Cloud Platform Problem statement: You have some input string

    that you need to store in some other system which has different rules for contents.
  3. Google Cloud Platform Example: Storing a user-provided input which allows

    “.” in the string into a place that doesn’t.
  4. Google Cloud Platform Naive solution: Replace all “.” with “-”.

    foo.bar => foo-bar
  5. Google Cloud Platform Why doesn’t this work? It can cause

    conflicts if the resulting values need to be unique. foo.bar => foo-bar foo-bar => foo-bar
  6. Google Cloud Platform Where have I seen this before? Most

    people experience this in printf() style formatting. “hello\n” => “hello” + newline “hello\\n” => “hello\n”
  7. Google Cloud Platform How can I fix the naive solution?

    Escape the inputs. foo.bar => foo-bar foo-bar => foo--bar foo--bar => foo----bar
  8. Google Cloud Platform Seriously? That’s gross. Yeah. And it gets

    worse. If you have multiple characters to handle or double-dash isn’t allowed, you need a different encoding. foo.bar_baz => foo-dot-bar-usc-baz
  9. Google Cloud Platform Now you have to escape the escapes

    You have to handle user inputs that use your escape. foo-dot-bar-usc-baz => foo-esc-dot-bar-esc-usc-baz
  10. Google Cloud Platform Now you have to escape the escape-escapes

    This is the last of it, I swear. foo-esc-dot-bar => foo-esc-esc-dot-bar
  11. Google Cloud Platform Second-order problem: Most such strings have length

    limits. Escaping eats into that. Now you have to worry about truncation.
  12. Google Cloud Platform Naive solution: Drop characters over the limit.

    foo-esc-dot-bar10 => foo-esc-esc-dot-bar1 foo-esc-dot-bar11 => foo-esc-esc-dot-bar1
  13. Google Cloud Platform More robust: Make them unique-within-set. foo-esc-dot-bar10 =>

    foo-esc-esc-dot-ba-1 foo-esc-dot-bar11 => foo-esc-esc-dot-ba-2
  14. Google Cloud Platform Where have I seen this before? Remember

    Windows95? FAT32 encoded long filenames like this. “long_file_name.txt” => “long_f~1.txt” “long_for_home.txt” => ”long_f~2.txt”
  15. Google Cloud Platform Can “within set” apply to the other

    cases? Sure, if deterministic values don’t matter. The naive solutions are OK if you KNOW there isn’t a conflict.
  16. Google Cloud Platform Take-aways: 1. Transcoding strings seems simple but

    it isn’t. 2. Always consider the encoding when crossing APIs. 3. Consider how strange input might break your code. Handle it.