Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to convert strings (and how not to). AKA Escaping is harder than it seems.

How to convert strings (and how not to). AKA Escaping is harder than it seems.

I have seen a lot of bugs that boil down to character-set mismatches between APIs. Some thoughts about that.

Tim Hockin

March 12, 2020
Tweet

More Decks by Tim Hockin

Other Decks in Technology

Transcript

  1. Google Cloud Platform How to convert strings (and how not

    to) AKA: escaping is harder than it seems Tim Hockin <[email protected]> Principal Software Engineer, Google @thockin
  2. Google Cloud Platform Problem statement: You have some input string

    that you need to store in some other system which has different rules for contents.
  3. Google Cloud Platform Example: Storing a user-provided input which allows

    “.” in the string into a place that doesn’t.
  4. Google Cloud Platform Why doesn’t this work? It can cause

    conflicts if the resulting values need to be unique. foo.bar => foo-bar foo-bar => foo-bar
  5. Google Cloud Platform Where have I seen this before? Most

    people experience this in printf() style formatting. “hello\n” => “hello” + newline “hello\\n” => “hello\n”
  6. Google Cloud Platform How can I fix the naive solution?

    Escape the inputs. foo.bar => foo-bar foo-bar => foo--bar foo--bar => foo----bar
  7. Google Cloud Platform Seriously? That’s gross. Yeah. And it gets

    worse. If you have multiple characters to handle or double-dash isn’t allowed, you need a different encoding. foo.bar_baz => foo-dot-bar-usc-baz
  8. Google Cloud Platform Now you have to escape the escapes

    You have to handle user inputs that use your escape. foo-dot-bar-usc-baz => foo-esc-dot-bar-esc-usc-baz
  9. Google Cloud Platform Now you have to escape the escape-escapes

    This is the last of it, I swear. foo-esc-dot-bar => foo-esc-esc-dot-bar
  10. Google Cloud Platform Second-order problem: Most such strings have length

    limits. Escaping eats into that. Now you have to worry about truncation.
  11. Google Cloud Platform Naive solution: Drop characters over the limit.

    foo-esc-dot-bar10 => foo-esc-esc-dot-bar1 foo-esc-dot-bar11 => foo-esc-esc-dot-bar1
  12. Google Cloud Platform More robust: Make them unique-within-set. foo-esc-dot-bar10 =>

    foo-esc-esc-dot-ba-1 foo-esc-dot-bar11 => foo-esc-esc-dot-ba-2
  13. Google Cloud Platform Where have I seen this before? Remember

    Windows95? FAT32 encoded long filenames like this. “long_file_name.txt” => “long_f~1.txt” “long_for_home.txt” => ”long_f~2.txt”
  14. Google Cloud Platform Can “within set” apply to the other

    cases? Sure, if deterministic values don’t matter. The naive solutions are OK if you KNOW there isn’t a conflict.
  15. Google Cloud Platform Take-aways: 1. Transcoding strings seems simple but

    it isn’t. 2. Always consider the encoding when crossing APIs. 3. Consider how strange input might break your code. Handle it.