Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to convert strings (and how not to). AKA Escaping is harder than it seems.

How to convert strings (and how not to). AKA Escaping is harder than it seems.

I have seen a lot of bugs that boil down to character-set mismatches between APIs. Some thoughts about that.

Tim Hockin

March 12, 2020
Tweet

More Decks by Tim Hockin

Other Decks in Technology

Transcript

  1. Google Cloud Platform
    How to convert strings
    (and how not to)
    AKA: escaping is harder than it seems
    Tim Hockin
    Principal Software Engineer, Google
    @thockin

    View Slide

  2. Google Cloud Platform
    Problem statement:
    You have some input string that you need to
    store in some other system which has different
    rules for contents.

    View Slide

  3. Google Cloud Platform
    Example:
    Storing a user-provided input which allows “.” in
    the string into a place that doesn’t.

    View Slide

  4. Google Cloud Platform
    Naive solution:
    Replace all “.” with “-”.
    foo.bar => foo-bar

    View Slide

  5. Google Cloud Platform
    Why doesn’t this work?
    It can cause conflicts if the resulting values
    need to be unique.
    foo.bar => foo-bar
    foo-bar => foo-bar

    View Slide

  6. Google Cloud Platform
    Where have I seen this before?
    Most people experience this in printf() style
    formatting.
    “hello\n” => “hello” + newline
    “hello\\n” => “hello\n”

    View Slide

  7. Google Cloud Platform
    How can I fix the naive solution?
    Escape the inputs.
    foo.bar => foo-bar
    foo-bar => foo--bar
    foo--bar => foo----bar

    View Slide

  8. Google Cloud Platform
    Seriously? That’s gross.
    Yeah. And it gets worse. If you have multiple
    characters to handle or double-dash isn’t
    allowed, you need a different encoding.
    foo.bar_baz => foo-dot-bar-usc-baz

    View Slide

  9. Google Cloud Platform
    Now you have to escape the escapes
    You have to handle user inputs that use your
    escape.
    foo-dot-bar-usc-baz => foo-esc-dot-bar-esc-usc-baz

    View Slide

  10. Google Cloud Platform
    Now you have to escape the escape-escapes
    This is the last of it, I swear.
    foo-esc-dot-bar => foo-esc-esc-dot-bar

    View Slide

  11. Google Cloud Platform
    Second-order problem:
    Most such strings have length limits. Escaping
    eats into that. Now you have to worry about
    truncation.

    View Slide

  12. Google Cloud Platform
    Naive solution:
    Drop characters over the limit.
    foo-esc-dot-bar10 => foo-esc-esc-dot-bar1
    foo-esc-dot-bar11 => foo-esc-esc-dot-bar1

    View Slide

  13. Google Cloud Platform
    More robust:
    Make them unique-within-set.
    foo-esc-dot-bar10 => foo-esc-esc-dot-ba-1
    foo-esc-dot-bar11 => foo-esc-esc-dot-ba-2

    View Slide

  14. Google Cloud Platform
    Where have I seen this before?
    Remember Windows95? FAT32 encoded long
    filenames like this.
    “long_file_name.txt” => “long_f~1.txt”
    “long_for_home.txt” => ”long_f~2.txt”

    View Slide

  15. Google Cloud Platform
    Can “within set” apply to the other cases?
    Sure, if deterministic values don’t matter. The
    naive solutions are OK if you KNOW there isn’t
    a conflict.

    View Slide

  16. Google Cloud Platform
    Take-aways:
    1. Transcoding strings seems simple but it
    isn’t.
    2. Always consider the encoding when
    crossing APIs.
    3. Consider how strange input might break
    your code. Handle it.

    View Slide