Google Cloud Platform
How to convert strings
(and how not to)
AKA: escaping is harder than it seems
Tim Hockin
Principal Software Engineer, Google
@thockin
Slide 2
Slide 2 text
Google Cloud Platform
Problem statement:
You have some input string that you need to
store in some other system which has different
rules for contents.
Slide 3
Slide 3 text
Google Cloud Platform
Example:
Storing a user-provided input which allows “.” in
the string into a place that doesn’t.
Slide 4
Slide 4 text
Google Cloud Platform
Naive solution:
Replace all “.” with “-”.
foo.bar => foo-bar
Slide 5
Slide 5 text
Google Cloud Platform
Why doesn’t this work?
It can cause conflicts if the resulting values
need to be unique.
foo.bar => foo-bar
foo-bar => foo-bar
Slide 6
Slide 6 text
Google Cloud Platform
Where have I seen this before?
Most people experience this in printf() style
formatting.
“hello\n” => “hello” + newline
“hello\\n” => “hello\n”
Slide 7
Slide 7 text
Google Cloud Platform
How can I fix the naive solution?
Escape the inputs.
foo.bar => foo-bar
foo-bar => foo--bar
foo--bar => foo----bar
Slide 8
Slide 8 text
Google Cloud Platform
Seriously? That’s gross.
Yeah. And it gets worse. If you have multiple
characters to handle or double-dash isn’t
allowed, you need a different encoding.
foo.bar_baz => foo-dot-bar-usc-baz
Slide 9
Slide 9 text
Google Cloud Platform
Now you have to escape the escapes
You have to handle user inputs that use your
escape.
foo-dot-bar-usc-baz => foo-esc-dot-bar-esc-usc-baz
Slide 10
Slide 10 text
Google Cloud Platform
Now you have to escape the escape-escapes
This is the last of it, I swear.
foo-esc-dot-bar => foo-esc-esc-dot-bar
Slide 11
Slide 11 text
Google Cloud Platform
Second-order problem:
Most such strings have length limits. Escaping
eats into that. Now you have to worry about
truncation.
Slide 12
Slide 12 text
Google Cloud Platform
Naive solution:
Drop characters over the limit.
foo-esc-dot-bar10 => foo-esc-esc-dot-bar1
foo-esc-dot-bar11 => foo-esc-esc-dot-bar1
Slide 13
Slide 13 text
Google Cloud Platform
More robust:
Make them unique-within-set.
foo-esc-dot-bar10 => foo-esc-esc-dot-ba-1
foo-esc-dot-bar11 => foo-esc-esc-dot-ba-2
Slide 14
Slide 14 text
Google Cloud Platform
Where have I seen this before?
Remember Windows95? FAT32 encoded long
filenames like this.
“long_file_name.txt” => “long_f~1.txt”
“long_for_home.txt” => ”long_f~2.txt”
Slide 15
Slide 15 text
Google Cloud Platform
Can “within set” apply to the other cases?
Sure, if deterministic values don’t matter. The
naive solutions are OK if you KNOW there isn’t
a conflict.
Slide 16
Slide 16 text
Google Cloud Platform
Take-aways:
1. Transcoding strings seems simple but it
isn’t.
2. Always consider the encoding when
crossing APIs.
3. Consider how strange input might break
your code. Handle it.