Slide 1

Slide 1 text

Google Cloud Platform How to convert strings (and how not to) AKA: escaping is harder than it seems Tim Hockin Principal Software Engineer, Google @thockin

Slide 2

Slide 2 text

Google Cloud Platform Problem statement: You have some input string that you need to store in some other system which has different rules for contents.

Slide 3

Slide 3 text

Google Cloud Platform Example: Storing a user-provided input which allows “.” in the string into a place that doesn’t.

Slide 4

Slide 4 text

Google Cloud Platform Naive solution: Replace all “.” with “-”. foo.bar => foo-bar

Slide 5

Slide 5 text

Google Cloud Platform Why doesn’t this work? It can cause conflicts if the resulting values need to be unique. foo.bar => foo-bar foo-bar => foo-bar

Slide 6

Slide 6 text

Google Cloud Platform Where have I seen this before? Most people experience this in printf() style formatting. “hello\n” => “hello” + newline “hello\\n” => “hello\n”

Slide 7

Slide 7 text

Google Cloud Platform How can I fix the naive solution? Escape the inputs. foo.bar => foo-bar foo-bar => foo--bar foo--bar => foo----bar

Slide 8

Slide 8 text

Google Cloud Platform Seriously? That’s gross. Yeah. And it gets worse. If you have multiple characters to handle or double-dash isn’t allowed, you need a different encoding. foo.bar_baz => foo-dot-bar-usc-baz

Slide 9

Slide 9 text

Google Cloud Platform Now you have to escape the escapes You have to handle user inputs that use your escape. foo-dot-bar-usc-baz => foo-esc-dot-bar-esc-usc-baz

Slide 10

Slide 10 text

Google Cloud Platform Now you have to escape the escape-escapes This is the last of it, I swear. foo-esc-dot-bar => foo-esc-esc-dot-bar

Slide 11

Slide 11 text

Google Cloud Platform Second-order problem: Most such strings have length limits. Escaping eats into that. Now you have to worry about truncation.

Slide 12

Slide 12 text

Google Cloud Platform Naive solution: Drop characters over the limit. foo-esc-dot-bar10 => foo-esc-esc-dot-bar1 foo-esc-dot-bar11 => foo-esc-esc-dot-bar1

Slide 13

Slide 13 text

Google Cloud Platform More robust: Make them unique-within-set. foo-esc-dot-bar10 => foo-esc-esc-dot-ba-1 foo-esc-dot-bar11 => foo-esc-esc-dot-ba-2

Slide 14

Slide 14 text

Google Cloud Platform Where have I seen this before? Remember Windows95? FAT32 encoded long filenames like this. “long_file_name.txt” => “long_f~1.txt” “long_for_home.txt” => ”long_f~2.txt”

Slide 15

Slide 15 text

Google Cloud Platform Can “within set” apply to the other cases? Sure, if deterministic values don’t matter. The naive solutions are OK if you KNOW there isn’t a conflict.

Slide 16

Slide 16 text

Google Cloud Platform Take-aways: 1. Transcoding strings seems simple but it isn’t. 2. Always consider the encoding when crossing APIs. 3. Consider how strange input might break your code. Handle it.