Of representation and interpretation: A unified theory - phpDay 2019

Of representation and interpretation: A unified theory - phpDay 2019

Video recording: https://www.youtube.com/watch?v=Uvy26uMys2c
Joind.in: https://joind.in/talk/85813

Many hard problems in programming originate from one single source: not properly distinguishing the representation of data from the way it is interpreted. Have you ever written code that filters $_GET for SQL injection attempts? Struggled with timezones? Tried to get escaping right for Javascript in HTML? Detected the character encoding of a string? All are examples of this one problem. In this talk we will look at some examples of the representation-interpretation problem and find the general pattern behind it. We will see how primitive types make it so hard for us to get this right, and how we can use value objects to steer us in the right direction. Once you notice the pattern, you’ll be able to reason about and solve these problems much more easily. Contains: math, character sets, strong opinions on string escaping, and an almost illegal slide.

8dfcb5f1b3cd5397f19780e2319694da?s=128

Arnout Boks

May 11, 2019
Tweet

Transcript

  1. Of representation and interpretation A unified theory @arnoutboks Arnout Boks

    #phpday 11-05-2019
  2. @arnoutboks #phpday Hard problems in programming • Cache invalidation •

    Naming things • Off-by-one errors
  3. @arnoutboks #phpday Hard problems in programming • String escaping •

    Timezones • Character encoding
  4. @arnoutboks #phpday Many difficulties with these topics are related

  5. @arnoutboks #phpday This talk is… • My personal view •

    Not an absolute truth • Meant to make you think
  6. Data And its meaning

  7. @arnoutboks #phpday A number

  8. @arnoutboks #phpday A byte

  9. @arnoutboks #phpday Some JSON data { "income":100000 }

  10. @arnoutboks #phpday A word

  11. @arnoutboks #phpday A prime number 4931083597028501900275777672390764957284907772150208632080750184097926278850976588645578020 1366007328679544734112831735367831201557535981978545054811571939345877330038009932619505876 4525023820408110189885042615176579941704250889037029119015870030479432826073821469541570330 2279875576818956016240300641115169008728798381942582716745647748166843479284645809291315318 6007001004335318936319343912948604450370991980047709462921558180711169153031876288477878354

    1575932891093295447350881882465495060005019006274705305381164278294267474853496525745368151 1706550281905552656221353146310421008662867971144467063669219825861581112515556504813420768 6732340765505485910826956266693066236799702104812396562518006818323653959348395675357557532 4619023481064700987753027956186892925380693305204238149969945456945774138335689906005870832 1812704861133682026515905166351874029X18197693937677852928722109550412925792573818660584501 5055250274994771883129310457698090915304613359419030258813205932277444385255046677902451869 7062627788891979580423065750615669834695617797879659201644051939960716981112615195610276283 2339825791423321726961443744381056485529348876349210309887028787453233132532122678633283702 7925099749969488775936915917644588032718384740235933020374888506755706587919461134193230781 4854436454375113207098606390746417564121635042388002967808558670370387509410769821183765499 2052043682558546422885024299633226853691246485500075591664024729240716450725319674499952944 8434741902107729606820558130923626837987951966199798285525887161096136561780745661592488660 8898164568541721362920846656279131478466791550965154310113538586208196875836883595577893914 5453935681996098808540476590735897289898342504712891841626587896821853808795627903997862944 9397605467534821256750121517082737107646270712467532102483678159400087505452543537
  12. @arnoutboks #phpday Data has a different meaning under different interpretations

  13. Functions A bit of theory

  14. @arnoutboks #phpday Functions x y y = f(x) f

  15. @arnoutboks #phpday Functions D R x y f: D →

    R y = f(x)
  16. @arnoutboks #phpday Functions <?php function f(D $x): R { $y

    = do_something_with($x); return $y; }
  17. @arnoutboks #phpday Functions float int 3.84 4 round: float →

    int 4 = round(3.84)
  18. @arnoutboks #phpday Functions D R x1 y f: D →

    R x2
  19. @arnoutboks #phpday Functions float int 3.84 4 round: float →

    int 4.35
  20. @arnoutboks #phpday Functions D R x y1 f: D →

    R y2
  21. @arnoutboks #phpday Pure functions D R x y1 f: D

    → R y2
  22. @arnoutboks #phpday Impure functions • Randomness • State (global or

    local) • IO • Filesystem • Network • System clock
  23. @arnoutboks #phpday Functions D R x ? f: D →

    R
  24. @arnoutboks #phpday Functions (in mathematics) D R x f: D

    → R
  25. @arnoutboks #phpday Functions (in programming) D R x Exception! f:

    D → R
  26. Representation & Interpretation

  27. @arnoutboks #phpday Level of abstraction float int round: abstraction

  28. @arnoutboks #phpday Level of abstraction Money string serialize_money: abstraction

  29. @arnoutboks #phpday Level of abstraction string byte[] encode_as_utf16: abstraction

  30. @arnoutboks #phpday Representation • Translation of a high-level concept to

    a lower-level representation • All input values supported • Lossless (ideally) • Same data, just in a different form
  31. @arnoutboks #phpday Level of abstraction byte[] string decode_from_utf16: abstraction

  32. @arnoutboks #phpday Level of abstraction string Money parse_money: abstraction

  33. @arnoutboks #phpday Interpretation • Translation of a low-level representation (back)

    to a higher-level concept • Usually not all input values supported • Same data, just in a different form
  34. @arnoutboks #phpday Why do we do this?

  35. None
  36. None
  37. @arnoutboks #phpday Transmission of data Money string abstraction byte[] light

    pulses, magnetic grains, etc. …
  38. Escaping A different view

  39. @arnoutboks #phpday Escaping for SQL <?php $username = $_POST['username']; $sql

    = "SELECT * FROM users WHERE username = '" . db_escape_string($username) . "'"; $result = db_query($sql);
  40. @arnoutboks #phpday Escaping for SQL <?php // $_POST['username'] = "';

    DROP TABLE users; --" $username = $_POST['username']; $sql = "SELECT * FROM users WHERE username = '" . db_escape_string($username) . "'"; $result = db_query($sql);
  41. @arnoutboks #phpday Escaping for SQL <?php // $_POST['username'] = "John

    O'Shea" $username = $_POST['username']; $sql = "SELECT * FROM users WHERE username = '" . db_escape_string($username) . "'"; $result = db_query($sql);
  42. @arnoutboks #phpday Escaping for SQL <?php // returns "John O'Shea"

    $username = get_username_using_api(); $sql = "SELECT * FROM users WHERE username = '" . db_escape_string($username) . "'"; $result = db_query($sql);
  43. @arnoutboks #phpday Escaping for SQL <?php // $username = "John

    O'Shea" $username = /* (whatever) */; $sql = "SELECT * FROM users WHERE username = '" . db_escape_string($username) . "'"; $result = db_query($sql);
  44. @arnoutboks #phpday String escaping is NOT a security measure

  45. @arnoutboks #phpday String escaping is just properly representing data

  46. @arnoutboks #phpday Escaping as representation string SQLFragment (?) db_escape_string: "O'Shea"

    "O\'Shea"
  47. @arnoutboks #phpday Escaping as representation <?php db_escape_string("John O'Shea"); // ->

    "John O\'Shea" <?php as_sql_string_literal("John O'Shea"); // -> "'John O\'Shea'" Better:
  48. @arnoutboks #phpday Escaping as representation string SQLStringLiteral as_sql_string_literal: "O'Shea" "'O\'Shea'"

  49. @arnoutboks #phpday Chained escaping as representation <?php $name = "John

    O'Shea"; $js_name = as_js_string_literal($name); $js_script = "showGreeting(" . $js_name . ")"; $html_tag = "<a onclick=" . as_html_attribute_value($js_script) . ">Show greeting</a>"; $sql = "INSERT INTO html_examples VALUES(" . as_sql_string_literal($html_tag) . ")";
  50. @arnoutboks #phpday Premature representation <?php foreach ($_REQUEST as $key =>

    $value) { $_REQUEST[$key] = htmlentities( db_escape_string($value) ); } // ‘Now the input data is safe’
  51. @arnoutboks #phpday Premature representation <?php foreach ($_REQUEST as $key =>

    $value) { $_REQUEST[$key] = htmlentities( db_escape_string($value) ); } // ‘Now the input data is safe’
  52. @arnoutboks #phpday Premature representation <?php foreach ($_REQUEST as $key =>

    $value) { if ( strpos($value, "DROP TABLE") !== false ) { die("SQL injection!"); } }
  53. @arnoutboks #phpday Premature representation <?php foreach ($_REQUEST as $key =>

    $value) { if ( strpos($value, "DROP TABLE") !== false ) { die("SQL injection!"); } }
  54. What goes wrong and how to prevent that

  55. @arnoutboks #phpday Misrepresentation Money string serialize_money_ok: (€ 10) "€ 10"

  56. @arnoutboks #phpday Misrepresentation Money string serialize_money_bad: (€ 10) "10"

  57. @arnoutboks #phpday Misrepresentation Money string serialize_money_bad: (€ 10) "10" ($

    10)
  58. @arnoutboks #phpday Misrepresentation Money string (€ 10) "10" ($ 10)

    Money ?
  59. @arnoutboks #phpday That doesn’t happen to me! DateTime string 2022-03-14

    06:28:00 Europe/Rome 2022-03-14T\ 06:28:00+01:00
  60. @arnoutboks #phpday That doesn’t happen to me! fraction float 1/3

    0.33333333333333
  61. @arnoutboks #phpday That doesn’t happen to me! string byte[] ⛽☂️

    ? encode_as_iso8859_1
  62. @arnoutboks #phpday Lossy representations cannot be re-interpreted

  63. @arnoutboks #phpday Misinterpretation byte[] […] string "café"

  64. @arnoutboks #phpday Misinterpretation byte[] […] string "café" utf8 decode_from_utf8 Exception!

  65. @arnoutboks #phpday Misinterpretation byte[] […] string "café" iso8859-1 utf8

  66. @arnoutboks #phpday Misinterpretation byte[] […] string "café" iso8859-1 utf8 "café"

  67. @arnoutboks #phpday Duck typing byte[] […] iso8859-1 utf8

  68. @arnoutboks #phpday Duck typing byte[] […] iso8859-1 utf8 HOW?

  69. @arnoutboks #phpday The meaning of data comes from our interpretation

    of it…
  70. @arnoutboks #phpday …which may very well be wrong

  71. @arnoutboks #phpday Interpretation hints Store/transmit desired interpretation along with data

    Examples: • HTTP headers • Content-Type • Content-Encoding • Content-Language • UTF8 byte order mark
  72. @arnoutboks #phpday Interpretation hints Store/transmit desired interpretation along with data

    Examples: • Music staves
  73. @arnoutboks #phpday Interpretation hints utf16 LE utf16 BE […]

  74. @arnoutboks #phpday Interpretation hints utf16 LE with BOM utf16 BE

    with BOM […]
  75. @arnoutboks #phpday Hints by variable naming Include intended interpretation in

    variable name: • $message_utf8 • $username_sql • $title_html
  76. @arnoutboks #phpday Hints by variable naming <?php header("Content-Type: text/plain; charset=iso-8859-1");

    print $message_utf8; Charset mismatch between header and content
  77. @arnoutboks #phpday Hints by variable naming <?php $database->query("SELECT * FROM

    users WHERE username = " . $username); Username is not properly represented for SQL
  78. @arnoutboks #phpday Hints by variable naming <?php print "<h1>" .

    htmlentities($title_html) . "</h1>"; Double escaping, title is already represented as HTML
  79. PHP’s string type A strange hybrid

  80. @arnoutboks #phpday Role overloading of string string byte[] encode_as_utf16: abstraction

  81. @arnoutboks #phpday PHP’s string type is actually a byte array

  82. @arnoutboks #phpday Role overloading of string string = byte[] encode_as_utf16:

    abstraction
  83. @arnoutboks #phpday Role overloading of string string = byte[] abstraction

    ‘real’ string use as if utf16
  84. @arnoutboks #phpday Role overloading of string string = byte[] abstraction

    ‘real’ string use as if utf16 use as if utf8
  85. @arnoutboks #phpday More accurate types • A better type system

    makes mis-interpretation more difficult • If our programming language does not provide the types, we have to do it ourselves • Value Objects!
  86. @arnoutboks #phpday More accurate types <?php class UTF8Bytes { private

    string $bytes; private function __construct(string $bytes) { $this->bytes = $bytes; } public static function fromPHPString(string $bytes) { return new self($bytes); } public function toPHPString(): string { return $this->bytes; } // ... }
  87. @arnoutboks #phpday More accurate types <?php class UTF8Bytes { private

    string $bytes; // <- PHP 7.4: typed properties private function __construct(string $bytes) { $this->bytes = $bytes; } public static function fromPHPString(string $bytes) { return new self($bytes); } public function toPHPString(): string { return $this->bytes; } // ... }
  88. @arnoutboks #phpday More accurate types <?php class ISO88591Bytes { private

    string $bytes; private function __construct(string $bytes) { $this->bytes = $bytes; } public static function fromPHPString(string $bytes) { return new self($bytes); } public function toPHPString(): string { return $this->bytes; } // ... }
  89. @arnoutboks #phpday More accurate types <?php class RealString { private

    string $value_utf16; private function __construct(string $value_utf16) { $this->value_utf16 = $value_utf16; } public static function fromUTF8(UTF8Bytes $utf8_bytes) { $value_utf16 = mb_convert_encoding( $utf8_bytes->toPHPString(), 'UTF-16', 'UTF-8'); return new self($value_utf16); } public static function fromISO88591(ISO88591Bytes $iso88591_bytes) { // ... } // ... }
  90. @arnoutboks #phpday More accurate types <?php class RealString { //

    ... public function toUTF8(): UTF8Bytes { $value_utf8 = mb_convert_encoding( $this->value_utf16, 'UTF-8', 'UTF-16'); return UTF8Bytes::fromPHPString($value_utf8); } public function toISO88591(): ISO88591Bytes { $value_iso88591 = mb_convert_encoding( $this->value_utf16, 'ISO-8859-1', 'UTF-16'); return ISO88591Bytes::fromPHPString($value_iso88591); } // ... }
  91. @arnoutboks #phpday More accurate types <?php class RealString { //

    ... public function concat(RealString $other): RealString { return new RealString($this->value_utf16 . $other->value_utf16); } public function substring(int $start, int $length): RealString { $substr_utf16 = mb_substr($this->value_utf16, $start, $length, 'UTF-16'); return new RealString($substr_utf16); } // ... }
  92. @arnoutboks #phpday More accurate types (usage) <?php $string1_utf8 = UTF8Bytes::fromPHPString(

    file_get_contents("file1.txt")); $string1 = RealString::fromUTF8($string1_utf8); $string2_iso88591 = ISO88591Bytes::fromPHPString( file_get_contents("file2.txt")); $string2 = RealString::fromISO88591($string2_iso88591); // The original charsets do not matter anymore now... $new_string = $string1->concat($string2)->substring(7, 42); header("Content-Type: text/plain; charset=utf-8"); print $new_string->toUTF8()->toPHPString();
  93. @arnoutboks #phpday Without value objects iso8859-1 utf8 string = byte[]

    abstraction
  94. @arnoutboks #phpday With value objects abstraction UTF8Bytes ISO88591Bytes RealString

  95. @arnoutboks #phpday Value objects • Achieve higher levels of abstraction

    • Avoid misinterpretation • No silver bullet! • Comes with additional overhead
  96. @arnoutboks #phpday Recap • Data • Functions • Interpretation and

    representation • String escaping • Misrepresentation and –interpretation • The string type • Value objects
  97. @arnoutboks #phpday The meaning of data is defined by how

    we interpret it
  98. @arnoutboks #phpday Feedback & Questions @arnoutboks @arnoutboks @aboks Arnout Boks

    Please leave your feedback on joind.in: https://joind.in/talk/85813
  99. @arnoutboks #phpday Image Credits • https://pixabay.com/en/reading-relaxation-glasses-sight- 3088491/ • https://fr.m.wikipedia.org/wiki/Fichier:DARPA_Big_Data.jpg •

    https://www.flickr.com/photos/elefevre/3936916711 • https://www.flickr.com/photos/perspective/9045532603 • https://www.flickr.com/photos/wwarby/11644168395/ • https://www.flickr.com/photos/opengridscheduler/16480450157 • https://www.flickr.com/photos/cogdog/14401469262 • https://www.flickr.com/photos/paulsimpson1976/3998279762 • https://www.flickr.com/photos/pewari/3499963407