Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Of representation and interpretation: A unified theory - PHPBenelux meetup

Arnout Boks
September 10, 2020

Of representation and interpretation: A unified theory - PHPBenelux meetup

Many hard problems in programming originate from one single source: not properly distinguishing the representation of data from the way it is interpreted. Have you ever written code that filters $_GET for SQL injection attempts? Struggled with timezones? Tried to get escaping right for Javascript in HTML? Detected the character encoding of a string? All are examples of this one problem.

In this talk we will look at some examples of the representation-interpretation problem and find the general pattern behind it. We will see how primitive types make it so hard for us to get this right, and how we can use value objects to steer us in the right direction. You’ll start finding many more examples of this pattern and understand them more easily.

Arnout Boks

September 10, 2020
Tweet

More Decks by Arnout Boks

Other Decks in Programming

Transcript

  1. Of representation and interpretation
    A unified theory
    @arnoutboks
    Arnout Boks
    PHPBenelux
    10-09-2020

    View Slide

  2. @arnoutboks #PHPBenelux
    Hard problems in programming
    • Cache invalidation
    • Naming things
    • Off-by-one errors

    View Slide

  3. @arnoutboks #PHPBenelux
    Hard problems in programming
    • String escaping
    • Timezones
    • Character encoding

    View Slide

  4. @arnoutboks #PHPBenelux
    Many difficulties with
    these topics are related

    View Slide

  5. @arnoutboks #PHPBenelux
    This talk is…
    • My personal view
    • Not an absolute truth
    • Meant to make you think

    View Slide

  6. Data
    And its meaning

    View Slide

  7. @arnoutboks #PHPBenelux
    A number

    View Slide

  8. @arnoutboks #PHPBenelux
    A byte

    View Slide

  9. @arnoutboks #PHPBenelux
    Some JSON data
    {
    "income":100000
    }

    View Slide

  10. @arnoutboks #PHPBenelux
    A word

    View Slide

  11. @arnoutboks #PHPBenelux
    A prime number
    4931083597028501900275777672390764957284907772150208632080750184097926278850976588645578020
    1366007328679544734112831735367831201557535981978545054811571939345877330038009932619505876
    4525023820408110189885042615176579941704250889037029119015870030479432826073821469541570330
    2279875576818956016240300641115169008728798381942582716745647748166843479284645809291315318
    6007001004335318936319343912948604450370991980047709462921558180711169153031876288477878354
    1575932891093295447350881882465495060005019006274705305381164278294267474853496525745368151
    1706550281905552656221353146310421008662867971144467063669219825861581112515556504813420768
    6732340765505485910826956266693066236799702104812396562518006818323653959348395675357557532
    4619023481064700987753027956186892925380693305204238149969945456945774138335689906005870832
    1812704861133682026515905166351874029X18197693937677852928722109550412925792573818660584501
    5055250274994771883129310457698090915304613359419030258813205932277444385255046677902451869
    7062627788891979580423065750615669834695617797879659201644051939960716981112615195610276283
    2339825791423321726961443744381056485529348876349210309887028787453233132532122678633283702
    7925099749969488775936915917644588032718384740235933020374888506755706587919461134193230781
    4854436454375113207098606390746417564121635042388002967808558670370387509410769821183765499
    2052043682558546422885024299633226853691246485500075591664024729240716450725319674499952944
    8434741902107729606820558130923626837987951966199798285525887161096136561780745661592488660
    8898164568541721362920846656279131478466791550965154310113538586208196875836883595577893914
    5453935681996098808540476590735897289898342504712891841626587896821853808795627903997862944
    9397605467534821256750121517082737107646270712467532102483678159400087505452543537

    View Slide

  12. @arnoutboks #PHPBenelux
    Data has a different meaning
    under different interpretations

    View Slide

  13. Functions
    A bit of theory

    View Slide

  14. @arnoutboks #PHPBenelux
    Functions
    x
    y
    y = f(x)
    f

    View Slide

  15. @arnoutboks #PHPBenelux
    Functions
    D R
    x
    y
    f: D → R
    y = f(x)

    View Slide

  16. @arnoutboks #PHPBenelux
    Functions
    function f(D $x): R {
    $y = do_something_with($x);
    return $y;
    }

    View Slide

  17. @arnoutboks #PHPBenelux
    Functions
    float int
    3.84
    4
    round: float → int
    4 = round(3.84)

    View Slide

  18. @arnoutboks #PHPBenelux
    Functions
    D R
    x1
    y
    f: D → R
    x2

    View Slide

  19. @arnoutboks #PHPBenelux
    Functions
    float int
    3.84
    4
    round: float → int
    4.35

    View Slide

  20. @arnoutboks #PHPBenelux
    Functions
    D R
    x
    y1
    f: D → R
    y2

    View Slide

  21. @arnoutboks #PHPBenelux
    Pure functions
    D R
    x
    y1
    f: D → R
    y2

    View Slide

  22. @arnoutboks #PHPBenelux
    Impure functions
    • Randomness
    • State (global or local)
    • IO
    • Filesystem
    • Network
    • System clock

    View Slide

  23. @arnoutboks #PHPBenelux
    Functions
    D R
    x
    ?
    f: D → R

    View Slide

  24. @arnoutboks #PHPBenelux
    Functions (in mathematics)
    D R
    x
    f: D → R

    View Slide

  25. @arnoutboks #PHPBenelux
    Functions (in programming)
    D R
    x
    Exception!
    f: D → R

    View Slide

  26. Representation &
    Interpretation

    View Slide

  27. @arnoutboks #PHPBenelux
    Level of abstraction
    float int
    round: abstraction

    View Slide

  28. @arnoutboks #PHPBenelux
    Level of abstraction
    Money
    string
    serialize_money: abstraction

    View Slide

  29. @arnoutboks #PHPBenelux
    Level of abstraction
    string
    byte[]
    encode_as_utf16: abstraction

    View Slide

  30. @arnoutboks #PHPBenelux
    Representation
    • Translation of a high-level concept
    to a lower-level representation
    • All input values supported
    • Lossless (ideally)
    • Same data, just in a different form

    View Slide

  31. @arnoutboks #PHPBenelux
    Level of abstraction
    byte[]
    string
    decode_from_utf16: abstraction

    View Slide

  32. @arnoutboks #PHPBenelux
    Level of abstraction
    string
    Money
    parse_money: abstraction

    View Slide

  33. @arnoutboks #PHPBenelux
    Interpretation
    • Translation of a low-level representation
    (back) to a higher-level concept
    • Usually not all input values supported
    • Same data, just in a different form

    View Slide

  34. @arnoutboks #PHPBenelux
    Why do we do this?

    View Slide

  35. View Slide

  36. View Slide

  37. @arnoutboks #PHPBenelux
    Transmission of data
    Money
    string
    abstraction
    byte[]
    light pulses,
    magnetic grains,
    etc.

    View Slide

  38. Escaping
    A different view

    View Slide

  39. @arnoutboks #PHPBenelux
    Escaping for SQL
    $username = $_POST['username'];
    $sql = "SELECT * FROM users WHERE username =
    '" . db_escape_string($username) . "'";
    $result = db_query($sql);

    View Slide

  40. @arnoutboks #PHPBenelux
    Escaping for SQL
    // $_POST['username'] = "'; DROP TABLE users; --"
    $username = $_POST['username'];
    $sql = "SELECT * FROM users WHERE username =
    '" . db_escape_string($username) . "'";
    $result = db_query($sql);

    View Slide

  41. @arnoutboks #PHPBenelux
    Escaping for SQL
    // $_POST['username'] = "John O'Shea"
    $username = $_POST['username'];
    $sql = "SELECT * FROM users WHERE username =
    '" . db_escape_string($username) . "'";
    $result = db_query($sql);

    View Slide

  42. @arnoutboks #PHPBenelux
    Escaping for SQL
    // returns "John O'Shea"
    $username = get_username_using_api();
    $sql = "SELECT * FROM users WHERE username =
    '" . db_escape_string($username) . "'";
    $result = db_query($sql);

    View Slide

  43. @arnoutboks #PHPBenelux
    Escaping for SQL
    // $username = "John O'Shea"
    $username = /* (whatever) */;
    $sql = "SELECT * FROM users WHERE username =
    '" . db_escape_string($username) . "'";
    $result = db_query($sql);

    View Slide

  44. @arnoutboks #PHPBenelux
    String escaping is NOT
    a security measure

    View Slide

  45. @arnoutboks #PHPBenelux
    String escaping is just
    properly representing data

    View Slide

  46. @arnoutboks #PHPBenelux
    Escaping as representation
    string
    SQLFragment (?)
    db_escape_string:
    "O'Shea"
    "O\'Shea"

    View Slide

  47. @arnoutboks #PHPBenelux
    Escaping as representation
    db_escape_string("John O'Shea");
    // -> "John O\'Shea"
    as_sql_string_literal("John O'Shea");
    // -> "'John O\'Shea'"
    Better:

    View Slide

  48. @arnoutboks #PHPBenelux
    Escaping as representation
    string
    SQLStringLiteral
    as_sql_string_literal:
    "O'Shea"
    "'O\'Shea'"

    View Slide

  49. @arnoutboks #PHPBenelux
    Chained escaping as representation
    $name = "John O'Shea";
    $js_name = as_js_string_literal($name);
    $js_script = "showGreeting(" . $js_name . ")";
    $html_tag = "Show greeting";
    $sql = "INSERT INTO html_examples VALUES("
    . as_sql_string_literal($html_tag) . ")";

    View Slide

  50. @arnoutboks #PHPBenelux
    Premature representation
    foreach ($_REQUEST as $key => $value) {
    $_REQUEST[$key] = htmlentities(
    db_escape_string($value)
    );
    }
    // ‘Now the input data is safe’

    View Slide

  51. @arnoutboks #PHPBenelux
    Premature representation
    foreach ($_REQUEST as $key => $value) {
    $_REQUEST[$key] = htmlentities(
    db_escape_string($value)
    );
    }
    // ‘Now the input data is safe’

    View Slide

  52. @arnoutboks #PHPBenelux
    Premature representation
    foreach ($_REQUEST as $key => $value) {
    if (
    strpos($value, "DROP TABLE")
    !== false
    ) {
    die("SQL injection!");
    }
    }

    View Slide

  53. @arnoutboks #PHPBenelux
    Premature representation
    foreach ($_REQUEST as $key => $value) {
    if (
    strpos($value, "DROP TABLE")
    !== false
    ) {
    die("SQL injection!");
    }
    }

    View Slide

  54. What goes wrong
    and how to prevent that

    View Slide

  55. @arnoutboks #PHPBenelux
    Misrepresentation
    Money
    string
    serialize_money_ok:
    (€ 10)
    "€ 10"

    View Slide

  56. @arnoutboks #PHPBenelux
    Misrepresentation
    Money
    string
    serialize_money_bad:
    (€ 10)
    "10"

    View Slide

  57. @arnoutboks #PHPBenelux
    Misrepresentation
    Money
    string
    serialize_money_bad:
    (€ 10)
    "10"
    ($ 10)

    View Slide

  58. @arnoutboks #PHPBenelux
    Misrepresentation
    Money
    string
    (€ 10)
    "10"
    ($ 10)
    Money
    ?

    View Slide

  59. @arnoutboks #PHPBenelux
    That doesn’t happen to me!
    DateTime
    string
    2022-03-14 06:28:00
    Europe/Amsterdam
    2022-03-14T\
    06:28:00+01:00

    View Slide

  60. @arnoutboks #PHPBenelux
    That doesn’t happen to me!
    fraction
    float
    1/3
    0.33333333333333

    View Slide

  61. @arnoutboks #PHPBenelux
    That doesn’t happen to me!
    string
    byte[]
    ⛽☂️
    ?
    encode_as_iso8859_1

    View Slide

  62. @arnoutboks #PHPBenelux
    Lossy representations
    cannot be re-interpreted

    View Slide

  63. @arnoutboks #PHPBenelux
    Misinterpretation
    byte[]
    […]
    string
    "café"

    View Slide

  64. @arnoutboks #PHPBenelux
    Misinterpretation
    byte[]
    […]
    string
    "café"
    utf8
    decode_from_utf8
    Exception!

    View Slide

  65. @arnoutboks #PHPBenelux
    Misinterpretation
    byte[]
    […]
    string
    "café"
    iso8859-1
    utf8

    View Slide

  66. @arnoutboks #PHPBenelux
    Misinterpretation
    byte[]
    […]
    string
    "café"
    iso8859-1
    utf8
    "café"

    View Slide

  67. @arnoutboks #PHPBenelux
    Duck typing
    byte[]
    […]
    iso8859-1
    utf8

    View Slide

  68. @arnoutboks #PHPBenelux
    Duck typing
    byte[]
    […]
    iso8859-1
    utf8
    HOW?

    View Slide

  69. @arnoutboks #PHPBenelux
    The meaning of data comes
    from our interpretation of it…

    View Slide

  70. @arnoutboks #PHPBenelux
    …which may very
    well be wrong

    View Slide

  71. @arnoutboks #PHPBenelux
    Interpretation hints
    Store/transmit desired interpretation along with data
    Examples:
    • Music staves

    View Slide

  72. @arnoutboks #PHPBenelux
    Interpretation hints
    Store/transmit desired interpretation along with data
    Examples:
    • HTTP headers
    • Content-Type
    • Content-Encoding
    • Content-Language
    • Unicode byte order mark

    View Slide

  73. @arnoutboks #PHPBenelux
    Interpretation hints
    utf16 LE utf16 BE
    […]

    View Slide

  74. @arnoutboks #PHPBenelux
    Interpretation hints
    utf16 LE
    with BOM
    utf16 BE
    with BOM
    […]

    View Slide

  75. @arnoutboks #PHPBenelux
    Hints by variable naming
    Include intended interpretation in variable name:
    • $message_utf8
    • $username_sql
    • $title_html

    View Slide

  76. @arnoutboks #PHPBenelux
    Hints by variable naming
    header("Content-Type: text/plain;
    charset=iso-8859-1");
    print $message_utf8;
    Charset mismatch between header and content

    View Slide

  77. @arnoutboks #PHPBenelux
    Hints by variable naming
    $database->query("SELECT * FROM users
    WHERE username = " . $username);
    Username is not properly represented for SQL

    View Slide

  78. @arnoutboks #PHPBenelux
    Hints by variable naming
    print "" . htmlentities($title_html) .
    "";
    Double escaping, title is already represented as HTML

    View Slide

  79. PHP’s string type
    A strange hybrid

    View Slide

  80. @arnoutboks #PHPBenelux
    Role overloading of string
    string
    byte[]
    encode_as_utf16: abstraction

    View Slide

  81. @arnoutboks #PHPBenelux
    PHP’s string type is
    actually a byte array

    View Slide

  82. @arnoutboks #PHPBenelux
    Role overloading of string
    string
    = byte[]
    encode_as_utf16: abstraction

    View Slide

  83. @arnoutboks #PHPBenelux
    Role overloading of string
    string
    = byte[]
    abstraction
    ‘real’ string
    use as if
    utf16

    View Slide

  84. @arnoutboks #PHPBenelux
    Role overloading of string
    string
    = byte[]
    abstraction
    ‘real’ string
    use as if
    utf16
    use as if
    utf8

    View Slide

  85. @arnoutboks #PHPBenelux
    More accurate types
    • A better type system makes mis-interpretation
    more difficult
    • If our programming language does not provide
    the types, we have to do it ourselves
    • Value Objects!

    View Slide

  86. @arnoutboks #PHPBenelux
    More accurate types
    class UTF8Bytes {
    private string $bytes;
    private function __construct(string $bytes) {
    $this->bytes = $bytes;
    }
    public static function fromPHPString(string $bytes) {
    return new self($bytes);
    }
    public function toPHPString(): string {
    return $this->bytes;
    }
    // ...
    }

    View Slide

  87. @arnoutboks #PHPBenelux
    More accurate types
    class ISO88591Bytes {
    private string $bytes;
    private function __construct(string $bytes) {
    $this->bytes = $bytes;
    }
    public static function fromPHPString(string $bytes) {
    return new self($bytes);
    }
    public function toPHPString(): string {
    return $this->bytes;
    }
    // ...
    }

    View Slide

  88. @arnoutboks #PHPBenelux
    More accurate types
    class RealString {
    private string $value_utf16;
    private function __construct(string $value_utf16) {
    $this->value_utf16 = $value_utf16;
    }
    public static function fromUTF8(UTF8Bytes $utf8_bytes) {
    $value_utf16 = mb_convert_encoding(
    $utf8_bytes->toPHPString(), 'UTF-16', 'UTF-8');
    return new self($value_utf16);
    }
    public static function fromISO88591(ISO88591Bytes $iso88591_bytes) {
    // ...
    }
    // ...
    }

    View Slide

  89. @arnoutboks #PHPBenelux
    More accurate types
    class RealString {
    // ...
    public function toUTF8(): UTF8Bytes {
    $value_utf8 = mb_convert_encoding(
    $this->value_utf16, 'UTF-8', 'UTF-16');
    return UTF8Bytes::fromPHPString($value_utf8);
    }
    public function toISO88591(): ISO88591Bytes {
    $value_iso88591 = mb_convert_encoding(
    $this->value_utf16, 'ISO-8859-1', 'UTF-16');
    return ISO88591Bytes::fromPHPString($value_iso88591);
    }
    // ...
    }

    View Slide

  90. @arnoutboks #PHPBenelux
    More accurate types
    class RealString {
    // ...
    public function concat(RealString $other): RealString {
    return new RealString($this->value_utf16 .
    $other->value_utf16);
    }
    public function substring(int $start, int $length): RealString {
    $substr_utf16 = mb_substr($this->value_utf16,
    $start, $length, 'UTF-16');
    return new RealString($substr_utf16);
    }
    // ...
    }

    View Slide

  91. @arnoutboks #PHPBenelux
    More accurate types (usage)
    $string1_utf8 = UTF8Bytes::fromPHPString(
    file_get_contents("file1.txt"));
    $string1 = RealString::fromUTF8($string1_utf8);
    $string2_iso88591 = ISO88591Bytes::fromPHPString(
    file_get_contents("file2.txt"));
    $string2 = RealString::fromISO88591($string2_iso88591);
    // The original charsets do not matter anymore now...
    $new_string = $string1->concat($string2)->substring(7, 42);
    header("Content-Type: text/plain; charset=utf-8");
    print $new_string->toUTF8()->toPHPString();

    View Slide

  92. @arnoutboks #PHPBenelux
    Without value objects
    iso8859-1
    utf8
    string
    = byte[]
    abstraction

    View Slide

  93. @arnoutboks #PHPBenelux
    With value objects
    abstraction
    UTF8Bytes ISO88591Bytes
    RealString

    View Slide

  94. @arnoutboks #PHPBenelux
    Value objects
    • Achieve higher levels of abstraction
    • Avoid misinterpretation
    • No silver bullet!
    • Comes with additional overhead

    View Slide

  95. @arnoutboks #PHPBenelux
    Recap
    • Data
    • Functions
    • Interpretation and representation
    • String escaping
    • Misrepresentation and –interpretation
    • The string type
    • Value objects

    View Slide

  96. @arnoutboks #PHPBenelux
    The meaning of data is defined
    by how we interpret it

    View Slide

  97. @arnoutboks #PHPBenelux
    Questions
    @arnoutboks
    @arnoutboks
    @aboks
    Arnout Boks
    Please leave your feedback on joind.in:
    https://joind.in/talk/109ec
    We’re hiring!

    View Slide

  98. @arnoutboks #PHPBenelux
    Image Credits
    • https://pixabay.com/en/reading-relaxation-glasses-sight-
    3088491/
    • https://fr.m.wikipedia.org/wiki/Fichier:DARPA_Big_Data.jpg
    • https://www.flickr.com/photos/elefevre/3936916711
    • https://www.flickr.com/photos/perspective/9045532603
    • https://www.flickr.com/photos/wwarby/11644168395/
    • https://www.flickr.com/photos/opengridscheduler/1648045015
    7
    • https://www.flickr.com/photos/cogdog/14401469262
    • https://www.flickr.com/photos/paulsimpson1976/3998279762
    • https://www.flickr.com/photos/pewari/3499963407

    View Slide