Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making the Impossible Impossible: Improving Reliability by Preventing Classes of Problems

Making the Impossible Impossible: Improving Reliability by Preventing Classes of Problems

This talk was given at SREcon EMEA 22, in Amsterdam: https://www.usenix.org/conference/srecon22emea/presentation/sinjakli

---

Service Level Objectives (SLOs) are a familiar topic in SRE circles. They provide a framework for measuring and thinking about the reliability of a service in terms of a percentage of successful operations, such as HTTP requests.

That key strength of SLOs - viewing reliability as a percentage game - can also also be a weakness. Within that framing, there are certain solutions we're likely to overlook.

This talk explores another lens for reliability - one that's complementary to SLOs: structuring software in a way that rules out entire classes of problem.

We'll explore this idea via three worked examples, and finish with some concrete take-aways, including how to spot problems that fit this shape.

Chris Sinjakli

October 26, 2022
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Making the Impossible


    Impossible


    Improving Reliability by


    Preventing Classes


    of Problems @ChrisSinjo
    Impossible

    View Slide

  2. Hi

    View Slide

  3. Hi


    Greetings

    View Slide

  4. @ChrisSinjo

    View Slide

  5. @ChrisSinjo

    View Slide

  6. Infra Engineer

    View Slide

  7. View Slide

  8. Making the Impossible


    Impossible


    Improving Reliability by


    Preventing Classes


    of Problems @ChrisSinjo
    Impossible

    View Slide

  9. We are at


    SREcon

    View Slide

  10. We likely share:


    - Job titles

    - Skills


    - Ways of thinking

    View Slide

  11. Common ground/


    "Best practices"

    View Slide

  12. Some ideas have
    outsized


    impact

    View Slide

  13. In SRE: SLOs


    (Service Level Objectives)

    View Slide

  14. A refresher:


    Measuring the performance
    of a service as a percentage
    of successful operations

    View Slide

  15. Successful requests


    Total requests
    Example: HTTP requests
    x 100
    ≥ 99.9%

    View Slide

  16. So why am I here
    today?

    View Slide

  17. View Slide

  18. The perils of success

    View Slide

  19. The way we measure


    shapes


    The way we think

    View Slide

  20. The way we think


    shapes


    The solutions we explore

    View Slide

  21. SLOs encourage


    percentage


    thinking

    View Slide

  22. Instances go unhealthy





    Add health checks &


    route traf
    fi
    c away

    View Slide

  23. Instances go unhealthy





    Add health checks &


    route traf
    fi
    c away

    View Slide

  24. Regional network issues





    Serve from multiple
    regions

    View Slide

  25. Regional network issues





    Serve from multiple
    regions

    View Slide

  26. Rare slow requests





    Add timeouts to protect
    majority of traf
    fi
    c

    View Slide

  27. Rare slow requests





    Add timeouts to protect
    majority of traf
    fi
    c

    View Slide

  28. Successful requests


    Total requests
    Example: HTTP requests
    x 100
    ≥ 99.9%

    View Slide

  29. Reliability is a


    percentage


    game

    View Slide

  30. We can


    stack the odds


    in our favour

    View Slide

  31. Not all solutions

    look the

    same

    View Slide

  32. Not all solutions

    are about

    percentages

    View Slide

  33. Some solutions
    prevent problems
    entirely

    View Slide

  34. Today's talk:


    - Another lens for reliability


    - Examples in the wild

    - How to spot problems of
    this shape

    View Slide

  35. Today's talk:


    - Another lens for reliability


    - Examples in the wild

    - How to spot problems of
    this shape

    View Slide

  36. Today's talk:


    - Another lens for reliability


    - Examples in the wild

    - How to spot problems of
    this shape

    View Slide

  37. This is not:


    - An attack on SLOs

    - One-size-
    fi
    ts all solution


    - Possible if you can't edit
    software

    View Slide

  38. This is not:


    - An attack on SLOs

    - One-size-
    fi
    ts all solution


    - Possible if you can't edit
    software

    View Slide

  39. This is not:


    - An attack on SLOs

    - One-size-
    fi
    ts all solution


    - Possible if you can't edit
    software

    View Slide

  40. Examples:


    - State machines


    - Type systems & memory
    safety

    - Database migrations

    View Slide

  41. Examples:


    - State machines


    - Memory safety

    - Database migrations

    View Slide

  42. Examples:


    - State machines


    - Memory safety

    - Database migrations

    View Slide

  43. Example 1


    State


    machines

    View Slide

  44. View Slide

  45. Collect from customer





    Pay out to merchant

    View Slide

  46. Collect from customer





    Pay out to merchant

    View Slide

  47. Payment


    💸

    View Slide

  48. Payment


    💸
    Created


    Submitted


    Collected


    Paid out


    Failed

    View Slide

  49. Simple model
    id description state
    1 Laptop submitted
    2 Phone collected
    3
    Unused domain


    renewal
    collected

    View Slide

  50. Simple model
    id description state
    1 Laptop submitted
    2 Phone collected
    3
    Unused domain


    renewal
    collected

    View Slide

  51. Simple model
    id description state
    1 Laptop collected
    2 Phone collected
    3
    Unused domain


    renewal
    collected

    View Slide

  52. Simple model
    id description state
    1 Laptop paid_out
    2 Phone collected
    3
    Unused domain


    renewal
    collected

    View Slide

  53. Simple model
    id description state
    1 Laptop submitted
    2 Phone collected
    3
    Unused domain


    renewal
    collected

    View Slide

  54. Simple model
    id description state
    1 Laptop failed
    2 Phone collected
    3
    Unused domain


    renewal
    collected

    View Slide

  55. View Slide

  56. Submitted ➡ Failed


    Collected ➡ Failed?

    View Slide

  57. Submitted ➡ Failed


    Collected ➡ Failed?

    View Slide

  58. Submitted ➡ Failed


    Paid out ➡ Failed?

    View Slide

  59. We want some


    restrictions

    View Slide

  60. class Payment


    def fail()


    state = "failed"
    State restriction pseudocode

    View Slide

  61. class Payment


    def fail()


    if state == "submitted"


    state = "failed"


    else


    raise "Cannot fail from state: #{state}"
    State restriction pseudocode

    View Slide

  62. class Payment


    def submit()


    if state == "created"


    state = "submitted"


    else


    raise "Cannot submit from state: #{state}"
    State restriction pseudocode

    View Slide

  63. Payment


    💸
    Created


    Submitted


    Collected


    Paid out


    Failed

    View Slide

  64. Payment


    💸
    Created


    Submitted


    Collected


    Payout submitted


    Paid out


    Failed

    View Slide

  65. class Payment


    def fail()


    if state in ["submitted", "payout_submitted"]


    state = "failed"


    else


    raise "Cannot fail from state: #{state}"
    State restriction pseudocode

    View Slide

  66. An


    ad-hoc


    mess

    View Slide

  67. Bugs 📈


    Maintenance 📈

    View Slide

  68. Computer
    Science has an
    answer

    View Slide

  69. We can use a


    state machine

    View Slide

  70. State machine:


    - A set of states


    - A set of allowed transitions
    between those states


    View Slide

  71. class Payment


    states(["created", "submitted", ...])


    allow_transition("created", "submitted")


    allow_transition("submitted", "collected")


    allow_transition("submitted", "failed")


    ...
    State machine pseudocode

    View Slide

  72. Created Collected Paid out
    Failed
    Submitted

    View Slide

  73. Created Collected Paid out
    Failed
    Submitted

    View Slide

  74. class Payment


    states(["created", "submitted", ...])


    allow_transition("created", "submitted")


    allow_transition("submitted", "collected")


    allow_transition("submitted", "failed")


    ...
    State machine pseudocode

    View Slide

  75. Error: cannot transition from
    "paid out" to "failed"

    View Slide

  76. class Payment


    states(["created", "submitted", ...])


    allow_transition("created", "submitted")


    allow_transition("submitted", "collected")


    allow_transition("submitted", "failed")


    ...
    State machine pseudocode

    View Slide

  77. class Payment


    states(["created", "submitted", ...])


    allow_transition("created", "submitted")


    allow_transition("submitted", "collected")


    allow_transition("submitted", "failed")


    allow_transition("failed", "submitted")


    ...
    State machine pseudocode

    View Slide

  78. Created Collected Paid out
    Failed
    Submitted

    View Slide

  79. Often
    dismissed:

    "Too academic"

    View Slide

  80. https://github.com/gocardless/statesman

    View Slide

  81. Make the problem


    impossible

    View Slide

  82. Example 2


    Memory


    safety

    View Slide

  83. Not here to sell
    you


    Rust

    View Slide

  84. Something we


    often


    take for granted

    View Slide

  85. But
    fi
    rst,


    some C

    View Slide

  86. char *ptr = malloc(SIZE);


    do_stuff(ptr);


    free(ptr);
    Memory allocation in C

    View Slide

  87. char *ptr = malloc(SIZE);


    do_stuff(ptr);


    free(ptr);


    // Many lines more code


    do_other_stuff(ptr);
    Use-after-free in C

    View Slide

  88. Unde
    fi
    ned
    behaviour


    (You don't know what your program will do)

    View Slide

  89. Unde
    fi
    ned
    behaviour


    (An attacker might be able to abuse it)

    View Slide

  90. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=use+after+free+2022
    A non-scienti
    fi
    c study

    View Slide

  91. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-41849
    A non-scienti
    fi
    c study

    View Slide

  92. You don't know
    which one will
    be serious

    View Slide

  93. The assertion that
    we can simply code
    better is nonsense

    View Slide

  94. Something we


    often


    take for granted

    View Slide

  95. Garbage


    collected


    languages

    View Slide

  96. def main()


    name = "Chris"


    greet(name)


    def greet(name)


    puts("Hello #{name}")


    Garbage collection pseudocode

    View Slide

  97. Garbage collection pseudocode
    def main()


    name = "Chris"


    greet(name)


    def greet(name)


    puts("Hello #{name}")


    Falls out of scope

    View Slide

  98. The computer


    does it


    for you

    View Slide

  99. Garbage collection


    is outrageously
    successful

    View Slide

  100. Java


    Go


    Ruby


    Python


    JavaScript


    C#


    Haskell


    Lisp


    PHP


    Erlang

    View Slide

  101. But what
    about...

    View Slide

  102. You don't


    always want a
    runtime

    View Slide

  103. View Slide

  104. View Slide

  105. Stuck with


    manual memory


    management

    View Slide

  106. Until...

    View Slide

  107. View Slide

  108. View Slide

  109. View Slide

  110. View Slide

  111. Okay so


    hear me out

    View Slide

  112. Ownership &


    borrow-checking

    View Slide

  113. Tl;dr:


    Every value in memory
    has at most one owner

    View Slide

  114. def main()


    name = "Chris"


    greet(name)


    def greet(name)


    puts("Hello #{name}")


    Garbage collection pseudocode

    View Slide

  115. fn main() {


    let name = String::from("Chris");


    greet(name);


    }


    fn greet(name: String) {


    println!("Hello {}", name);


    }
    Rust greetings

    View Slide

  116. fn main() {


    let name = String::from("Chris");


    greet(name);


    }


    fn greet(name: String) {


    println!("Hello {}", name);


    }
    Rust greetings
    Owner transferred

    View Slide

  117. fn main() {


    let name = String::from("Chris");


    greet(name);


    }


    fn greet(name: String) {


    println!("Hello {}", name);


    }
    Rust greetings
    Falls out of scope
    Owner transferred

    View Slide

  118. Owner out-of-scope





    Value dropped

    View Slide

  119. fn main() {


    let name = String::from("Chris");


    greet(name);


    say_goodbye(name);


    }


    fn greet(name: String) {


    println!("Hello {}", name);


    }
    Rust greetings
    Compiler error

    View Slide

  120. fn main() {


    let name = String::from("Chris");


    greet(&name);


    say_goodbye(name);


    }


    fn greet(name: &String) {


    println!("Hello {}", name);


    }
    Rust greetings
    Borrow

    View Slide

  121. No


    manual memory


    management

    View Slide

  122. The computer


    does it


    for you

    View Slide

  123. No


    GC

    View Slide

  124. View Slide

  125. Make the problem


    impossible

    View Slide

  126. Example 3


    Database


    migrations

    View Slide

  127. MySQL


    (but also true in Postgres)

    View Slide

  128. -- Create a table


    CREATE TABLE payments (


    id int NOT NULL,


    ...


    )


    -- Realise `int` isn't large enough (232)


    -- You're going to run out of IDs


    ALTER TABLE payments MODIFY id bigint;

    View Slide

  129. -- Create a table


    CREATE TABLE payments (


    id int NOT NULL,


    ...


    )


    -- Realise `int` isn't large enough (232)


    -- You're going to run out of IDs


    ALTER TABLE payments MODIFY id bigint;

    View Slide

  130. -- Create a table


    CREATE TABLE payments (


    id int NOT NULL,


    ...


    )


    -- Realise `int` isn't large enough (232)


    -- You're going to run out of IDs


    ALTER TABLE payments MODIFY id bigint;
    Blocks all


    other queries

    View Slide

  131. 🕵
    The migrations


    reviewer


    View Slide

  132. Add a new column


    or


    Recreate the table

    View Slide

  133. View Slide

  134. 🕵
    The migrations


    reviewer


    View Slide

  135. 😰
    The migrations


    reviewer


    View Slide

  136. 🕵 🕵 🕵
    The migrations


    reviewers


    View Slide

  137. 😰 😰 😰
    The migrations


    reviewers


    View Slide

  138. It doesn't


    scale

    View Slide

  139. and it's still


    not enough

    View Slide

  140. Seemingly innocuous


    ALTER TABLE payments ADD COLUMN refunded boolean;

    View Slide

  141. But can


    still


    be dangerous

    View Slide

  142. -- Slow transaction


    START TRANSACTION;


    SELECT * FROM payments;


    -- Forces this to queue


    ALTER TABLE payments ADD COLUMN refunded boolean;


    -- Which blocks these


    SELECT * FROM payments WHERE id = 123;

    View Slide

  143. -- Slow transaction


    START TRANSACTION;


    SELECT * FROM payments;


    -- Forces this to queue


    ALTER TABLE payments ADD COLUMN refunded boolean;


    -- Which blocks these


    SELECT * FROM payments WHERE id = 123;

    View Slide

  144. -- Slow transaction


    START TRANSACTION;


    SELECT * FROM payments;


    -- Forces this to queue


    ALTER TABLE payments ADD COLUMN refunded boolean;


    -- Which blocks these


    SELECT * FROM payments WHERE id = 123;

    View Slide

  145. View Slide

  146. View Slide

  147. View Slide

  148. Tl;dr:


    - MySQL-compatible


    - Scalability (sharding)

    - High-availability


    View Slide

  149. Tl;dr:


    - MySQL-compatible


    - Scalability (sharding)

    - High-availability


    View Slide

  150. Tl;dr:


    - MySQL-compatible


    - Scalability (sharding)

    - High-availability


    View Slide

  151. View Slide

  152. View Slide

  153. VReplication


    A stream of changes

    View Slide

  154. Delete
    Insert
    Update

    View Slide

  155. ALTER TABLE payments MODIFY id bigint;

    View Slide

  156. ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone

    View Slide

  157. id (bigint) description
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone

    View Slide

  158. id (bigint) description
    1 Laptop
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone

    View Slide

  159. id (bigint) description
    1 Laptop
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal

    View Slide

  160. id (bigint) description
    1 Laptop
    2 Phone
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal

    View Slide

  161. id (bigint) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal

    View Slide

  162. id (bigint) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal

    View Slide

  163. id (bigint) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    User queries (via proxy)

    View Slide

  164. id (bigint) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    User queries (via proxy)

    View Slide

  165. id (bigint) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    ALTER TABLE payments MODIFY id bigint;
    id (int) description
    1 Laptop
    2 Phone
    3 Unused domain


    renewal
    User queries (via proxy)

    View Slide

  166. Fully-online


    schema


    migrations

    View Slide

  167. 😰 😰 😰
    The migrations


    reviewers


    View Slide

  168. People doing


    their actual job


    😀 😀 😀

    View Slide

  169. Make the problem


    impossible

    View Slide

  170. Examples

    View Slide

  171. View Slide

  172. Take aways:


    - Complementary technique


    - You have to write software

    - It's not easy to spot


    View Slide

  173. SLOs


    are alive


    and well

    View Slide

  174. Percentage


    solutions


    are too

    View Slide

  175. Percentage
    solutions

    View Slide

  176. A


    complementary


    technique

    View Slide

  177. View Slide

  178. https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/

    View Slide

  179. Take aways:


    - Complementary technique


    - You have to write software

    - It's not easy to spot


    View Slide

  180. No code
    changes

    View Slide

  181. This is


    not


    one of them

    View Slide

  182. Sometimes
    BIG


    Sometimes small

    View Slide

  183. Not everyone
    can build a


    database

    View Slide

  184. https://github.com/gocardless/statesman

    View Slide

  185. Maybe
    someone
    already solved it

    View Slide

  186. Take aways:


    - Complementary technique


    - You have to write software

    - It's not easy to spot


    - But there are some tells

    View Slide

  187. Take aways:


    - Complementary technique


    - You have to write software

    - It's not easy to spot


    - But there are some tells

    View Slide

  188. 🕵
    The migrations


    reviewer


    View Slide

  189. View Slide

  190. 🙄
    Smug internet

    comments


    View Slide

  191. View Slide

  192. 🙄
    Smug internet

    comments


    View Slide

  193. Examples:


    - State machines


    - Memory safety

    - Database migrations

    Add more
    unit tests
    Write
    better C
    Just hire


    View Slide

  194. Smug comments:


    - State machines


    - Memory safety

    - Database migrations

    Write
    better C
    Just hire


    View Slide

  195. Smug comments:


    - State machines


    - Memory safety

    - Database migrations

    Add more
    unit tests
    Write
    better C
    Just hire


    View Slide

  196. Smug comments:


    - State machines


    - Memory safety

    - Database migrations

    Add more
    unit tests
    Write
    better C
    Just hire


    View Slide

  197. Smug comments:


    - State machines


    - Memory safety

    - Database migrations

    Add more
    unit tests
    Write
    better C
    Just hire


    a DBA

    View Slide

  198. There's
    probably more
    to it

    View Slide

  199. The assertion that
    we can simply code
    better is nonsense

    View Slide

  200. We


    can


    do better

    View Slide

  201. Thank you
    ✌❤
    @ChrisSinjo


    @planetscaledata

    View Slide

  202. Image credits
    • Poker Winnings - slgckgc - CC-BY - https://www.
    fl
    ickr.com/photos/slgc/42157896194/


    • Thinking Face - Twemoji - CC-BY - https://github.com/twitter/twemoji


    • Ferris (Extra-cute) - Unof
    fi
    cial Rust mascot - Copyright waived - https://rustacean.net/


    • A350 Board - Mark Turnauckas - CC-BY - https://www.
    fl
    ickr.com/photos/marktee/
    17118767669/


    • Play - Annie Roi - CC-BY - https://www.
    fl
    ickr.com/photos/annieroi/4421442720/

    View Slide

  203. Image credits
    • White jigsaw puzzle with missing piece - Marco Verch Professional Photographer - CC-BY
    - https://www.
    fl
    ickr.com/photos/30478819@N08/50605134766/


    • Hedge maze - claumoho - CC-BY - https://
    fl
    ickr.com/photos/claudiah/3929921991/


    • photo_1405_20060410 - Robo Android - CC-BY - https://www.
    fl
    ickr.com/photos/
    49140926@N07/6798304070/


    • Gears - Mustang Joe - Public Domain - https://www.
    fl
    ickr.com/photos/mustangjoe/
    20437315996/

    View Slide

  204. Questions?
    ✌❤
    @ChrisSinjo


    @planetscaledata

    View Slide