Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making the Impossible Impossible: Improving Reliability by Preventing Classes of Problems

Making the Impossible Impossible: Improving Reliability by Preventing Classes of Problems

This talk was given at SREcon EMEA 22, in Amsterdam: https://www.usenix.org/conference/srecon22emea/presentation/sinjakli

---

Service Level Objectives (SLOs) are a familiar topic in SRE circles. They provide a framework for measuring and thinking about the reliability of a service in terms of a percentage of successful operations, such as HTTP requests.

That key strength of SLOs - viewing reliability as a percentage game - can also also be a weakness. Within that framing, there are certain solutions we're likely to overlook.

This talk explores another lens for reliability - one that's complementary to SLOs: structuring software in a way that rules out entire classes of problem.

We'll explore this idea via three worked examples, and finish with some concrete take-aways, including how to spot problems that fit this shape.

Chris Sinjakli

October 26, 2022
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Making the Impossible Impossible Improving Reliability by Preventing Classes of

    Problems @ChrisSinjo Impossible
  2. Hi

  3. Hi Greetings

  4. @ChrisSinjo

  5. @ChrisSinjo

  6. Infra Engineer

  7. None
  8. Making the Impossible Impossible Improving Reliability by Preventing Classes of

    Problems @ChrisSinjo Impossible
  9. We are at SREcon

  10. We likely share: - Job titles 
 - Skills -

    Ways of thinking
  11. Common ground/ "Best practices"

  12. Some ideas have outsized impact

  13. In SRE: SLOs (Service Level Objectives)

  14. A refresher: Measuring the performance of a service as a

    percentage of successful operations
  15. Successful requests Total requests Example: HTTP requests x 100 ≥

    99.9%
  16. So why am I here today?

  17. None
  18. The perils of success

  19. The way we measure shapes The way we think

  20. The way we think shapes The solutions we explore

  21. SLOs encourage percentage thinking

  22. Instances go unhealthy ‑ Add health checks & route traf

    fi c away
  23. Instances go unhealthy ‑ Add health checks & route traf

    fi c away
  24. Regional network issues ‑ Serve from multiple regions

  25. Regional network issues ‑ Serve from multiple regions

  26. Rare slow requests ‑ Add timeouts to protect majority of

    traf fi c
  27. Rare slow requests ‑ Add timeouts to protect majority of

    traf fi c
  28. Successful requests Total requests Example: HTTP requests x 100 ≥

    99.9%
  29. Reliability is a percentage game

  30. We can stack the odds in our favour

  31. Not all solutions 
 look the 
 same

  32. Not all solutions 
 are about 
 percentages

  33. Some solutions prevent problems entirely

  34. Today's talk: - Another lens for reliability - Examples in

    the wild 
 - How to spot problems of this shape
  35. Today's talk: - Another lens for reliability - Examples in

    the wild 
 - How to spot problems of this shape
  36. Today's talk: - Another lens for reliability - Examples in

    the wild 
 - How to spot problems of this shape
  37. This is not: - An attack on SLOs 
 -

    One-size- fi ts all solution - Possible if you can't edit software
  38. This is not: - An attack on SLOs 
 -

    One-size- fi ts all solution - Possible if you can't edit software
  39. This is not: - An attack on SLOs 
 -

    One-size- fi ts all solution - Possible if you can't edit software
  40. Examples: - State machines - Type systems & memory safety

    
 - Database migrations
  41. Examples: - State machines - Memory safety 
 - Database

    migrations 

  42. Examples: - State machines - Memory safety 
 - Database

    migrations 

  43. Example 1 State machines

  44. None
  45. Collect from customer ‑ Pay out to merchant

  46. Collect from customer ‑ Pay out to merchant

  47. Payment 💸

  48. Payment 💸 Created Submitted Collected Paid out Failed

  49. Simple model id description state 1 Laptop submitted 2 Phone

    collected 3 Unused domain renewal collected
  50. Simple model id description state 1 Laptop submitted 2 Phone

    collected 3 Unused domain renewal collected
  51. Simple model id description state 1 Laptop collected 2 Phone

    collected 3 Unused domain renewal collected
  52. Simple model id description state 1 Laptop paid_out 2 Phone

    collected 3 Unused domain renewal collected
  53. Simple model id description state 1 Laptop submitted 2 Phone

    collected 3 Unused domain renewal collected
  54. Simple model id description state 1 Laptop failed 2 Phone

    collected 3 Unused domain renewal collected
  55. None
  56. Submitted ➡ Failed Collected ➡ Failed?

  57. Submitted ➡ Failed Collected ➡ Failed?

  58. Submitted ➡ Failed Paid out ➡ Failed?

  59. We want some restrictions

  60. class Payment def fail() state = "failed" State restriction pseudocode

  61. class Payment def fail() if state == "submitted" state =

    "failed" else raise "Cannot fail from state: #{state}" State restriction pseudocode
  62. class Payment def submit() if state == "created" state =

    "submitted" else raise "Cannot submit from state: #{state}" State restriction pseudocode
  63. Payment 💸 Created Submitted Collected Paid out Failed

  64. Payment 💸 Created Submitted Collected Payout submitted Paid out Failed

  65. class Payment def fail() if state in ["submitted", "payout_submitted"] state

    = "failed" else raise "Cannot fail from state: #{state}" State restriction pseudocode
  66. An ad-hoc mess

  67. Bugs 📈 Maintenance 📈

  68. Computer Science has an answer

  69. We can use a state machine

  70. State machine: - A set of states - A set

    of allowed transitions between those states
  71. class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted",

    "failed") ... State machine pseudocode
  72. Created Collected Paid out Failed Submitted

  73. Created Collected Paid out Failed Submitted

  74. class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted",

    "failed") ... State machine pseudocode
  75. Error: cannot transition from "paid out" to "failed"

  76. class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted",

    "failed") ... State machine pseudocode
  77. class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted",

    "failed") allow_transition("failed", "submitted") ... State machine pseudocode
  78. Created Collected Paid out Failed Submitted

  79. Often dismissed: 
 "Too academic"

  80. https://github.com/gocardless/statesman

  81. Make the problem impossible

  82. Example 2 Memory safety

  83. Not here to sell you Rust

  84. Something we often take for granted

  85. But fi rst, some C

  86. char *ptr = malloc(SIZE); do_stuff(ptr); free(ptr); Memory allocation in C

  87. char *ptr = malloc(SIZE); do_stuff(ptr); free(ptr); // Many lines more

    code do_other_stuff(ptr); Use-after-free in C
  88. Unde fi ned behaviour (You don't know what your program

    will do)
  89. Unde fi ned behaviour (An attacker might be able to

    abuse it)
  90. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=use+after+free+2022 A non-scienti fi c study

  91. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-41849 A non-scienti fi c study

  92. You don't know which one will be serious

  93. The assertion that we can simply code better is nonsense

  94. Something we often take for granted

  95. Garbage collected languages

  96. def main() name = "Chris" greet(name) def greet(name) puts("Hello #{name}")

    Garbage collection pseudocode
  97. Garbage collection pseudocode def main() name = "Chris" greet(name) def

    greet(name) puts("Hello #{name}") Falls out of scope
  98. The computer does it for you

  99. Garbage collection is outrageously successful

  100. Java Go Ruby Python JavaScript C# Haskell Lisp PHP Erlang

  101. But what about...

  102. You don't always want a runtime

  103. None
  104. None
  105. Stuck with manual memory management

  106. Until...

  107. None
  108. None
  109. None
  110. None
  111. Okay so hear me out

  112. Ownership & borrow-checking

  113. Tl;dr: Every value in memory has at most one owner

  114. def main() name = "Chris" greet(name) def greet(name) puts("Hello #{name}")

    Garbage collection pseudocode
  115. fn main() { let name = String::from("Chris"); greet(name); } fn

    greet(name: String) { println!("Hello {}", name); } Rust greetings
  116. fn main() { let name = String::from("Chris"); greet(name); } fn

    greet(name: String) { println!("Hello {}", name); } Rust greetings Owner transferred
  117. fn main() { let name = String::from("Chris"); greet(name); } fn

    greet(name: String) { println!("Hello {}", name); } Rust greetings Falls out of scope Owner transferred
  118. Owner out-of-scope ‑ Value dropped

  119. fn main() { let name = String::from("Chris"); greet(name); say_goodbye(name); }

    fn greet(name: String) { println!("Hello {}", name); } Rust greetings Compiler error
  120. fn main() { let name = String::from("Chris"); greet(&name); say_goodbye(name); }

    fn greet(name: &String) { println!("Hello {}", name); } Rust greetings Borrow
  121. No manual memory management

  122. The computer does it for you

  123. No GC

  124. None
  125. Make the problem impossible

  126. Example 3 Database migrations

  127. MySQL (but also true in Postgres)

  128. -- Create a table CREATE TABLE payments ( id int

    NOT NULL, ... ) -- Realise `int` isn't large enough (232) -- You're going to run out of IDs ALTER TABLE payments MODIFY id bigint;
  129. -- Create a table CREATE TABLE payments ( id int

    NOT NULL, ... ) -- Realise `int` isn't large enough (232) -- You're going to run out of IDs ALTER TABLE payments MODIFY id bigint;
  130. -- Create a table CREATE TABLE payments ( id int

    NOT NULL, ... ) -- Realise `int` isn't large enough (232) -- You're going to run out of IDs ALTER TABLE payments MODIFY id bigint; Blocks all other queries
  131. 🕵 The migrations reviewer

  132. Add a new column or Recreate the table

  133. None
  134. 🕵 The migrations reviewer

  135. 😰 The migrations reviewer

  136. 🕵 🕵 🕵 The migrations reviewers

  137. 😰 😰 😰 The migrations reviewers

  138. It doesn't scale

  139. and it's still not enough

  140. Seemingly innocuous ALTER TABLE payments ADD COLUMN refunded boolean;

  141. But can still be dangerous

  142. -- Slow transaction START TRANSACTION; SELECT * FROM payments; --

    Forces this to queue ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these SELECT * FROM payments WHERE id = 123;
  143. -- Slow transaction START TRANSACTION; SELECT * FROM payments; --

    Forces this to queue ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these SELECT * FROM payments WHERE id = 123;
  144. -- Slow transaction START TRANSACTION; SELECT * FROM payments; --

    Forces this to queue ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these SELECT * FROM payments WHERE id = 123;
  145. None
  146. None
  147. None
  148. Tl;dr: - MySQL-compatible - Scalability (sharding) 
 - High-availability

  149. Tl;dr: - MySQL-compatible - Scalability (sharding) 
 - High-availability

  150. Tl;dr: - MySQL-compatible - Scalability (sharding) 
 - High-availability

  151. None
  152. None
  153. VReplication A stream of changes

  154. Delete Insert Update

  155. ALTER TABLE payments MODIFY id bigint;

  156. ALTER TABLE payments MODIFY id bigint; id (int) description 1

    Laptop 2 Phone
  157. id (bigint) description ALTER TABLE payments MODIFY id bigint; id

    (int) description 1 Laptop 2 Phone
  158. id (bigint) description 1 Laptop ALTER TABLE payments MODIFY id

    bigint; id (int) description 1 Laptop 2 Phone
  159. id (bigint) description 1 Laptop ALTER TABLE payments MODIFY id

    bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal
  160. id (bigint) description 1 Laptop 2 Phone ALTER TABLE payments

    MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal
  161. id (bigint) description 1 Laptop 2 Phone 3 Unused domain

    renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal
  162. id (bigint) description 1 Laptop 2 Phone 3 Unused domain

    renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal
  163. id (bigint) description 1 Laptop 2 Phone 3 Unused domain

    renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal User queries (via proxy)
  164. id (bigint) description 1 Laptop 2 Phone 3 Unused domain

    renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal User queries (via proxy)
  165. id (bigint) description 1 Laptop 2 Phone 3 Unused domain

    renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal User queries (via proxy)
  166. Fully-online schema migrations

  167. 😰 😰 😰 The migrations reviewers

  168. People doing their actual job 😀 😀 😀

  169. Make the problem impossible

  170. Examples

  171. None
  172. Take aways: - Complementary technique - You have to write

    software 
 - It's not easy to spot
  173. SLOs are alive and well

  174. Percentage solutions are too

  175. Percentage solutions

  176. A complementary technique

  177. None
  178. https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/

  179. Take aways: - Complementary technique - You have to write

    software 
 - It's not easy to spot
  180. No code changes

  181. This is not one of them

  182. Sometimes BIG Sometimes small

  183. Not everyone can build a database

  184. https://github.com/gocardless/statesman

  185. Maybe someone already solved it

  186. Take aways: - Complementary technique - You have to write

    software 
 - It's not easy to spot - But there are some tells
  187. Take aways: - Complementary technique - You have to write

    software 
 - It's not easy to spot - But there are some tells
  188. 🕵 The migrations reviewer

  189. None
  190. 🙄 Smug internet 
 comments

  191. None
  192. 🙄 Smug internet 
 comments

  193. Examples: - State machines - Memory safety 
 - Database

    migrations 
 Add more unit tests Write better C Just hire
  194. Smug comments: - State machines - Memory safety 
 -

    Database migrations 
 Write better C Just hire
  195. Smug comments: - State machines - Memory safety 
 -

    Database migrations 
 Add more unit tests Write better C Just hire
  196. Smug comments: - State machines - Memory safety 
 -

    Database migrations 
 Add more unit tests Write better C Just hire
  197. Smug comments: - State machines - Memory safety 
 -

    Database migrations 
 Add more unit tests Write better C Just hire a DBA
  198. There's probably more to it

  199. The assertion that we can simply code better is nonsense

  200. We can do better

  201. Thank you ✌❤ @ChrisSinjo @planetscaledata

  202. Image credits • Poker Winnings - slgckgc - CC-BY -

    https://www. fl ickr.com/photos/slgc/42157896194/ • Thinking Face - Twemoji - CC-BY - https://github.com/twitter/twemoji • Ferris (Extra-cute) - Unof fi cial Rust mascot - Copyright waived - https://rustacean.net/ • A350 Board - Mark Turnauckas - CC-BY - https://www. fl ickr.com/photos/marktee/ 17118767669/ • Play - Annie Roi - CC-BY - https://www. fl ickr.com/photos/annieroi/4421442720/
  203. Image credits • White jigsaw puzzle with missing piece -

    Marco Verch Professional Photographer - CC-BY - https://www. fl ickr.com/photos/[email protected]/50605134766/ • Hedge maze - claumoho - CC-BY - https:// fl ickr.com/photos/claudiah/3929921991/ • photo_1405_20060410 - Robo Android - CC-BY - https://www. fl ickr.com/photos/ [email protected]/6798304070/ • Gears - Mustang Joe - Public Domain - https://www. fl ickr.com/photos/mustangjoe/ 20437315996/
  204. Questions? ✌❤ @ChrisSinjo @planetscaledata