Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When to give up, and how?

When to give up, and how?

Programs with constrained states are susceptible to end up in contradictory situations. The most likely cause of this is bugs. What to do when the program is found to be in a state where it cannot function correctly? Continue doing the wrong thing? Crash? What if it the situation can be saved? Let's look at some situations and some options we have.

Avatar for Björn Fahller

Björn Fahller

September 24, 2025
Tweet

More Decks by Björn Fahller

Other Decks in Programming

Transcript

  1. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 6/88 Terminology Bug The program does not behave correctly due to flaws in the encoded logic.
  2. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 7/88 Terminology Bug The program does not behave correctly due to flaws in the encoded logic. Disappointment The program is subjected to something undesired outside of its logic, e.g. received malformed packet, disk full, out of memory.
  3. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 8/88 Terminology Bug The program does not behave correctly due to flaws in the encoded logic. Disappointment The program is subjected to something undesired outside of its logic, e.g. received malformed packet, disk full, out of memory. Error handling Logic that deals with disappointments in a useful way.
  4. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 9/88 Terminology Bug The program does not behave correctly due to flaws in the encoded logic. Disappointment The program is subjected to something undesired outside of its logic, e.g. received malformed packet, disk full, out of memory. Error handling Logic that deals with disappointments in a useful way. if (::close(fd) < 0 && errno == EINVAL) { ???? }
  5. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 10/88 Terminology Bug The program does not behave correctly due to flaws in the encoded logic. Disappointment The program is subjected to something undesired outside of its logic, e.g. received malformed packet, disk full, out of memory. Error handling Logic that deals with disappointments in a useful way. This presentation is about bugs
  6. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 11/88 Bugs How can a program know that it has a bug?
  7. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 12/88 Bugs How can a program know that it has a bug? A function can see that it has been called with invalid arguments
  8. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 13/88 Bugs How can a program know that it has a bug? A function can see that it has been called with invalid arguments A function can see that it has been called in an illegal state
  9. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 14/88 Bugs How can a program know that it has a bug? A function can see that it has been called with invalid arguments A function can see that it has been called in an illegal state A caller can see a return/exception that is invalid
  10. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 15/88 Bugs How can a program know that it has a bug? A function can see that it has been called with invalid arguments A function can see that it has been called in an illegal state A caller can see a return/exception that is invalid A function can see that data is inconsistent or invalid
  11. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 16/88 Bugs A function can see that it has been called with invalid arguments A function can see that it has been called in an illegal state A caller can see a return/exception that is invalid A function can see that data is inconsistent or invalid But in order to see these, you need to define what legal/valid/consistent means
  12. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 23/88 Bugs These counter bugs hide the real issues, making them much less likely to be found and fixed
  13. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 25/88 Bugs You have two goals • Cause as little damage as possible
  14. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 26/88 Bugs You have two goals • Cause as little damage as possible • Gather information so that you can fix the bug
  15. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 27/88 Bugs You have two goals • Cause as little damage as possible • Gather information so that you can fix the bug These are sometimes in conflict
  16. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 32/88 Bugs How about assert() assert() does one of: • nothing • calls abort
  17. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 35/88 Bugs How about assert() When assert() is triggered, it usually does so with: • A message • Often a stack trace • If you’re lucky a core dump
  18. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 36/88 Bugs How about assert() When assert() is triggered, it usually does so with: • A message • Often a stack trace • If you’re lucky a core dump If you get the info from a triggered assert, you often have good chances of hunting down the bug
  19. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 37/88 Bugs How about assert() When assert() is triggered, it usually does so with: • A message • Often a stack trace • If you’re lucky a core dump If you get the info from a triggered assert, you often have good chances of hunting down the bug Many companies have their own “assert” macros with finer control
  20. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 43/88 Bugs Contracts? Much better visibility of expectations Program wide - unclear what implementations will support regarding custom violation handlers
  21. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 44/88 Story time <How a policy of “always keep running” caused many crashes, and how adding deliberate termination points helped fix it./>
  22. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 45/88 You need info! If you’re going to terminate: • Gather as much information as possible, so you can learn how to fix the underlying bug • Terminate as early as possible, before the state has been contaminated by erroneous work
  23. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 46/88 Can we do better? Architecture for robustness Case study – Ericsson AXE telephone exchange
  24. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 47/88 AXE Digital phone exchange First deployment in 1976 Can serve a city
  25. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 48/88 AXE Central Processor Handles billing, monitoring, and subscriber services like call forwarding and wakeup calls. Typically stores 10-15 words of data per subscriber.
  26. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 49/88 AXE Central Processor Handles billing, monitoring, and subscriber services like call forwarding and wakeup calls. Typically stores 10-15 words of data per subscriber.
  27. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 50/88 AXE Central Processor Handles billing, monitoring, and subscriber services like call forwarding and wakeup calls. Typically stores 10-15 words of data per subscriber. Two CPUs for redundancy.
  28. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 51/88 AXE Central Processor Handles billing, monitoring, and subscriber services like call forwarding and wakeup calls. Typically stores 10-15 words of data per subscriber. Two CPUs for redundancy. Each continuously monitored for faults
  29. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 52/88 AXE Central Processor Handles billing, monitoring, and subscriber services like call forwarding and wakeup calls. Typically stores 10-15 words of data per subscriber. Two CPUs for redundancy. Each continuously monitored for faults Cycle-by-cycle parallel and continuously monitored for differences by the MAU
  30. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 53/88 AXE Central Processor Handles billing, monitoring, and subscriber services like call forwarding and wakeup calls. Typically stores 10-15 words of data per subscriber. Two CPUs for redundancy. Each continuously monitored for faults Cycle-by-cycle parallel and continuously monitored for differences by the MAU When a fault is detected, the faulty CPU is shut down, and an alarm is raised
  31. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 54/88 AXE Regional Processors Can connect local phone calls even if contact with central processor is down
  32. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 55/88 AXE Regional Processors Can connect local phone calls even if contact with central processor is down
  33. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 56/88 AXE Regional Processors Can connect local phone calls even if contact with central processor is down Two CPUs for redundancy and damage containment
  34. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 57/88 AXE Regional Processors Can connect local phone calls even if contact with central processor is down Two CPUs for redundancy and damage containment Each continuously monitored for faults
  35. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 58/88 AXE Regional Processors Can connect local phone calls even if contact with central processor is down Two CPUs for redundancy and damage containment Each continuously monitored for faults When a fault is detected in one CPU, its work is moved to the other, and a restart is attempted
  36. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 59/88 AXE Ongoing phone calls work without any CPUs running. CPUs being down may prevent making new phone calls.
  37. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 60/88 AXE Ongoing phone calls work without any CPUs running. CPUs being down may prevent making new phone calls.
  38. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 61/88 AXE Ongoing phone calls work without any CPUs running. CPUs being down may prevent making new phone calls. The local most stage, capable of handling 30 subscribers, can handle phone calls within that group in total isolation. There are no CPUs involved.
  39. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 62/88 AXE The autonomy of parts makes the design highly fault tolerant All fault detection is aimed at HW malfunction
  40. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 63/88 AXE The autonomy of parts makes the design highly fault tolerant All fault detection is aimed at HW malfunction SW was bug free?
  41. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 65/88 What can we learn from this? • Fault tolerance is architectural
  42. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 66/88 What can we learn from this? • Fault tolerance is architectural • Divide your system into parts that can work in isolation and autonomously
  43. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 67/88 What can we learn from this? • Fault tolerance is architectural • Divide your system into parts that can work in isolation and autonomously • Allow parts to fail to save the whole
  44. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 68/88 What can we learn from this? • Fault tolerance is architectural • Divide your system into parts that can work in isolation and autonomously • Allow parts to fail to save the whole • Information about failures must reach the developers
  45. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 69/88 What can we learn from this? • Fault tolerance is architectural • Divide your system into parts that can work in isolation and autonomously • Allow parts to fail to save the whole • Information about failures must reach the developers If you can, add a mechanism that “calls home”, or gives the user a way to send info back to you
  46. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 70/88 What if the problem is local? If you are certain that you found the inconsistency early enough, fail the offending part to save the rest
  47. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 71/88 What if the problem is local? If you are certain that you found the inconsistency early enough, fail the offending part to save the rest – Fail the transaction
  48. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 72/88 What if the problem is local? If you are certain that you found the inconsistency early enough, fail the offending part to save the rest – Fail the transaction – Drop the video stream
  49. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 73/88 What if the problem is local? If you are certain that you found the inconsistency early enough, fail the offending part to save the rest – Fail the transaction – Drop the video stream – Abort the download
  50. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 74/88 What if the problem is local? If you are certain that you found the inconsistency early enough, fail the offending part to save the rest – Fail the transaction – Drop the video stream – Abort the download How sure are you?
  51. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 75/88 What if the problem is local? If you are certain that you found the inconsistency early enough, fail the offending part to save the rest – Fail the transaction – Drop the video stream – Abort the download How sure are you? These things are extremely difficult to test
  52. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 76/88 Oh, and one neat trick! if (part_of_functionality.is_inconsistent()) { if (fork() == 0) { log(MAJOR, "Remove {} due to inconsistent state", part_of_functionality.identity()); abort(); } part_of_functionality.force_kill(); }
  53. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 77/88 Oh, and one neat trick! if (part_of_functionality.is_inconsistent()) { if (fork() == 0) { log(MAJOR, "Remove {} due to inconsistent state", part_of_functionality.identity()); abort(); } part_of_functionality.force_kill(); } On a unix-like system, this creates a child process that is an exact copy of the parent.
  54. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 78/88 Oh, and one neat trick! if (part_of_functionality.is_inconsistent()) { if (fork() == 0) { log(MAJOR, "Remove {} due to inconsistent state", part_of_functionality.identity()); abort(); } part_of_functionality.force_kill(); } Return value of 0 means we’re the child process.
  55. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 79/88 Oh, and one neat trick! if (part_of_functionality.is_inconsistent()) { if (fork() == 0) { log(MAJOR, "Remove {} due to inconsistent state", part_of_functionality.identity()); abort(); } part_of_functionality.force_kill(); } Forces termination, and saves a snapshot of the process memory as a ‘core’ file.
  56. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 80/88 Oh, and one neat trick! The ‘core’ file can be read in a debugger, as if you had stopped on a breakpoint. It shows you the current state, but it doesn’t show how you got there. To improve your chances of fixing bugs, include a simple ram-log, it can be a std::vector<std::string>, for the objects. Reading the log in the debugger shows you how you got there.
  57. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 82/88 Summary • No generic right answers – Sacrifice execution to save data? – Continue running doing the wrong thing? – Revert to “safe mode”?
  58. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 83/88 Summary • No generic right answers – Sacrifice execution to save data? – Continue running doing the wrong thing? – Revert to “safe mode”? • Resist the temptation to sweep bugs under the rug
  59. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 84/88 Summary • No generic right answers – Sacrifice execution to save data? – Continue running doing the wrong thing? – Revert to “safe mode”? • Resist the temptation to sweep bugs under the rug • If you must crash, crash with actionable information
  60. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 85/88 Summary • No generic right answers – Sacrifice execution to save data? – Continue running doing the wrong thing? – Revert to “safe mode”? • Resist the temptation to sweep bugs under the rug • If you must crash, crash with actionable information • Make sure there’s a way for that information to reach the developer
  61. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 86/88 Summary • No generic right answers – Sacrifice execution to save data? – Continue running doing the wrong thing? – Revert to “safe mode”? • Resist the temptation to sweep bugs under the rug • If you must crash, crash with actionable information • Make sure there’s a way for that information to reach the developer • Resilience is architectural. You can only do so much as an individual developer
  62. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 87/88 Summary • No generic right answers – Sacrifice execution to save data? – Continue running doing the wrong thing? – Revert to “safe mode”? • Resist the temptation to sweep bugs under the rug • If you must crash, crash with actionable information • Make sure there’s a way for that information to reach the developer • Resilience is architectural. You can only do so much as an individual developer • Every bit of cleverness you add to keep the system running, is added complexity that makes the system more prone to bugs
  63. When to give up, and how? NDC{TechTown} 2025 © Björn

    Fahller @[email protected] 88/88 Björn Fahller [email protected] @rollbear @[email protected] When to give up, and how? @rollbear.bsky.social https://speakerdeck.com/rollbear/when-to-give-up