Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Safety First App

The Safety First App

Talk on The Safety First App from DDD Melbourne 2026.

Avatar for Renaldi Gondosubroto

Renaldi Gondosubroto

February 22, 2026
Tweet

More Decks by Renaldi Gondosubroto

Other Decks in Technology

Transcript

  1. The Safety-First App 2/24 • Currently hold 20 certifications for

    Microsoft Azure and am an AWS Community Builder and Microsoft Certified Trainer • Instructor and Author (Wrote the Azure AI Engineer Associate Study Guide) • International speaker at 50+ events and conferences • Organizer of the Melbourne Python meetup • Enjoy all things AWS, open-source, generative AI, and Virtual Reality Renaldi Gondosubroto Author and Senior Software Engineer @Renaldig @renaldigondosubroto @renaldig About Me @therenaldigram
  2. Agenda • Why "safety-first" is a product capability • The

    12 recovery patterns, with examples and trade-offs • System-aware design through retries, receipts, degradation and rescue mode • Incident-ready UX with customer-facing runbooks and comms • Worksheet to audit one feature in a sprint • Q&A 1/24
  3. The Safety-First App 2/24 WHY IT MATTERS Most apps are

    built for the sunny day But customers live in the messy middle. • Users mistype, misclick, and change their mind mid-flow • Networks wobble, services time out, and retries happen • Safety patterns turn panic into progress • The goal: fewer tickets, fewer pages, calmer users
  4. The Safety-First App 3/24 REALITY CHECK “Rainy day” failure modes

    are normal Design for these as first-class states. • Double-clicks and repeated taps under pressure • Refresh, back button, tab close, and session expiry • Latency spikes, partial responses, 429s, and 5xxs • Mobile backgrounding and offline transitions
  5. The Safety-First App 4/24 OUTCOME A shared language for safety

    Product, design, and engineering can align on “recovery”. • Reversible actions instead of irreversible mistakes • Durable intent that survives retries and refresh • Explainable failures with an obvious next step • Safe resumption from a checkpoint with one click
  6. The Safety-First App 5/24 M AP The 12 patterns A

    field guide for turning failures into recoveries. User-facing recovery System-aware resilience 01 Undo everywhere 02 Autosave + drafts 03 Guard destructive actions 04 Resilient forms 05 Explainable errors 06 One-click recovery links 07 Degraded states 08 Idempotent actions 09 Long-running receipts 10 Outbox + offline queue 11 Rescue mode 12 Customer runbooks
  7. The Safety-First App 6/24 01 PATTERN 01 Undo everywhere Make

    accidental actions cheap to fix. • Default to “do + undo” instead of blocking confirmations • Keep undo available long enough to cover distraction and latency • Make undo server-backed when the action has side effects • Instrument undo rate to find confusing UI • Great example with AI tools’ undo EXAMPLE Gmail “Undo Send”, Slack message delete, Trello card move. SKETCH 1 const intentId = uuid(); 2 await api.removeItem({ id, intentId }); 3 toast.undo('Removed', () => api.restoreItem({ id, intentId })); Trade-off: requires storage or compensating actions.
  8. The Safety-First App 7/24 02 PATTERN 02 Autosave + draft

    states Treat “save” as infrastructure. • Persist drafts on every meaningful change • Expose Draft, Saving, and Saved as explicit UI states • Resume drafts after refresh, navigation, or auth expiry • Design for conflicts; last-write wins is rarely humane EXAMPLE Google Docs, Notion, Figma drafts and autosave. SKETCH 1 setStatus('draft'); 2 debounce(() => { 3 setStatus('saving'); 4 return api.saveDraft({ id, data }) 5 .then(() => setStatus('saved')); 6 }, 800)(); Trade-off: conflicts need a policy users can understand.
  9. The Safety-First App 8/24 03 PATTERN 03 Guard destructive actions

    Add friction only when blast radius is high. • Prefer reversible deletes with a retention window • Use confirmation that proves intent, not “Are you sure?” • Show consequences in plain language before the action • Make recovery visible: restore, undo, or support path EXAMPLE GitHub repo deletion requires typing the repo name. SKETCH 1 if (typed !== expected) throw new Error('Confirm text mismatch'); 2 await api.deleteProject({ id, confirm: typed }); 3 toast.success('Deleted. Restore available for 30 days.'); Trade-off: friction can reduce conversion on low-risk actions.
  10. The Safety-First App 9/24 04 PATTERN 04 Resilient forms Never

    lose input. • Keep form state local first; sync is a separate concern • Validate inline and early, close to the field • Retry submit without clearing inputs or jumping scroll • Persist partial progress for long forms and wizards EXAMPLE Checkout flows that keep fields after a payment failure. SKETCH 1 useEffect(() => localStorage.setItem(key, JSON.stringify(state)), [state]); 2 const submit = async () => { 3 setStatus('submitting'); 4 await api.submit(state).catch(() => setStatus('fix-and-retry')); 5 }; Trade-off: you must handle stale drafts and validation drift.
  11. The Safety-First App 10/24 05 PATTERN 05 Explainable errors Make

    the next step obvious. • Say what failed in user language, not system language • Offer the best next action: retry, edit, or contact support • Include a correlation ID for support and incident triage • Avoid blame and mystery: no codes, no “unexpected error” EXAMPLE “Payment declined. Try a different card or contact your bank.” SKETCH 1 showError({ 2 title: 'Could not save changes', 3 detail: 'The server timed out. Your draft is safe.', 4 action: { label: 'Retry', onClick: retry }, 5 correlationId, 6 }); Trade-off: requires good error taxonomy and mapping.
  12. The Safety-First App 11/24 06 PATTERN 06 One-click recovery links

    Resume beats restart. • Create stable links to a safe checkpoint • Rehydrate state from the server, not fragile client memory • Use clear “Resume” language instead of “Start over” • Make recovery the default when a flow fails mid-way EXAMPLE Resume checkout, continue draft, retry from step 3. SKETCH 1 const receipt = await api.startFlow({ cartId }); 2 router.push('/checkout/resume/' + receipt.id); 3 // resume page fetches state by receipt.id Trade-off: requires stable IDs and durable server state.
  13. The Safety-First App 12/24 SHIFT System-aware product design When the

    backend is degraded, your UI still has choices.
  14. The Safety-First App 13/24 07 PATTERN 07 Degraded states Show

    what still works. • Expose partial degradation instead of a blank spinner • Degrade optional features first; keep the core path alive • Use banners to explain impact and expected duration • Lean on cached data with staleness labels when needed EXAMPLE Read-only mode with cached data and disabled actions. SKETCH 1 if (status.payments === 'down') { 2 disable('pay'); 3 showBanner('Payments are degraded. You can still browse and save carts.'); 4 } Trade-off: consistency; you must label stale data clearly.
  15. The Safety-First App 14/24 08 PATTERN 08 Idempotent actions Retry

    should never double charge. • Generate an idempotency key per user intent, not per request • Reuse the same key across retries, refresh, and back/forward • Return the same result for the same key, even after timeouts • Show a receipt so users can verify what happened EXAMPLE Stripe idempotency keys; safe “Pay again” buttons. SKETCH 1 const key = getOrCreateKey('checkout'); 2 await api.charge({ amount, key }); 3 // server: key -> stored result 4 // retries return the same receipt Trade-off: keys need storage and expiry strategy.
  16. The Safety-First App 15/24 09 PATTERN 09 Long-running work with

    receipts Progress that survives refresh. • Start work returns a receipt ID immediately • Progress is queryable and resumable across sessions • Users can close the tab without losing the outcome • Receipts double as audit logs during incidents EXAMPLE Report generation, uploads, exports, model runs. SKETCH 1 const { receiptId } = await api.startExport(params); 2 router.push('/exports/' + receiptId); 3 // page polls /exports/{id}/status Trade-off: requires background workers and status APIs.
  17. The Safety-First App 16/24 10 PATTERN 10 Outbox + offline

    queue Treat the network as intermittent. • Capture intent locally, then sync when the network cooperates • Show queued, syncing, and failed states in the UI • Allow users to cancel or edit queued actions • Reconcile conflicts deterministically and transparently EXAMPLE Email send queue, notes sync, offline-first mobile apps. SKETCH 1 outbox.enqueue({ type: 'UPDATE', payload, key }); 2 syncLoop(); 3 // UI: “3 changes queued” 4 // server: idempotent processing by key Trade-off: conflict resolution becomes a product decision.
  18. The Safety-First App 17/24 11 PATTERN 11 Rescue mode Make

    the product safer during incidents. • Use feature flags to disable risky paths without redeploying • Switch to read-only mode when writes threaten integrity • Block destructive actions first; keep viewing and exporting • Explain what’s disabled and where users can verify outcomes EXAMPLE Read-only mode during outages; disabling integrations. SKETCH 1 if (flags.rescueMode) { 2 disableAllWrites(); 3 showBanner('Rescue mode: changes are paused to protect your data.'); 4 } Trade-off: must be pre-wired and rehearsed.
  19. The Safety-First App 18/24 12 PATTERN 12 Customer-facing runbooks Make

    support effective at 2 AM. • Provide a consistent incident message users can share • Expose correlation IDs, receipts, and known-issue links • Document safe user steps: verify, retry, or wait • Align the UX with your operational playbook EXAMPLE In-product service status + “How to recover” guidance. SKETCH 1 # Incident quick guide 2 • What’s impacted 3 • What still works 4 • How users can verify receipts 5 • When to retry safely 6 • Support escalation path Trade-off: runbooks must stay current and accessible.
  20. The Safety-First App 19/24 TRADEOFFS How to choose the right

    safety tool Prefer reversibility, then durability, then friction. • Can we make it reversible instead of confirmable? • What is the blast radius if this goes wrong? • Is retry safe, or can it duplicate side effects? • Do users have a stable place to resume and verify? • Can support diagnose this without engineering? • Are you putting enough safety in AI tools?
  21. The Safety-First App 20/24 APPLY IT Mini Case Study: A

    Checkout Flow Same backend. Different product outcomes. Before: fragile After: recoverable • Remove item requires confirmation modal • Form clears after a payment failure • Timeout forces a start over • Retry can double-charge • More complaints received • Undo on remove item • Autosaved draft checkout details • Receipt page survives refresh • Idempotent “Pay” with safe retries • More user satisfaction
  22. The Safety-First App 21/24 TAKEAWAY Audit Worksheet Use this on

    one feature. Aim to finish in a single sprint. 1. What is the most common accidental action and is it undoable? 2. Does user input persist across refresh, navigation, and errors? 3. Are destructive actions reversible or guarded with intent proof? 4. Do errors explain what happened and the best next step? 5. Is there a one-click recovery link to a safe checkpoint? 6. Can retries duplicate side effects or are actions idempotent? 7. Do long tasks have durable receipts and resumable progress? 8. Does the UI show degraded states instead of hiding behind spinners? 9. Is there an outbox for queued work when the network is flaky? 10. Can we switch to rescue mode quickly during incidents? 11. Can support answer “what happened” with receipts and IDs? 12. Have we run the rainy-day test for this flow? Tip: Screenshot this slide, then audit one high-impact flow.
  23. The Safety-First App 22/24 EXECUTION How to ship this in

    one sprint Small surface area. High leverage. • Pick one flow with high ticket volume or revenue risk • Add receipts and safe retries first as they unlock many patterns • Layer in undo, drafts, and explainable errors on top • Instrument through undo rate, error rate, abandon rate, and time-to-recover • Write the support runbook as you build as it exposes gaps fast
  24. The Safety-First App 23/24 PROOF What to measure You’re winning

    when users recover quickly and confidently. • Undo usage and reversal success rate • Autosave success rate and draft resume rate • Retry success rate without duplicate side effects • Time-to-recover after an error or timeout • Support tickets per 1,000 sessions for the target flow
  25. The Safety-First App 24/24 CLOSE Safety is a feature •

    Fewer tickets • Fewer late-night pages • Users who feel looked after • Pick one flow this week and apply the worksheet
  26. The Safety-First App 24/24 Thank You! Questions? • Renaldi Gondosubroto

    [email protected] • www.renaldigondosubroto.com @Renaldig @renaldigondosubroto @renaldig @therenaldigram