Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better Architecture by Telling Stories

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Better Architecture by Telling Stories

It is difficult to explain architectural concepts and outcomes in simple language. When tell stakeholders that "the system needs five-nines availability" or "we need Amazon level scability" their eyes tend to glaze over. Abstract architectural language doesn't mean a lot to people who don't build and operate systems. This can make it really difficult to get buy-in for the things we really need to do: improve our security, fix our scalability, improve our resilience or improve our user experience.

This talk (re)introduces a proven, but underused, technique, using specific scenarios - short, structured stories - to bring architectural aspects of a system to life, particularly for less technical stakeholders. Instead of abstract discussions about "resilience" I'll explain how to tell stories about "what happens when a database fails at 4am on Black Friday with 5000 online customers".

Avatar for Eoin Woods

Eoin Woods

May 06, 2026

More Decks by Eoin Woods

Other Decks in Technology

Transcript

  1. © 2025, Artechra Limited. All rights reserved. Eoin Woods •

    Independent consultant (software architecture, CTO) • 10 years as CTO in delivery consultancy - Endava • 10 years in capital markets - UBS and BGI • 10+ years in products - Bull, Sybase, InterTrust • Co-author of three Software Architecture books etc ADDRESSING ENERGY EFFICIENCY IN SYSTEM DESIGN: A JOURNEY FROM ARCHITECTURE TO OPERATION EOIN WOODS A thesis submitted in partial fulfilment of the requirements of the University of East London for the degree of Doctor of Philosophy December 2018
  2. © 2025, Artechra Limited. All rights reserved. We really need

    this system to be very scalable … we have big plans. Yes understood, that’s a given And really cost effective … we have no revenue yet remember Yes ok, that’s important And after all we have personal data, so prioritise security Well of course, we always do that And ease of use is critical, no unnecessary steps or in sign up or login Err ok … we’ll get on it
  3. © 2025, Artechra Limited. All rights reserved. I wonder what

    they mean by “cost effective” … or for that matter ”prioritise security” … or any of the rest of it? … what I need to do is to make some sensible choices and then go back and play it back to them in a structured way. They can then make some decisions.
  4. © 2025, Artechra Limited. All rights reserved. Good news, I’ve

    thought through your requirements, and I think I know what you need, here are my proposals: Need Solution Scalability Microservices in containers on AWS (EKS) Multiple databases Cloudflare CDN for web assets Efficiency Limit data retention to 1 year to minimize storage costs Shut down test environments nightly Security Identity verification on registration Long passwords Password change every 4 weeks Mandatory MFA on login Data encrypted in database Usability UX design agency employed Emphasis on consistent metaphors and structure throughout UI Task oriented UI – minimize steps to achieve goal Ok great! Can’t say I totally understand all that but looking forward to the demo!
  5. © 2025, Artechra Limited. All rights reserved. Anything familiar at

    all in that interaction? Ambiguity? Implicit tradeoffs and implications? Lack of shared language? Difficulty reaching a shared understanding?
  6. © 2025, Artechra Limited. All rights reserved. Architectural scenarios replace

    vague aspirations - “must be scalable” - with concrete situations to reason about - “here’s what happens with 5000 users”
  7. © 2025, Artechra Limited. All rights reserved. OUR EXAMPLE ORGANISATION

    A retailer operating a large network of physical stores, a high-traffic e-commerce website, and a popular mobile application. The Retail Automation System (RAS) is their core platform designed to unify all customer interactions and provide all critical retail processing functions. They operate 350 physical stores across the UK, serving approximately 4 million active loyalty customers. The transaction value is highly skewed, as in-store transactions account for 70% of physical units sold (high volume, low complexity) while web and mobile transactions account for 80% of gross transaction value. Like most retailers, online and web shopping offers a range of fulfilment options (Click & Collect, Ship From Store, returns).
  8. © 2025, Artechra Limited. All rights reserved. AN EXAMPLE (BRIEF)

    SCENARIO Name Database Server Failure During Peak Load Attribute Availability (Resilience) Stimulus The primary database server hangs unexpectedly due to a hardware failure at 19:00 during peak load. Environment The system is operating under normal peak load conditions with 5,000 concurrent users and the primary database server is handling transaction processing. Response • The system detects the database failure within 5 seconds and automatically fails over to the standby replica database server. • User sessions are preserved, and customers experience no more than a 10- second delay in their current transactions. • The system continues to process orders without any data loss, and customers can complete their purchases without interruption. • The system remains operational with 90% of normal capacity while the primary server is being recovered. What is the situation? What are we explaining? What happened? What was going on? What does the system do?
  9. © 2025, Artechra Limited. All rights reserved. USE CASE vs

    SCENARIO Use-cases describe what a system does, architectural scenarios describe how well it does it – and under what conditions
  10. © 2025, Artechra Limited. All rights reserved. USE CASE vs

    SCENARIO Use Case: Apply Promotion Discount Actor: Store Manager Goal: Apply % discount to product Precondition: Manager logged in with PRICE_CONTROL role Flow: 1. Manager locates & selects product 2. Manager enters discount % and end-date 3. System validates discount % against pricing rules 4. System applies discount to product until end- date Postcondition: Product has discount applied until end-date Scenario: POS Device Throughput Quality: Performance (Throughput) Stimulus: POS device processing 30 txn/min from 14:00-16:00 Environment: Saturday, all stores busy, system at 80% load, 75% card, 25% cash Response: • 99.5% txns processed in 3 sec • 0.5% txns processed in 7 sec via auto retry • Inventory records sync’ed in < 10 sec
  11. © 2025, Artechra Limited. All rights reserved. HOW DOES THIS

    HELP? ¡ A story not a list of facts – gain attention ¡ Comprehensible by relevant audience ¡ Context and implications are clear ¡ Useful for describing current state and discussing future state ¡ Highlights what is unknown Engrossed Young Reader StockCake
  12. © 2025, Artechra Limited. All rights reserved. A RATHER POOR

    SCENARIO Name Database Server Problem Attribute Availability (Resilience) Stimulus An error occurs with the database Environment The system is busy with transactional workload Response • The system recovers from the database failure and does not lose any data • Customers do not have to log into the system again Very general name, could be many situations Not enough detail, not specific enough Needs more precision. How busy? With what exactly? Not enough info to grab attention and be credible … not precise enough
  13. © 2025, Artechra Limited. All rights reserved. WHAT MAKES A

    GOOD SCENARIO? Credible Significant Precise Specific Comprehensible
  14. © 2025, Artechra Limited. All rights reserved. A GOOD SCENARIO

    - CREDIBLE Credible Significant Precise Specific Comprehensible “Stimulus: Our stores across Yorkshire, Lancashire and Greater Manchester lose connectivity to the Core Retail Platform due to a router failure and a misconfigured secondary device …” vs “Stimulus: All our stores have a network problem …” Realistic situation, good level of detail
  15. © 2025, Artechra Limited. All rights reserved. A GOOD SCENARIO

    - SIGNIFICANT Credible Significant Precise Specific Comprehensible “Stimulus: A shopfloor employee attempts to access customer address details using a store mobile device …” vs “Stimulus: A store manager logs into RAS using username, password and 2FA code …” Illustrates something you need to explain and that the reader is interested in
  16. © 2025, Artechra Limited. All rights reserved. A GOOD SCENARIO

    - SPECIFIC Credible Significant Precise Specific Comprehensible “Stimulus: 50 new stores are opened across Scotland in a 5 month period, all being in city centres and between 35-45% of our average store size …” vs “Stimulus: We are opening lots of new stores in Scotland …” Describe a situation, don’t generalise, the scenario loses its power
  17. © 2025, Artechra Limited. All rights reserved. A GOOD SCENARIO

    - PRECISE Credible Significant Precise Specific Comprehensible “Response: Automatic failover to secondary data center, in-progress transactions preserved, service interruption < 30 seconds, zero data loss, 99.95% uptime measured monthly …” vs “Response: Automatic failover to secondary data center, quickly minimal data loss, no significant uptime impact, …” Use facts, numbers, specific details, to gain attention
  18. © 2025, Artechra Limited. All rights reserved. A GOOD SCENARIO

    - COMPREHENSIBLE Credible Significant Precise Specific Comprehensible “Response: The attack is detected and blocked before attackers can access RAS, security team alerted, connection terminated…” vs “Response: The application firewall flags an SQL injection attempt, this is logged to CSLP, an alert is raised for the SOC, the application firewall closes the websocket, …” Write it in language the reader will understand, explain jargon, concise as possible
  19. © 2025, Artechra Limited. All rights reserved. WHAT MAKES A

    GOOD SCENARIO? Name Database Server Problem Attribute Availability (Resilience) Stimulus An error occurs with the database Environment The system is busy with transactional workload Response • The system recovers from the database failure and does not lose any data • Customers do not have to log into the system again
  20. © 2025, Artechra Limited. All rights reserved. WHAT MAKES A

    GOOD SCENARIO? Name Database Server Failure During Peak Load Attribute Availability (Resilience) Stimulus The primary database server hangs unexpectedly due to a hardware failure at 19:00 during peak load. Environment The system is operating under normal peak load conditions with 5,000 concurrent users and the primary database server is handling transaction processing. Response • The system detects the database failure within 5 seconds and automatically fails over to the standby replica database server. • User sessions are preserved, and customers experience no more than a 10- second delay in their current transactions. • The system continues to process orders without any data loss, and customers can complete their purchases without interruption. • The system remains operational with 90% of normal capacity while the primary server is being recovered. Credible Significant Specific Precise Comprehensible
  21. © 2025, Artechra Limited. All rights reserved. USES FOR SCENARIOS

    What to build Identify trade-offs Drive research Assess design options Explain behaviour
  22. © 2025, Artechra Limited. All rights reserved. USES FOR SCENARIOS:

    WHAT TO BUILD “Do we really need user registration it hurts usability?” “True, but how about security? … “ Scenario: Rushed Customer Abandons Basket Quality: Usability Stimulus: First time customer arrives via clicking a social media ad link to buy one item Environment: Normal trading, customer has 5 minutes before a meeting so is rushed Response: 1. Customer selects item and goes to checkout 2. Customer not logged in so offered ”Login” and “Register” options 3. Customer selects “Register” and is asked for name, address, email, DoB and password 4. Customer abandons purchase (stats show 65% of first time customers react this way) Scenario: Unregistered Customer Fraud Quality: Security Stimulus: Customer uses stolen card details and “guest” checkout option to place 12 orders in 20 minutes for high value items Environment: Weekend evening, low traffic Response: 1. All 12 orders pass payment authorization 2. Fulfilment begins 3. Fraud system flags probable fraud on all 12 orders in ~15 minutes, orders all marked as “suspended” 10 minutes later 4. 5 orders are dispatched from regional warehouse in 20 minutes, assumed lost, total value £3,800 5. Chargeback process initiated on all 12 transactions Helps the business and POs decide relative importance
  23. © 2025, Artechra Limited. All rights reserved. USES FOR SCENARIOS:

    TRADEOFFS “Can we have resilience with less cost than parallel regions?” “Yes, but … “ Scenario: Recover from DB Failure (Backups) Quality: Availability Stimulus: DB corruption at 14:00 on Saturday Environment: 2000 web users, all 350 stores online Response: 1. Monitoring reports database errors 2. Load balancer recognizes failures within 30 seconds, fails over to minimal failover system (static pages plus order storage without confirmation) 3. Customers with open baskets in db notified baskets are saved in 5 minutes 4. Database recovered in 40 minutes + 30 minutes to replay txn logs 5. System back online in 80 minutes with 30 sec data loss (from txn log) Scenario: Recover from DB Failure (Hot Standby) Quality: Availability Stimulus: DB corruption at 14:00 on Saturday Environment: 2000 web users, all 350 stores online Response: 1. Monitoring reports database errors 2. Load balancer recognizes failures within 30 seconds, fails over to backup region 3. ~20 sec of txns lost from the old system 4. WebUI recognizes failover, automatically refreshes to update state, replays missing txns, displays warning to users to check their purchases 5. Store devices recognize failover and refresh or restart operation to resync with servers 6. System fully available within 45 seconds
  24. © 2025, Artechra Limited. All rights reserved. USES FOR SCENARIOS:

    DRIVE RESEARCH “What happens if the store wireless network fails?” “Things work in offline mode … I think“ Scenario: Store Wireless Network Fails and the Associate Devices Fallback to Offline Mode Quality: Resilience Stimulus: In one store the internal WAP serving associate’s handheld devices fails due to a hardware failure and no automatic failover is provided Environment: Routine mid-day retail operations, store operating at ~70% of capacity Response: 1. Associates’ devices report connectivity failure 2. Devices report working in offline mode, use cached data to provide stock level estimates and store location of 10,000 most popular products and accept customer returns for upload when reconnected, but cannot provide promotions information or locate customer orders delivered to site 3. Associates use devices to accept returns and answer stock queries (informing customers of estimated values) 4. Customer collections handled by manual processing and using desktop terminals 5. Associates plug devices into docking stations to upload and sync data Who tested this? Do we have the sync logic in RAS to handle this?
  25. © 2025, Artechra Limited. All rights reserved. USES FOR SCENARIOS:

    EXPLAIN BEHAVIOUR “So what would happen if the web UI became overloaded?” “A couple of things, … “ Scenario: Unexpected Demand Overloads Web UI Quality: Scalability, Resilience Stimulus: Unexpected online marketing for new Nintendo Switch Pokemon game causes huge online store traffic spike (~20,000 concurrent requests) on a Thursday morning Environment: Normal light weekday load, ~750 concurrent users, autoscaling configured (~3-minute startup) Response: 1. Within 60 seconds web servers saturated, response times climb from < 200ms to ~35sec 2. CDN still serving content and keeping pace with traffic 3. Users missing page content and refresh browsers, increasing traffic further (some post to social media) 4. Web servers effectively stalled due to overload 5. 15 new servers come online via auto-scaling, 3 of original 5 servers auto restarted as assumed hung 6. Traffic gradually redistributes over the servers (~4 minutes) 7. Database server now overloaded and response times climb from 250ms to 10+ sec 8. DBAs respond to monitoring call and bring 4 additional instances online, restoring service in ~12 mins 9. System returns to normal operation after ~26 minutes, supporting new load which continues for ~4 hours 10. ~10,000 users abandon session, significant social media noise for ~8 hours following
  26. © 2025, Artechra Limited. All rights reserved. USES FOR SCENARIOS

    What to build Brings choices to life, sparks better conversations Understand design options & tradeoffs Drive research Illustrates what happens when X is prioritised over Y Writing scenarios quickly reveals what is unknown, driving research effort in a valuable direction Explain behaviour Don’t just write “happy path” scenarios, lots of learning in the failure mode ones too
  27. © 2025, Artechra Limited. All rights reserved. A FEW DIFFICULTIES

    TO BE AWARE OF Stakeholder Attention If stakeholders won’t help to identify meaningful scenarios and won’t spend the time to understand them. Time from the Team Excessive Number Similarly the development team need to engage with this idea and spend the time to create and analyse scenarios The opposite problem! Once people get the idea the tendency is to do dozens … which just defocuses people
  28. © 2025, Artechra Limited. All rights reserved. ARCHITECTURAL QUALITY TRADEOFFS

    ¡ We know that we have tension and trade-offs between architectural qualities – scenarios highlight this ¡ A scenario demanding security may well conflict with a scenario needing runtime efficiency ¡ Individual scenarios don’t really help so we need some techniques for visualising and analysing these tradeoffs Image by Arek Socha from Pixabay
  29. © 2025, Artechra Limited. All rights reserved. TRADEOFF MATRICES Step

    1: Define Arch Quality Factors Factor Definition Measurement Performance API request speed from client call to server code invocation H (< 5ms) / M (5- 10ms) / L (>10ms) Operational Complexity Amount of manual intervention needed in operation Frequent (daily) / Occasional (week- month) / Never Capacity Elasticity Ease of adding server capacity High (minutes) / Medium (hours) / Low (days) Development Complexity Relative effort to create and maintain APIs High / Medium / Low Step 2: Capture Factors for Each Option in a Tradeoff Matrix Option Performance Operational Complexity Capacity Elasticity Dev Complexity Synchronous JSON over HTTP M N M L Synchronous gRPC H N L M Asynchronous Messaging using pub/sub (ActiveMQ) L M H H Options for client API communication “How do we connect the store servers to the RAS platform services?“
  30. © 2025, Artechra Limited. All rights reserved. SCENARIO INTERACTIONS Performance:

    Sale Day Demand Surge Security: Authentication via OAuth Availability: Automatic Recovery from DB Failure Usability: Checkout Without Account Modifiability: Change Payment Provider Performance: Sale Day Demand Surge -- ++ + - Security: Authentication via OAuth -- 0 -- + Availability: Automatic Recovery from DB Failure ++ 0 0 0 Usability: Checkout Without Account + -- 0 0 Modifiability: Change Payment Provider - + 0 0 (“-” for conflicts, “+” for reinforcing) “How do our scenarios interact with each other?“
  31. © 2025, Artechra Limited. All rights reserved. SCENARIOS vs STAKEHOLDERS

    Customer Product Team Tech Ops Team Marketing Team Commercial Ops Team Performance: Sale Day Demand Surge H H H H M Security: Authentication via OAuth L H M L L Availability: Automatic Recovery from DB Failure M M H L L Usability: Checkout Without Account H L L H L Modifiability: Change Payment Provider L M L L H (High, Medium, Low level of importance) “Who cares? About what?“
  32. © 2025, Artechra Limited. All rights reserved. RADAR CHART ¡

    Relative strength of two or three options vs architectural qualities ¡ Illustrates tradeoffs. (e.g. “security goes up, usability down”) Need to stress that this is qualitative (judgement), not quantitative (calculated) radarchart.io
  33. © 2025, Artechra Limited. All rights reserved. Communication is difficult

    … … particularly when it is about architectural qualities! Photo by Issa K_T on Unsplash
  34. © 2025, Artechra Limited. All rights reserved. ARCHITECTURAL SCENARIOS CAN

    HELP Memorable name, real situation About a specific quality And something significant that happens to our system In a specific situation That explains what the system does or needs to do
  35. © 2025, Artechra Limited. All rights reserved. Scenarios can turn

    vague aspirations such as “needing scalability” into comprehensible, meaningful “stories” that allow better decisions to be made. Let’s return to our conversation earlier …
  36. © 2025, Artechra Limited. All rights reserved. We really need

    this system to be very scalable … we have big plans. Yes understood, that’s a given And really cost effective … we have no revenue yet remember Yes ok, that’s important And after all we have personal data, so prioritise security Well of course, we always do that And ease of use is critical, no unnecessary steps or in sign up or login Err ok … we’ll get on it
  37. © 2025, Artechra Limited. All rights reserved. This is all

    subtle and contextual stuff … and there seem to be lots of tradeoffs, I’d better create some realistic scenarios …
  38. © 2025, Artechra Limited. All rights reserved. So when you

    said scalable, this is what I think you mean … let’s consider what a busy day looks like and what would happen if we had a freak peak in load … Name Overload Condition on Busy Day Attribute Scalability Stimulus … Environment … Response … Oh I see and can we handle that peak case? Of course, but not very cost effectively … about 250% of the operational cost and 75% more delivery time … it’s a real tradeoff Ok, I understand … you know just surviving a freak peak is probably fine for now.
  39. © 2025, Artechra Limited. All rights reserved. And that need

    for security and also usability needs thought through too … let me sketch a couple of scenarios for you … Name Customer Abandons Sale due to Registration Attribute Usability Stimulus … Environment … Response … Name Written Off Fraud Due to Unregistered Customers Attribute Security Stimulus … Environment … Response … I get it, this is what you guys call “a tradeoff” right? We’ll need to think about this some more.
  40. © 2025, Artechra Limited. All rights reserved. Which scenarios do

    you need to go back to work and write this week?
  41. © 2025, Artechra Limited. All rights reserved. Thank You –

    Questions? Eoin Woods Artechra [email protected] www.eoinwoods.info