Infinite scalability and loose coupling • deploy independently, ship faster • Real-time everything, millisecond responses • Cost efficiency that makes CFOs smile What actually happens at 2 AM • Debugging becomes forensic investigation • Event schemas drift, consumers break silently • Product managers expect instant consistency • Month-end AWS bill arrives with surprises "EDA is a trade-off: you exchange request-response simplicity for scalability and resilience. Know the price before you pay it."
Dev, Ops, QA are separate teams with handoffs • Centralized architecture review boards approve every change • Teams don't own their services end -to-end EDA THRIVES WHEN... • Stream -aligned teams own domains end -to-end • Platform teams provide self -service infrastructure • Teams can deploy independently without coordination Truth: You cannot successfully adopt EDA without fixing your organizational silos and how teams are organized. Technology follows structure. THE PRACTICAL GUIDANCE Team Size Two pizza “self-sufficient” teams (6-8 people) can own 2-3 services Ownership One team owns event schema. Consumers adopt. On-call Teams on-call for their own services build better systems. "Organizations design systems that mirror their communication structures." — Conway's Law
• Multiple teams need data from same business events • Traffic patterns are spiky or unpredictable • Downstream processing can tolerate 1-30s delay • You need complete audit trail of state changes Strong Signal: No!! • Simple linear processes, Simple CRUD app, single team, low scale needs • Strict ACID transactions across operations, Strict ordering requirements • End-to-end SLAs under 100ms required • Team lacks distributed systems experience The Critical Question: Can your business process tolerate eventual consistency? If the answer is no, EDA adds complexity without solving your actual problem.
Gateway, IOT Core Producers: Detect state changes and emit events. Own the event schema. Know nothing about consumers. • EventBridge, MSK* Brokers (+Routers): Route events to interested parties. Handle delivery guarantees. Decouple producers from consumers. • Lambda, Step Functions, SQS Consumers: React to events. Must handle duplicates gracefully. Can transform, store, or trigger workflows. • CloudEvents Specifications* * Events: Facts happened in system. The Core Trade-off: Producers never know about consumers. This is loose coupling — and also why debugging is harder. Correlation IDs are your lifeline.
Command is a message sent to initiate the system’s state change or specific system components’ state change. Commands express intent. • "PlaceOrder" — can be rejected, validated, denied • Events record facts: "OrderPlaced" — already happened, cannot be undone Why Immutability Matters • Enables replay for debugging and recovery • Past tense naming prevents semantic confusion Design Choice: Include essential data in events, not just IDs. Reference-only events create runtime coupling; overly fat events create schema coupling. Find the balance. Event Anatomy — What Every Event Must Have – CloudEvents Spec eventId UUID for idempotency checks eventType Domain.Entity.PastTenseAction timestamp When it occurred (UTC ISO-8601) correlationId (Traceparent) Links related events for tracing
discuss today Selection Principle: • Start with EventBridge for routing. • Add SQS for buffering. • Use Kinesis only when you need strict ordering or stream replay. • Simplest service that meets requirements wins.
not just event source. Poll-based gives you control over batching and retries; async gives simplicity. The Kinesis/DynamoDB Streams Trap: A failed batch blocks the entire shard until resolved. One poison message halts your pipeline. Always configure: maxRetryAttempts, bisectBatchOnFunctionError, and destination DLQ.
Traffic is spiky with idle periods between bursts • Function execution under 15 seconds typically • You want zero infrastructure management • Cold start latency is acceptable (100-500ms) Consider Fargate/ECS When • Sustained high throughput (1M+ events/day steady) • Processing requires 15+ minutes per item • You need persistent connections (WebSockets, DB pools) • Memory needs exceed 10GB regularly THE COST CROSSOVER At approximately 1-2 million invocations per day with average duration, Fargate becomes cheaper than Lambda. But total cost of ownership includes operational overhead — Lambda still wins if your team is small and you value simplicity over raw cost efficiency. Reality: Lambda is not always the answer. Model your costs monthly before committing. Use AWS Pricing Calculator with realistic traffic patterns.
Strengths: Visual debugging, explicit compensation, clear ownership of workflow logic • Challenge: Orchestrator becomes bottleneck for changes. Tighter coupling to coordinator. • Step Functions, Express Workflows Choreography • Services react independently to events. No central coordinator. Each service decides how to respond to what it observes. • Strengths: Loose coupling, autonomous teams, resilient to single points of failure • Challenge: No single view of workflow state. Debugging requires correlating logs across multiple services. • EventBridge rules, SNS/SQS fan-out Tip: "Orchestrate within bounded contexts. Choreograph across them." - The pragmatic middle ground that scales.
distributed system fallacies: In distributed systems, messages WILL be delivered more than once. • Network timeouts • retries • at-least -once delivery duplicates are guaranteed – design for it. A payment processed twice means an angry customer and potential fraud investigation.
• Store eventId in DynamoDB before processing. Use conditional write — if key exists, skip processing entirely. • ConditionExpression: attribute_not_exists(pk) PATTERN 2: NATURAL IDEMPOTENCY • Design operations to produce same result regardless of execution count. Use SET instead of INCREMENT operations. • SET balance = :newValue (not balance + :delta) LAMBDA POWERTOOLS SHORTCUT The @idempotent decorator handles all complexity. Uses DynamoDB with configurable TTL. One decorator, production-grade idempotency. Cost Trade-off: Idempotency checks add latency (~5-15ms) and DynamoDB costs. For high-volume, low-value events, consider time-windowed deduplication instead.
"When I click Submit, the data should be there immediately. All screens should show the updated state. This is basic functionality." WHAT EDA ACTUALLY PROVIDES Acknowledgment that request was accepted. Actual processing happens asynchronously within seconds to minutes. Different views may show different states temporarily. The Real Fix: Eventual consistency is a product and UX problem disguised as a technical one. Solve it with design and communication, not just engineering. THE PRACTICAL SOLUTION Optimistic UI Show expected state immediately, reconcile when confirmation arrives. Explicit SLAs "99th percentile completion within 30 seconds" - measurable, contractual. Push Notifications WebSockets via API Gateway or AppSync for real-time completion updates.
No shared call stack like monolith • Distributed systems require MORE observability investment • Trade -off accepted: one complex system for many simple systems but debugging now spans service boundaries WHAT EDA ACTUALLY PROVIDES Acknowledgment that request was accepted. Actual processing happens asynchronously within seconds to minutes. Different views may show different states temporarily. Budget Reality: Plan for 15-20% of infrastructure cost on observability. CloudWatch Logs ($0.50/GB ingestion), X-Ray traces, and custom metrics add up fast at scale. CLOUDWATCH METRICS TO ALERT ON Event Latency Time from publish to consumer ack (p50, p95, p99) DLQ Depth Any message > 0 needs investigation Consumer Lag Events waiting to be processed THE PRACTICAL SOLUTION Correlation IDs Generate at entry point, propagate through every event and log. This is non-negotiable. Without it, you're blind. Structured Logging JSON logs with consistent fields. Lambda Powertools Logger does this with zero config. CloudWatch Insights then becomes powerful. Distributed Tracing X-Ray for AWS-native tracing. Trace context must propagate through async boundaries — this requires manual work.
Failures are features, not bugs. Design your error handling paths as carefully as your happy paths. Test them in production. Malformed events that fail repeatedly, blocking the queue. Fix: maxReceiveCount + DLQ + alerting Poison Messages Database or API unavailable causes cascade of failures. Fix: Circuit breaker + exponential backoff Downstrea m Outage Burst exceeds Lambda concurrency, events pile up. Fix: Reserved concurrency + SQS buffer Throttling Storm THE PRACTICAL SOLUTION - THE DLQ STRATEGY EVERY EDA NEEDS DLQ on Everything SQS queues, Lambda async, EventBridge rules Alert on DLQ > 0 Any message needs investigation Build Replay Tooling Reprocess after fix is deployed Archive to S3 Before DLQ retention expires (14 days)
Separate read and write models. Writes go to one store optimized for writes, reads come from another optimized for queries. Use When: Read and write patterns differ significantly Avoid When: Simple CRUD, team unfamiliar with pattern EVENT SOURCING Store events as the source of truth, not current state. Rebuild state by replaying events. Complete audit trail built-in. Use When: Audit requirements, temporal queries Avoid When: Simple state tracking, no audit needs Caution: These patterns add significant complexity. Most applications do not need them. Start simple, add complexity only when requirements demand it. AWS IMPLEMENTATION Event Store DynamoDB with sort key for event sequence. Kinesis for high-volume append-only. Read Models DynamoDB for key-value, OpenSearch for full-text, RDS for complex joins. Projections Lambda subscribed to DynamoDB Streams builds read models from events.
Cost Best for Choreography Excellent Low Medium Distributed Low Loosely coupled domains Orchestration Good Medium High Centralized Medium Complex workflows Event sourcing Excellent High Excellent Medium High Audit requirement CQRS Eventual Excellent Medium Good Medium Read or write optimization Saga Eventual Good High Excellent Medium Long transactions Outbox Strong Good Medium Good Low Transactional consistency
from a real production incident. Learn from others’ mistakes. EVENT SPAGHETTI Every service publishes events consumed by every other service. Circular dependencies. Fix: Clear domain boundaries, event ownership THE GOD EVENT One massive event type with 50+ fields. Every change affects every consumer. Fix: Granular event types per business action SYNC IN DISGUISE Producer waits for consumer response via another event. Request-response with extra steps. Fix: If you need sync, use sync. Be honest. NO DLQ Failed events disappear into the void. No visibility into failures. Data loss during incidents. Fix: DLQ on everything, alert on depth > 0 UNBOUNDED FAN-OUT One event triggers 100 consumers. Thundering herd on downstream services. Fix: Rate limiting, SQS buffering, concurrency OPTIMISTIC PROCESSING No idempotency. Assuming events arrive once. Duplicate processing causes corruption. Fix: Idempotency key on every consumer
Idempotency, DLQs, and Replay as First-Class Design Why these are not “nice to have” but foundational to correctness. Choreography vs Orchestration How to balance autonomy and debuggability using EventBridge and Step Functions. Eventual Consistency as a UX Problem Why many EDA failures are actually product failures. Cost Inflection Points When Lambda stops being cheaper, and how FinOps becomes an architectural concern. Conway’s Law in Practice Why EDA fails without aligned team ownership and platform enablement.
a silver bullet You exchange request- response simplicity for scalability and resilience. Make the trade consciously. Idempotency is non- negotiable Messages will be delivered more than once. Design every consumer to handle duplicates gracefully. Observability requires more investment, not less Correlation IDs, structured logging, and distributed tracing are mandatory. Budget 15- 20% for observability. Eventual consistency is a UX problem Solve it with product (optimistic UI, notifications) not just engineering. Set explicit SLAs. Orchestrate within, choreograph between Use Step Functions for complex workflows within a domain. Use EventBridge for cross- domain communication.