Resilient Software Architecture
for Microservices:
Dr. Walter Kammergruber
A Pragmatic Approach to Dealing with Chaos
MicroService Meetup Munich, 26.06.2019
Slide 2
Slide 2 text
1. Why is resilience important for Microservice
(especially)?
2. Best practices for designing resilient
distributed architectures
3. Testing resilience in applications
Overview
2
“Anything that can
possibly go wrong,
does.”
5
Sack, John. The Butcher: The Ascent of Yerupaja epigraph (1952), reprinted in Shapiro, Fred R., ed., The Yale
Book of Quotations 529 (2006)
Murphy's law
Slide 6
Slide 6 text
‘Resilience is an imperative. Our software runs
on the truly dismal computers we call data
centers. Besides being heinously complex
...they are unreliable and prone to operator
error.’
6
Marius Eriksen, former “peddler of
abstraction” at Twitter
Slide 7
Slide 7 text
7
Slide 8
Slide 8 text
E. B. Walker Photography https://www.flickr.com/photos/premierehdr/
Slide 9
Slide 9 text
Potential Problems with microservices in general
9
● Performance
● Correctness
● Monitoring
● Debugging
● Efficiency
● Security
● Operability
● Resilience
Slide 10
Slide 10 text
Resilience in Microservice is hard
At least one of those
● software you didn't
write
● hardware you can't
touch
● network you can't
configure
10
Stuff
● breaks in new and
surprising ways
● and your customers
shouldn't notice
Slide 11
Slide 11 text
Best Practices
11
Slide 12
Slide 12 text
12
Very Simple Example
Invoice
Service
Exchange
Rate
Service
$€£¥₽ ?
Slide 13
Slide 13 text
Defensive Programming
13
Slide 14
Slide 14 text
14
def retrieve_exchange_rates
MAX_RETRIES = 7, retry_count = 0
while retry_count < MAX_RETRIES do
response = call_exchange_rates()
if response.present?
cache_rates(response)
return response
end
sleep(wait_time)
wait_time = increase_wait_time(retry_count)
retry_count = retry_count + 1
end
return cached_rates() if cached_rates().present?
raise Error.new('Could not fetch Exchange Rate')
end
Slide 15
Slide 15 text
What's wrong?
Code smells - not job of the business logic / your code to know:
- Number Retries
- (Increased) Wait period
- Caching
15
Not in this example, but not rarely found:
- Strategies for dealing with request peaks (fallbacks, throttling, prioritization etc.)
- Timeout thresholds ((milli-) seconds)
Slide 16
Slide 16 text
What to do?
- Service objects, maybe as an extra library
- Job queues if possible
- Ambassador Pattern (see later)
- Use for example Finagle or Linkerd
16
Slide 17
Slide 17 text
The reactive manifesto
“The system stays responsive in the face of
failure. This applies not only to
highly-available, mission-critical systems —
any system that is not resilient will be
unresponsive after a failure. “
17
www.reactivemanifesto.org
Slide 18
Slide 18 text
The reactive manifesto, paradigms for resilience:
● Delegation
● Replication
● Containment
● Isolation
18
www.reactivemanifesto.org
Slide 19
Slide 19 text
Delegation
“Delegating a task asynchronously to another
component means that the execution of the
task will take place in the context of that other
component. This delegated context could entail
running in a different error handling context, on
a different thread, in a different process, or on a
different network node, to name a few
possibilities.”
19
Slide 20
Slide 20 text
Replication
“Executing a component simultaneously in
different places is referred to as replication. This
can mean executing on different threads or
thread pools, processes, network nodes, or
computing centers.”
20
Slide 21
Slide 21 text
Isolation (and Containment)
“Isolation can be defined in terms of decoupling,
both in time and space. Decoupling in time
means that the sender and receiver can have
independent life-cycles—they do not need to be
present at the same time for communication to
be possible. It is enabled by adding
asynchronous boundaries between the
components, communicating through
message-passing. ”
21
Slide 22
Slide 22 text
General basic Do`s for preparing resilience
● Logging
● Monitoring
● Integration Testing
● System Testing
22
Slide 23
Slide 23 text
Service Design
● Cluster services into use cases: What SLAs do you need for that?
● Make a list of your services: Which are most important for your business?
● What can are the challenges?
○ Response Time (e.g. Online Shop, UX)
○ Are there any potential Peaks to expect?
○ Data intensive services (network bandwidth, cpu, gpu, io writes)
23
Ambassador pattern (aka sidecar)
25
Application
Main functionality
docs.microsoft.com/en-us/azure/architecture/patterns/ambassador
Ambassador
Proxy to handle:
● Retry
● Circuit breaking
● Monitoring
● Security
Host
Remote
Service
Slide 26
Slide 26 text
Anti-Corruption Layer pattern
26
Microservice
docs.microsoft.com/en-us/azure/architecture/patterns/anti-corruption-layer
Anti-
Corruption
Layer
Subsystem A
Subsystem
B
(aka legacy)
Microservice
Microservice
Slide 27
Slide 27 text
Message Queues
27
Producer
Broker
Consumer
Consumer
Consumer
T
o
p
i
c
Queue 1
Queue 2
Queue 3
Consumer
Slide 28
Slide 28 text
Compensating Transaction pattern: SaaS shop
28
1. Register User
5. Send
confirmation
email
2. Acquire
Package
3. Choose a
Plan
4. Issue
payment
Compensate Compensate
Compensate Compensate Compensate
Send
cancelation
email
Cancel
payment /
trigger
refund
Delete Plan
association
Delete
Package
association
Delete User
Counter operation in each
step of the long running
transaction
Slide 29
Slide 29 text
Microservice Antipatterns
● Microlith: Everything depends on everything else
● The More The Merrier: Too much of everything, i.e. communication, too
complex, bad service cuts
● Flying Blind: Too less logging & monitoring
● Pride and Wrath of Distributed Services: “Yeah, it’s going to work just fine”
29
Nice read: itnext.io/anti-patterns-of-microservices-6e802553bd46
Slide 30
Slide 30 text
General (Resiliency) Anti-patterns
Do not violate these programming principles:
● Separation of concern → enable change / improvements
● Information hiding → enable change / improvements
● Loose coupling → enable change / improvements
● Fail-fast (but of course be resilient as a system) → detect failures / corrupt
data
You might violate:
● “Don't repeat yourself”, “Cohesion”
30
Slide 31
Slide 31 text
General (Resiliency) Anti-patterns
Do not violate these programming principles:
● Separation of concern → enable change / improvements
● Information hiding → enable change / improvements
● Loose coupling → enable change / improvements
● Fail-fast (but of course be resilient as a system) → detect failures / corrupt
data
You might violate:
● “Don't repeat yourself”, “Cohesion”
31
Do Violate:
"If it ain't broke,
don't fix it."
Slide 32
Slide 32 text
Testing with Chaos
32
Slide 33
Slide 33 text
33
A QA engineer walks into a bar.
Orders a beer. Orders 0 beers.
Orders 99999999999 beers.
Orders a lizard. Orders -1 beers.
Orders a ueicbksjdhd.
@brenankeller, 2018-11-30, 1:21PM
Slide 34
Slide 34 text
34
@brenankeller, 2018-11-30, 1:21PM
First real customer walks in
and asks where the bathroom
is. The bar bursts into flames,
killing everyone.
Slide 35
Slide 35 text
Principles of Chaos Engineering
“Chaos Engineering is the discipline of
experimenting on a system in order to build
confidence in the system’s capability to
withstand turbulent conditions in production. “
35
principlesofchaos.org
Slide 36
Slide 36 text
Cynefin complexity framework
36
(pronounced kun-EV-in)
means haunt, habitat, acquainted,
accustomed, or familiar.
Slide 37
Slide 37 text
37
Complex
Unknown unknowns
Chaotic
Unknowables
Obvious
Known knowns
Complicated
Known unknowns
Disorder
Source: https://learning.oreilly.com/library/view/the-agile-developers/9781787280205/
Slide 38
Slide 38 text
“Chaos doesn’t cause
problems it reveals
them”
38
*(really hard to find the original source, but quoted very frequently - must be somehow a valid quote)
Nora Jones, Senior Software Engineer, Netflix
Slide 39
Slide 39 text
Chaos engineering general approach
39
1. Formulate hypothesis: What might go wrong in the system?
2. Setup experiment: Can you recreate the failure without impacting users?
3. Minimize blast radius: Try the smallest experiment first to learn something.
4. Run the experiment: Monitor the results and the system behavior carefully.
5. Analyze: If the system did not work as expected, well, you found a bug. If
everything worked as it should, increase the blast radius and repeat.
6. Fix it: If an error occurred try to fix it using your metrics for finding the
problem. Repeat the experiment.
Slide 40
Slide 40 text
40
Formulate
hypothesis
Setup
experiment
Analyze
Run the
experiment
Do not forget
the blast
radius!!!
Fix it
Examples for what you can test
42
● Maxing out CPU cores on an Elasticsearch cluster.
● Time travel: Forcing system clocks out of sync with each
other.
● Simulating the failure of an entire region or datacenter.
● Partially deleting Kafka topics over a variety of instances
to recreate an issue that occurred in production.
Slide 43
Slide 43 text
Examples for what you can test (continued)
43
● Injecting latency between services for a select percentage
of traffic over a predetermined period of time.
● Function-based chaos (runtime injection): Randomly
causing functions to throw exceptions.
● Executing a routine in driver code emulating I/O errors.
learning.oreilly.com/library/view/chaos-engineering/9781491988459
Take away: Be pragmatic
45
1. Know your most important services
2. Consider the End-to-End User Experience (UX)
3. Be defensive to failure
4. Architecture to change
5. Expect the worst and deal with it
6. What is the most simple solution that works for me?
7. Play around with Chaos
8. Never forget about people: Ask other teams, and experts and bring them
onboard.
Slide 46
Slide 46 text
46
CTO
@woidda
xing.com/profile/WalterChristian_Kammergruber
linkedin.com/in/dr-walter-kammergruber-9b643130
Dr. Walter Kammergruber