21
How is it prevented for memory?
• Nothing — silent data corruption.
• Parity — detect, but not correct, (some) errors.
• Error Correcting Codes (ECC) — correct and detect bit
errors. Can usually detect more than it can correct.
• Application — can perform checksumming, be pessimistic
and have redundancy.
Slide 22
Slide 22 text
22
Slide 23
Slide 23 text
23
[email protected]:~ $ sudo dmidecode --type 16
# dmidecode 2.12
SMBIOS 2.8 present.
Handle 0x0059, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 512 GB
Error Information Handle: Not Provided
Number Of Devices: 8
Slide 24
Slide 24 text
24
WOULD YOU BUY
DISC ARDED DRAM?
Slide 25
Slide 25 text
25
Assume the system will fail
• CPU – it will misbehave. It will overheat.
• DRAM – it will corrupt your data. It will fail. It will modify your
program.
• Operating System – it will panic. It will be slow. It will corrupt
your data.
• Disk – it will corrupt data. It will be slow. It will fail.
Slide 26
Slide 26 text
26
System Communication
Application
Process
Slide 27
Slide 27 text
27
GR AC EFUL
LOOSE COUPLING
Slide 28
Slide 28 text
28
Slide 29
Slide 29 text
29
Usually not the first error that causes failure,
but the errors that follow
Slide 30
Slide 30 text
30
Redundant backups with degraded
functionality
Slide 31
Slide 31 text
31
single component failure should not be
able to compromise the performance
or availability of the entire system
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
search
sessions
carts
mysql
cdn
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
Customer signed out
Slide 36
Slide 36 text
36
Another example
class Product
def tags
redis.smembers("product:#{id}:tags")
end
end
Slide 37
Slide 37 text
37
Slightly better
class Product
def tags
redis.smembers("product:#{id}:tags")
rescue Redis::BaseError => e
ErrorReporter.log(e)
[]
end
end
Slide 38
Slide 38 text
38
System Communication
Application
Process
Slide 39
Slide 39 text
(Micro)service equation
39
Uptime = AN Number of services
Availability per service
Total availability
Even monolithic services easily have tens of
dependencies:
RDBMS, Redis, Memcached, ElasticSearch,
S3, CDN, APIs, Mailing, CRM, …
41
Slide 42
Slide 42 text
42
Checkout Admin Storefront
MySQL Shard Unavailable Unavailable Degraded
MySQL Master Available (if cached) Unavailable Available
Kafka Available Degraded Available
External HTTP API Degraded Available Unavailable
redis-sessions Unavailable Unavailable Degraded
logging (disk full) Unavailable Unavailable Unavailable
Resiliency Matrix
Slide 43
Slide 43 text
43
https://github.com/shopify/toxiproxy
Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do
Shop.first # this takes at least 1s
end
Toxiproxy[/redis/].down do
session[:user_id] # this will throw an exception
end
Simulate TCP conditions with Toxiproxy
Slide 44
Slide 44 text
Write a Toxiproxy test for each cell
44
# test/integration/resiliency_matrix_test.rb
def test_section_a_mq_a_down
Toxiproxy[:message_queue_a].down do
get '/section_a'
assert_response :success
end
end
def test_section_b_datastore_b
Toxiproxy[:datastore_b].down do
get '/section_b'
assert_response 500
end
end
# ... and every other cell
Slide 45
Slide 45 text
With fallbacks the system is still vulnerable
to slowness. ECONNREFUSED is a luxury in
distributed systems, slowness is the killer.
45
48
timeouts will waste resources.
response time might be up by
at least that timeout.
Slide 49
Slide 49 text
Circuit Breakers
after e timeouts raise immediately for t seconds
49
Slide 50
Slide 50 text
Bulkheads
only allow t processes to access a resource simultaneously
50
Slide 51
Slide 51 text
51
SHOP IFY/SE MI AN
Slide 52
Slide 52 text
52
System Communication
Application
Process
Slide 53
Slide 53 text
53
Antifragile Process
Fragile Robust Antifragile
Slide 54
Slide 54 text
54
ROOT CAUSE A NA LYSIS
Slide 55
Slide 55 text
55
GAME DAY
Slide 56
Slide 56 text
56
Slide 57
Slide 57 text
Resiliency Maturity Pyramid
57
No resiliency effort
Testing with mocks
Toxiproxy tests and matrix
Resiliency Patterns
Production Practise Days (Games)
Kill nodes
Latency
Application-Specific Fallbacks
Kill DC
Slide 58
Slide 58 text
Failure Tolerance
Reliability
98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999%
None
Basic
Fallbacks
Covered resiliency matrix
Resiliency Primitives
Indifferent to failure
Antifragile
Docker
Shopify
Prototype
A low quality Erlang app
Balance reliability and failure tolerance. Your app is the sum.
Rails
MySQL
Ruby
Linux ECC memory
Your shitty banks COBOL mainframe
that they pray won’t break
Some runtime genetic programming craziness
Aerospace
Probably you
Premature resiliency
100%
EARLY
MATURE APPS
FOUNDATION
Hackathon
Redis
MATURING
DNS
Google
HFT
Slide 59
Slide 59 text
Final remarks
59
Draw your resiliency matrix, write Toxiproxy tests and
implement application-specific fallbacks
Everything will fail. Embrace graceful loose coupling.
Make resiliency fun!
Be careful when introducing new dependencies. Where
are they on the reliability/tolerance graph?
Slide 60
Slide 60 text
Resources
@Sirupsen on Twitter
github.com/Shopify/semian: Resiliency toolkit for Ruby
github.com/Shopify/toxiproxy: Testing resiliency (any language)
github.com/eapache/go-resiliency: Go library for resiliency
github.com/Netflix/Hystrix: JVM library for resiliency
Release It: Incredible book on resiliency
Shopify Blog post on Resiliency
Netflix Engineering Blog
Hystrix Wiki
Semian README
Talk on Resilient Routing and Discovery @ DockerCon 15
Talk on Building and Testing Resilient Applications @ GoRuCo 15
ParisRB June 15 slides from @byroot
60
Slide 61
Slide 61 text
Thank You!
FOLLOW @SIRUPSE N
Slide 62
Slide 62 text
62
• Atomic by Ema Dimitrova from the Noun Project
• Rocket by Simon Mettler from the Noun Project
• processor by Creative Stall from the Noun Project
• Another Squat Discovered by Nahemoth (Flickr)
• Starting Line by rowens27 (Flickr)
• Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr)
• Under Pressure by Feans (Flickr)
Credits