Slide 1

Slide 1 text

1 A Roadmap Towards Chaos Engineering Jose E.

Slide 2

Slide 2 text

2 – The Roadmap that got us to Chaos Engineering. – 8 Stability Patterns. – 4 ways to achieve observability. Agenda

Slide 3

Slide 3 text

3 1 4 3 2 1 – Observability 2 – Alerting 3 – Incident Management 4 – Test Harness

Slide 4

Slide 4 text

4 What have we learned and how do we do it

Slide 5

Slide 5 text

5 Before running the Chaos Experimentation make sure you account for: – Stability Patterns – Observability

Slide 6

Slide 6 text

6 8 Stability Patterns The principle is that everything will fail.

Slide 7

Slide 7 text

7 Timeout & Retries – Consequences are, depleting your HTTP or DB pools. – Pay close attention to achieving comprehensive retries. – Do not overwhelm the server with retries.

Slide 8

Slide 8 text

8 Circuit Breaker – Be a good fellow client. – Open Circuit during failure events. – Include this on your monitoring. – Graceful degradation.

Slide 9

Slide 9 text

9 Bulkhead Pattern – Error Isolation.

Slide 10

Slide 10 text

10 Steady State – Define what is a steady state for your API. – Avoid increasing amount of data.

Slide 11

Slide 11 text

11 Fail Fast – The other side of the timeout. – Early Validation.

Slide 12

Slide 12 text

12 Handshaking – Allows to reject calls. – Avoid memory overflow. – Pull monitoring using /health endpoints.

Slide 13

Slide 13 text

13 Uncoupling via Middleware. – Avoid waiting for a response “fire and forget”. – System can process other things while waiting. – Server side will never be overwhelmed.

Slide 14

Slide 14 text

14 Test Harness

Slide 15

Slide 15 text

15 4 ways to Achieve Observability The goal is to answer all the questions we have about our system.

Slide 16

Slide 16 text

16 Logging *events Metrics & Reports Tracing Alerting

Slide 17

Slide 17 text

17 – Choose what you want to see then choose the tools. Don’t be afraid of tools proliferation. – This exist because somebody wrote something. APM are good but nothing beats intentionality. – Ordered vs structured logs. Logging [INFO ] [AvalaraCaptureOperation] 04d6fa2e-71b7-4881-88b1-71c7afcb2b76.motosport.10.42.7.10 - Discrepancy of -0.42 dollars found | Taxes charged to user: 2.53 | Taxes reported by Remote Tax Service 2.95 | order OrderLookupResult(orderId=m7075551, orderDate=2019-07-26, catalogId=Motosport, [email protected], estimatedOrderTotalTax=2.53, shippingCostTax=0.52, shipmentTaxes=[ShipmentTax(id='m7075551-42616991', facility=SLCW, items=[ItemTax(sku=ABL000E-X001-Y003, quantity=1, tic=P0000000, productGroupId=100001782, totalTax=2.43, price=33.44, originalPrice=33.44, unitaryClientTax=2.08)], shippingCostTaxItem=ShippingCostTaxItem(cost=7.0, tax=0.52), capturedDate=2019-07-26, captured=true)], isPartnerOrder=false, isExemptTaxOrder=false, address=Destination(country=US, address1=3240 Gilliland Rd, address2=3240 Gilliland Rd, city=Springtown, state=TX, zip=76082-5233), provider=Avalara, created=2019-07-26, lastModified=2019-07-26) --------------------------------------------------------------------------------------------- 2019-08-21 23:32:33,153 merch-log ERROR c.b.s.m.w.c.c.BaseRestController - "netsuite-price-5879-18" | 10.42.7.10:"Apache-HttpClient/4.5.9" 524 PUT "/erp/variants/prices" {} | { "request-size"=103909b, "duration"=5543ms, "response-size"=8192b } com.backcountry.supplychain.merch.business.common.exceptions.B adRequestServiceException: Bad Request received. Invalid value for field 'sku' on object 'erpVariantPriceUpdateModelList': Invalid reference of sku COL4271-BLA-LT. Invalid value for field 'sku' on object 'erpVariantPriceUpdateModelList': Invalid reference of sku COL4271-BLA-XS

Slide 18

Slide 18 text

18 – Causality across systems. – TraceId and object ids. – This are the hardest to code as context need to be passed across systems. Tracing The Event Log

Slide 19

Slide 19 text

19 Tracing

Slide 20

Slide 20 text

20 – Aggregate Data. – Expose values: The good the bad and the ugly. – Reports can be pushed to the customer. Metrics & Reports

Slide 21

Slide 21 text

21 Status Code ratio by minute Tax Discrepancies by site

Slide 22

Slide 22 text

22 – Differentiate Warnings with Critical – Criticals are meant to wake up someone. – 6 practices to make great alerting: Alerting ●  Stop using emails for Critical alerts. ●  Write runbooks. ●  Delete and tune alerts. ●  Use maintenance periods. ●  Attempt self healing, but be careful. ●  Overcome arbitrary static thresholds.

Slide 23

Slide 23 text

23 – A maturity model in the form of an outgoing Roadmap that can get us to Chaos Engineering activities. – 8 Stability Patterns that can apply to software, hardware, network, etc. – 4 ways to achieve observability and be able to answer all the questions we have about our systems. Summary

Slide 24

Slide 24 text

24 Thank you very much! Pura Vida!