A ROADMAP TOWARDS CHAOS ENGINEERING

777bc656cb5c276519c2d52951d6ebca?s=47 Chaos Conf
September 26, 2019

A ROADMAP TOWARDS CHAOS ENGINEERING

Jose Esquivel, Backcountry

A common problem with Chaos Experimentation is knowing where to start. In this talk, Principal Software Engineer Jose Esquivel will present a roadmap for Chaos Experimentation that can be applicable to any organization.

777bc656cb5c276519c2d52951d6ebca?s=128

Chaos Conf

September 26, 2019
Tweet

Transcript

  1. 1 A Roadmap Towards Chaos Engineering Jose E.

  2. 2 – The Roadmap that got us to Chaos Engineering.

    – 8 Stability Patterns. – 4 ways to achieve observability. Agenda
  3. 3 1 4 3 2 1 – Observability 2 –

    Alerting 3 – Incident Management 4 – Test Harness
  4. 4 What have we learned and how do we do

    it
  5. 5 Before running the Chaos Experimentation make sure you account

    for: – Stability Patterns – Observability
  6. 6 8 Stability Patterns The principle is that everything will

    fail.
  7. 7 Timeout & Retries – Consequences are, depleting your HTTP

    or DB pools. – Pay close attention to achieving comprehensive retries. – Do not overwhelm the server with retries.
  8. 8 Circuit Breaker – Be a good fellow client. –

    Open Circuit during failure events. – Include this on your monitoring. – Graceful degradation.
  9. 9 Bulkhead Pattern – Error Isolation.

  10. 10 Steady State – Define what is a steady state

    for your API. – Avoid increasing amount of data.
  11. 11 Fail Fast – The other side of the timeout.

    – Early Validation.
  12. 12 Handshaking – Allows to reject calls. – Avoid memory

    overflow. – Pull monitoring using /health endpoints.
  13. 13 Uncoupling via Middleware. – Avoid waiting for a response

    “fire and forget”. – System can process other things while waiting. – Server side will never be overwhelmed.
  14. 14 Test Harness

  15. 15 4 ways to Achieve Observability The goal is to

    answer all the questions we have about our system.
  16. 16 Logging *events Metrics & Reports Tracing Alerting

  17. 17 – Choose what you want to see then choose

    the tools. Don’t be afraid of tools proliferation. – This exist because somebody wrote something. APM are good but nothing beats intentionality. – Ordered vs structured logs. Logging [INFO ] [AvalaraCaptureOperation] 04d6fa2e-71b7-4881-88b1-71c7afcb2b76.motosport.10.42.7.10 - Discrepancy of -0.42 dollars found | Taxes charged to user: 2.53 | Taxes reported by Remote Tax Service 2.95 | order OrderLookupResult(orderId=m7075551, orderDate=2019-07-26, catalogId=Motosport, email=gumbydunn@yahoo.com, estimatedOrderTotalTax=2.53, shippingCostTax=0.52, shipmentTaxes=[ShipmentTax(id='m7075551-42616991', facility=SLCW, items=[ItemTax(sku=ABL000E-X001-Y003, quantity=1, tic=P0000000, productGroupId=100001782, totalTax=2.43, price=33.44, originalPrice=33.44, unitaryClientTax=2.08)], shippingCostTaxItem=ShippingCostTaxItem(cost=7.0, tax=0.52), capturedDate=2019-07-26, captured=true)], isPartnerOrder=false, isExemptTaxOrder=false, address=Destination(country=US, address1=3240 Gilliland Rd, address2=3240 Gilliland Rd, city=Springtown, state=TX, zip=76082-5233), provider=Avalara, created=2019-07-26, lastModified=2019-07-26) --------------------------------------------------------------------------------------------- 2019-08-21 23:32:33,153 merch-log ERROR c.b.s.m.w.c.c.BaseRestController - "netsuite-price-5879-18" | 10.42.7.10:"Apache-HttpClient/4.5.9" 524 PUT "/erp/variants/prices" {} | { "request-size"=103909b, "duration"=5543ms, "response-size"=8192b } com.backcountry.supplychain.merch.business.common.exceptions.B adRequestServiceException: Bad Request received. Invalid value for field 'sku' on object 'erpVariantPriceUpdateModelList': Invalid reference of sku COL4271-BLA-LT. Invalid value for field 'sku' on object 'erpVariantPriceUpdateModelList': Invalid reference of sku COL4271-BLA-XS
  18. 18 – Causality across systems. – TraceId and object ids.

    – This are the hardest to code as context need to be passed across systems. Tracing The Event Log
  19. 19 Tracing

  20. 20 – Aggregate Data. – Expose values: The good the

    bad and the ugly. – Reports can be pushed to the customer. Metrics & Reports
  21. 21 Status Code ratio by minute Tax Discrepancies by site

  22. 22 – Differentiate Warnings with Critical – Criticals are meant

    to wake up someone. – 6 practices to make great alerting: Alerting •  Stop using emails for Critical alerts. •  Write runbooks. •  Delete and tune alerts. •  Use maintenance periods. •  Attempt self healing, but be careful. •  Overcome arbitrary static thresholds.
  23. 23 – A maturity model in the form of an

    outgoing Roadmap that can get us to Chaos Engineering activities. – 8 Stability Patterns that can apply to software, hardware, network, etc. – 4 ways to achieve observability and be able to answer all the questions we have about our systems. Summary
  24. 24 Thank you very much! Pura Vida!