Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mesh, not just for the club (anymore)

Mesh, not just for the club (anymore)

Presented at Maritime DevCon 2019 https://maritimedevcon.ca/

A mesh is defined as an “interlaced structure” and this exactly describes what we are building when we create distributed systems. The pieces talk to and depend on one another. Managing, monitoring, and controlling the ether between our services is a daunting task and one that can lead to failure scenarios if mismanaged. In this talk, Josh Comer will explore how adding a Service Mesh to our systems can not only make them more resilient but flexible and transparent. With first-hand stories of distributed failure, Josh will discuss how a mesh could have avoided these scenarios. Hopefully, at the end, you will agree that if you could only take one technology to a distributed desert island, it would be a Service Mesh.

Josh Comer

June 08, 2019
Tweet

More Decks by Josh Comer

Other Decks in Technology

Transcript

  1. #MaritimeDevCon - 08/06/19 - @jjcomer What is a distributed system

    A grouping of related, yet independently resourced, processes. Ok… but why? • Resilience • Performance • Economics • … and more
  2. #MaritimeDevCon - 08/06/19 - @jjcomer Microservices Strict(ish) set of definitions

    NO SHARING! 12 Factor Apps are better: https://12factor.net/ Fun fact: “If your services must be deployed in a certain order, you have a monolith”
  3. #MaritimeDevCon - 08/06/19 - @jjcomer What are the problems we

    face Coordination, coordination, coordination Oh and coordination Deployments without downtime Resilience Configuration Coordination Observability
  4. #MaritimeDevCon - 08/06/19 - @jjcomer Agnostic configuration... the dream Create

    build artifacts that can be used anywhere The environment is the environment! • Pull vs Push (prometheus) • Docker logging drivers (log to stdout) • Environment aware URLs ◦ http://serviceb vs http://env1.us.serviceb • Constant secret keys ◦ Requesting “DB_PASSWORD” will return the correct password for context ◦ Limits blast radius
  5. #MaritimeDevCon - 08/06/19 - @jjcomer Mesh definition From the Dictionary:

    Noun: a weblike pattern or construction Verb: to coordinate closely Also a fabric which makes great club wear
  6. #MaritimeDevCon - 08/06/19 - @jjcomer Service mesh definition “A service

    mesh is a configurable, low‑latency infrastructure layer designed to handle a high volume of network‑based interprocess communication among application infrastructure services using application programming interfaces (APIs).” -nginx site
  7. #MaritimeDevCon - 08/06/19 - @jjcomer Data Plane Control Plane Service

    Service Service Service Service Discovery Telemetry
  8. #MaritimeDevCon - 08/06/19 - @jjcomer Service A Instance Service C

    Instance Service B Instance M M M Mesh Control Plane
  9. #MaritimeDevCon - 08/06/19 - @jjcomer Service Zoom out a bit...

    Load Balancer Instance 1 Instance 2 Instance 3 Requestor
  10. #MaritimeDevCon - 08/06/19 - @jjcomer Service B Zoom in a

    bit... Mesh node (Service A) Instance 1 Instance 2 Instance 3 Service A Instance M M M
  11. #MaritimeDevCon - 08/06/19 - @jjcomer Case study #1 -- Architecture

    Diagrams You are trying to put together on boarding materials... but there are NO DIAGRAMS. The platform is too large for any one person to know all service relationships and you need this information to be accurate (forever).
  12. #MaritimeDevCon - 08/06/19 - @jjcomer Discoverability The mesh controls all

    communication It knows exactly who talks to whom Can glean extra information: latency and frequency Only drawback is that the communication must happen once
  13. #MaritimeDevCon - 08/06/19 - @jjcomer Case study #2 -- retry

    failure The worst outage ever Microservice API → 4 layers deep with 5*retries at each level Load overwhelmed the second layer which lead to 5^3 retries per request Ended up needing an entire restart of the platform to flush the queues Client Service A Service B Service C
  14. #MaritimeDevCon - 08/06/19 - @jjcomer Resilience All communication goes through

    the mesh Control plane manages distributed circuit breakers Sidebar: Request time should be started at the edge of the platform. Each level down just has a slice of the total request time Client Service A Service B Service C 30s 10s 3s
  15. #MaritimeDevCon - 08/06/19 - @jjcomer Case study #2.5 -- retry

    resolution As part of the outage a defect was found in the retry logic, which prevented us from disabling retries This meant a change was required to resolve the defect A lot of disruption to teams and a complete redeployment
  16. #MaritimeDevCon - 08/06/19 - @jjcomer Externalization of dependencies Services don’t

    need to implement retry logic Only need to write logic for error states Ability to dramatically change networking implementation without changing service code
  17. #MaritimeDevCon - 08/06/19 - @jjcomer Case study #3 -- Security

    The security team finally noticed the platform you have been building. They would like to know why your data plane is in the clear… Well now you need to encrypt all traffic. This means securely setting up an internal CA and issuing certs to each service instance (with responsible expirations). Services need to change their code to install the CA, use TLS, and verify incoming connection. Oh and do a complete redeploy. Not to mention bootstrapping into TLS mode.
  18. #MaritimeDevCon - 08/06/19 - @jjcomer Security All communication goes through

    the service mesh 0 Trust Networking! Mesh supports mutual TLS (including authorization) between nodes No service changes required! Service A Instance Service B Instance M M Secure (via Mesh) Clear (localhost) Clear (localhost)
  19. #MaritimeDevCon - 08/06/19 - @jjcomer Case study #4 -- new

    feature roll out I want to release a new feature to my customers in a safe manner. There can’t be any downtime and if things go wrong I would like a small blast radius. Feature flags are ok, but they can create spaghetti code and need to be removed after the fact (intentional tech debt).
  20. #MaritimeDevCon - 08/06/19 - @jjcomer Traffic distribution • All traffic

    goes through the service mesh! • Provides ability to shape traffic ◦ Blue/Green deploys ◦ Percentage based routing ◦ Tenant based routing ◦ Ghost traffic • If things go wrong, only a small fraction of requests were affected • No service changes required to support new deployment models
  21. #MaritimeDevCon - 08/06/19 - @jjcomer Service A Instance Service B’

    Instance Service B Instance M M M All requests to Service B are mirrored to B’ Responses from B’ and automatically discarded
  22. #MaritimeDevCon - 08/06/19 - @jjcomer Latency A service mesh will

    add latency No way to avoid this Any extra hops will incur latency Are the benefits worth it?
  23. #MaritimeDevCon - 08/06/19 - @jjcomer Resource cost A proxy is

    required for every instance of a service While efficient, they still require memory, CPU, and bandwidth Mesh CPU Memory Istio 0.6 vCPU/1000r/s 50M/proxy Linkerd 2.0 0.2-1.5 vCPU 128M-2G
  24. #MaritimeDevCon - 08/06/19 - @jjcomer Additional maintenance • Requires integration

    with multiple systems ◦ Service discovery ◦ Orchestration layer • Easier to maintain than embedded network code
  25. #MaritimeDevCon - 08/06/19 - @jjcomer There can be only one

    The battle for orchestration is over CNCF is here to save the day K8S won! Others will continue to exist due to prior adoption Most cloud native solutions are targeting k8s first or only
  26. #MaritimeDevCon - 08/06/19 - @jjcomer Mesh Darwinism Allow the best

    to survive… Unlike orchestration, many can co-exist Landscape has already churned: Istio → Linkerd 2.0 Investment is high to sink into particular adoption, how can we allow easy churn?
  27. #MaritimeDevCon - 08/06/19 - @jjcomer SMI for the win Hot

    off the Press! Microsoft, Linkerd, HashiCorp, Solo.io, Kinvolk, and Weaveworks Removes the lock-in Allows adoption to be a one time cost to service teams Data and Control Planes can be changed without changing service code
  28. #MaritimeDevCon - 08/06/19 - @jjcomer SMI -- What does it

    cover? Traffic Policy -- Identity and Security Traffic Telemetry -- Key metrics Traffic Management -- Controlling distribution https://smi-spec.io/ * * Only for k8s
  29. #MaritimeDevCon - 08/06/19 - @jjcomer Telemetry OpenTelemetry -- https://opentelemetry.io/ Merger

    of OpenTracing and OpenCensus Finally! Covers Tracing and Metrics -- Long term plan to include logging One library to rule all observability! * Not k8s specific
  30. #MaritimeDevCon - 08/06/19 - @jjcomer In summary Service Mesh add

    an abstraction between your services and how they communicate Consistent telemetry increases discoverability and observability Distributed error handling can allow the system to breath and recover 0 Trust Networking helps secure your infrastructure OPEN STANDARDS!!!!!
  31. #MaritimeDevCon - 08/06/19 - @jjcomer whoami Josh Comer Software Architect

    -- SRE @Cvent (we are hiring ) Here in Fredericton!