Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding microservice latency

Understanding microservice latency

An introduction to Distributed Tracing and Zipkin with a demo using Spring Cloud Sleuth

Adrian Cole

June 30, 2016
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. © 2016 Pivotal !1 Understanding microservice latency An introduction to

    Distributed Tracing and Zipkin Adrian Cole, Pivotal @adrianfcole
  2. Latency Analysis Microservice and data pipeline architectures are a often

    a graph of components, distributed across a network. A call graph or data flow can become delayed or fail due to the nature of the operation, components, or edges between them. We want to understand our current architecture and troubleshoot latency problems, in production.
  3. Troubleshooting latency problems When was the event and how long

    did it take? Where did this happen? Which event was it? Is it abnormal?
  4. When was the event and how long did it take?

    First log statement was at 15:31:29.103 GMT… last… 15:31:30.530 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds
  5. wombats:10.2.3.47:8080 Server log says Client IP was 1.2.3.4 This is

    a shard in the wombats cluster, listening on 10.2.3.47:8080 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Where did this happen? peer.ipv4 1.2.3.4
  6. wombats:10.2.3.47:8080 Which event was it? The http response header had

    “request-id: abcd- ffe”? Is that what you mean? Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds peer.ipv4 1.2.3.4 http.request-id abcd-ffe
  7. wombats:10.2.3.47:8080 Is it abnormal? I’ll check other logs for this

    request id and see what I can find out. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Well, average response time for POST /things in the last 2 days is 100ms peer.ipv4 1.2.3.4 http.request-id abcd-ffe
  8. wombats:10.2.3.47:8080 Achieving understanding I searched the logs for others in

    that group.. took about the same time. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Ok, looks like this client is in the experimental group for HD uploads peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads
  9. POST /things There’s often two sides to the story Client

    Sent:15:31:28:500 Client Received:15:31:31:000 Duration: 2500 milliseconds Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds
  10. and not all operations are on the critical path Wire

    Send Store Async Store Wire Send POST /things POST /things
  11. and not all operations are relevant Wire Send Store Async

    Async Store Failed Wire Send POST /things POST /things KQueueArrayWrapper.kev UnboundedFuturePool-2 SelectorUtil.select LockSupport.parkNan ReferenceQueue.remove
  12. It’s getting harder to draw our architecture Polyglot microservice and

    data flow architectures are increasingly easy to write and deploy.
  13. Can we make troubleshooting wizard-free? We no longer need wizards

    to deploy complex architectures. We shouldn’t need wizards to troubleshoot them, either!
  14. Distributed Tracing commoditizes knowledge Distributed tracing systems collect end-to-end latency

    graphs (traces) in near real-time. You can compare traces to understand why certain requests take longer than others.
  15. Distributed Tracing Vocabulary A Span is an individual operation that

    took place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans.
  16. wombats:10.2.3.47:8080 A Span is an individual operation Server Received POST

    /things Server Sent Events Tags Operation peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads
  17. Tracers record time, duration and host Wire Send Store Async

    Store Wire Send POST /things POST /things
  18. Tracers cement shape with IDs No parent means this is

    the root of the trace Each span in a trace has its own ID The parent ID tells you where you are in the call graph
  19. Tracers send trace data out of process Tracers propagate IDs

    in-band, to tell the receiver there’s a trace in progress Completed spans are reported out-of-band, to reduce overhead and allow for batching
  20. Tracers usually live in your application Tracers execute in your

    production apps! They are written to not log too much, and to not cause applications to crash. - propagate structural data in-band, and the rest out-of-band - have instrumentation or sampling policy to manage volume - often include opinionated instrumentation of layers such as HTTP
  21. Tracing Systems are Observability Tools Tracing systems collect, process and

    present data reported by tracers. - aggregate spans into trace trees - provide query and visualization for latency analysis - have retention policy (usually days)
  22. Zipkin has pluggable architecture Tracers report spans HTTP or Kafka.

    Servers collect spans, storing them in MySQL, Cassandra, or Elasticsearch. Users query for traces via Zipkin’s Web UI or Api. services: storage: image: openzipkin/zipkin-cassandra:1.1.5 container_name: cassandra ports: - 9042:9042 server: image: openzipkin/zipkin:1.1.5 environment: - STORAGE_TYPE=cassandra - CASSANDRA_CONTACT_POINTS=cassandra ports: - 9411:9411 depends_on: - storage
  23. Zipkin has starter architecture Tracing is new for a lot

    of folks. For many, the MySQL option is a good start, as it is familiar. services: storage: image: openzipkin/zipkin-mysql:1.1.5 container_name: mysql ports: - 3306:3306 server: image: openzipkin/zipkin:1.1.5 environment: - STORAGE_TYPE=mysql - MYSQL_HOST=mysql ports: - 9411:9411 depends_on: - storage
  24. Zipkin can be as simple as a single file $

    curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' > zipkin.jar $ java -jar zipkin.jar . ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v1.3.5.RELEASE) 2016-06-23 20:17:09.123 INFO 37481 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 37481 —snip— $ curl -s localhost:9411/api/v1/services|jq . [ "zipkin-server" ]
  25. Zipkin lives in GitHub Zipkin was created by Twitter in

    2012. In 2015, OpenZipkin became the primary fork. OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images. https://github.com/openzipkin
  26. Two Spring Boot (Java) services collaborate over http. Zipkin will

    show how long the whole operation took, as well how much time was spent in each service. https://github.com/adriancole/sleuth-webmvc-example Distributed Tracing across Spring Boot apps
  27. Web requests in the demo are served by Spring MVC

    controllers. Tracing of these are automatically performed by Spring Cloud Sleuth. Spring Cloud Sleuth reports to Zipkin via HTTP by depending on spring-cloud-sleuth-zipkin. https://cloud.spring.io/spring-cloud-sleuth/ Spring Cloud Sleuth Java
  28. Wrapping up Start by sending traces directly to a zipkin

    server. Grow into fanciness as you need it: sampling, streaming, etc Remember you are not alone! @adrianfcole #zipkin gitter.im/spring-cloud/spring-cloud-sleuth gitter.im/openzipkin/zipkin