distributed tracing and observability for your integration platform - OUGN 2018

distributed tracing and observability for your integration platform [email protected] @jeqo89

monitoring & observability

drivers

distributed systems & complexity

service collaboration ## orchestration * **explicit** data flow * coordination,
coupling ## choreography * **implicit** data flow * flexible, moves faster

ever-changing deployment platforms

observability

observability is for **unknown unknowns** is about making a system
more: * debuggable: tracking down failures and bugs * understandable: answer questions, trends a superset of: * monitoring: how to operate a system * instrumentation: how to develop a system to be monitoriable Observability for Emerging Infra: What Got You Here Won't Get You There - Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE

observability methods

observability won’t replace your intuition Monitoring and Observability - Cindy
Sridharan https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

metrics * From Host-Oriented to Application-Oriented metrics: => From USE
method To RED method USE: Usage, Saturation, Errors RED: Requests, Errors, Duration * Metrics are cheap. Get every meaningful number exposed. * But Metrics don’t tell **stories**.

logging & events * log actionable events. * bring as
**much context** (e.g. build id, user id, device, etc.) * think about Retention and Sampling. * logging is a streaming data problem.

distributed tracing

tracing _monoliths_

tracing `distributed` systems

distributed tracing * 1 story, N storytellers * **aggregated** traces,
**across** boundaries * “distributed tracing _commoditizes knowledge_” - Adrian Cole

distributed tracing approaches ## Black-Box Implicit, Agent-based, Language/Framework specific. e.g.
Most SaaS, Instana, Apache Skywalking, etc. ## Annotation-Based Explicitly adding Global identifiers to tasks to track. e.g. Openzipkin, OpenTracing, Opencensus, Dapper

`dapper`: impact on Google’s development (Adworks) **performance**: Devs were able
to track progress against request latency targets and pinpoint easy optimization opportunities. **correctness**: Was possible to know where clients where accessing master replica when they don’t need to. **understanding**: Now was possible to understand how long it takes to query back-ends fan-out. **testing**

distributed tracing **commons**

`opentracing` OpenTracing API application logic µ-service frameworks control-flow packages RPC
frameworks existing instrumentation tracing infrastructure main() T R A C E R J a e g e r service process

opentracing semantics * `trace` = tree of `spans` (i.e. DAG)
* `span`: `service_name` + `operation` + `tags` + `logs` + `latency` * `span` identified by `context` * `spanContext`: `traceId` + `spanId` + `baggages`

demo: understanding integrations collaboration with distributed tracing

what’s next?

`canopy`: how facebook does it > `canopy` construct traces by
propagating identifiers through the system to correlate information across components. **challenges** about this: * end-to-end data is heterogeneous [...] consuming instrumented data directly is cumbersome and infeasible at scale.

`canopy`: how facebook does it > unless we provide further
abstractions (on top of traces and events) users will have to consume trace data directly, which entails complicated queries to extract simple high-level features. > **users should be able to view traces through a lens appropriate for their particular tasks**.

`canopy`: building models from traces

netflix “pain suit” Intuition Engineering at Netflix - Justin Reynolds
https://vimeo.com/173607639

vizceral - intuition engineering Intuition Engineering at Netflix - Justin
Reynolds https://vimeo.com/173607639

> “the best way to find patterns in a system
is looking at it from above” es devlin, designer Abstract, The Art of Design

simulation with `simianviz` * how would my architecture looks like
if it grows? * monitoring tools often “explode on impact” with real-world use cases at scale * `spigo` (aka. simianviz) is a tool to produce any format output to feed your monitoring tools from a laptop - Adrian Cockcroft

demo: simulate a migration with simianviz and vizceral

https://twitter.com/mipsytipsy/status/932551447555858433 https://twitter.com/jessitron/status/579109266042150912

## references ### papers * **dapper** https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf * **canopy** http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf
* **automating failure testing research at internet scale** https://people.ucsc.edu/~palvaro/socc16.pdf * data on the outside vs data on the inside http://cidrdb.org/cidr2005/papers/P12.pdf * pivot tracing http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf ### articles * ok log https://peter.bourgon.org/ok-log/ * logs - 12 factor application https://12factor.net/logs * the problem with logging https://blog.codinghorror.com/the-problem-with-logging/ * logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html * logs and metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38 * measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ * metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html * monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c * monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e * sre book https://landing.google.com/sre/book/index.html * distributed tracing at uber https://eng.uber.com/distributed-tracing/ * spigo and simianviz https://github.com/adrianco/spigo * observability: what’s in a name? https://honeycomb.io/blog/2017/08/observability-whats-in-a-name/ * wtf is operations? #serverless https://charity.wtf/2016/05/31/wtf-is-operations-serverless/ * event foo: what should i add to an event https://honeycomb.io/blog/2017/08/event-foo-what-should-i-add-to-an-event/

## references ### articles (continued) * Google’s approach to Observability
https://medium.com/@rakyll/googles-approach-to-observability-frameworks-c89fc1f0e058 * Microservices and Observability https://medium.com/@rakyll/microservices-observability-26a8b7056bb4 * Best Practices for Observability https://honeycomb.io/blog/2017/11/best-practices-for-observability/ * https://thenewstack.io/dev-ops-doesnt-matter-need-observability/ ### talks * "Observability for Emerging Infra: What Got You Here Won't Get You There" by Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE * “The Verification of a Distributed System” by Caitie McCaffrey https://www.youtube.com/watch?v=kDh5BrqiGhI * “Mastering Chaos - A Netflix Guide to Microservices” by Josh Evans https://www.youtube.com/watch?v=CZ3wIuvmHeM * “Monitoring Microservices” by Tom Wilkie https://www.youtube.com/watch?v=emaPPg_zxb4 * “Microservice application tracing standards and simulations” by Adrian Cole and Adrian Cockcroft https://www.slideshare.net/adriancockcroft/microservices-application-tracing-standards-and-simulators-adrians-at-oscon * “Intuition Engineering at Netflix” by Justin Reynolds https://vimeo.com/173607639 * Distributed Tracing: Understanding how your all your components work together by José Carlos Chávez https://speakerdeck.com/jcchavezs/distributed-tracing-understanding-how-your-all-your-components-work-together * “Monitoring isn't just an accident” https://docs.google.com/presentation/d/1IEJIaQoCjzBsVq0h2Y7qcsWRWPS5lYt9CS2Jl25eurc/edit#slide=id.g327c9fd948_0_534 * Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

references ## articles (continued) * “The Verification of A Distributed
System” - Caitie McCaffrie https://github.com/CaitieM20/Talks/tree/master/TheVerificationOfADistributedSystem * “Testing in Production” by Charity Majors https://opensource.com/article/17/8/testing-production * “Data on the outside vs Data on the inside - Review” by Adrian Colyer https://blog.acolyer.org/2016/09/13/data-on-the-outside-versus-data-on-the-inside/

**thank you!** * github.com/jeqo/talk-observing-distributed-systems * github.com/sysco-middleware/talk-observability-tracing-kafka * github.com/sysco-middleware/apm-oracle-fmw * jeqo.github.io
* twitter.com/jeqo89

distributed tracing and observability for your ...

distributed tracing and observability for your integration platform - OUGN 2018

More Decks by Jorge Quilcate

Other Decks in Technology

Featured

Transcript