distributed tracing and observability for your integration platform - OUGN 2018

Slide 1

Slide 1 text

distributed tracing and observability for your integration platform [email protected] @jeqo89

Slide 2

Slide 2 text

monitoring & observability

Slide 3

Slide 3 text

drivers

Slide 4

Slide 4 text

distributed systems & complexity

Slide 5

Slide 5 text

service collaboration ## orchestration * **explicit** data flow * coordination, coupling ## choreography * **implicit** data flow * flexible, moves faster

Slide 6

Slide 6 text

ever-changing deployment platforms

Slide 7

Slide 7 text

observability

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

observability is for **unknown unknowns** is about making a system more: * debuggable: tracking down failures and bugs * understandable: answer questions, trends a superset of: * monitoring: how to operate a system * instrumentation: how to develop a system to be monitoriable Observability for Emerging Infra: What Got You Here Won't Get You There - Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE

Slide 10

Slide 10 text

observability methods

Slide 11

Slide 11 text

observability methods

Slide 12

Slide 12 text

observability won’t replace your intuition Monitoring and Observability - Cindy Sridharan https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Slide 13

Slide 13 text

metrics * From Host-Oriented to Application-Oriented metrics: => From USE method To RED method USE: Usage, Saturation, Errors RED: Requests, Errors, Duration * Metrics are cheap. Get every meaningful number exposed. * But Metrics don’t tell **stories**.

Slide 14

Slide 14 text

logging & events * log actionable events. * bring as **much context** (e.g. build id, user id, device, etc.) * think about Retention and Sampling. * logging is a streaming data problem.

Slide 15

Slide 15 text

distributed tracing

Slide 16

Slide 16 text

tracing _monoliths_

Slide 17

Slide 17 text

tracing `distributed` systems

Slide 18

Slide 18 text

distributed tracing * 1 story, N storytellers * **aggregated** traces, **across** boundaries * “distributed tracing _commoditizes knowledge_” - Adrian Cole

Slide 19

Slide 19 text

distributed tracing approaches ## Black-Box Implicit, Agent-based, Language/Framework specific. e.g. Most SaaS, Instana, Apache Skywalking, etc. ## Annotation-Based Explicitly adding Global identifiers to tasks to track. e.g. Openzipkin, OpenTracing, Opencensus, Dapper

Slide 20

Slide 20 text

`dapper`: impact on Google’s development (Adworks) **performance**: Devs were able to track progress against request latency targets and pinpoint easy optimization opportunities. **correctness**: Was possible to know where clients where accessing master replica when they don’t need to. **understanding**: Now was possible to understand how long it takes to query back-ends fan-out. **testing**

Slide 21

Slide 21 text

distributed tracing **commons**

Slide 22

Slide 22 text

`opentracing` OpenTracing API application logic µ-service frameworks control-flow packages RPC frameworks existing instrumentation tracing infrastructure main() T R A C E R J a e g e r service process

Slide 23

Slide 23 text

opentracing semantics * `trace` = tree of `spans` (i.e. DAG) * `span`: `service_name` + `operation` + `tags` + `logs` + `latency` * `span` identified by `context` * `spanContext`: `traceId` + `spanId` + `baggages`

Slide 24

Slide 24 text

demo: understanding integrations collaboration with distributed tracing

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

what’s next?

Slide 29

Slide 29 text

`canopy`: how facebook does it > `canopy` construct traces by propagating identifiers through the system to correlate information across components. **challenges** about this: * end-to-end data is heterogeneous [...] consuming instrumented data directly is cumbersome and infeasible at scale.

Slide 30

Slide 30 text

`canopy`: how facebook does it > unless we provide further abstractions (on top of traces and events) users will have to consume trace data directly, which entails complicated queries to extract simple high-level features. > **users should be able to view traces through a lens appropriate for their particular tasks**.

Slide 31

Slide 31 text

`canopy`: building models from traces

Slide 32

Slide 32 text

netflix “pain suit” Intuition Engineering at Netflix - Justin Reynolds https://vimeo.com/173607639

Slide 33

Slide 33 text

vizceral - intuition engineering Intuition Engineering at Netflix - Justin Reynolds https://vimeo.com/173607639

Slide 34

Slide 34 text

> “the best way to find patterns in a system is looking at it from above” es devlin, designer Abstract, The Art of Design

Slide 35

Slide 35 text

simulation with `simianviz` * how would my architecture looks like if it grows? * monitoring tools often “explode on impact” with real-world use cases at scale * `spigo` (aka. simianviz) is a tool to produce any format output to feed your monitoring tools from a laptop - Adrian Cockcroft

Slide 36

Slide 36 text

demo: simulate a migration with simianviz and vizceral

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

https://twitter.com/mipsytipsy/status/932551447555858433 https://twitter.com/jessitron/status/579109266042150912

Slide 40

Slide 40 text

## references ### papers * **dapper** https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf * **canopy** http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf * **automating failure testing research at internet scale** https://people.ucsc.edu/~palvaro/socc16.pdf * data on the outside vs data on the inside http://cidrdb.org/cidr2005/papers/P12.pdf * pivot tracing http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf ### articles * ok log https://peter.bourgon.org/ok-log/ * logs - 12 factor application https://12factor.net/logs * the problem with logging https://blog.codinghorror.com/the-problem-with-logging/ * logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html * logs and metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38 * measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ * metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html * monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c * monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e * sre book https://landing.google.com/sre/book/index.html * distributed tracing at uber https://eng.uber.com/distributed-tracing/ * spigo and simianviz https://github.com/adrianco/spigo * observability: what’s in a name? https://honeycomb.io/blog/2017/08/observability-whats-in-a-name/ * wtf is operations? #serverless https://charity.wtf/2016/05/31/wtf-is-operations-serverless/ * event foo: what should i add to an event https://honeycomb.io/blog/2017/08/event-foo-what-should-i-add-to-an-event/

Slide 41

Slide 41 text

## references ### articles (continued) * Google’s approach to Observability https://medium.com/@rakyll/googles-approach-to-observability-frameworks-c89fc1f0e058 * Microservices and Observability https://medium.com/@rakyll/microservices-observability-26a8b7056bb4 * Best Practices for Observability https://honeycomb.io/blog/2017/11/best-practices-for-observability/ * https://thenewstack.io/dev-ops-doesnt-matter-need-observability/ ### talks * "Observability for Emerging Infra: What Got You Here Won't Get You There" by Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE * “The Verification of a Distributed System” by Caitie McCaffrey https://www.youtube.com/watch?v=kDh5BrqiGhI * “Mastering Chaos - A Netflix Guide to Microservices” by Josh Evans https://www.youtube.com/watch?v=CZ3wIuvmHeM * “Monitoring Microservices” by Tom Wilkie https://www.youtube.com/watch?v=emaPPg_zxb4 * “Microservice application tracing standards and simulations” by Adrian Cole and Adrian Cockcroft https://www.slideshare.net/adriancockcroft/microservices-application-tracing-standards-and-simulators-adrians-at-oscon * “Intuition Engineering at Netflix” by Justin Reynolds https://vimeo.com/173607639 * Distributed Tracing: Understanding how your all your components work together by José Carlos Chávez https://speakerdeck.com/jcchavezs/distributed-tracing-understanding-how-your-all-your-components-work-together * “Monitoring isn't just an accident” https://docs.google.com/presentation/d/1IEJIaQoCjzBsVq0h2Y7qcsWRWPS5lYt9CS2Jl25eurc/edit#slide=id.g327c9fd948_0_534 * Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 42

Slide 42 text

references ## articles (continued) * “The Verification of A Distributed System” - Caitie McCaffrie https://github.com/CaitieM20/Talks/tree/master/TheVerificationOfADistributedSystem * “Testing in Production” by Charity Majors https://opensource.com/article/17/8/testing-production * “Data on the outside vs Data on the inside - Review” by Adrian Colyer https://blog.acolyer.org/2016/09/13/data-on-the-outside-versus-data-on-the-inside/

Slide 43

Slide 43 text

**thank you!** * github.com/jeqo/talk-observing-distributed-systems * github.com/sysco-middleware/talk-observability-tracing-kafka * github.com/sysco-middleware/apm-oracle-fmw * jeqo.github.io * twitter.com/jeqo89