Slide 1

Slide 1 text

Observando Sistemas Distribuidos Jorge Quilcate @jeqo89 github.com/jeqo/talk-observing-distributed-systems

Slide 2

Slide 2 text

Peruano en Noruega Ingeniero de Software en Sysco AS, parte del equipo de Middleware Iniciando mi trayecto en Sistemas Distribuidos Open-Source Contributor, Apache Kafka project Oracle ACE Associate jeqo.github.io | github.com/jeqo | @jeqo89 Jorge Quilcate

Slide 3

Slide 3 text

Objetivo Explorar herramientas para incrementar el nivel de Observabilidad en nuestras aplicaciones

Slide 4

Slide 4 text

Observabilidad

Slide 5

Slide 5 text

Observabilidad Metrics, Tracing and Logging - Peter Bourgon https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Slide 6

Slide 6 text

Observabilidad Metrics, Tracing and Logging - Peter Bourgon https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Slide 7

Slide 7 text

Observabilidad Metrics, Tracing and Logging - Peter Bourgon https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Slide 8

Slide 8 text

Logging ➔ Eventos discretos: `+load => +logs` ➔ Logging eventos accionables: peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html ➔ Fácil de agregar, difícil de gestionar: blog.codinghorror.com/the-problem-with-logging/ ➔ No intentes gestionar logs como parte de tu aplicación: 12factor.net/logs

Slide 9

Slide 9 text

https://twitter.com/copyconstruct/status/938444628923097089

Slide 10

Slide 10 text

OK Log --> ingestor . . . store | | service --(stdout)--> forwarder --|--> ingestor . . . store | | --> ingestor . . . store OK Log: Distributed and Coördination-Free Logging - Peter Bourgon https://www.youtube.com/watch?v=gWWK2eyZ-sc

Slide 11

Slide 11 text

FluentD/Fluent-bit --> fluentd . . . store | (files) | service --(stdout)--> fluent-bit--|--> fluentd . . . store (docker) | | --> fluentd . . . store

Slide 12

Slide 12 text

Demo: Centralizando logs en Docker con Fluent-bit

Slide 13

Slide 13 text

➔ Valores agregados: `+load => =metrics` ➔ Método RED: twitter.com/LindsayofSF/status/692191001692237825 ◆ Request rate ◆ Error rate ◆ Duration Métricas

Slide 14

Slide 14 text

➔ Post tweets: ◆ 4.6k requests/second en promedio ◆ 12k requests/second en pico ➔ Home timeline: ◆ 300k requests/second Soportar una carga de 12,000 writes/second seria sencillo. Sin embargo, el problema no era el volumen de tweets, pero el fan-out. `#reads/sec = 25 * #writes/sec` Describiendo Carga Twitter use-case - Nov, 2016 Designing Data-Intensive Applications (Chapter 1) - Martin Kleppmann https://dataintensive.net

Slide 15

Slide 15 text

Prometheus Prometheus Architecture https://prometheus.io/docs/introduction/overview

Slide 16

Slide 16 text

Demo: Instrumentando métricas en JAX-RS con Prometheus

Slide 17

Slide 17 text

Trazabilidad Distribuida con OpenTracing

Slide 18

Slide 18 text

Monolithic vs Distributed Tracing

Slide 19

Slide 19 text

Orígenes

Slide 20

Slide 20 text

➔ Basado en el paper “Dapper” ◆ Utilizado en la mayoría de sistemas en Google ◆ Siguen un enfoque basado en Anotaciones, en comparación al enfoque basado en Black-box ➔ “Just an API” ➔ `Trace = DAG[Span]` OpenTracing DAG: Directed Acyclic Graph a.k.a Tree

Slide 21

Slide 21 text

OpenTracing OpenTracing API application logic µ-service frameworks control-flow packages RPC frameworks existing instrumentation tracing infrastructure main() T R A C E R J a e g e r service process OpenTracing Isn't just Tracing: Measure Twice, Instrument Once - Ben Sigelman https://www.youtube.com/watch?v=NyySNe6Rr_g

Slide 22

Slide 22 text

Traza Distribuida

Slide 23

Slide 23 text

JaegerTracing JaegerTracing - Architecture http://jaeger.readthedocs.io/en/latest/architecture/

Slide 24

Slide 24 text

Demo: Intro a OpenTracing API

Slide 25

Slide 25 text

Qué sucede cuando tomamos algo apestoso y aumentamos su área de superficie? Engineering you - Martin Thompson https://www.youtube.com/watch?v=S4LzzuMTqjs&t=1177s MONOLITH

Slide 26

Slide 26 text

Demo: Tweets app “Instrument once, Measure twice”

Slide 27

Slide 27 text

Tweets App - v1: Monolith approach

Slide 28

Slide 28 text

Tweets App - v2: Data pipeline approach

Slide 29

Slide 29 text

➔ Adopción, compatibilidad y nivel de conformidad con el API (Gitter) ➔ Acceso, alcance y nivel de granularidad para diferentes escenarios (Canopy) Retos y Oportunidades con OpenTracing

Slide 30

Slide 30 text

https://twitter.com/mipsytipsy/status/932551447555858433

Slide 31

Slide 31 text

What’s next?

Slide 32

Slide 32 text

Lineage-Driven Fault Injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 33

Slide 33 text

Lineage-Driven Fault Injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 34

Slide 34 text

Lineage-Driven Fault Injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 35

Slide 35 text

Lineage-Driven Fault Injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 36

Slide 36 text

Intuition Engineering Intuition Engineering at Netflix - Justin Reynolds https://vimeo.com/173607639

Slide 37

Slide 37 text

➔ Benjamin Sigelman et al. - “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf ➔ Raja R. Sambasivan et al. “So, you want to trace your distributed system? Key design insights from years of practical experience” http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf ➔ Monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e ➔ OK Log https://peter.bourgon.org/ok-log/ ➔ Metrics, Tracing and Logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html ➔ Distributed Tracing at Uber https://eng.uber.com/distributed-tracing/ ➔ Monitoring and Observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c ➔ Measure Anything, Measure Everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ ➔ The death of ops is greatly exaggerated https://medium.com/@copyconstruct/the-death-of-ops-is-greatly-exaggerated-ff3bd4a67f24 ➔ Logs and Metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38 ➔ Logs - 12 Factor Application https://12factor.net/logs ➔ Take OpenTracing for a HotRod Ride https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941 ➔ The Problem with Logging https://blog.codinghorror.com/the-problem-with-logging/ ➔ Logging v. Instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html ➔ SRE Book https://landing.google.com/sre/book/index.html ➔ Canopy: An End-to-End Performance Tracing And Analysis System http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf ➔ Peter Alvaro et al. - Lineage-Driven Fault Injection https://people.eecs.berkeley.edu/~palvaro/molly.pdf ➔ Vizceral Open Source - Netflix Techblog https://medium.com/netflix-techblog/vizceral-open-source-acc0c32113fe Referencias

Slide 38

Slide 38 text

https://twitter.com/jessitron/status/579109266042150912

Slide 39

Slide 39 text

Observando Sistemas Distribuidos Jorge Quilcate @jeqo89 github.com/jeqo/talk-observing-distributed-systems