#BuildStuffUA A collection of independent components appears to its users as a single coherent system. Characteristics: - Concurrency - No global clock - Independent failures Image source: https://link.medium.com/jey42ga7p1
Videos service DB2 DB3 DB4 Error 1152 ER_ABORTING_CONNECTION 500 Internal Error 500 Internal Error GET /media/e5k2 API Proxy DB1 Media API Distributed Systems @jcchavezs - #BuildStuffUA
Cold water storage tank Shutoff valve First floor branch Tank valve 爆 $❄#☭ I AM HERE! First floor distributor is clogged! Distributed Systems @jcchavezs - #BuildStuffUA
to operate a Distributed System? • Systems change all the time • Things fail in unexpected ways • Unknown unknowns • Most problems are the convergence of many different things failing at once • Everyone in the team is supposed to respond with the same level of confidence and tools no matter experience or expertise and the more components, the less individuals know about them @jcchavezs - #BuildStuffUA Distributed Systems
Cold water storage tank Shutoff valve First floor branch Tank valve 爆 $❄#☭ I AM HERE! First floor distributor is clogged! Distributed Systems @jcchavezs - #BuildStuffUA
Auth Videos Images Time error [1508410442] no cache for resource, retrieving from DB TraceID d52d38b69b0fb15efa I AM HERE! Aborted connection Distributed Tracing @jcchavezs - #BuildStuffUA
a request/message pass through? • What occurred in each service for a given request/message? • Where did the error happen? • Where are the bottlenecks? • What is the critical path for a request? • Who should I page? Distributed Tracing The Answers @jcchavezs - #BuildStuffUA
Distributed Tracing @jcchavezs - #BuildStuffUA Logs tell you that an event happened. Metrics tell you how many events of this type are happening in the system. Tracing tells you what happened (who did what) and the impact of that propagated across your system. Image source: https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
/videos mysql.Get T R A C E Trace’s Anatomy Distributed Tracing • A trace shows an execution path through a distributed system • A span in the trace represents a logical unit of work (with a start and end) • A context includes information that should be propagated across services • Tags and logs (optional) add complementary information to spans. @jcchavezs - #BuildStuffUA
a request or consume a message it will extract (if possible) the context from upstream to continue the trace, otherwise it will start a new trace. API Proxy Media API GET /media/{id} TraceID: fAf3oXL6DS SpanID: dZ0xHIBa1A ... Leg 1: Inbound propagation Distributed Tracing @jcchavezs - #BuildStuffUA
an outbound call to another service it will inject the context in the request (headers) or message (metadata). Media API Video service GET /videos TraceID: fAf3oXL6DS ParentID: y74fr5udj SpanID: dZ0xHIBa1A http/get Leg 2: Outbound propagation Distributed Tracing @jcchavezs - #BuildStuffUA TraceID: fAf3oXL6DS SpanID: y74fr5udj
inside the service it will use the server context as a parent to create local spans. mysql.Query redis.Get Media API Cache service Images service GET /images Leg 3: In process propagation Distributed Tracing @jcchavezs - #BuildStuffUA
Auth Videos Images Time error [1508410442] no cache for resource, retrieving from DB TraceID d52d38b69b0fb15efa I AM HERE! Aborted connection Distributed Tracing @jcchavezs - #BuildStuffUA
Overhead for users: • Observability tools are meant to be unintrusive • Sampling reduces overhead • (Don’t) trace every single operation Overhead for developers: • Not all libraries are ready to plug instruments • Instrumentation can be delegated to common frameworks • Right sampling is hard Distributed Tracing @jcchavezs - #BuildStuffUA
inspired by Google Dapper (2010). It was open sourced by Twitter (2012). • Mature tracing model emerged from users’ needs. • Used by large companies like LINE, Netflix, SoundCloud and Yelp but also small ones. • Strong community: ◦ @zipkinproject ◦ gitter.im/openzipkin Zipkin @jcchavezs - #BuildStuffUA
spans Collector API UI Storage DB Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch Zipkin @jcchavezs - #BuildStuffUA
are complex and will be. • Distributed tracing helps you to understand latencies, critical paths and errors in within a request or message flow. • Distributed Tracing provides contextual insights within a request, it is complementary to other observability tools. @jcchavezs - #BuildStuffUA