As systems get more complex, reasoning about performance gets more difficult. Telemetry data emitted by our services is noisy and usually unhelpful in stressful situations. Distributed Tracing, in particular, can provide rich, contextual data but root-cause analysis can still be convoluted. In this talk, I'll review a few statistics-based approaches we have applied to help quickly identify which properties of the system are correlated with performance issues.
In order to support this type of aggregate trace analysis, we need data, but data isn't cheap. We want to gather only the relevant traces and bias towards traces that have abnormal behavior. I'll also talk about a few sampling approaches we use for analysis to minimize cost and overhead.