The Morning Paper on Operability

Adrian Colyer | @adriancolyer The Morning Paper on Operability

blog.acolyer.org 400 Foundations Frontiers

Adrian Colyer | @adriancolyer Operability starts with design and development

Adrian Colyer | @adriancolyer LISA ‘07

Adrian Colyer | @adriancolyer We have long believed that 80%
of operations issues originate in design and development... when systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues however, either have their genesis in design and development or are best solved there. “ ”

Adrian Colyer | @adriancolyer If the development team is frequently
called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team. “ ”

Adrian Colyer | @adriancolyer 1 2 3 Expect failures to
happen regularly, and handle them gracefully Keep things as simple as possible Automate Everything The three tenets

Adrian Colyer | @adriancolyer

Adrian Colyer | @adriancolyer Figuring out what’s going on -
the big picture

Adrian Colyer | @adriancolyer “ ” Modern Internet services are
often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities…. Understanding system behavior in this context requires observing related activities across many different programs and machines.

Adrian Colyer | @adriancolyer “ ” When systems involve not
just dozens of subsystems but dozens of engineering teams, even our best and most experienced engineers routinely guess wrong about the root cause of poor end-to-end performance. In such situations, Dapper can furnish much-needed facts and is able to answer many important performance questions conclusively. ”

Adrian Colyer | @adriancolyer … many systems are like the
Facebook systems we study; they grow organically over time in a culture that favors innovation over standardization (e.g., “move fast and break things” is a well-known Facebook slogan). There is broad diversity in programming languages, communication middleware, execution environments, and scheduling mechanisms. Adding instrumentation retroactively to such an infrastructure is a Herculean task. “ ”

Adrian Colyer | @adriancolyer Never fear, the logs are here!
• Request identifier • Host identifier • Host-local timestamp • Unique event label “ ” Our key observation is that the sheer volume of requests handled by modern services allows us to gather observations of the order in which messages are logged over a tremendous number of requests.

Adrian Colyer | @adriancolyer “ ” The information in the
logs is sufficiently rich to allow the recovering of the inherent structure of the dispersed and intermingled log output messages, thus enabling useful performance profilers like lprof. ”

Adrian Colyer | @adriancolyer “ ” Our evaluation shows that
lprof can accurately attribute 88% of the log messages from widely-used production quality distributed systems and is helpful in debugging 65% of the sampled real-world performance anomalies. ”

Adrian Colyer | @adriancolyer Narrowing it down

Adrian Colyer | @adriancolyer “ ” The greatest challenge is
posed by bugs that only recur in production and cannot be reproduced in-house. Diagnosing the root cause and fixing such bugs is truly hard. In [57] developers noted: "We don't have tools for the once every 24 hours bug in a 100 machine cluster". An informal poll on Quora asked "What is a coder's worst nightmare?," and the answers were "The bug only occurs in production and can't be replicated locally," and "The cause of the bug is unknown." ”

Adrian Colyer | @adriancolyer “ ” We evaluated our Gist
prototype using 11 failures from 7 different programs including Apache, SQLite, and Memcached. The Gist prototype managed to automatically build failure sketches with an average accuracy of 96% for all the failures while incurring an average performance overhead of 3.74%. On average, Gist incurs 166x less runtime performance overhead than a state-of-the-art record/replay system. ”

Adrian Colyer | @adriancolyer #define SIZE 20 double mult(double z[],
int n) { int i, j ; i = 0; for (j = 0; j < n; j++) { i = i + j + 1; z[i] = z[i] * (z[0] + 1.0); } return z[n]; } void copy(double to[], double from[], int count) { int n = (count + 7) / 8; switch (count % 8) do { case 0: *to++ = *from++; case 7: *to++ = *from++; case 6: *to++ = *from++; case 5: *to++ = *from++; case 4: *to++ = *from++; case 3: *to++ = *from++; case 2: *to++ = *from++; case 1: *to++ = *from++; } while(--n > 0); return mult(to,2); } int main(int argc, char *argv[]) { double x[SIZE], y[SIZE]; double *px = x; while (px < x + SIZE) *px++ = (px - x) * (SIZE + 1.0); return copy(y,x,SIZE); }

Adrian Colyer | @adriancolyer t(double z[],int n){int i,j;for(;;){i=i+j+1;z[i]=z[i]*z[0]+0;}ret urn z[n];}

Adrian Colyer | @adriancolyer mult(double *z, int n) { int
i; int j; for(;;) { i = i + j + 1; z[i] = z[i] * (z[0] + 0); } }

Adrian Colyer | @adriancolyer 1 2 3 Find a minimal
causal sequence of external events Minimise internal events Minimise the payloads of external messages DEMi’s three stages

Adrian Colyer | @adriancolyer Bug Found or reproduced? #events: Total
(external) Minimised #events Time (secs) raft-45 reproduced 1140 (108) 37 (8) 498 raft-46 reproduced 1730 (108) 61 (8) 250 raft-56 found 780 (108) 23 (8) 197 raft-58a found 2850 (108) 226 (31) 43345 raft-58b found 1500 (208) 40 (9) 42 raft-42 reproduced 1710 (208) 180 (21) 10558 raft-66 found 400 (68) 77 (15) 334 spark-2294 reproduced 1000 (30) 40 (3) 97 spark-2294-c reproduced 700 (3) 51 (3) 270 spark-3150 reproduced 600 (20) 14 (3) 26 spark-9256 found 600 (20) 16 (3) 15

Adrian Colyer | @adriancolyer Why oh why…?

Adrian Colyer | @adriancolyer network send < 10KB and network
receive < 10KB and client wait times > 100ms and cpu usage < 5 Predicate-based: Causal: “Rotation of the redo log file”

Adrian Colyer | @adriancolyer “ ” Our extensive experiments show
that our algorithm is highly effective in identifying the correct explanations and is more accurate than the state-of-the-art algorithm. As a much needed tool for coping with the increasing complexity of today's DBMS, DBSherlock is released as an open-source module in our workload management toolkit. ”

Adrian Colyer | @adriancolyer Closing the loop

Adrian Colyer | @adriancolyer “ ” A unifying theme of
many ongoing trends in software engineering is a blurring of the boundaries between building and operating software products. In this paper, we explore what we consider to be the logical next step in this succession: integrating runtime monitoring data from production deployments of the software into the tools developers utilize in their daily workflows (i.e., IDEs) to enable tighter feedback loops. We refer to this notion as feedback-driven development (FDD). ”

Adrian Colyer | @adriancolyer Timeless advice

Adrian Colyer | @adriancolyer “ ” The complexity of complex
systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure, they are regarded as a minor factor during operations. Complex systems therefore run in degraded mode as their normal mode of operation. ”

A new paper every weekday Published at https://blog.acolyer.org. 01 Delivered
Straight to your inbox If you prefer email-based subscription to read at your leisure. 02 Announced on Twitter I’m @adriancolyer. 03 Go to a Papers We Love Meetup A repository of academic computer science papers and a community who loves reading them. 04 Share what you learn Anyone can take part in the great conversation. 05

Adrian Colyer | @adriancolyer Papers from this talk:

Adrian Colyer | @adriancolyer Paper links • On designing and
deploying internet scale services https://blog.acolyer.org/2016/09/12/on-designing-and-deploying-internet-scale-services/ • Dapper: https://blog.acolyer.org/2015/10/06/dapper-a-large-scale-distributed-systems-tracing-infrastructure/ • Mystery Machine: https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/ • Gorilla https://blog.acolyer.org/2016/05/03/gorilla-a-fast-scalable-in-memory-time-series-database/ • lprof https://blog.acolyer.org/2015/10/08/lprof-a-non-intrusive-request-flow-profiler-for-distributed-systems/ • Pivot tracing https://blog.acolyer.org/2015/10/13/pivot-tracing-dynamic-causal-monitoring-for-distributed-systems/ • Failure sketching https://blog.acolyer.org/2015/10/12/failure-sketching-a-technique-for-automated-root-cause-diagnosis-of-in-production-failures/ • Simplifying and isolating failure inducing input https://blog.acolyer.org/2015/11/16/simplifying-and-isolating-failure-inducing-input/ • Hierarchical delta debugging https://blog.acolyer.org/2015/11/17/hierarchical-delta-debugging/ • Minimising faulty executions https://blog.acolyer.org/2015/11/18/minimizing-faulty-executions-of-distributed-systems/ • DBSherlock https://blog.acolyer.org/2016/07/14/dbsherlock-a-performance-diagnostic-tool-for-transactional-databases/ • Runtime metric meets developer https://blog.acolyer.org/2015/11/10/runtime-metric-meets-developer-building-better-cloud-applications-using-feedback/ • How complex systems fail https://blog.acolyer.org/2016/02/10/how-complex-systems-fail/

The Morning Paper on Operability

The Morning Paper on Operability

More Decks by Adrian Colyer

Other Decks in Technology

Featured

Transcript