Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Morning Paper on Operability

Adrian Colyer
September 20, 2016

The Morning Paper on Operability

Talk from Operability.io September 19th 2016

Adrian Colyer

September 20, 2016
Tweet

More Decks by Adrian Colyer

Other Decks in Technology

Transcript

  1. Adrian Colyer | @adriancolyer
    The Morning Paper
    on Operability

    View full-size slide

  2. blog.acolyer.org
    400
    Foundations
    Frontiers

    View full-size slide

  3. Adrian Colyer | @adriancolyer
    Operability starts with design and development

    View full-size slide

  4. Adrian Colyer | @adriancolyer
    LISA ‘07

    View full-size slide

  5. Adrian Colyer | @adriancolyer
    We have long believed that 80% of operations
    issues originate in design and development...
    when systems fail, there is a natural tendency
    to look first to operations since that is where
    the problem actually took place. Most
    operations issues however, either have their
    genesis in design and development or are
    best solved there.


    View full-size slide

  6. Adrian Colyer | @adriancolyer
    If the development team is frequently called
    in the middle of the night, automation is the
    likely outcome. If operations is frequently
    called, the usual reaction is to grow the
    operations team.


    View full-size slide

  7. Adrian Colyer | @adriancolyer
    1
    2
    3
    Expect failures to happen regularly, and
    handle them gracefully
    Keep things as simple as possible
    Automate Everything
    The three tenets

    View full-size slide

  8. Adrian Colyer | @adriancolyer

    View full-size slide

  9. Adrian Colyer | @adriancolyer
    Figuring out what’s going on - the big picture

    View full-size slide

  10. Adrian Colyer | @adriancolyer

    View full-size slide

  11. Adrian Colyer | @adriancolyer


    Modern Internet services are often implemented as
    complex, large-scale distributed systems. These
    applications are constructed from collections of software
    modules that may be developed by different teams,
    perhaps in different programming languages, and could
    span many thousands of machines across multiple
    physical facilities…. Understanding system behavior in
    this context requires observing related activities across
    many different programs and machines.

    View full-size slide

  12. Adrian Colyer | @adriancolyer


    When systems involve not just dozens of subsystems but
    dozens of engineering teams, even our best and most
    experienced engineers routinely guess wrong about the
    root cause of poor end-to-end performance. In such
    situations, Dapper can furnish much-needed facts and is
    able to answer many important performance questions
    conclusively.

    View full-size slide

  13. Adrian Colyer | @adriancolyer

    View full-size slide

  14. Adrian Colyer | @adriancolyer
    … many systems are like the Facebook systems we study;
    they grow organically over time in a culture that favors
    innovation over standardization (e.g., “move fast and
    break things” is a well-known Facebook slogan). There is
    broad diversity in programming languages,
    communication middleware, execution environments,
    and scheduling mechanisms. Adding instrumentation
    retroactively to such an infrastructure is a Herculean
    task.


    View full-size slide

  15. Adrian Colyer | @adriancolyer
    Never fear, the logs are here!
    ● Request identifier
    ● Host identifier
    ● Host-local timestamp
    ● Unique event label


    Our key observation is that the sheer volume of
    requests handled by modern services allows us to
    gather observations of the order in which
    messages are logged over a tremendous number of
    requests.

    View full-size slide

  16. Adrian Colyer | @adriancolyer

    View full-size slide

  17. Adrian Colyer | @adriancolyer

    View full-size slide

  18. Adrian Colyer | @adriancolyer


    The information in the logs is sufficiently rich to allow
    the recovering of the inherent structure of the dispersed
    and intermingled log output messages, thus enabling
    useful performance profilers like lprof.

    View full-size slide

  19. Adrian Colyer | @adriancolyer


    Our evaluation shows that lprof can accurately attribute
    88% of the log messages from widely-used production
    quality distributed systems and is helpful in debugging
    65% of the sampled real-world performance anomalies.

    View full-size slide

  20. Adrian Colyer | @adriancolyer

    View full-size slide

  21. Adrian Colyer | @adriancolyer
    Narrowing it down

    View full-size slide

  22. Adrian Colyer | @adriancolyer

    View full-size slide

  23. Adrian Colyer | @adriancolyer


    The greatest challenge is posed by bugs that only recur
    in production and cannot be reproduced in-house.
    Diagnosing the root cause and fixing such bugs is truly
    hard. In [57] developers noted: "We don't have tools for
    the once every 24 hours bug in a 100 machine cluster".
    An informal poll on Quora asked "What is a coder's worst
    nightmare?," and the answers were "The bug only occurs
    in production and can't be replicated locally," and "The
    cause of the bug is unknown." ”

    View full-size slide

  24. Adrian Colyer | @adriancolyer

    View full-size slide

  25. Adrian Colyer | @adriancolyer


    We evaluated our Gist prototype using 11 failures from 7
    different programs including Apache, SQLite, and
    Memcached. The Gist prototype managed to
    automatically build failure sketches with an average
    accuracy of 96% for all the failures while incurring an
    average performance overhead of 3.74%. On average,
    Gist incurs 166x less runtime performance overhead
    than a state-of-the-art record/replay system.

    View full-size slide

  26. Adrian Colyer | @adriancolyer

    View full-size slide

  27. Adrian Colyer | @adriancolyer
    #define SIZE 20
    double mult(double z[], int n)
    {
    int i, j ;
    i = 0;
    for (j = 0; j < n; j++) {
    i = i + j + 1;
    z[i] = z[i] * (z[0] + 1.0);
    }
    return z[n];
    }
    void copy(double to[], double from[], int count)
    {
    int n = (count + 7) / 8;
    switch (count % 8) do {
    case 0: *to++ = *from++;
    case 7: *to++ = *from++;
    case 6: *to++ = *from++;
    case 5: *to++ = *from++;
    case 4: *to++ = *from++;
    case 3: *to++ = *from++;
    case 2: *to++ = *from++;
    case 1: *to++ = *from++;
    } while(--n > 0);
    return mult(to,2);
    }
    int main(int argc, char *argv[])
    {
    double x[SIZE], y[SIZE];
    double *px = x;
    while (px < x + SIZE)
    *px++ = (px - x) * (SIZE + 1.0);
    return copy(y,x,SIZE);
    }

    View full-size slide

  28. Adrian Colyer | @adriancolyer
    t(double z[],int n){int
    i,j;for(;;){i=i+j+1;z[i]=z[i]*z[0]+0;}ret
    urn z[n];}

    View full-size slide

  29. Adrian Colyer | @adriancolyer

    View full-size slide

  30. Adrian Colyer | @adriancolyer

    View full-size slide

  31. Adrian Colyer | @adriancolyer
    mult(double *z, int n)
    {
    int i;
    int j;
    for(;;) {
    i = i + j + 1;
    z[i] = z[i] * (z[0] + 0);
    }
    }

    View full-size slide

  32. Adrian Colyer | @adriancolyer

    View full-size slide

  33. Adrian Colyer | @adriancolyer
    1
    2
    3
    Find a minimal causal sequence of external
    events
    Minimise internal events
    Minimise the payloads of external messages
    DEMi’s three stages

    View full-size slide

  34. Adrian Colyer | @adriancolyer
    Bug Found or
    reproduced?
    #events:
    Total (external)
    Minimised #events Time (secs)
    raft-45 reproduced 1140 (108) 37 (8) 498
    raft-46 reproduced 1730 (108) 61 (8) 250
    raft-56 found 780 (108) 23 (8) 197
    raft-58a found 2850 (108) 226 (31) 43345
    raft-58b found 1500 (208) 40 (9) 42
    raft-42 reproduced 1710 (208) 180 (21) 10558
    raft-66 found 400 (68) 77 (15) 334
    spark-2294 reproduced 1000 (30) 40 (3) 97
    spark-2294-c reproduced 700 (3) 51 (3) 270
    spark-3150 reproduced 600 (20) 14 (3) 26
    spark-9256 found 600 (20) 16 (3) 15

    View full-size slide

  35. Adrian Colyer | @adriancolyer
    Why oh why…?

    View full-size slide

  36. Adrian Colyer | @adriancolyer

    View full-size slide

  37. Adrian Colyer | @adriancolyer

    View full-size slide

  38. Adrian Colyer | @adriancolyer
    network send < 10KB
    and network receive < 10KB
    and client wait times > 100ms
    and cpu usage < 5
    Predicate-based:
    Causal:
    “Rotation of the redo log file”

    View full-size slide

  39. Adrian Colyer | @adriancolyer

    View full-size slide

  40. Adrian Colyer | @adriancolyer

    View full-size slide

  41. Adrian Colyer | @adriancolyer

    View full-size slide

  42. Adrian Colyer | @adriancolyer

    View full-size slide

  43. Adrian Colyer | @adriancolyer


    Our extensive experiments show that our algorithm is
    highly effective in identifying the correct explanations
    and is more accurate than the state-of-the-art algorithm.
    As a much needed tool for coping with the increasing
    complexity of today's DBMS, DBSherlock is released as
    an open-source module in our workload management
    toolkit.

    View full-size slide

  44. Adrian Colyer | @adriancolyer

    View full-size slide

  45. Adrian Colyer | @adriancolyer
    Closing the loop

    View full-size slide

  46. Adrian Colyer | @adriancolyer

    View full-size slide

  47. Adrian Colyer | @adriancolyer


    A unifying theme of many ongoing trends in software
    engineering is a blurring of the boundaries between
    building and operating software products. In this paper,
    we explore what we consider to be the logical next step
    in this succession: integrating runtime monitoring data
    from production deployments of the software into the
    tools developers utilize in their daily workflows (i.e.,
    IDEs) to enable tighter feedback loops. We refer to this
    notion as feedback-driven development (FDD). ”

    View full-size slide

  48. Adrian Colyer | @adriancolyer

    View full-size slide

  49. Adrian Colyer | @adriancolyer
    Timeless advice

    View full-size slide

  50. Adrian Colyer | @adriancolyer

    View full-size slide

  51. Adrian Colyer | @adriancolyer


    The complexity of complex systems makes it impossible
    for them to run without multiple flaws being present.
    Because these are individually insufficient to cause
    failure, they are regarded as a minor factor during
    operations. Complex systems therefore run in degraded
    mode as their normal mode of operation.

    View full-size slide

  52. A new paper every weekday
    Published at https://blog.acolyer.org.
    01
    Delivered Straight to your inbox
    If you prefer email-based subscription to read at
    your leisure.
    02
    Announced on Twitter
    I’m @adriancolyer.
    03
    Go to a Papers We Love Meetup
    A repository of academic computer science papers
    and a community who loves reading them.
    04
    Share what you learn
    Anyone can take part in the great conversation.
    05

    View full-size slide

  53. Adrian Colyer | @adriancolyer
    Papers from this talk:

    View full-size slide

  54. Adrian Colyer | @adriancolyer
    Paper links
    ● On designing and deploying internet scale services https://blog.acolyer.org/2016/09/12/on-designing-and-deploying-internet-scale-services/
    ● Dapper: https://blog.acolyer.org/2015/10/06/dapper-a-large-scale-distributed-systems-tracing-infrastructure/
    ● Mystery Machine:
    https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/
    ● Gorilla https://blog.acolyer.org/2016/05/03/gorilla-a-fast-scalable-in-memory-time-series-database/
    ● lprof https://blog.acolyer.org/2015/10/08/lprof-a-non-intrusive-request-flow-profiler-for-distributed-systems/
    ● Pivot tracing https://blog.acolyer.org/2015/10/13/pivot-tracing-dynamic-causal-monitoring-for-distributed-systems/
    ● Failure sketching
    https://blog.acolyer.org/2015/10/12/failure-sketching-a-technique-for-automated-root-cause-diagnosis-of-in-production-failures/
    ● Simplifying and isolating failure inducing input https://blog.acolyer.org/2015/11/16/simplifying-and-isolating-failure-inducing-input/
    ● Hierarchical delta debugging https://blog.acolyer.org/2015/11/17/hierarchical-delta-debugging/
    ● Minimising faulty executions https://blog.acolyer.org/2015/11/18/minimizing-faulty-executions-of-distributed-systems/
    ● DBSherlock https://blog.acolyer.org/2016/07/14/dbsherlock-a-performance-diagnostic-tool-for-transactional-databases/
    ● Runtime metric meets developer
    https://blog.acolyer.org/2015/11/10/runtime-metric-meets-developer-building-better-cloud-applications-using-feedback/
    ● How complex systems fail https://blog.acolyer.org/2016/02/10/how-complex-systems-fail/

    View full-size slide