Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When Systems Fail: Lessons learned from real life experience's as SRE's

When Systems Fail: Lessons learned from real life experience's as SRE's

Umegbewe

June 16, 2023
Tweet

More Decks by Umegbewe

Other Decks in Programming

Transcript

  1. When Systems Fail:
    Lessons Learned from Real-Life Experiences as SREs
    Great Umegbewe - Infrastructure Engineer, Gala Games
    Jubril Oyetunji - Site Reliability Engineer

    View full-size slide

  2. Introduction
    Hello, I'm Great an Infrastructure Engineer currently at Gala Games. I've spent over 3 years
    monitoring, diagnosing, and troubleshooting large-scale distributed systems, ensuring
    optimal performance and reliability.
    Github
    Twitter
    [email protected]

    View full-size slide

  3. Introduction
    Hello, everyone. I'm Jubril, an independent Software Engineer/SRE much like Great I've
    spent the last 3 years doing SRE/DevOps work for various companies.
    Github
    Twitter
    [email protected]

    View full-size slide

  4. What this talk is NOT
    A guide for your company to implement incident response
    Technology specific pitch for a product
    A deep dive into the technical details of incident response

    View full-size slide

  5. The Role of an SRE
    Site Reliability Engineers (SREs) bridge the gap between development and operations. Our
    key responsibilities include:
    Developing and maintaining scalable and resilient systems
    Monitoring system health and performance
    Responding to incidents and resolving them effectively
    Conducting root cause analysis after incidents
    Implementing automation to reduce manual work and prevent recurring issues
    Enhancing system security and data protection

    View full-size slide

  6. The Importance of System Reliability
    System reliability is critical. When systems fail, the impacts can be profound:
    Financial Loss: Interruptions can result in lost sales and extra costs to resolve the
    issues.
    Reputation Damage: System failures can negatively affect a company's reputation,
    eroding customer trust.
    Productivity Loss: Employees may not be able to perform their duties, leading to lost
    productivity.
    Data Loss: In some cases, system failures can lead to data loss, which can have
    severe implications.
    Learning from failures is thus vital to prevent these impacts.

    View full-size slide

  7. What Constitutes a 'System Failure'?
    A system failure occurs when a system or component doesn't perform its intended
    function. This could be a complete shutdown, performance degradation, security breach, or
    data corruption. The impact can range from minor user inconvenience to severe disruption
    of business operations.

    View full-size slide

  8. Case Study 1: High Database CPU Usage Incident
    We observed a sudden and sustained spike in our database CPU usage. The CPU peaked
    anytime a specific reporting function was initiated.
    The System: Reporting system, built on Ruby.
    The Failure: A specific operation caused the database CPU usage to skyrocket,
    slowing down the entire system and delaying the generation of reports.
    The Impact: Report generation was severely delayed, hampering business decision-
    making. This also strained our database resources, impacting other applications
    relying on the same database.

    View full-size slide

  9. Analysis of Case Study 1
    On investigating the root cause, we identified a piece of code that replicates this below,
    was leading to inefficient database querying:
    users = User.all
    users.each do |user|
    posts = user.posts
    posts.each do |post|
    # some magic is going on here
    end
    end

    View full-size slide

  10. Root Cause: This code was performing inefficient database operations for two
    reasons. First, User.all loaded all users into memory at once, second was making a
    separate database query for each user's posts – a classic N+1 problem.
    Response: Code was digged out and refactored
    What Worked: The refactored code significantly reduced the database CPU usage
    and we shifted reporting queries to read replicas.
    What Didn't: Our initial alerting system didn't catch this issue until we were impacted.

    View full-size slide

  11. Lessons from Case Study 1
    This incident taught us valuable lessons about efficient database operations and proactive
    monitoring:
    Efficient Database Operations: Always be mindful of the potential impacts of
    database operations.
    Proactive Monitoring: Granular monitoring of database operations is critical. Being
    alerted early can help prevent small issues from escalating into major problems.
    Code Reviews: Regular code reviews can catch potentially problematic code. Pay
    special attention to code that interacts with the database, as inefficient queries can
    have a profound system-wide impact.

    View full-size slide

  12. Case Study 2: Database Outage Incident
    In Q1 2023 , I recieved a call from a client that their database was down and requests to the
    API were failing.
    The System: Our primary database was running on Azure, which handles most of our
    customer transactions.
    The Failure: CPU usage on the database server spiked to 100%, causing the
    database to crash.
    The Impact: The database was down for half a day, leading to a disruption of of the
    backend.

    View full-size slide

  13. Analysis of Case Study 2
    Hoping on the Azure portal i was able to see that the database was indeed down and the
    CPU usage was at 100%.
    Root Cause: A deprecation of Azure single server instance caused a severe drop in
    performance.
    Response: We quickly migrated the database to a new server instance while we
    worked on bringing the primary server back online.
    What Worked: Migrating over to a newer instance type fixed the problem however it
    didn't give us the full picture.
    What Didn't: Our system monitoring failed to alert us becuase it was non existent.

    View full-size slide

  14. Lessons from Case Study 2
    This incident was a wake-up call and led to several important changes.
    Monitoring: We implemented monitoring using Azure monitor.
    Deprecation: We started monitoring for deprecation notices from our cloud providers.

    View full-size slide

  15. Case study 3: High CPU steal times
    In Q4 2022, we encountered a higher than usual RTT which was caused by major
    performance degradation of our message broker (self hosted RabbitMQ) running on Cloud.
    The System: RabbitMQ, others.
    The Failure: RabbitMQ was significantly degraded, leading to slow message
    processing and communication between services.
    The Impact: Slowed service-to-service communication resulted in delays across our
    platform, leading to complaints and lowered system throughput.

    View full-size slide

  16. Analysis of Case Study 3
    Our investigation into the incident revealed an often overlooked metric: CPU steal time.
    Root Cause: High CPU steal time on our RabbitMQ server. This happens when the
    physical CPU is too busy servicing other virtual machines (VMs), causing ours to wait,
    leading to performance degradation.
    Response: We attempted back pressuring, reboots (one by one so as not to lose
    quorom), then contacted our cloud service provider to discuss the issue. We also
    explored options like Cross-AZ configurations and resizing our VMs to have more
    CPU resources.
    What Worked: The above helped to an extent, but did not completely resolve the
    issue.
    What Didn't: Initial attempts to resize the VM were unsuccessful due to capacity
    constraints with our service provider.

    View full-size slide

  17. Lessons from Case Study 3
    This incident shed light on the nuances of working with virtualized systems.
    Monitor the Right Metrics: Beyond traditional performance metrics, some metrics are
    unique to virtual environments. We incorporated CPU steal time into our routine
    monitoring.
    Instance Optimization: RabbitMQ performs optimally on instances with high disk I/O
    operations. Switching to such instances could significantly improve its performance.
    We learned to choose our resources based on the specific needs of our applications.
    Flexible Infrastructure: We recognized the need for a more scalable infrastructure.
    This led us to consider alternatives such as containerization with Kubernetes to better
    manage resources.

    View full-size slide

  18. Case study 4: Opensearch cluster outage
    Like most outages this one started with a call from a client that their application was down.
    And their search functionality was not working.
    The System: The primary feature of the application was search and it was powered by
    an opensearch cluster.
    The Failure: The master node of the cluster was down and qourum was lost.
    The Impact: The search functionality was down for a few hours.

    View full-size slide

  19. Analysis of Case Study 4
    Root Cause: The configuration being used was not suitable for the scale and size of
    the cluster.
    Response: I increased the size of the nodes as well the memory available to Java.
    What Worked: Increasing the size of the nodes and memory available to Java fixed
    the problem.
    What Didn't: The configuration being used was not the one intended for prod and
    opensearch wasn't detecting it so it fell back on a similarly named config file.

    View full-size slide

  20. Lessons from Case Study 4
    Again this incident show how much care we need to take as SREs and how important it is
    to have a good monitoring system in place.
    Configuration: We started monitoring for configuration changes.
    Automation: We started automating the deployment of our opensearch clusters (
    Which was slightly more painful than one would expect).

    View full-size slide

  21. Broader Learnings for SREs and Organizations
    These experiences taught us broader lessons applicable to all SREs and organizations:
    1. Preparedness: Always be prepared for a system failure. Regular load tests and system
    checks can prevent surprises.
    2. Automation: The more you can automate, the less likely it is that human error will
    cause system failures.
    3. Transparency: Encourage a culture where teams feel comfortable reporting potential
    issues. This leads to early detection and mitigation.
    4. Continuous Learning: Embrace failures as learning opportunities. Regular post-
    mortem reviews are crucial.
    5. Resilience: Build systems with failure in mind to ensure they can recover quickly when
    incidents occur.

    View full-size slide

  22. Q&A
    We are now open to any questions you might have. Please feel free to laugh or ask about
    any aspect of site reliability engineering, incident management, or anything else that's
    been covered in this presentation.

    View full-size slide

  23. Conclusion
    In conclusion, system failures are costly and disruptive, but they also provide valuable
    opportunities for learning and growth. By examining our failures, we can improve our
    systems, our processes, and ourselves.
    Thanks your time and attention.

    View full-size slide

  24. Because, you could 'live log' with your failure, where
    does that bring you? Back to rm -rf
    me!

    View full-size slide