When Systems Fail: Lessons learned from real life experience's as SRE's

When Systems Fail: Lessons Learned from Real-Life Experiences as SREs
Great Umegbewe - Infrastructure Engineer, Gala Games Jubril Oyetunji - Site Reliability Engineer

Introduction Hello, I'm Great an Infrastructure Engineer currently at Gala
Games. I've spent over 3 years monitoring, diagnosing, and troubleshooting large-scale distributed systems, ensuring optimal performance and reliability. Github Twitter [email protected]

Introduction Hello, everyone. I'm Jubril, an independent Software Engineer/SRE much
like Great I've spent the last 3 years doing SRE/DevOps work for various companies. Github Twitter [email protected]

What this talk is NOT A guide for your company
to implement incident response Technology specific pitch for a product A deep dive into the technical details of incident response

The Role of an SRE Site Reliability Engineers (SREs) bridge
the gap between development and operations. Our key responsibilities include: Developing and maintaining scalable and resilient systems Monitoring system health and performance Responding to incidents and resolving them effectively Conducting root cause analysis after incidents Implementing automation to reduce manual work and prevent recurring issues Enhancing system security and data protection

The Importance of System Reliability System reliability is critical. When
systems fail, the impacts can be profound: Financial Loss: Interruptions can result in lost sales and extra costs to resolve the issues. Reputation Damage: System failures can negatively affect a company's reputation, eroding customer trust. Productivity Loss: Employees may not be able to perform their duties, leading to lost productivity. Data Loss: In some cases, system failures can lead to data loss, which can have severe implications. Learning from failures is thus vital to prevent these impacts.

What Constitutes a 'System Failure'? A system failure occurs when
a system or component doesn't perform its intended function. This could be a complete shutdown, performance degradation, security breach, or data corruption. The impact can range from minor user inconvenience to severe disruption of business operations.

Case Study 1: High Database CPU Usage Incident We observed
a sudden and sustained spike in our database CPU usage. The CPU peaked anytime a specific reporting function was initiated. The System: Reporting system, built on Ruby. The Failure: A specific operation caused the database CPU usage to skyrocket, slowing down the entire system and delaying the generation of reports. The Impact: Report generation was severely delayed, hampering business decision- making. This also strained our database resources, impacting other applications relying on the same database.

Analysis of Case Study 1 On investigating the root cause,
we identified a piece of code that replicates this below, was leading to inefficient database querying: users = User.all users.each do |user| posts = user.posts posts.each do |post| # some magic is going on here end end

Root Cause: This code was performing inefficient database operations for
two reasons. First, User.all loaded all users into memory at once, second was making a separate database query for each user's posts – a classic N+1 problem. Response: Code was digged out and refactored What Worked: The refactored code significantly reduced the database CPU usage and we shifted reporting queries to read replicas. What Didn't: Our initial alerting system didn't catch this issue until we were impacted.

Lessons from Case Study 1 This incident taught us valuable
lessons about efficient database operations and proactive monitoring: Efficient Database Operations: Always be mindful of the potential impacts of database operations. Proactive Monitoring: Granular monitoring of database operations is critical. Being alerted early can help prevent small issues from escalating into major problems. Code Reviews: Regular code reviews can catch potentially problematic code. Pay special attention to code that interacts with the database, as inefficient queries can have a profound system-wide impact.

Case Study 2: Database Outage Incident In Q1 2023 ,
I recieved a call from a client that their database was down and requests to the API were failing. The System: Our primary database was running on Azure, which handles most of our customer transactions. The Failure: CPU usage on the database server spiked to 100%, causing the database to crash. The Impact: The database was down for half a day, leading to a disruption of of the backend.

Analysis of Case Study 2 Hoping on the Azure portal
i was able to see that the database was indeed down and the CPU usage was at 100%. Root Cause: A deprecation of Azure single server instance caused a severe drop in performance. Response: We quickly migrated the database to a new server instance while we worked on bringing the primary server back online. What Worked: Migrating over to a newer instance type fixed the problem however it didn't give us the full picture. What Didn't: Our system monitoring failed to alert us becuase it was non existent.

Lessons from Case Study 2 This incident was a wake-up
call and led to several important changes. Monitoring: We implemented monitoring using Azure monitor. Deprecation: We started monitoring for deprecation notices from our cloud providers.

Case study 3: High CPU steal times In Q4 2022,
we encountered a higher than usual RTT which was caused by major performance degradation of our message broker (self hosted RabbitMQ) running on Cloud. The System: RabbitMQ, others. The Failure: RabbitMQ was significantly degraded, leading to slow message processing and communication between services. The Impact: Slowed service-to-service communication resulted in delays across our platform, leading to complaints and lowered system throughput.

Analysis of Case Study 3 Our investigation into the incident
revealed an often overlooked metric: CPU steal time. Root Cause: High CPU steal time on our RabbitMQ server. This happens when the physical CPU is too busy servicing other virtual machines (VMs), causing ours to wait, leading to performance degradation. Response: We attempted back pressuring, reboots (one by one so as not to lose quorom), then contacted our cloud service provider to discuss the issue. We also explored options like Cross-AZ configurations and resizing our VMs to have more CPU resources. What Worked: The above helped to an extent, but did not completely resolve the issue. What Didn't: Initial attempts to resize the VM were unsuccessful due to capacity constraints with our service provider.

Lessons from Case Study 3 This incident shed light on
the nuances of working with virtualized systems. Monitor the Right Metrics: Beyond traditional performance metrics, some metrics are unique to virtual environments. We incorporated CPU steal time into our routine monitoring. Instance Optimization: RabbitMQ performs optimally on instances with high disk I/O operations. Switching to such instances could significantly improve its performance. We learned to choose our resources based on the specific needs of our applications. Flexible Infrastructure: We recognized the need for a more scalable infrastructure. This led us to consider alternatives such as containerization with Kubernetes to better manage resources.

Case study 4: Opensearch cluster outage Like most outages this
one started with a call from a client that their application was down. And their search functionality was not working. The System: The primary feature of the application was search and it was powered by an opensearch cluster. The Failure: The master node of the cluster was down and qourum was lost. The Impact: The search functionality was down for a few hours.

Analysis of Case Study 4 Root Cause: The configuration being
used was not suitable for the scale and size of the cluster. Response: I increased the size of the nodes as well the memory available to Java. What Worked: Increasing the size of the nodes and memory available to Java fixed the problem. What Didn't: The configuration being used was not the one intended for prod and opensearch wasn't detecting it so it fell back on a similarly named config file.

Lessons from Case Study 4 Again this incident show how
much care we need to take as SREs and how important it is to have a good monitoring system in place. Configuration: We started monitoring for configuration changes. Automation: We started automating the deployment of our opensearch clusters ( Which was slightly more painful than one would expect).

Broader Learnings for SREs and Organizations These experiences taught us
broader lessons applicable to all SREs and organizations: 1. Preparedness: Always be prepared for a system failure. Regular load tests and system checks can prevent surprises. 2. Automation: The more you can automate, the less likely it is that human error will cause system failures. 3. Transparency: Encourage a culture where teams feel comfortable reporting potential issues. This leads to early detection and mitigation. 4. Continuous Learning: Embrace failures as learning opportunities. Regular post- mortem reviews are crucial. 5. Resilience: Build systems with failure in mind to ensure they can recover quickly when incidents occur.

Q&A We are now open to any questions you might
have. Please feel free to laugh or ask about any aspect of site reliability engineering, incident management, or anything else that's been covered in this presentation.

Conclusion In conclusion, system failures are costly and disruptive, but
they also provide valuable opportunities for learning and growth. By examining our failures, we can improve our systems, our processes, and ourselves. Thanks your time and attention.

Because, you could 'live log' with your failure, where does
that bring you? Back to rm -rf me!

When Systems Fail: Lessons learned from real li...

When Systems Fail: Lessons learned from real life experience's as SRE's

Umegbewe

More Decks by Umegbewe

Other Decks in Programming

Featured

Transcript

When Systems Fail: Lessons Learned from Real-Life Experiences as SREs

Introduction Hello, I'm Great an Infrastructure Engineer currently at Gala

Introduction Hello, everyone. I'm Jubril, an independent Software Engineer/SRE much

What this talk is NOT A guide for your company

The Role of an SRE Site Reliability Engineers (SREs) bridge

The Importance of System Reliability System reliability is critical. When

What Constitutes a 'System Failure'? A system failure occurs when

Case Study 1: High Database CPU Usage Incident We observed

Analysis of Case Study 1 On investigating the root cause,

Root Cause: This code was performing inefficient database operations for

Lessons from Case Study 1 This incident taught us valuable

Case Study 2: Database Outage Incident In Q1 2023 ,

Analysis of Case Study 2 Hoping on the Azure portal

Lessons from Case Study 2 This incident was a wake-up

Case study 3: High CPU steal times In Q4 2022,

Analysis of Case Study 3 Our investigation into the incident

Lessons from Case Study 3 This incident shed light on

Case study 4: Opensearch cluster outage Like most outages this

Analysis of Case Study 4 Root Cause: The configuration being

Lessons from Case Study 4 Again this incident show how

Broader Learnings for SREs and Organizations These experiences taught us

Q&A We are now open to any questions you might

Conclusion In conclusion, system failures are costly and disruptive, but

Because, you could 'live log' with your failure, where does