know what's wrong!) • alter system state, make it dif fi cult to discover the original condition and confuse later troubleshooting • can spiral out of control
you know: • add as you learn more about the problem • determine the impact—who and what are affected • establish implications to SLAs • re fi ne the scope • centralize information • create dedicated Teams/Slack channels • regularly share updates, observations
people work! • support and protect the team • assign a liaison for all communications—no exceptions! • agree on an update cycle • organize breaks, snacks, drinks • plan early for shift work/relief teams
• who discovered it? • who/what does it affect? • how do we reproduce it? • what's been done so far? • if intermittent, how often or what is the timing? • is it related to or dependent on something else?
"It started recently." When, exactly? • "It's been like this for a while." For how long? • "It only happens sometimes." What do events have in common? Establish expectations • How long should it take?
the logs. • No, really. Read the logs. • Look for errors in the minutes/hours/days before the incident • Look for recent changes • grep the diagnostic directory for similar/related errors/entries
based on assumptions; • ...when queries come from a blog post; • ...when using duplicates or "improvements" of built-in instrumentation. You can't prove a negative! • "If that were true, we'd get an alert." • "We've never had that problem before."
identical) con fi gurations and topologies You can't test a RAC issue on a single-node system. • Populated with representative data Performance problems in 1M row tables won't show in small samples. • Similar visibility to production Stakeholders and monitoring tools need access to duplicate results.
ps wc date du, df iostat, vmstat, sar, etc. env | sort history cat more, less tail, head, watch, strace grep awk sed Regular expressions OS tools & utilities
only! Never overwrite anything! • Use timestamps in the fi lename • Log everything • Redirect all output to a fi le • >>, tee -a • Add times strategically throughout • Too much information is better than not enough • Don't assume—capture basic information (environment, settings, etc)
is true. • Being repeated on multiple blogs doesn't make it accurate. • Does it apply to: • your situation? • your version? • your OS? • Be cautious of "silver bullet" fi xes.