Database Troubleshooting

September 17, 2024 From the Trenches: Database Troubleshooting

Database Reliability Engineering MAA ⁘ RAC ⁘ RMAN Data Guard
⁘ Sharding ⁘ Partitioning Information Lifecycle Management Exadata & Engineered Systems Database Modernization Upgrades ⁘ Patching ⁘ Migrations Cloud ⁘ Hybrid Automation DevOps ⁘ IaC ⁘ Containers ⁘ Terraform Vagrant ⁘ Ansible Observability AHF ⁘ TFA ⁘ CHA ⁘ CHM

www.viscosityna.com @ViscosityNA Oracle on Docker Running Oracle Databases in Linux
Containers Free sample chapter: https://oraclesean.com

www.viscosityna.com @ViscosityNA

www.viscosityna.com @ViscosityNA "Put out the fire now! Find the cause
later."

www.viscosityna.com @ViscosityNA firefighting is emotional. troubleshooting is logical.

www.viscosityna.com @ViscosityNA right brain emotion logic left brain

www.viscosityna.com @ViscosityNA slow left brain right brain fast

www.viscosityna.com @ViscosityNA Stress Inoculation

Practiced skills become autonomous, releasing cognitive resources.

Fixes vs. solutions Fixes: • often quick and undocumented (I
know what's wrong!) • alter system state, make it dif fi cult to discover the original condition and confuse later troubleshooting • can spiral out of control

Fixes vs. solutions Fixes: • don't address root causes •
may introduce unintended effects • are perceived as faster but often take longer than deliberate efforts toward a solution

First moves

First moves • take a deep breath • list what
you know: • add as you learn more about the problem • determine the impact—who and what are affected • establish implications to SLAs • re fi ne the scope • centralize information • create dedicated Teams/Slack channels • regularly share updates, observations

First moves Multiple systems/symptoms: • what they have in common
may be a clue • prioritize the issue that is most likely the root cause • share responsibilities and coordinate efforts across a team

Leadership responsibilities

Leadership responsibilities Resist the urge to micromanage! • don't interfere—let
people work! • support and protect the team • assign a liaison for all communications—no exceptions! • agree on an update cycle • organize breaks, snacks, drinks • plan early for shift work/relief teams

Preserve evidence

Preserve evidence • create a new directory dedicated to the
issue • make copies of logs, directories, con fi gurations, etc. • document every change!

Ask questions

Ask questions • what changed? • when did it start?
• who discovered it? • who/what does it affect? • how do we reproduce it? • what's been done so far? • if intermittent, how often or what is the timing? • is it related to or dependent on something else?

Ask questions Eliminate ambiguities • "It's slow." How slow? •
"It started recently." When, exactly? • "It's been like this for a while." For how long? • "It only happens sometimes." What do events have in common? Establish expectations • How long should it take?

Troubleshooting techniques

Logs are friends Oracle is a well-instrumented application. • Read
the logs. • No, really. Read the logs. • Look for errors in the minutes/hours/days before the incident • Look for recent changes • grep the diagnostic directory for similar/related errors/entries

Logs are friends A log entry identifying the problem is
present in ~75% of issues I see

Believe only what you can prove • "It's not X.
I checked." Really? Show me proof. • "It can't possibly be X!" It's... probably X! • "We didn't change anything." Except for... • "That change is unrelated." It's probably that change

Proof isn't always proof • ...when queries are incorrect or
based on assumptions; • ...when queries come from a blog post; • ...when using duplicates or "improvements" of built-in instrumentation. You can't prove a negative! • "If that were true, we'd get an alert." • "We've never had that problem before."

Proof isn't always proof We're blind to facts that contradict
our cognitive bias.

Reproduce the issue Test system guidelines: • Identical (or nearly
identical) con fi gurations and topologies You can't test a RAC issue on a single-node system. • Populated with representative data Performance problems in 1M row tables won't show in small samples. • Similar visibility to production Stakeholders and monitoring tools need access to duplicate results.

Test systems To be useful, test systems, tools & utilities
need to exist before the problem occurs.

I'm stuck!

What to do when you're stuck • Read the documentation
• Recruit a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break

Database tools • Autonomous Health Framework (AHF) • changes, events,
analyze, tail, param • diagcollect • OraCHK/ExaCHK • Remote Diagnostic Assistant (RDA) • Enterprise Manager (OEM) • EDB360 (https://carlos-sierra.net) • ShellCheck online (https://www.shellcheck.net) install (https://github.com/koalaman/shellcheck)

Database tables • dba_errors • dba_source • dba_autotask* • dba_scheduler*
• v$rman_output • v$rman_status • v$session* • v$active_session_* • v$backup_* • v$datafile* • v$parameter* • v$spparameter

OS tools & utilities • Read the documentation • Recruit
a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break

www.viscosityna.com @ViscosityNA tee, >, >> diff curl, wget, mailx mktemp
ps wc date du, df iostat, vmstat, sar, etc. env | sort history cat more, less tail, head, watch, strace grep awk sed Regular expressions OS tools & utilities

www.viscosityna.com @ViscosityNA suffix=$(date -u +'%Y-%m-%dT%H:%M:%SZ') suffix=$(date '+%Y%m%d%H%M') logfile=$(mktemp -t $(hostname)_${suffix}.XXXX.out)
printf "%-10s\n" "$var" printf "%s\n" "$(date '+%F %T')" >> $__logfile OS tools & utilities

www.viscosityna.com @ViscosityNA export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS' bash: set -e: exit on
non-zero return code set -u: treat unset variables as errors set -x: generate verbose output Environmental setup

www.viscosityna.com @ViscosityNA alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS'; select sys_context() --
Write messages to the alert log: dbms_system.ksdwrt(2, to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS') || ' My text'); SQL*Plus setup

www.viscosityna.com @ViscosityNA set pages 9999 set lines 200 set timing
on set time on -- Format columns col member format a20 col member for a20 SQL*Plus setup

Track your work

Track your work • Create a dedicated directory • Add
only! Never overwrite anything! • Use timestamps in the fi lename • Log everything • Redirect all output to a fi le • >>, tee -a • Add times strategically throughout • Too much information is better than not enough • Don't assume—capture basic information (environment, settings, etc)

Finding answers

Finding answers • Chrome: Return 100 AI-free results by adding
to its search engine settings: {google:baseURL}search?q=%s&{google:RLZ} {google:originalQueryForSuggestion} {google:assistedQueryStats}{google:searchFieldtrialParameter} {google:language}{google:prefetchSource}{google:searchClient} {google:sourceId}{google:contextualSearchVersion} ie={inputEncoding}&num=100

Finding answers • Not everything you read on the internet
is true. • Being repeated on multiple blogs doesn't make it accurate. • Does it apply to: • your situation? • your version? • your OS? • Be cautious of "silver bullet" fi xes.

Finding answers • Wrapping terms in quotes forces them into
the results. • These are not the same: • ORA-600 kdspf 4194 • ORA-600 "kdspf" "4194"

Finding answers Always include an AHF diagnostic collection when raising
Service Requests!

Occam's Razor The simplest, most elegant explanation is usually the
one closest to the truth.

Errors are errors. There is no such thing as "acceptable
errors" in production environments.

www.viscosityna.com @ViscosityNA Questions [email protected] https://linktr.ee/oraclesean

Database Troubleshooting

Database Troubleshooting

More Decks by Sean Scott

Other Decks in Technology

Featured

Transcript