Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Database Troubleshooting

Sean Scott
September 17, 2024

Database Troubleshooting

Sean Scott

September 17, 2024
Tweet

More Decks by Sean Scott

Other Decks in Technology

Transcript

  1. Database Reliability Engineering MAA ⁘ RAC ⁘ RMAN Data Guard

    ⁘ Sharding ⁘ Partitioning Information Lifecycle Management Exadata & Engineered Systems Database Modernization Upgrades ⁘ Patching ⁘ Migrations Cloud ⁘ Hybrid Automation DevOps ⁘ IaC ⁘ Containers ⁘ Terraform Vagrant ⁘ Ansible Observability AHF ⁘ TFA ⁘ CHA ⁘ CHM
  2. www.viscosityna.com @ViscosityNA Oracle on Docker Running Oracle Databases in Linux

    Containers Free sample chapter: https://oraclesean.com
  3. Fixes vs. solutions Fixes: • often quick and undocumented (I

    know what's wrong!) • alter system state, make it dif fi cult to discover the original condition and confuse later troubleshooting • can spiral out of control
  4. Fixes vs. solutions Fixes: • don't address root causes •

    may introduce unintended effects • are perceived as faster but often take longer than deliberate efforts toward a solution
  5. First moves • take a deep breath • list what

    you know: • add as you learn more about the problem • determine the impact—who and what are affected • establish implications to SLAs • re fi ne the scope • centralize information • create dedicated Teams/Slack channels • regularly share updates, observations
  6. First moves Multiple systems/symptoms: • what they have in common

    may be a clue • prioritize the issue that is most likely the root cause • share responsibilities and coordinate efforts across a team
  7. Leadership responsibilities Resist the urge to micromanage! • don't interfere—let

    people work! • support and protect the team • assign a liaison for all communications—no exceptions! • agree on an update cycle • organize breaks, snacks, drinks • plan early for shift work/relief teams
  8. Preserve evidence • create a new directory dedicated to the

    issue • make copies of logs, directories, con fi gurations, etc. • document every change!
  9. Ask questions • what changed? • when did it start?

    • who discovered it? • who/what does it affect? • how do we reproduce it? • what's been done so far? • if intermittent, how often or what is the timing? • is it related to or dependent on something else?
  10. Ask questions Eliminate ambiguities • "It's slow." How slow? •

    "It started recently." When, exactly? • "It's been like this for a while." For how long? • "It only happens sometimes." What do events have in common? Establish expectations • How long should it take?
  11. Logs are friends Oracle is a well-instrumented application. • Read

    the logs. • No, really. Read the logs. • Look for errors in the minutes/hours/days before the incident • Look for recent changes • grep the diagnostic directory for similar/related errors/entries
  12. Believe only what you can prove • "It's not X.

    I checked." Really? Show me proof. • "It can't possibly be X!" It's... probably X! • "We didn't change anything." Except for... • "That change is unrelated." It's probably that change
  13. Proof isn't always proof • ...when queries are incorrect or

    based on assumptions; • ...when queries come from a blog post; • ...when using duplicates or "improvements" of built-in instrumentation. You can't prove a negative! • "If that were true, we'd get an alert." • "We've never had that problem before."
  14. Reproduce the issue Test system guidelines: • Identical (or nearly

    identical) con fi gurations and topologies You can't test a RAC issue on a single-node system. • Populated with representative data Performance problems in 1M row tables won't show in small samples. • Similar visibility to production Stakeholders and monitoring tools need access to duplicate results.
  15. Test systems To be useful, test systems, tools & utilities

    need to exist before the problem occurs.
  16. What to do when you're stuck • Read the documentation

    • Recruit a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break
  17. Database tools • Autonomous Health Framework (AHF) • changes, events,

    analyze, tail, param • diagcollect • OraCHK/ExaCHK • Remote Diagnostic Assistant (RDA) • Enterprise Manager (OEM) • EDB360 (https://carlos-sierra.net) • ShellCheck online (https://www.shellcheck.net) install (https://github.com/koalaman/shellcheck)
  18. Database tables • dba_errors • dba_source • dba_autotask* • dba_scheduler*

    • v$rman_output • v$rman_status • v$session* • v$active_session_* • v$backup_* • v$datafile* • v$parameter* • v$spparameter
  19. OS tools & utilities • Read the documentation • Recruit

    a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break
  20. www.viscosityna.com @ViscosityNA tee, >, >> diff curl, wget, mailx mktemp

    ps wc date du, df iostat, vmstat, sar, etc. env | sort history cat more, less tail, head, watch, strace grep awk sed Regular expressions OS tools & utilities
  21. www.viscosityna.com @ViscosityNA export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS' bash: set -e: exit on

    non-zero return code set -u: treat unset variables as errors set -x: generate verbose output Environmental setup
  22. www.viscosityna.com @ViscosityNA alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS'; select sys_context() --

    Write messages to the alert log: dbms_system.ksdwrt(2, to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS') || ' My text'); SQL*Plus setup
  23. www.viscosityna.com @ViscosityNA set pages 9999 set lines 200 set timing

    on set time on -- Format columns col member format a20 col member for a20 SQL*Plus setup
  24. Track your work • Create a dedicated directory • Add

    only! Never overwrite anything! • Use timestamps in the fi lename • Log everything • Redirect all output to a fi le • >>, tee -a • Add times strategically throughout • Too much information is better than not enough • Don't assume—capture basic information (environment, settings, etc)
  25. Finding answers • Chrome: Return 100 AI-free results by adding

    to its search engine settings: {google:baseURL}search?q=%s&{google:RLZ} {google:originalQueryForSuggestion} {google:assistedQueryStats}{google:searchFieldtrialParameter} {google:language}{google:prefetchSource}{google:searchClient} {google:sourceId}{google:contextualSearchVersion} ie={inputEncoding}&num=100
  26. Finding answers • Not everything you read on the internet

    is true. • Being repeated on multiple blogs doesn't make it accurate. • Does it apply to: • your situation? • your version? • your OS? • Be cautious of "silver bullet" fi xes.
  27. Finding answers • Wrapping terms in quotes forces them into

    the results. • These are not the same: • ORA-600 kdspf 4194 • ORA-600 "kdspf" "4194"
  28. Errors are errors. There is no such thing as "acceptable

    errors" in production environments.