Slide 1

Slide 1 text

September 17, 2024 From the Trenches: Database Troubleshooting

Slide 2

Slide 2 text

Database Reliability Engineering MAA ⁘ RAC ⁘ RMAN Data Guard ⁘ Sharding ⁘ Partitioning Information Lifecycle Management Exadata & Engineered Systems Database Modernization Upgrades ⁘ Patching ⁘ Migrations Cloud ⁘ Hybrid Automation DevOps ⁘ IaC ⁘ Containers ⁘ Terraform Vagrant ⁘ Ansible Observability AHF ⁘ TFA ⁘ CHA ⁘ CHM

Slide 3

Slide 3 text

www.viscosityna.com @ViscosityNA Oracle on Docker Running Oracle Databases in Linux Containers Free sample chapter: https://oraclesean.com

Slide 4

Slide 4 text

www.viscosityna.com @ViscosityNA

Slide 5

Slide 5 text

www.viscosityna.com @ViscosityNA "Put out the fire now! Find the cause later."

Slide 6

Slide 6 text

www.viscosityna.com @ViscosityNA firefighting is emotional. troubleshooting is logical.

Slide 7

Slide 7 text

www.viscosityna.com @ViscosityNA right brain emotion logic left brain

Slide 8

Slide 8 text

www.viscosityna.com @ViscosityNA slow left brain right brain fast

Slide 9

Slide 9 text

www.viscosityna.com @ViscosityNA Stress Inoculation

Slide 10

Slide 10 text

Practiced skills become autonomous, releasing cognitive resources.

Slide 11

Slide 11 text

Fixes vs. solutions Fixes: • often quick and undocumented (I know what's wrong!) • alter system state, make it dif fi cult to discover the original condition and confuse later troubleshooting • can spiral out of control

Slide 12

Slide 12 text

Fixes vs. solutions Fixes: • don't address root causes • may introduce unintended effects • are perceived as faster but often take longer than deliberate efforts toward a solution

Slide 13

Slide 13 text

First moves

Slide 14

Slide 14 text

First moves • take a deep breath • list what you know: • add as you learn more about the problem • determine the impact—who and what are affected • establish implications to SLAs • re fi ne the scope • centralize information • create dedicated Teams/Slack channels • regularly share updates, observations

Slide 15

Slide 15 text

First moves Multiple systems/symptoms: • what they have in common may be a clue • prioritize the issue that is most likely the root cause • share responsibilities and coordinate efforts across a team

Slide 16

Slide 16 text

Leadership responsibilities

Slide 17

Slide 17 text

Leadership responsibilities Resist the urge to micromanage! • don't interfere—let people work! • support and protect the team • assign a liaison for all communications—no exceptions! • agree on an update cycle • organize breaks, snacks, drinks • plan early for shift work/relief teams

Slide 18

Slide 18 text

Preserve evidence

Slide 19

Slide 19 text

Preserve evidence • create a new directory dedicated to the issue • make copies of logs, directories, con fi gurations, etc. • document every change!

Slide 20

Slide 20 text

Ask questions

Slide 21

Slide 21 text

Ask questions • what changed? • when did it start? • who discovered it? • who/what does it affect? • how do we reproduce it? • what's been done so far? • if intermittent, how often or what is the timing? • is it related to or dependent on something else?

Slide 22

Slide 22 text

Ask questions Eliminate ambiguities • "It's slow." How slow? • "It started recently." When, exactly? • "It's been like this for a while." For how long? • "It only happens sometimes." What do events have in common? Establish expectations • How long should it take?

Slide 23

Slide 23 text

Troubleshooting techniques

Slide 24

Slide 24 text

Logs are friends Oracle is a well-instrumented application. • Read the logs. • No, really. Read the logs. • Look for errors in the minutes/hours/days before the incident • Look for recent changes • grep the diagnostic directory for similar/related errors/entries

Slide 25

Slide 25 text

Logs are friends A log entry identifying the problem is present in ~75% of issues I see

Slide 26

Slide 26 text

Believe only what you can prove • "It's not X. I checked." Really? Show me proof. • "It can't possibly be X!" It's... probably X! • "We didn't change anything." Except for... • "That change is unrelated." It's probably that change

Slide 27

Slide 27 text

Proof isn't always proof • ...when queries are incorrect or based on assumptions; • ...when queries come from a blog post; • ...when using duplicates or "improvements" of built-in instrumentation. You can't prove a negative! • "If that were true, we'd get an alert." • "We've never had that problem before."

Slide 28

Slide 28 text

Proof isn't always proof We're blind to facts that contradict our cognitive bias.

Slide 29

Slide 29 text

Reproduce the issue Test system guidelines: • Identical (or nearly identical) con fi gurations and topologies You can't test a RAC issue on a single-node system. • Populated with representative data Performance problems in 1M row tables won't show in small samples. • Similar visibility to production Stakeholders and monitoring tools need access to duplicate results.

Slide 30

Slide 30 text

Test systems To be useful, test systems, tools & utilities need to exist before the problem occurs.

Slide 31

Slide 31 text

I'm stuck!

Slide 32

Slide 32 text

What to do when you're stuck • Read the documentation • Recruit a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break

Slide 33

Slide 33 text

Tools

Slide 34

Slide 34 text

Database tools • Autonomous Health Framework (AHF) • changes, events, analyze, tail, param • diagcollect • OraCHK/ExaCHK • Remote Diagnostic Assistant (RDA) • Enterprise Manager (OEM) • EDB360 (https://carlos-sierra.net) • ShellCheck online (https://www.shellcheck.net) install (https://github.com/koalaman/shellcheck)

Slide 35

Slide 35 text

Database tables • dba_errors • dba_source • dba_autotask* • dba_scheduler* • v$rman_output • v$rman_status • v$session* • v$active_session_* • v$backup_* • v$datafile* • v$parameter* • v$spparameter

Slide 36

Slide 36 text

OS tools & utilities • Read the documentation • Recruit a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break

Slide 37

Slide 37 text

www.viscosityna.com @ViscosityNA tee, >, >> diff curl, wget, mailx mktemp ps wc date du, df iostat, vmstat, sar, etc. env | sort history cat more, less tail, head, watch, strace grep awk sed Regular expressions OS tools & utilities

Slide 38

Slide 38 text

www.viscosityna.com @ViscosityNA suffix=$(date -u +'%Y-%m-%dT%H:%M:%SZ') suffix=$(date '+%Y%m%d%H%M') logfile=$(mktemp -t $(hostname)_${suffix}.XXXX.out) printf "%-10s\n" "$var" printf "%s\n" "$(date '+%F %T')" >> $__logfile OS tools & utilities

Slide 39

Slide 39 text

www.viscosityna.com @ViscosityNA export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS' bash: set -e: exit on non-zero return code set -u: treat unset variables as errors set -x: generate verbose output Environmental setup

Slide 40

Slide 40 text

www.viscosityna.com @ViscosityNA alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS'; select sys_context() -- Write messages to the alert log: dbms_system.ksdwrt(2, to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS') || ' My text'); SQL*Plus setup

Slide 41

Slide 41 text

www.viscosityna.com @ViscosityNA set pages 9999 set lines 200 set timing on set time on -- Format columns col member format a20 col member for a20 SQL*Plus setup

Slide 42

Slide 42 text

Track your work

Slide 43

Slide 43 text

Track your work • Create a dedicated directory • Add only! Never overwrite anything! • Use timestamps in the fi lename • Log everything • Redirect all output to a fi le • >>, tee -a • Add times strategically throughout • Too much information is better than not enough • Don't assume—capture basic information (environment, settings, etc)

Slide 44

Slide 44 text

Finding answers

Slide 45

Slide 45 text

Finding answers • Chrome: Return 100 AI-free results by adding to its search engine settings: {google:baseURL}search?q=%s&{google:RLZ} {google:originalQueryForSuggestion} {google:assistedQueryStats}{google:searchFieldtrialParameter} {google:language}{google:prefetchSource}{google:searchClient} {google:sourceId}{google:contextualSearchVersion} ie={inputEncoding}&num=100

Slide 46

Slide 46 text

Finding answers • Not everything you read on the internet is true. • Being repeated on multiple blogs doesn't make it accurate. • Does it apply to: • your situation? • your version? • your OS? • Be cautious of "silver bullet" fi xes.

Slide 47

Slide 47 text

Finding answers • Wrapping terms in quotes forces them into the results. • These are not the same: • ORA-600 kdspf 4194 • ORA-600 "kdspf" "4194"

Slide 48

Slide 48 text

Finding answers Always include an AHF diagnostic collection when raising Service Requests!

Slide 49

Slide 49 text

Occam's Razor The simplest, most elegant explanation is usually the one closest to the truth.

Slide 50

Slide 50 text

Errors are errors. There is no such thing as "acceptable errors" in production environments.

Slide 51

Slide 51 text

www.viscosityna.com @ViscosityNA Questions [email protected] https://linktr.ee/oraclesean

Slide 52

Slide 52 text

No content