Autonomous Health Framework

Autonomous Health Framework: How to Use Your Database "Swiss Army
Knife” (Without Poking an Eye Out)

@ViscosityNA www.viscosityna.com www.viscosityna.com @ViscosityNA Sean Scott Working with Oracle technology
since 1995   Development ⁘ DBA ⁘ Reliability Engineering ⁘ DevOps   Oracle OpenWorld ⁘ Collaborate/IOUG ⁘ Regional UG RAC/MAA ⁘ Data Guard ⁘ Sharding ⁘ Exadata/ODA   Diagnostic Tools (AHF, TFA, RDA, CHA, CHM)   DR, HA, Site Reliability/Continuity   Upgrade ⁘ Migration ⁘ Cloud DevOps ⁘ Infrastructure as Code ⁘ Automation   Containers ⁘ Virtualization

"My experience with AHF/TFA is..." a) Hang on, spell that
'AHF' thing one more time. b) Used it for an SR once—no lingering trauma. c) Used it extensively—currently seeking therapy. d) Dude, I could be the PM. e) Whatever, someone said there were GIFs.

You need AHF

You need AHF • AHF diagnostic collections required by MOS
for some SR • Diagnostic collections accelerate SR resolution • Cluster-aware ADR log inspection and management • Advanced system and log monitoring • Incident control and notification • Connect to MOS • SMTP, REST APIs

You need AHF • Built-in Oracle tools: • ORAchk/EXAchk •
OS Watcher • Cluster Verification Utility (CVU) • Hang Manager • Diagnostic Assistant

You need AHF • Integrated with: • Database • ASM
and Clusterware • Automatic Diagnostic Repository (ADR) • Grid Infrastructure Management Repository (GIMR) • Cluster Health Advisor (CHA) & Cluster Health Monitor (CHM) • Enterprise Manager

You need AHF • Cluster aware: • Run commands for
all, some nodes • Cross-node configuration and file inspection • Central management for ADR • Consolidated diagnostic collection

You need AHF • Over 800 health checks • 400
identified as critical/failures • Severe problem check daily: 2AM • All known problem check weekly: 3AM Sunday • Auto-generates a collection when problems detected • Everything required to diagnose & resolve • Results delivered to the notification email

AHF is FREE!

References & Availability

AHF MOS Master Document: 2550798.1 • Briefly 2832630.1, 2832594.1 •
Downloads available for • Linux, ZLinux • Solaris x86/SPARC64 • HPUX • AIX 6/7 • Win 64-bit

AHF MOS Master Document: 2550798.1 • Major release each quarter
• Typically follows DBRU schedule • Naming convention is year, quarter, release: YY.Q.R • 21.4.0, 21.4.1 • Latest version: 22.1.1 (as of 2022-06-18) • Intermediate releases are common!

Installation

Types of installs: Daemon or root • Recommended method •
Cluster awareness • Full AHF capabilities • Includes compliance checks • Enables notifications • Automatic diagnostic collection when issues are detected • May conflict with existing AHF/TFA installations

Types of installs: Local or non-root • Reduced feature set
• No automatic or remote diagnostics, collections • Limited file visibility (must be readable by Oracle home owner) • /var/log/messages • Some Grid Infrastructure logs • May co-exist with Daemon installations • No special pre-install considerations

Recommendation #0 Don’t follow the instructions

A brief history lesson… • There are two flavors of
TFA • A version downloaded from MOS • A version included in Grid Infrastructure install & patches • GI version is not fully featured • GI and MOS versions can interfere, conflict

Install AHF • Oracle’s instructions work when things are perfect
• Systems are rarely perfect! • AHF and TFA are known for certain… ahem, peculiarities

Recommendation #1 Remove existing AHF/TFA before install

TFA pre-installation checks # Uninstall TFA (as root) tfactl uninstall
# Check for existing AHF/TFA installs which tfactl which ahfctl

TFA pre-installation checks # Locate leftover setup configuration files find
/ -name tfa_setup.txt # Verify files are removed find / -name tfactl find / -name startOSWbb.sh

TFA pre-installation checks # Remove legacy/existing AHF/TFA installations for d
in $(find / -name uninstalltfa) do cd $(dirname $d) ./tfactl uninstall # cd .. && rm -fr . done # Insure ALL AHF/TFA processes are stopped/inactive prior to uninstall # PERFORM THIS STEP ON ALL NODES

Post-installation sanity checks

Command line tools: ahfctl and tfactl $ tfactl <command> <options>
- or - $ tfactl tfactl> <command> <options> $ tfactl help $ tfactl <command> help $ ahfctl <command> <options> - or - $ ahfctl ahfctl> <command> <options> $ ahfctl help $ ahfctl <command> help

Post-install checks ahfctl version tfactl status ahfctl statusahf tfactl toolstatus
tfactl print hosts tfactl print components tfactl print protocols tfactl print config -node all

status vs statusahf [root@node1 ~]# tfactl status .---------------------------------------------------------------------------------------------. | Host
| Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' [root@node1 ~]#

[root@node1 ~]# tfactl statusahf .---------------------------------------------------------------------------------------------. | Host | Status of
TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' ------------------------------------------------------------ Master node = node1 orachk daemon version = 214100 Install location = /opt/oracle.ahf/orachk Started at = Wed Feb 02 20:50:12 GMT 2022 Scheduler type = TFA Scheduler Scheduler PID: 28883 ... status vs statusahf

------------------------------------------------------------ ID: orachk.autostart_client_oratier1 ------------------------------------------------------------ AUTORUN_FLAGS = -usediscovery -profile oratier1 -dball
-showpass -tag autostart_client_oratier1 -readenvconfig COLLECTION_RETENTION = 7 AUTORUN_SCHEDULE = 3 2 * * 1,2,3,4,5,6 ------------------------------------------------------------ ------------------------------------------------------------ ID: orachk.autostart_client ------------------------------------------------------------ AUTORUN_FLAGS = -usediscovery -tag autostart_client -readenvconfig COLLECTION_RETENTION = 14 AUTORUN_SCHEDULE = 3 3 * * 0 ------------------------------------------------------------ Next auto run starts on Feb 03, 2022 02:03:00 ID:orachk.AUTOSTART_CLIENT_ORATIER1 statusahf option in tfactl is deprecated and will be removed in AHF 22.1.0. Please start using ahfctl for statusahf, Example: ahfctl statusahf status vs statusahf

Failed installs and upgrades Common issues

Warning remains after a successful upgrade [root@node1 ahf]# ahfctl statusahf
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade. .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 24554 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' • Run ahfctl syncpatch

Not all nodes appear after upgrade [root@node1 ahf]# tfactl syncnodes
Current Node List in TFA : 1. node1 2. node2 Node List in Cluster : 1. node1 2. node2 Node List to sync TFA Certificates : 1 node2 Do you want to update this node list? Y|[N]: Syncing TFA Certificates on node2 : TFA_HOME on node2 : /opt/oracle.ahf/tfa ...

Not all nodes appear after upgrade (cont) ... TFA_HOME on
node2 : /opt/oracle.ahf/tfa DATA_DIR on node2 : /opt/oracle.ahf/data/node2/tfa Shutting down TFA on node2... Copying TFA Certificates to node2... Copying SSL Properties to node2... Sleeping for 5 seconds... Starting TFA on node2... .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' [root@node1 ahf]#

Installation and upgrade issues • Post-installation troubleshooting: • ahfctl stopahf;
ahfctl startahf • tfactl stop; tfactl start • tfactl status • ahfctl statusahf • tfactl toolstatus • tfactl syncnodes • ahfctl syncpatch

Installation and upgrade issues • Post-installation troubleshooting: • tfactl diagnosetfa
• Create an SR and upload result to MOS

Recommendation #3 Move the repository to shared storage!

RAC: Move the repository to shared storage Local Repository Local
Repository Files needed by MOS

RAC: Move the repository to shared storage Shared Repository Files
needed by MOS

• One AHF home for the Enterprise • AHF unaffected
by GI, database patching • Lower maintenance overhead • Easier to keep AHF current Coming to AHF: Standalone AHF Home (22.2?)

• One AHF home for the Enterprise • AHF unaffected
by GI, database patching • Lower maintenance overhead • Easier to keep AHF current • Centralized Collection Manager & EXAchk/ORAchk reporting • AHF Health Check added to EXAchk/ORAchk • New compliance dashboard in OEM • Greater visibility for CHA, CHM, QoS, Hang Manager, etc. Coming to AHF: Standalone AHF Home (22.2?)

Recommendation #4 Configure email notification

Set email notifications [root@node1 ~]# tfactl set [email protected] Successfully set
[email protected] .---------------------------------------------------------------------------. | node1 | +----------------------------------------------+----------------------------+ | Configuration Parameter | Value | +----------------------------------------------+----------------------------+ | Notification Address ( notificationAddress ) | [email protected] | '----------------------------------------------+----------------------------'

Set email notifications Send test email: tfactl sendmail [email protected]

AHF SMTP configuration also applies to ORAchk/EXAchk New in AHF
22.1.1

Recommendation #5 Additional configurations

Recommended configurations # Repository settings tfactl set autodiagcollect=ON # default
tfactl set trimfiles=ON # default tfactl set reposizeMB= # default=10240 tfactl set rtscan=ON # default tfactl set redact=mask # default=none # Disk space monitoring tfactl set diskUsageMon=ON # default=OFF tfactl set diskUsageMonInterval=240 # Depends on activity. default=60 # Log purge tfactl set autopurge=ON # If space is slim. default=OFF tfactl set manageLogsAutoPurge=ON # default=OFF tfactl set manageLogsAutoPurgeInterval=720 # Set to 12 hours. default=60 tfactl set manageLogsAutoPurgePolicyAge=30d # default=30 tfactl set minfileagetopurge=48 # default=12

Recommended configurations [root@node1 ~]# tfactl set manageLogsAutoPurge=ON Successfully set manageLogsAutoPurge=ON
.-------------------------------------------------------. | node1 | +-----------------------------------------------+-------+ | Configuration Parameter | Value | +-----------------------------------------------+-------+ | Managelogs Auto Purge ( manageLogsAutoPurge ) | ON | '-----------------------------------------------+-------' [root@node1 ~]# tfactl set manageLogsAutoPurgePolicyAge=30d Successfully set manageLogsAutoPurgePolicyAge=30d .-------------------------------------------------------------------------------. | node1 | +-----------------------------------------------------------------------+-------+ | Configuration Parameter | Value | +-----------------------------------------------------------------------+-------+ | Logs older than the time period will be auto purged(days[d]|hours[h]) | 30d | '-----------------------------------------------------------------------+-------'

Annoyances

Annoyances • Documentation isn’t always current • Commands, options, and
syntax may not match docs • Run tfactl <command> -h or tfactl <command> help • Some commands are user (root, oracle, grid) specific • Regression (usually minor) • Don’t build complex automation on new features • Don’t (always) rush to upgrade to the latest version • Example: GI can’t always see/manage DB & vice-versa

Annoyances • The transition from tfactl to ahfctl is incomplete
• Commands may be: • …available in both • …deprecated in tfactl • …new and unavailable in tfactl • …not ported to ahfctl (yet)

Annoyances • Date format options in commands are inconsistent •
Some require quotes, some don’t, some work either way • Some take double quotes, others take single quotes • YYYY/MM/DD or YYYY-MM-DD or YYYYMMDD or … • Some take dates and times separately • Sometimes there are -d and -t flags • Some take timestamps • Some work with either, others are specific

However… Many commands (incl. complex ones) have an -example option
[root@node1 ~]# tfactl diagcollect -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl diagcollect Trim and Zip all files updated in the last 1 hours as well as chmos/osw data from across the cluster and collect at the initiating node Note: This collection could be larger than required but is there as the simplest way to capture diagnostics if an issue has recently occurred. /opt/oracle.ahf/tfa/bin/tfactl diagcollect -last 8h Trim and Zip all files updated in the last 8 hours as well as chmos/osw data from across the cluster and collect at the initiating node /opt/oracle.ahf/tfa/bin/tfactl diagcollect -database hrdb,fdb -last 1d -z foo Trim and Zip all files from databases hrdb & fdb in the last 1 day and collect at the initiating node ...

However… Many commands (incl. complex ones) have an -example option
[oracle@node1 ~]$ tfactl analyze -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl analyze -since 5h Show summary of events from alert logs, system messages in last 5 hours. /opt/oracle.ahf/tfa/bin/tfactl analyze -comp os -since 1d Show summary of events from system messages in last 1 day. /opt/oracle.ahf/tfa/bin/tfactl analyze -search "ORA-" -since 2d Search string ORA- in alert and system logs in past 2 days. /opt/oracle.ahf/tfa/bin/tfactl analyze -search "/Starting/c" -since 2d Search case sensitive string "Starting" in past 2 days. /opt/oracle.ahf/tfa/bin/tfactl analyze -comp osw -since 6h Show OSWatcher Top summary in last 6 hours. ...

Recommendation #6 Get to know the diagnostic collection options

Diagnostic collections diagcollect [ [component1] [component2] ... [componenteN] | [-srdc
<srdc_profile>] | [-defips] ] [-sr <SR#>] [-node <all|local|n1,n2,..>] [-tag <tagname>] [-z <filename>] [-acrlevel <system,database,userdata>] [-last <n><m|h|d> | -from <time> -to <time> | -for <time>] [-nocopy] [-notrim] [-silent] [-cores] [-collectalldirs] [-collectdir <dir1,dir2..>] [-collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles] ]

Diagnostic collections - components diagcollect [component1] [component2] ... [componenteN] -acfs
-afd -ahf -ashhtml -ashtext -asm -awrhtml -awrtext -cfgtools -cha -crs   -crsclient -cvu -database -dataguard -dbclient -dbwlm -em -emagent -emagenti -emplugins -install   -ips -ocm -oms -omsi -os -procinfo -qos -rhp -sosreport -tns -wls

Diagnostic collections - 170+ SRDC profiles diagcollect ... -srdc <srdc_profile>
diagcollect -srdc -help <srdc_profile> can be any of the following, DBCORRUPT Required Diagnostic Data Collection for a Generic Database Corruption DBDATAGUARD Required Diagnostic Data Collection for Data Guard issues including Broker Listener_Services SRDC - Data Collection for TNS-12516 / TNS-12518 / TNS-12519 / TNS-12520. Naming_Services SRDC - Data Collection for ORA-12154 / ORA-12514 / ORA-12528. ORA-00020 SRDC for database ORA-00020 Maximum number of processes exceeded ORA-00060 SRDC for ORA-00060. Internal error code. ORA-00494 SRDC for ORA-00494. ORA-00600 SRDC for ORA-00600. Internal error code. ... ora4023 SRDC - ORA-4023 : Checklist of Evidence to Supply ora4063 SRDC - ORA-4063 : Checklist of Evidence to Supply ora445 SRDC - ORA-445 or Unable to Spawn Process: Checklist of Evidence to Supply (Doc ID 2500730.1) xdb600 SRDC - Required Diagnostic Data Collection for XDB ORA-00600 and ORA-07445 zlgeneric SRDC - Zero Data Loss Recovery Appliance (ZDLRA) Data Collection.

Diagnostic collections - Misc diagcollect ... -defips -sr <SR#> -node
<all|local|n1,n2,..> -defips Include in the default collection the IPS Packages for: ASM, CRS and Databases -sr Enter SR number to which the collection will be uploaded -node Specify comma separated list of host names for collection

Diagnostic collections - Time ranges diagcollect ... -last <n><m|h|d> -since
-from <time> -to <time> -for <time> -last <n><m|h|d> Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours -since Same as -last. Kept for backward compatibility. -from "Mon/dd/yyyy hh:mm:ss" From <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -to "Mon/dd/yyyy hh:mm:ss" To <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -for "Mon/dd/yyyy" For <date>. or "yyyy-mm-dd"

Diagnostic collections - File management diagcollect ... -nocopy -notrim -tag
<tagname> -z <zipname> -collectalldirs -collectdir <dir1,dir2..> -collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles] -nocopy Does not copy back the zip files to initiating node from all nodes -notrim Does not trim the files collected -tag <tagname> The files will be collected into tagname directory inside the repository -z <zipname> The collection zip file will be given this name in the collection repo -collectalldirs Collect all files from a directory marked "Collect All” flag to true -collectdir Specify a comma separated list of directories and the collection will include all files from these irrespective of type and time constraints in addition to the components specified -collectfiles Specify a comma separated list of files/directories and the collection will include the files and directories in addition to the components specified. if -onlycollectfiles is also used, then no other components will be collected.

Diagnostic collections - File redaction diagcollect ... -mask | -sanitize
tfactl set redact=mask tfactl set redact=sanitize tfactl set redact=none sanitize: Replaces sensitive data in collections with random characters mask: Replaces sensitive data in collections with asterisks (*)

Diagnostic collections: diagcollect -examples tfactl diagcollect Trim and Zip all
files updated in the last 1 hours as well as chmos/osw data from across the cluster and collect at the initiating node Note: This collection could be larger than required but is there as the simplest way to capture diagnostics if an issue has recently occurred. tfactl diagcollect -last 8h Trim and Zip all files updated in the last 8 hours as well as chmos/osw data from across the cluster and collect at the initiating node tfactl diagcollect -database hrdb,fdb -last 1d -z foo Trim and Zip all files from databases hrdb & fdb in the last 1 day and collect at the initiating node tfactl diagcollect -crs -os -node node1,node2 -last 6h Trim and Zip all crs files, o/s logs and chmos/osw data from node1 & node2 updated in the last 6 hours and collect at the initiating node

Diagnostic collections: diagcollect -examples tfactl diagcollect -asm -node node1 -from
"Mar/15/2022" -to "Mar/15/2022 21:00:00" Trim and Zip all ASM logs from node1 updated between from and to time and collect at the initiating node tfactl diagcollect -for "Mar/15/2022" Trim and Zip all log files updated on "Mar/15/2022" and collect at the collect at the initiating node tfactl diagcollect -for "Mar/15/2022 21:00:00" Trim and Zip all log files updated from 09:00 on "Mar/15/2022" to 09:00 on “Mar/16/2022"(i.e. 12 hours before and after the time given) and collect at the initiating node tfactl diagcollect -crs -collectdir /tmp_dir1,/tmp_dir2 Trim and Zip all crs files updated in the last 1 hours Also collect all files from /tmp_dir1 and /tmp_dir2 at the initiating node

Recommendation #7 True and tune tfactl managelogs

ADR log management Report space use for database, GI logs
Report space variations over time # Reporting tfactl managelogs -show usage # Show all space use in ADR tfactl managelogs -show usage -gi # Show GI space use tfactl managelogs -show usage -database # Show DB space use tfactl managelogs -show usage -saveusage # Save use for variation reports # Report space use variation tfactl managelogs -show variation -since 1d tfactl managelogs -show variation -since 1d -gi tfactl managelogs -show variation -since 1d -database

ADR log management Purge logs in ADR across cluster nodes
ALERT, INCIDENT, TRACE, CDUMP, HM, UTSCDMP, LOG All diagnostic subdirectories must be owned by dba/grid # Purge ADR files tfactl managelogs -purge -older 30d -dryrun # Estimated space saving tfactl managelogs -purge -older 30d # Purge logs > 30 days old tfactl managelogs -purge -older 30d -gi # GI only tfactl managelogs -purge -older 30d -database # Database only tfactl managelogs -purge -older 30d -database all # All databases tfactl managelogs -purge -older 30d -database SID1,SID3 tfactl managelogs -purge -older 30d -node all # All nodes tfactl managelogs -purge -older 30d -node local # Local node tfactl managelogs -purge -older 30d -node NODE1,NODE3

Purging seems slow? • First-time purge can take a long
time for: • Large directories • Many files • NOTE: Purge operation loops over files • Strategies for first time purge: • Delete in batches by age—365 days, 180 days, 90 days, etc. • Delete database and GI homes separately • Delete for individual SIDs, nodes

Check file ownership! • Files cannot be deleted if subdirectories
under ADR_HOME are not owned by grid/oracle or oinstall/dba • One mis-owned subdirectory • No files under that ADR_HOME will be purged • Even subdirectories with correct ownership! • Depending on version • grid may not be able to delete files in database ADR_HOMEs • oracle may not be able to delete files in GI ADR_HOMEs

Files aren't deleted when the ADR_HOME... • ...schema version is
mismatched or obsolete • ...library version is mismatched • ...is unregistered • ...is for an orphaned CRS event or user • ...is for an inactive listener

Files aren't deleted when... • ORACLE_SID or ORACLE_HOME not present
in oratab • Duplicate ORACLE_SIDs are present in oratab • Database unique name doesn't match the directory • Common after cloning operations • ADR_BASE is not set properly • $ORACLE_HOME/log/diag directory is missing • $ORACLE_HOME/log/diag/adrci_dir.mif missing • $ORACLE_HOME/log/diag/adrci_dir.mif doesn’t list ADR_BASE

The best of AHF: alertsummary

alertsummary # Summarize events in database and ASM alert logs
tfactl alertsummary [root@node1 ~]# tfactl alertsummary Output from host : node1 ------------------------------ Reading /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ------------------------------------------------------------------------ 02 02 2022 20:04:57 Database started ------------------------------------------------------------------------ 02 02 2022 20:07:41 Database started Summary: Ora-600=0, Ora-7445=0, Ora-700=0 ~~~~~~~ Warning: Only FATAL errors reported Warning: These errors were seen and NOT reported Ora-15173 Ora-15032 Ora-15017 Ora-15013 Ora-15326

The best of AHF: analyze

analyze # Perform system analysis of DB, ASM, GI, system,
OS Watcher logs/output tfactl analyze # Options: -search "pattern" # Search in DB and CRS alert logs # Sets the search period to -last 1h # Override with -last xh|xd -verbose timeline file1 file2 # Shows timeline for specified files

analyze INFO: analyzing all (Alert and Unix System Logs) logs
for the last 1440 minutes... Please wait... INFO: analyzing host: node1 Report title: Analysis of Alert,System Logs Report date range: last ~1 day(s) Report (default) time zone: GMT - Greenwich Mean Time Analysis started at: 03-Feb-2022 06:27:46 PM GMT Elapsed analysis time: 0 second(s). Configuration file: /opt/oracle.ahf/tfa/ext/tnt/conf/tnt.prop Configuration group: all Total message count: 963, from 02-Feb-2022 08:01:39 PM GMT to 03-Feb-2022 04:23:43 PM GMT Messages matching last ~1 day(s): 963, from 02-Feb-2022 08:01:39 PM GMT to 03-Feb-2022 04:23:43 PM GMT last ~1 day(s) error count: 4, from 02-Feb-2022 08:03:31 PM GMT to 02-Feb-2022 08:11:12 PM GMT last ~1 day(s) ignored error count: 0 last ~1 day(s) unique error count: 3 Message types for last ~1 day(s) Occurrences percent server name type ----------- ------- -------------------- ----- 952 98.9% node1 generic 7 0.7% node1 WARNING 4 0.4% node1 ERROR ----------- ------- 963 100.0%

analyze ... Unique error messages for last ~1 day(s) Occurrences
percent server name error ----------- ------- ----------- ----- 2 50.0% node1 [OCSSD(30863)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 . 1 25.0% node1 [OCSSD(2654)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 . 1 25.0% node1 [OCSSD(2654)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 . ----------- ------- 4 100.0%

The best of AHF: changes

changes # Find changes made on the system tfactl changes
# Times and ranges -for "YYYY-MM-DD" -from "YYYY-MM-DD" -to "YYYY-MM-DD" -from "YYYY-MM-DD HH24:MI:SS" -to "YYYY-MM-DD HH24:MI:SS" -last 6h -last 1d

changes [root@node1 ~]# tfactl changes -last 2d Output from host
: node2 ------------------------------ [Feb/02/2022 20:11:16.438]: Package: cvuqdisk-1.0.10-1.x86_64 Output from host : node1 ------------------------------ [Feb/02/2022 19:57:16.438]: Package: cvuqdisk-1.0.10-1.x86_64 [Feb/02/2022 20:11:16.438]: Package: cvuqdisk-1.0.10-1.x86_64

The best of AHF: events

events [root@node1 ~]# tfactl events -last 1d Output from host
: node2 ------------------------------ Event Summary: INFO :3 ERROR :2 WARNING :0 Event Timeline: [Feb/02/2022 20:10:46.649 GMT]: [crs]: 2022-02-02 20:10:46.649 [ORAROOTAGENT(27881)]CRS-5822: Agent '/u01/app/19.3.0.0/grid/ bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:3} in /u01/app/grid/diag/crs/node2/crs/trace/ ohasd_orarootagent_root.trc. [Feb/02/2022 20:11:12.856 GMT]: [crs]: 2022-02-02 20:11:12.856 [OCSSD(28472)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 . [Feb/02/2022 20:11:57.000 GMT]: [asm.+ASM2]: Reconfiguration started (old inc 0, new inc 4) [Feb/02/2022 20:28:31.000 GMT]: [db.db193h1.DB193H12]: Starting ORACLE instance (normal) (OS id: 24897) [Feb/02/2022 20:28:42.000 GMT]: [db.db193h1.DB193H12]: Reconfiguration started (old inc 0, new inc 4)

The best of AHF: grep

grep # Find patterns in multiple files tfactl grep "ERROR"
alert tfactl grep -i "error" alert,trace [root@node1 ~]# tfactl grep -i "error" alert Output from host : node1 ------------------------------ Searching 'error' in alert Searching /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- 28: PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 375:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_32035.trc: 378:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_32049.trc: 446:ERROR: /* ASMCMD */ALTER DISKGROUP ALL MOUNT 543: PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 1034:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_28105.trc: ...

The best of AHF: param

param # View database parameters - cluster aware param <parameter>
tfactl> param sga_target Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'

param # View database parameters - cluster aware tfactl> param
-h Output from host : vna1 ------------------------------ Usage : /opt/oracle.ahf/tfa/bin/tfactl [run] param <name pattern> Show value of OS/DB parameters matching input e.g: /opt/oracle.ahf/tfa/bin/tfactl param sga_max /opt/oracle.ahf/tfa/bin/tfactl param sga_min /opt/oracle.ahf/tfa/bin/tfactl param db_unique /opt/oracle.ahf/tfa/bin/tfactl param shmmax /opt/oracle.ahf/tfa/bin/tfactl run param sga_max /opt/oracle.ahf/tfa/bin/tfactl run param sga_min /opt/oracle.ahf/tfa/bin/tfactl run param db_unique /opt/oracle.ahf/tfa/bin/tfactl run param shmmax

sga_target Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'

sga Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'

param # There are more parameters for sga* SQL> show
parameter sga NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ allow_group_access_to_sga boolean FALSE lock_sga boolean FALSE pre_page_sga boolean TRUE sga_max_size big integer 1536M sga_min_size big integer 0 sga_target big integer 1536M unified_audit_sga_queue_size integer 1048576

sga_max Output from host : vna1 ------------------------------ Output from host : vna2 ------------------------------

shmmax Output from host : vna1 ------------------------------ Output from host : vna2 ------------------------------

The best of AHF: ps

ps # List processes - default flags are "-ef" ps
pmon ps <flags> pmon tfactl> ps pmon Output from host : vna1 ------------------------------ grid 15260 1 0 14:30 ? 00:00:00 asm_pmon_+ASM1 oracle 16883 1 0 14:31 ? 00:00:00 ora_pmon_VNA1 Output from host : vna2 ------------------------------ grid 8063 1 0 14:25 ? 00:00:00 asm_pmon_+ASM2 oracle 9929 1 0 14:27 ? 00:00:00 ora_pmon_VNA2...

ps tfactl> ps aux pmon Output from host : vna1
------------------------------ grid 15260 0.0 1.0 1556860 79508 ? Ss 14:30 0:00 asm_pmon_+ASM1 oracle 16883 0.0 0.8 2297012 66148 ? Ss 14:31 0:00 ora_pmon_VNA1 Output from host : vna2 ------------------------------ grid 8063 0.0 1.0 1556860 79896 ? Ss 14:25 0:00 asm_pmon_+ASM2 oracle 9929 0.0 0.8 2297012 66168 ? Ss 14:27 0:00 ora_pmon_VNA2

The best of AHF: pstack

pstack # Print a stack trace for a process .------------------------------------------------------------------.
| TOOLS STATUS - HOST : vna1 | +----------------------+--------------+--------------+-------------+ | Tool Type | Tool | Version | Status | +----------------------+--------------+--------------+-------------+ | AHF Utilities | alertsummary | 21.4.1 | DEPLOYED | ... | | ps | 21.4.1 | DEPLOYED | | | pstack | 21.4.1 | DEPLOYED | | | summary | 21.4.1 | DEPLOYED | ... | | vi | 21.4.1 | DEPLOYED | +----------------------+--------------+--------------+-------------+

pstack tfactl> pstack -h Output from host : vna1 ------------------------------
Error: pstack command not found in system. If its installed, please set the PATH and try again. yum install -y gdb

pstack tfactl> pstack mmon Output from host : vna1 ------------------------------
# pstack output for pid : 15318 #0 0x00007f33bac6928a in semtimedop () from /lib64/libc.so.6 #1 0x0000000011c58285 in sskgpwwait () #2 0x0000000011c543db in skgpwwait () #3 0x000000001144ccba in ksliwat () #4 0x000000001144c06c in kslwaitctx () #5 0x0000000011a6fd40 in ksarcv () #6 0x00000000038174fa in ksbabs () #7 0x0000000003835ab3 in ksbrdp () #8 0x0000000003c19a4d in opirip () #9 0x00000000024c23e5 in opidrv ()

pstack # ahfctl pstack accepts standard flags Usage pstack <pid|process
name> [-n <n>] [-s <secs>] Print stack trace of a running process <n> times. Sleep <secs> seconds between runs. e.g: pstack lmd pstack 2345 -n 5 -s 5 run pstack lmd run pstack 2345 -n 5 -s 5

The best of AHF: summary

summary # Generate a system summary tfactl> summary -h ---------------------------------------------------------------------------------
Usage : TFACTL [run] summary -help --------------------------------------------------------------------------------- Command : /opt/oracle.ahf/tfa/bin/tfactl [run] summary [OPTIONS] Following Options are supported: [no_components] : [Default] Complete Summary Collection -overview : [Optional/Default] Complete Summary Collection - Overview -crs : [Optional/Default] CRS Status Summary -asm : [Optional/Default] ASM Status Summary -acfs : [Optional/Default] ACFS Status Summary -database : [Optional/Default] DATABASE Status Summary -exadata : [Optional/Default] EXADATA Status Summary Not enabled/ignored in Windows and Non-Exadata machine -patch : [Optional/Default] Patch Details -listener : [Optional/Default] LISTENER Status Summary -network : [Optional/Default] NETWORK Status Summary -os : [Optional/Default] OS Status Summary -tfa : [Optional/Default] TFA Status Summary -summary : [Optional/Default] Summary Tool Metadata -json : [Optional] - Prepare json report -html : [Optional] - Prepare html report -print : [Optional] - Display [html or json] Report at Console -silent : [Optional] - Interactive console by defauly -history <num> : [Optional] - View Previous <numberof> Summary Collection History in Interpreter -node <node(s)> : [Optional] - local or Comma Separated Node Name(s) -help : Usage/Help. ---------------------------------------------------------------------------------

summary Example output tfactl> summary Executing Summary in Parallel on
Following Nodes: Node : vna1 Node : vna2 LOGFILE LOCATION : /opt/oracle.ahf/…/log/summary_command_20220316151853_vna1_18097.log Component Specific Summary collection : - Collecting CRS details ... Done. - Collecting ASM details ... Done. - Collecting ACFS details ... Done. - Collecting DATABASE details ... Done. - Collecting PATCH details ... Done. - Collecting LISTENER details ... Done. - Collecting NETWORK details ... Done. - Collecting OS details ... Done. - Collecting TFA details ... Done. - Collecting SUMMARY details ... Done. Remote Summary Data Collection : In-Progress - Please wait ... - Data Collection From Node - vna2 .. Done. Prepare Clusterwide Summary Overview ... Done cluster_status_summary

summary Example output (cont) COMPONENT DETAILS STATUS +-----------+---------------------------------------------------------------------------------------------------+---------+ ... PATCH
.----------------------------------------------. OK | CRS_PATCH_CONSISTENCY_ACROSS_NODES : OK | | DATABASE_PATCH_CONSISTENCY_ACROSS_NODES : OK | '----------------------------------------------' LISTENER .-----------------------. OK | LISTNER_STATUS : OK | '-----------------------' NETWORK .---------------------------. OK | CLUSTER_NETWORK_STATUS : | '---------------------------' OS .-----------------------. OK | MEM_USAGE_STATUS : OK | '-----------------------' TFA .----------------------. OK | TFA_STATUS : RUNNING | '----------------------' SUMMARY .------------------------------------. OK | SUMMARY_EXECUTION_TIME : 0H:1M:52S | ‘------------------------------------' +-----------+---------------------------------------------------------------------------------------------------+---------+

summary Interactive menu ### Entering in to SUMMARY Command-Line Interface
### tfactl_summary>list Components : Select Component - select [component_number|component_name] 1 => overview 2 => crs_overview 3 => asm_overview 4 => acfs_overview 5 => database_overview 6 => patch_overview 7 => listener_overview 8 => network_overview 9 => os_overview 10 => tfa_overview 11 => summary_overview tfactl_summary>

summary Interactive menu tfactl_summary>5 ORACLE_HOME_DETAILS ORACLE_HOME_NAME +-----------------------------------------------------------------------------------+------------------+ .-------------------------------------------------------------------------------. OraDB19Home1 |
DATABASE_DETAILS | DATABASE_NAME | +---------------------------------------------------------------+---------------+ | .-----------------------------------------------------------. | VNA | | | DB_BLOCKS | STATUS | DB_CHAINS | INSTANCE_NAME | HOSTNAME | | | | +-----------+--------+-----------+---------------+----------+ | | | | PASS | OPEN | FAIL | VNA1 | vna1 | | | | | PASS | OPEN | FAIL | VNA2 | vna2 | | | | '-----------+--------+-----------+---------------+----------' | | '---------------------------------------------------------------+---------------' +-----------------------------------------------------------------------------------+------------------+ tfactl_summary_databaseoverview>list Status Type: Select Status Type - select [status_type_number|status_type_name] 1 => database_clusterwide_status 2 => database_vna1 3 => database_vna2

summary Interactive menu tfactl_summary_databaseoverview>list Status Type: Select Status Type -
select [status_type_number|status_type_name] 1 => database_clusterwide_status 2 => database_vna1 3 => database_vna2 tfactl_summary_databaseoverview>2 =====> database_sql_statistics =====> database_instance_details =====> database_components_version =====> database_system_events =====> database_hanganalyze =====> database_rman_stats =====> database_incidents =====> database_account_status =====> database_tablespace_details =====> database_status_summary =====> database_sqlmon_statistics =====> database_problems =====> database_statistics =====> database_group_details =====> database_pdb_stats =====> database_configuration_details

The best of AHF: tail

tail # Tail logs by name or pattern tfactl tail
alert_ # Tail all logs matching alert_ tfactl tail alert_ORCL1.log -exact # Tail for an exact match tfactl tail -f alert_ # Follow logs(local node only) [root@node1 ~]# tfactl tail -f alert_ Output from host : node1 ------------------------------ ==> /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log <== NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup00.ocr.274.1095654191' 2022-02-03T12:23:35.194335+00:00 NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup01.ocr.274.1095654191' 2022-02-03T16:23:43.602629+00:00 NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup01.ocr.275.1095668599' ==> /u01/app/oracle/diag/rdbms/db193h1/DB193H11/trace/alert_DB193H11.log <== TABLE SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY: ADDED INTERVAL PARTITION SYS_P301 (44594) VALUES LESS THAN (TO_DATE(‘... SYS.WRI$_OPTSTAT_HISTGRM_HISTORY: ADDED INTERVAL PARTITION SYS_P304 (44594) VALUES LESS THAN (TO_DATE(‘... 2022-02-03T06:00:16.143988+00:00 Thread 1 advanced to log sequence 22 (LGWR switch) Current log# 2 seq# 22 mem# 0: +DATA/DB193H1/ONLINELOG/group_2.265.1095625353

Questions

Autonomous Health Framework

Autonomous Health Framework

More Decks by Seán Scott

Other Decks in Technology

Featured

Transcript