Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Autonomous Health Framework

Autonomous Health Framework

Autonomous Health Framework: How to Use Your Database "Swiss Army Knife" (Without Poking an Eye Out!)

KScope 2022

Sean Scott

June 20, 2022
Tweet

More Decks by Sean Scott

Other Decks in Technology

Transcript

  1. @ViscosityNA www.viscosityna.com www.viscosityna.com @ViscosityNA Sean Scott Working with Oracle technology

    since 1995 
 Development ⁘ DBA ⁘ Reliability Engineering ⁘ DevOps 
 Oracle OpenWorld ⁘ Collaborate/IOUG ⁘ Regional UG RAC/MAA ⁘ Data Guard ⁘ Sharding ⁘ Exadata/ODA 
 Diagnostic Tools (AHF, TFA, RDA, CHA, CHM) 
 DR, HA, Site Reliability/Continuity 
 Upgrade ⁘ Migration ⁘ Cloud DevOps ⁘ Infrastructure as Code ⁘ Automation 
 Containers ⁘ Virtualization
  2. "My experience with AHF/TFA is..." a) Hang on, spell that

    'AHF' thing one more time. b) Used it for an SR once—no lingering trauma. c) Used it extensively—currently seeking therapy. d) Dude, I could be the PM. e) Whatever, someone said there were GIFs.
  3. You need AHF • AHF diagnostic collections required by MOS

    for some SR • Diagnostic collections accelerate SR resolution • Cluster-aware ADR log inspection and management • Advanced system and log monitoring • Incident control and notification • Connect to MOS • SMTP, REST APIs
  4. You need AHF • Built-in Oracle tools: • ORAchk/EXAchk •

    OS Watcher • Cluster Verification Utility (CVU) • Hang Manager • Diagnostic Assistant
  5. You need AHF • Integrated with: • Database • ASM

    and Clusterware • Automatic Diagnostic Repository (ADR) • Grid Infrastructure Management Repository (GIMR) • Cluster Health Advisor (CHA) & Cluster Health Monitor (CHM) • Enterprise Manager
  6. You need AHF • Cluster aware: • Run commands for

    all, some nodes • Cross-node configuration and file inspection • Central management for ADR • Consolidated diagnostic collection
  7. You need AHF • Over 800 health checks • 400

    identified as critical/failures • Severe problem check daily: 2AM • All known problem check weekly: 3AM Sunday • Auto-generates a collection when problems detected • Everything required to diagnose & resolve • Results delivered to the notification email
  8. AHF MOS Master Document: 2550798.1 • Briefly 2832630.1, 2832594.1 •

    Downloads available for • Linux, ZLinux • Solaris x86/SPARC64 • HPUX • AIX 6/7 • Win 64-bit
  9. AHF MOS Master Document: 2550798.1 • Major release each quarter

    • Typically follows DBRU schedule • Naming convention is year, quarter, release: YY.Q.R • 21.4.0, 21.4.1 • Latest version: 22.1.1 (as of 2022-06-18) • Intermediate releases are common!
  10. Types of installs: Daemon or root • Recommended method •

    Cluster awareness • Full AHF capabilities • Includes compliance checks • Enables notifications • Automatic diagnostic collection when issues are detected • May conflict with existing AHF/TFA installations
  11. Types of installs: Local or non-root • Reduced feature set

    • No automatic or remote diagnostics, collections • Limited file visibility (must be readable by Oracle home owner) • /var/log/messages • Some Grid Infrastructure logs • May co-exist with Daemon installations • No special pre-install considerations
  12. A brief history lesson… • There are two flavors of

    TFA • A version downloaded from MOS • A version included in Grid Infrastructure install & patches • GI version is not fully featured • GI and MOS versions can interfere, conflict
  13. Install AHF • Oracle’s instructions work when things are perfect

    • Systems are rarely perfect! • AHF and TFA are known for certain… ahem, peculiarities
  14. TFA pre-installation checks # Uninstall TFA (as root) tfactl uninstall

    # Check for existing AHF/TFA installs which tfactl which ahfctl
  15. TFA pre-installation checks # Kill any leftover processes (and make

    sure they stay that way!) pkill "oswbb|OSWatcher*|toolstatus|tfactl" sleep 300 ps -ef | egrep -i "oswbb|OSWatcher|toolstatus|tfactl" # Check for leftover, conflicting processes ps -ef | egrep -i "oswbb|OSWatcher|ahf|tfa|prw|toolstatus"
  16. TFA pre-installation checks # Locate leftover setup configuration files find

    / -name tfa_setup.txt # Verify files are removed find / -name tfactl find / -name startOSWbb.sh
  17. TFA pre-installation checks # Remove legacy/existing AHF/TFA installations for d

    in $(find / -name uninstalltfa) do cd $(dirname $d) ./tfactl uninstall # cd .. && rm -fr . done # Insure ALL AHF/TFA processes are stopped/inactive prior to uninstall # PERFORM THIS STEP ON ALL NODES
  18. Command line tools: ahfctl and tfactl $ tfactl <command> <options>

    - or - $ tfactl tfactl> <command> <options> $ tfactl help $ tfactl <command> help $ ahfctl <command> <options> - or - $ ahfctl ahfctl> <command> <options> $ ahfctl help $ ahfctl <command> help
  19. Post-install checks ahfctl version tfactl status ahfctl statusahf tfactl toolstatus

    tfactl print hosts tfactl print components tfactl print protocols tfactl print config -node all
  20. status vs statusahf [root@node1 ~]# tfactl status .---------------------------------------------------------------------------------------------. | Host

    | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' [root@node1 ~]#
  21. [root@node1 ~]# tfactl statusahf .---------------------------------------------------------------------------------------------. | Host | Status of

    TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' ------------------------------------------------------------ Master node = node1 orachk daemon version = 214100 Install location = /opt/oracle.ahf/orachk Started at = Wed Feb 02 20:50:12 GMT 2022 Scheduler type = TFA Scheduler Scheduler PID: 28883 ... status vs statusahf
  22. ------------------------------------------------------------ ID: orachk.autostart_client_oratier1 ------------------------------------------------------------ AUTORUN_FLAGS = -usediscovery -profile oratier1 -dball

    -showpass -tag autostart_client_oratier1 -readenvconfig COLLECTION_RETENTION = 7 AUTORUN_SCHEDULE = 3 2 * * 1,2,3,4,5,6 ------------------------------------------------------------ ------------------------------------------------------------ ID: orachk.autostart_client ------------------------------------------------------------ AUTORUN_FLAGS = -usediscovery -tag autostart_client -readenvconfig COLLECTION_RETENTION = 14 AUTORUN_SCHEDULE = 3 3 * * 0 ------------------------------------------------------------ Next auto run starts on Feb 03, 2022 02:03:00 ID:orachk.AUTOSTART_CLIENT_ORATIER1 statusahf option in tfactl is deprecated and will be removed in AHF 22.1.0. Please start using ahfctl for statusahf, Example: ahfctl statusahf status vs statusahf
  23. Warning remains after a successful upgrade [root@node1 ahf]# ahfctl statusahf

    WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade. .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 24554 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' • Run ahfctl syncpatch
  24. Not all nodes appear after upgrade [root@node1 ahf]# tfactl status

    .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' [root@node1 ahf]# • Run ahfctl syncnodes
  25. Not all nodes appear after upgrade [root@node1 ahf]# tfactl syncnodes

    Current Node List in TFA : 1. node1 2. node2 Node List in Cluster : 1. node1 2. node2 Node List to sync TFA Certificates : 1 node2 Do you want to update this node list? Y|[N]: Syncing TFA Certificates on node2 : TFA_HOME on node2 : /opt/oracle.ahf/tfa ...
  26. Not all nodes appear after upgrade (cont) ... TFA_HOME on

    node2 : /opt/oracle.ahf/tfa DATA_DIR on node2 : /opt/oracle.ahf/data/node2/tfa Shutting down TFA on node2... Copying TFA Certificates to node2... Copying SSL Properties to node2... Sleeping for 5 seconds... Starting TFA on node2... .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' [root@node1 ahf]#
  27. OS Watcher not managed by AHF/TFA message • Legacy TFA

    install present • OS Watcher process running during install/upgrade • Multiple install.properties or tfa_setup.txt files • Check logs & permissions • Reinstall [root@node1 ahf]# tfactl toolstatus | grep oswbb | | oswbb | 8.3.2 | NOT MANAGED BY TFA | | | oswbb | 8.3.2 | RUNNING |
  28. Installation and upgrade issues • Post-installation troubleshooting: • ahfctl stopahf;

    ahfctl startahf • tfactl stop; tfactl start • tfactl status • ahfctl statusahf • tfactl toolstatus • tfactl syncnodes • ahfctl syncpatch
  29. Update repository location tfactl> print repository .--------------------------------------------------------. | node1 |

    +----------------------+---------------------------------+ | Repository Parameter | Value | +----------------------+---------------------------------+ | Location | /opt/oracle.ahf/data/repository | | Maximum Size (MB) | 10240 | | Current Size (MB) | 11 | | Free Size (MB) | 10229 | | Status | OPEN | '----------------------+---------------------------------'
  30. Update repository location tfactl> set repositorydir=/some/directory/repository Successfully changed repository .-------------------------------------------------------------.

    | Repository Parameter | Value | +---------------------------+---------------------------------+ | Old Location | /opt/oracle.ahf/data/repository | | New Location | /some/directory/repository | | Current Maximum Size (MB) | 10240 | | Current Size (MB) | 0 | | Status | OPEN | ‘---------------------------+---------------------------------' # Repository commands are applied only on the local node
  31. • One AHF home for the Enterprise • AHF unaffected

    by GI, database patching • Lower maintenance overhead • Easier to keep AHF current Coming to AHF: Standalone AHF Home (22.2?)
  32. • One AHF home for the Enterprise • AHF unaffected

    by GI, database patching • Lower maintenance overhead • Easier to keep AHF current • Centralized Collection Manager & EXAchk/ORAchk reporting • AHF Health Check added to EXAchk/ORAchk • New compliance dashboard in OEM • Greater visibility for CHA, CHM, QoS, Hang Manager, etc. Coming to AHF: Standalone AHF Home (22.2?)
  33. Set email notifications [root@node1 ~]# tfactl set [email protected] Successfully set

    [email protected] .---------------------------------------------------------------------------. | node1 | +----------------------------------------------+----------------------------+ | Configuration Parameter | Value | +----------------------------------------------+----------------------------+ | Notification Address ( notificationAddress ) | [email protected] | '----------------------------------------------+----------------------------'
  34. Set email notifications [root@node1 ~]# tfactl print smtp .---------------------------. |

    SMTP Server Configuration | +---------------+-----------+ | Parameter | Value | +---------------+-----------+ | smtp.auth | false | | smtp.port | 25 | | smtp.from | tfa | | smtp.cc | - | | smtp.password | ******* | | smtp.ssl | false | | smtp.debug | true | | smtp.user | - | | smtp.host | localhost | | smtp.bcc | - | | smtp.to | - | '---------------+-----------' View SMTP settings: tfactl print smtp
  35. Set email notifications [root@node1 ~]# tfactl set smtp .---------------------------. |

    SMTP Server Configuration | +---------------+-----------+ | Parameter | Value | +---------------+-----------+ | smtp.password | ******* | | smtp.debug | true | | smtp.user | - | | smtp.cc | - | | smtp.port | 25 | | smtp.from | tfa | | smtp.bcc | - | | smtp.to | - | | smtp.auth | false | | smtp.ssl | false | | smtp.host | localhost | ‘---------------+-----------' Enter the SMTP property you want to update : smtp.host Configure SMTP settings: tfactl set smtp Opens an interactive dialog
  36. Set email notifications Enter the SMTP property you want to

    update : smtp.host Enter value for smtp.host : 127.0.0.1 SMTP Property smtp.host updated with 127.0.0.1 Do you want to continue ? Y|N : Y .---------------------------. | SMTP Server Configuration | +---------------+-----------+ | Parameter | Value | +---------------+-----------+ | smtp.port | 25 | | smtp.cc | - | | smtp.user | - | | smtp.password | ******* | | smtp.debug | true | | smtp.host | 127.0.0.1 | | smtp.ssl | false | ... View SMTP settings: tfactl print smtp Configure SMTP settings: tfactl set smtp
  37. Recommended configurations # Repository settings tfactl set autodiagcollect=ON # default

    tfactl set trimfiles=ON # default tfactl set reposizeMB= # default=10240 tfactl set rtscan=ON # default tfactl set redact=mask # default=none # Disk space monitoring tfactl set diskUsageMon=ON # default=OFF tfactl set diskUsageMonInterval=240 # Depends on activity. default=60 # Log purge tfactl set autopurge=ON # If space is slim. default=OFF tfactl set manageLogsAutoPurge=ON # default=OFF tfactl set manageLogsAutoPurgeInterval=720 # Set to 12 hours. default=60 tfactl set manageLogsAutoPurgePolicyAge=30d # default=30 tfactl set minfileagetopurge=48 # default=12
  38. Recommended configurations [root@node1 ~]# tfactl set manageLogsAutoPurge=ON Successfully set manageLogsAutoPurge=ON

    .-------------------------------------------------------. | node1 | +-----------------------------------------------+-------+ | Configuration Parameter | Value | +-----------------------------------------------+-------+ | Managelogs Auto Purge ( manageLogsAutoPurge ) | ON | '-----------------------------------------------+-------' [root@node1 ~]# tfactl set manageLogsAutoPurgePolicyAge=30d Successfully set manageLogsAutoPurgePolicyAge=30d .-------------------------------------------------------------------------------. | node1 | +-----------------------------------------------------------------------+-------+ | Configuration Parameter | Value | +-----------------------------------------------------------------------+-------+ | Logs older than the time period will be auto purged(days[d]|hours[h]) | 30d | '-----------------------------------------------------------------------+-------'
  39. Annoyances • Documentation isn’t always current • Commands, options, and

    syntax may not match docs • Run tfactl <command> -h or tfactl <command> help • Some commands are user (root, oracle, grid) specific • Regression (usually minor) • Don’t build complex automation on new features • Don’t (always) rush to upgrade to the latest version • Example: GI can’t always see/manage DB & vice-versa
  40. Annoyances • The transition from tfactl to ahfctl is incomplete

    • Commands may be: • …available in both • …deprecated in tfactl • …new and unavailable in tfactl • …not ported to ahfctl (yet)
  41. Annoyances • Date format options in commands are inconsistent •

    Some require quotes, some don’t, some work either way • Some take double quotes, others take single quotes • YYYY/MM/DD or YYYY-MM-DD or YYYYMMDD or … • Some take dates and times separately • Sometimes there are -d and -t flags • Some take timestamps • Some work with either, others are specific
  42. However… Most commands have good help options: tfactl <command> -h

    [root@node1 ~]# tfactl diagcollect -h Collect logs from across nodes in cluster Usage : /opt/oracle.ahf/tfa/bin/tfactl diagcollect [ [component_name1] [component_name2] ... [component_nameN] | [-srdc <srdc_profile>] | [-defips]] [-sr <SR#>] [-node <all|local|n1,n2,..>] [-tag <tagname>] [-z <filename>] [-acrlevel <system,database,userdata>] [-last <n><m|h|d>| -from <time> -to <time> | -for <time>] [-nocopy] [-notrim] [-silent] [-cores][-collectalldirs][-collectdir <dir1,dir2..>][-collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles]][- examples] components:-ips|-database|-asm|-crsclient|-dbclient|-dbwlm|-tns|-rhp|-procinfo|-cvu|-afd|-crs|-cha|-wls|-emagenti|-emagent|-oms|-omsi|-ocm|-emplugins|-em|- acfs|-install|-cfgtools|-os|-ashhtml|-ashtext|-awrhtml|-awrtext|-sosreport|-qos|-ahf|-dataguard -srdc Service Request Data Collection (SRDC). -database Specify comma separated list of db unique names for collection -defips Include in the default collection the IPS Packages for: ASM, CRS and Databases -sr Enter SR number to which the collection will be uploaded -node Specify comma separated list of host names for collection -tag <tagname> The files will be collected into tagname directory inside repository -z <zipname> The collection zip file will be given this name within the TFA collection repository -last <n><m|h|d> Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours -since Same as -last. Kept for backward compatibility. -from "Mon/dd/yyyy hh:mm:ss" From <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or “yyyy-mm-dd" ...
  43. However… Many commands (incl. complex ones) have an -example option

    [root@node1 ~]# tfactl diagcollect -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl diagcollect Trim and Zip all files updated in the last 1 hours as well as chmos/osw data from across the cluster and collect at the initiating node Note: This collection could be larger than required but is there as the simplest way to capture diagnostics if an issue has recently occurred. /opt/oracle.ahf/tfa/bin/tfactl diagcollect -last 8h Trim and Zip all files updated in the last 8 hours as well as chmos/osw data from across the cluster and collect at the initiating node /opt/oracle.ahf/tfa/bin/tfactl diagcollect -database hrdb,fdb -last 1d -z foo Trim and Zip all files from databases hrdb & fdb in the last 1 day and collect at the initiating node ...
  44. However… Many commands (incl. complex ones) have an -example option

    [oracle@node1 ~]$ tfactl analyze -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl analyze -since 5h Show summary of events from alert logs, system messages in last 5 hours. /opt/oracle.ahf/tfa/bin/tfactl analyze -comp os -since 1d Show summary of events from system messages in last 1 day. /opt/oracle.ahf/tfa/bin/tfactl analyze -search "ORA-" -since 2d Search string ORA- in alert and system logs in past 2 days. /opt/oracle.ahf/tfa/bin/tfactl analyze -search "/Starting/c" -since 2d Search case sensitive string "Starting" in past 2 days. /opt/oracle.ahf/tfa/bin/tfactl analyze -comp osw -since 6h Show OSWatcher Top summary in last 6 hours. ...
  45. Diagnostic collections diagcollect [ [component1] [component2] ... [componenteN] | [-srdc

    <srdc_profile>] | [-defips] ] [-sr <SR#>] [-node <all|local|n1,n2,..>] [-tag <tagname>] [-z <filename>] [-acrlevel <system,database,userdata>] [-last <n><m|h|d> | -from <time> -to <time> | -for <time>] [-nocopy] [-notrim] [-silent] [-cores] [-collectalldirs] [-collectdir <dir1,dir2..>] [-collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles] ]
  46. Diagnostic collections - components diagcollect [component1] [component2] ... [componenteN] -acfs

    -afd -ahf -ashhtml -ashtext -asm -awrhtml -awrtext -cfgtools -cha -crs 
 -crsclient -cvu -database -dataguard -dbclient -dbwlm -em -emagent -emagenti -emplugins -install 
 -ips -ocm -oms -omsi -os -procinfo -qos -rhp -sosreport -tns -wls
  47. Diagnostic collections - 170+ SRDC profiles diagcollect ... -srdc <srdc_profile>

    diagcollect -srdc -help <srdc_profile> can be any of the following, DBCORRUPT Required Diagnostic Data Collection for a Generic Database Corruption DBDATAGUARD Required Diagnostic Data Collection for Data Guard issues including Broker Listener_Services SRDC - Data Collection for TNS-12516 / TNS-12518 / TNS-12519 / TNS-12520. Naming_Services SRDC - Data Collection for ORA-12154 / ORA-12514 / ORA-12528. ORA-00020 SRDC for database ORA-00020 Maximum number of processes exceeded ORA-00060 SRDC for ORA-00060. Internal error code. ORA-00494 SRDC for ORA-00494. ORA-00600 SRDC for ORA-00600. Internal error code. ... ora4023 SRDC - ORA-4023 : Checklist of Evidence to Supply ora4063 SRDC - ORA-4063 : Checklist of Evidence to Supply ora445 SRDC - ORA-445 or Unable to Spawn Process: Checklist of Evidence to Supply (Doc ID 2500730.1) xdb600 SRDC - Required Diagnostic Data Collection for XDB ORA-00600 and ORA-07445 zlgeneric SRDC - Zero Data Loss Recovery Appliance (ZDLRA) Data Collection.
  48. Diagnostic collections - Misc diagcollect ... -defips -sr <SR#> -node

    <all|local|n1,n2,..> -defips Include in the default collection the IPS Packages for: ASM, CRS and Databases -sr Enter SR number to which the collection will be uploaded -node Specify comma separated list of host names for collection
  49. Diagnostic collections - Time ranges diagcollect ... -last <n><m|h|d> -since

    -from <time> -to <time> -for <time> -last <n><m|h|d> Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours -since Same as -last. Kept for backward compatibility. -from "Mon/dd/yyyy hh:mm:ss" From <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -to "Mon/dd/yyyy hh:mm:ss" To <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -for "Mon/dd/yyyy" For <date>. or "yyyy-mm-dd"
  50. Diagnostic collections - File management diagcollect ... -nocopy -notrim -tag

    <tagname> -z <zipname> -collectalldirs -collectdir <dir1,dir2..> -collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles] -nocopy Does not copy back the zip files to initiating node from all nodes -notrim Does not trim the files collected -tag <tagname> The files will be collected into tagname directory inside the repository -z <zipname> The collection zip file will be given this name in the collection repo -collectalldirs Collect all files from a directory marked "Collect All” flag to true -collectdir Specify a comma separated list of directories and the collection will include all files from these irrespective of type and time constraints in addition to the components specified -collectfiles Specify a comma separated list of files/directories and the collection will include the files and directories in addition to the components specified. if -onlycollectfiles is also used, then no other components will be collected.
  51. Diagnostic collections - File redaction diagcollect ... -mask | -sanitize

    tfactl set redact=mask tfactl set redact=sanitize tfactl set redact=none sanitize: Replaces sensitive data in collections with random characters mask: Replaces sensitive data in collections with asterisks (*)
  52. Diagnostic collections: diagcollect -examples tfactl diagcollect Trim and Zip all

    files updated in the last 1 hours as well as chmos/osw data from across the cluster and collect at the initiating node Note: This collection could be larger than required but is there as the simplest way to capture diagnostics if an issue has recently occurred. tfactl diagcollect -last 8h Trim and Zip all files updated in the last 8 hours as well as chmos/osw data from across the cluster and collect at the initiating node tfactl diagcollect -database hrdb,fdb -last 1d -z foo Trim and Zip all files from databases hrdb & fdb in the last 1 day and collect at the initiating node tfactl diagcollect -crs -os -node node1,node2 -last 6h Trim and Zip all crs files, o/s logs and chmos/osw data from node1 & node2 updated in the last 6 hours and collect at the initiating node
  53. Diagnostic collections: diagcollect -examples tfactl diagcollect -asm -node node1 -from

    "Mar/15/2022" -to "Mar/15/2022 21:00:00" Trim and Zip all ASM logs from node1 updated between from and to time and collect at the initiating node tfactl diagcollect -for "Mar/15/2022" Trim and Zip all log files updated on "Mar/15/2022" and collect at the collect at the initiating node tfactl diagcollect -for "Mar/15/2022 21:00:00" Trim and Zip all log files updated from 09:00 on "Mar/15/2022" to 09:00 on “Mar/16/2022"(i.e. 12 hours before and after the time given) and collect at the initiating node tfactl diagcollect -crs -collectdir /tmp_dir1,/tmp_dir2 Trim and Zip all crs files updated in the last 1 hours Also collect all files from /tmp_dir1 and /tmp_dir2 at the initiating node
  54. ADR log management Report space use for database, GI logs

    Report space variations over time # Reporting tfactl managelogs -show usage # Show all space use in ADR tfactl managelogs -show usage -gi # Show GI space use tfactl managelogs -show usage -database # Show DB space use tfactl managelogs -show usage -saveusage # Save use for variation reports # Report space use variation tfactl managelogs -show variation -since 1d tfactl managelogs -show variation -since 1d -gi tfactl managelogs -show variation -since 1d -database
  55. ADR log management Purge logs in ADR across cluster nodes

    ALERT, INCIDENT, TRACE, CDUMP, HM, UTSCDMP, LOG All diagnostic subdirectories must be owned by dba/grid # Purge ADR files tfactl managelogs -purge -older 30d -dryrun # Estimated space saving tfactl managelogs -purge -older 30d # Purge logs > 30 days old tfactl managelogs -purge -older 30d -gi # GI only tfactl managelogs -purge -older 30d -database # Database only tfactl managelogs -purge -older 30d -database all # All databases tfactl managelogs -purge -older 30d -database SID1,SID3 tfactl managelogs -purge -older 30d -node all # All nodes tfactl managelogs -purge -older 30d -node local # Local node tfactl managelogs -purge -older 30d -node NODE1,NODE3
  56. Purging seems slow? • First-time purge can take a long

    time for: • Large directories • Many files • NOTE: Purge operation loops over files • Strategies for first time purge: • Delete in batches by age—365 days, 180 days, 90 days, etc. • Delete database and GI homes separately • Delete for individual SIDs, nodes
  57. Check file ownership! • Files cannot be deleted if subdirectories

    under ADR_HOME are not owned by grid/oracle or oinstall/dba • One mis-owned subdirectory • No files under that ADR_HOME will be purged • Even subdirectories with correct ownership! • Depending on version • grid may not be able to delete files in database ADR_HOMEs • oracle may not be able to delete files in GI ADR_HOMEs
  58. Files aren't deleted when the ADR_HOME... • ...schema version is

    mismatched or obsolete • ...library version is mismatched • ...is unregistered • ...is for an orphaned CRS event or user • ...is for an inactive listener
  59. Files aren't deleted when... • ORACLE_SID or ORACLE_HOME not present

    in oratab • Duplicate ORACLE_SIDs are present in oratab • Database unique name doesn't match the directory • Common after cloning operations • ADR_BASE is not set properly • $ORACLE_HOME/log/diag directory is missing • $ORACLE_HOME/log/diag/adrci_dir.mif missing • $ORACLE_HOME/log/diag/adrci_dir.mif doesn’t list ADR_BASE
  60. alertsummary # Summarize events in database and ASM alert logs

    tfactl alertsummary [root@node1 ~]# tfactl alertsummary Output from host : node1 ------------------------------ Reading /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ------------------------------------------------------------------------ 02 02 2022 20:04:57 Database started ------------------------------------------------------------------------ 02 02 2022 20:07:41 Database started Summary: Ora-600=0, Ora-7445=0, Ora-700=0 ~~~~~~~ Warning: Only FATAL errors reported Warning: These errors were seen and NOT reported Ora-15173 Ora-15032 Ora-15017 Ora-15013 Ora-15326
  61. analyze # Perform system analysis of DB, ASM, GI, system,

    OS Watcher logs/output tfactl analyze # Options: -comp [db|asm|crs|acfs|oratop|os|osw|oswslabinfo] # default=all -type [error|warning|generic] # default=error -node [all|local|nodename] # default=all -o filename # Output to filename # Times and ranges -for "YYYY-MM-DD" -from "YYYY-MM-DD" -to "YYYY-MM-DD" -from "YYYY-MM-DD HH24:MI:SS" -to "YYYY-MM-DD HH24:MI:SS" -last 6h -last 1d
  62. analyze # Perform system analysis of DB, ASM, GI, system,

    OS Watcher logs/output tfactl analyze # Options: -search "pattern" # Search in DB and CRS alert logs # Sets the search period to -last 1h # Override with -last xh|xd -verbose timeline file1 file2 # Shows timeline for specified files
  63. analyze INFO: analyzing all (Alert and Unix System Logs) logs

    for the last 1440 minutes... Please wait... INFO: analyzing host: node1 Report title: Analysis of Alert,System Logs Report date range: last ~1 day(s) Report (default) time zone: GMT - Greenwich Mean Time Analysis started at: 03-Feb-2022 06:27:46 PM GMT Elapsed analysis time: 0 second(s). Configuration file: /opt/oracle.ahf/tfa/ext/tnt/conf/tnt.prop Configuration group: all Total message count: 963, from 02-Feb-2022 08:01:39 PM GMT to 03-Feb-2022 04:23:43 PM GMT Messages matching last ~1 day(s): 963, from 02-Feb-2022 08:01:39 PM GMT to 03-Feb-2022 04:23:43 PM GMT last ~1 day(s) error count: 4, from 02-Feb-2022 08:03:31 PM GMT to 02-Feb-2022 08:11:12 PM GMT last ~1 day(s) ignored error count: 0 last ~1 day(s) unique error count: 3 Message types for last ~1 day(s) Occurrences percent server name type ----------- ------- -------------------- ----- 952 98.9% node1 generic 7 0.7% node1 WARNING 4 0.4% node1 ERROR ----------- ------- 963 100.0%
  64. analyze ... Unique error messages for last ~1 day(s) Occurrences

    percent server name error ----------- ------- ----------- ----- 2 50.0% node1 [OCSSD(30863)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 . 1 25.0% node1 [OCSSD(2654)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 . 1 25.0% node1 [OCSSD(2654)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 . ----------- ------- 4 100.0%
  65. changes # Find changes made on the system tfactl changes

    # Times and ranges -for "YYYY-MM-DD" -from "YYYY-MM-DD" -to "YYYY-MM-DD" -from "YYYY-MM-DD HH24:MI:SS" -to "YYYY-MM-DD HH24:MI:SS" -last 6h -last 1d
  66. changes [root@node1 ~]# tfactl changes -last 2d Output from host

    : node2 ------------------------------ [Feb/02/2022 20:11:16.438]: Package: cvuqdisk-1.0.10-1.x86_64 Output from host : node1 ------------------------------ [Feb/02/2022 19:57:16.438]: Package: cvuqdisk-1.0.10-1.x86_64 [Feb/02/2022 20:11:16.438]: Package: cvuqdisk-1.0.10-1.x86_64
  67. events [root@node1 ~]# tfactl events -last 1d Output from host

    : node2 ------------------------------ Event Summary: INFO :3 ERROR :2 WARNING :0 Event Timeline: [Feb/02/2022 20:10:46.649 GMT]: [crs]: 2022-02-02 20:10:46.649 [ORAROOTAGENT(27881)]CRS-5822: Agent '/u01/app/19.3.0.0/grid/ bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:3} in /u01/app/grid/diag/crs/node2/crs/trace/ ohasd_orarootagent_root.trc. [Feb/02/2022 20:11:12.856 GMT]: [crs]: 2022-02-02 20:11:12.856 [OCSSD(28472)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 . [Feb/02/2022 20:11:57.000 GMT]: [asm.+ASM2]: Reconfiguration started (old inc 0, new inc 4) [Feb/02/2022 20:28:31.000 GMT]: [db.db193h1.DB193H12]: Starting ORACLE instance (normal) (OS id: 24897) [Feb/02/2022 20:28:42.000 GMT]: [db.db193h1.DB193H12]: Reconfiguration started (old inc 0, new inc 4)
  68. grep # Find patterns in multiple files tfactl grep "ERROR"

    alert tfactl grep -i "error" alert,trace [root@node1 ~]# tfactl grep -i "error" alert Output from host : node1 ------------------------------ Searching 'error' in alert Searching /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- 28: PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 375:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_32035.trc: 378:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_32049.trc: 446:ERROR: /* ASMCMD */ALTER DISKGROUP ALL MOUNT 543: PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 1034:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_28105.trc: ...
  69. param # View database parameters - cluster aware param <parameter>

    tfactl> param sga_target Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'
  70. param # View database parameters - cluster aware tfactl> param

    -h Output from host : vna1 ------------------------------ Usage : /opt/oracle.ahf/tfa/bin/tfactl [run] param <name pattern> Show value of OS/DB parameters matching input e.g: /opt/oracle.ahf/tfa/bin/tfactl param sga_max /opt/oracle.ahf/tfa/bin/tfactl param sga_min /opt/oracle.ahf/tfa/bin/tfactl param db_unique /opt/oracle.ahf/tfa/bin/tfactl param shmmax /opt/oracle.ahf/tfa/bin/tfactl run param sga_max /opt/oracle.ahf/tfa/bin/tfactl run param sga_min /opt/oracle.ahf/tfa/bin/tfactl run param db_unique /opt/oracle.ahf/tfa/bin/tfactl run param shmmax
  71. param # View database parameters - cluster aware tfactl> param

    sga_target Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'
  72. param # View database parameters - cluster aware tfactl> param

    sga Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'
  73. param # There are more parameters for sga* SQL> show

    parameter sga NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ allow_group_access_to_sga boolean FALSE lock_sga boolean FALSE pre_page_sga boolean TRUE sga_max_size big integer 1536M sga_min_size big integer 0 sga_target big integer 1536M unified_audit_sga_queue_size integer 1048576
  74. param # View database parameters - cluster aware tfactl> param

    sga_max Output from host : vna1 ------------------------------ Output from host : vna2 ------------------------------
  75. param # View database parameters - cluster aware tfactl> param

    shmmax Output from host : vna1 ------------------------------ Output from host : vna2 ------------------------------
  76. ps # List processes - default flags are "-ef" ps

    pmon ps <flags> pmon tfactl> ps pmon Output from host : vna1 ------------------------------ grid 15260 1 0 14:30 ? 00:00:00 asm_pmon_+ASM1 oracle 16883 1 0 14:31 ? 00:00:00 ora_pmon_VNA1 Output from host : vna2 ------------------------------ grid 8063 1 0 14:25 ? 00:00:00 asm_pmon_+ASM2 oracle 9929 1 0 14:27 ? 00:00:00 ora_pmon_VNA2...
  77. ps tfactl> ps aux pmon Output from host : vna1

    ------------------------------ grid 15260 0.0 1.0 1556860 79508 ? Ss 14:30 0:00 asm_pmon_+ASM1 oracle 16883 0.0 0.8 2297012 66148 ? Ss 14:31 0:00 ora_pmon_VNA1 Output from host : vna2 ------------------------------ grid 8063 0.0 1.0 1556860 79896 ? Ss 14:25 0:00 asm_pmon_+ASM2 oracle 9929 0.0 0.8 2297012 66168 ? Ss 14:27 0:00 ora_pmon_VNA2
  78. pstack # Print a stack trace for a process .------------------------------------------------------------------.

    | TOOLS STATUS - HOST : vna1 | +----------------------+--------------+--------------+-------------+ | Tool Type | Tool | Version | Status | +----------------------+--------------+--------------+-------------+ | AHF Utilities | alertsummary | 21.4.1 | DEPLOYED | ... | | ps | 21.4.1 | DEPLOYED | | | pstack | 21.4.1 | DEPLOYED | | | summary | 21.4.1 | DEPLOYED | ... | | vi | 21.4.1 | DEPLOYED | +----------------------+--------------+--------------+-------------+
  79. pstack tfactl> pstack -h Output from host : vna1 ------------------------------

    Error: pstack command not found in system. If its installed, please set the PATH and try again. yum install -y gdb
  80. pstack tfactl> pstack mmon Output from host : vna1 ------------------------------

    # pstack output for pid : 15318 #0 0x00007f33bac6928a in semtimedop () from /lib64/libc.so.6 #1 0x0000000011c58285 in sskgpwwait () #2 0x0000000011c543db in skgpwwait () #3 0x000000001144ccba in ksliwat () #4 0x000000001144c06c in kslwaitctx () #5 0x0000000011a6fd40 in ksarcv () #6 0x00000000038174fa in ksbabs () #7 0x0000000003835ab3 in ksbrdp () #8 0x0000000003c19a4d in opirip () #9 0x00000000024c23e5 in opidrv ()
  81. pstack # ahfctl pstack accepts standard flags Usage pstack <pid|process

    name> [-n <n>] [-s <secs>] Print stack trace of a running process <n> times. Sleep <secs> seconds between runs. e.g: pstack lmd pstack 2345 -n 5 -s 5 run pstack lmd run pstack 2345 -n 5 -s 5
  82. summary # Generate a system summary tfactl> summary -h ---------------------------------------------------------------------------------

    Usage : TFACTL [run] summary -help --------------------------------------------------------------------------------- Command : /opt/oracle.ahf/tfa/bin/tfactl [run] summary [OPTIONS] Following Options are supported: [no_components] : [Default] Complete Summary Collection -overview : [Optional/Default] Complete Summary Collection - Overview -crs : [Optional/Default] CRS Status Summary -asm : [Optional/Default] ASM Status Summary -acfs : [Optional/Default] ACFS Status Summary -database : [Optional/Default] DATABASE Status Summary -exadata : [Optional/Default] EXADATA Status Summary Not enabled/ignored in Windows and Non-Exadata machine -patch : [Optional/Default] Patch Details -listener : [Optional/Default] LISTENER Status Summary -network : [Optional/Default] NETWORK Status Summary -os : [Optional/Default] OS Status Summary -tfa : [Optional/Default] TFA Status Summary -summary : [Optional/Default] Summary Tool Metadata -json : [Optional] - Prepare json report -html : [Optional] - Prepare html report -print : [Optional] - Display [html or json] Report at Console -silent : [Optional] - Interactive console by defauly -history <num> : [Optional] - View Previous <numberof> Summary Collection History in Interpreter -node <node(s)> : [Optional] - local or Comma Separated Node Name(s) -help : Usage/Help. ---------------------------------------------------------------------------------
  83. summary Example output tfactl> summary Executing Summary in Parallel on

    Following Nodes: Node : vna1 Node : vna2 LOGFILE LOCATION : /opt/oracle.ahf/…/log/summary_command_20220316151853_vna1_18097.log Component Specific Summary collection : - Collecting CRS details ... Done. - Collecting ASM details ... Done. - Collecting ACFS details ... Done. - Collecting DATABASE details ... Done. - Collecting PATCH details ... Done. - Collecting LISTENER details ... Done. - Collecting NETWORK details ... Done. - Collecting OS details ... Done. - Collecting TFA details ... Done. - Collecting SUMMARY details ... Done. Remote Summary Data Collection : In-Progress - Please wait ... - Data Collection From Node - vna2 .. Done. Prepare Clusterwide Summary Overview ... Done cluster_status_summary
  84. summary Example output (cont) COMPONENT DETAILS STATUS +-----------+---------------------------------------------------------------------------------------------------+---------+ CRS .-----------------------------------------------.

    PROBLEM | CRS_SERVER_STATUS : ONLINE | | CRS_STATE : ONLINE | | CRS_INTEGRITY_CHECK : FAIL | | CRS_RESOURCE_STATUS : OFFLINE Resources Found | '-----------------------------------------------' ASM .-----------------------------. PROBLEM | ASM_DISK_SIZE_STATUS : OK | | ASM_BLOCK_STATUS : PASS | | ASM_CHAIN_STATUS : PASS | | ASM_INCIDENTS : FAIL | | ASM_PROBLEMS : FAIL | '-----------------------------' ACFS .-----------------------. OFFLINE | ACFS_STATUS : OFFLINE | ‘-----------------------' DATABASE .-----------------------------------------------------------------------------------------------. PROBLEM | ORACLE_HOME_NAME | ORACLE_HOME_DETAILS | +------------------+----------------------------------------------------------------------------+ | OraDB19Home1 | .------------------------------------------------------------------------. | | | | INCIDENTS | DB_BLOCKS | DATABASE_NAME | DB_CHAINS | PROBLEMS | STATUS | | | | +-----------+-----------+---------------+-----------+----------+---------+ | | | | PROBLEM | PASS | VNA | PROBLEM | PROBLEM | PROBLEM | | | | '-----------+-----------+---------------+-----------+----------+---------' | '------------------+----------------------------------------------------------------------------'
  85. summary Example output (cont) COMPONENT DETAILS STATUS +-----------+---------------------------------------------------------------------------------------------------+---------+ ... PATCH

    .----------------------------------------------. OK | CRS_PATCH_CONSISTENCY_ACROSS_NODES : OK | | DATABASE_PATCH_CONSISTENCY_ACROSS_NODES : OK | '----------------------------------------------' LISTENER .-----------------------. OK | LISTNER_STATUS : OK | '-----------------------' NETWORK .---------------------------. OK | CLUSTER_NETWORK_STATUS : | '---------------------------' OS .-----------------------. OK | MEM_USAGE_STATUS : OK | '-----------------------' TFA .----------------------. OK | TFA_STATUS : RUNNING | '----------------------' SUMMARY .------------------------------------. OK | SUMMARY_EXECUTION_TIME : 0H:1M:52S | ‘------------------------------------' +-----------+---------------------------------------------------------------------------------------------------+---------+
  86. summary Interactive menu ### Entering in to SUMMARY Command-Line Interface

    ### tfactl_summary>list Components : Select Component - select [component_number|component_name] 1 => overview 2 => crs_overview 3 => asm_overview 4 => acfs_overview 5 => database_overview 6 => patch_overview 7 => listener_overview 8 => network_overview 9 => os_overview 10 => tfa_overview 11 => summary_overview tfactl_summary>
  87. summary Interactive menu tfactl_summary>5 ORACLE_HOME_DETAILS ORACLE_HOME_NAME +-----------------------------------------------------------------------------------+------------------+ .-------------------------------------------------------------------------------. OraDB19Home1 |

    DATABASE_DETAILS | DATABASE_NAME | +---------------------------------------------------------------+---------------+ | .-----------------------------------------------------------. | VNA | | | DB_BLOCKS | STATUS | DB_CHAINS | INSTANCE_NAME | HOSTNAME | | | | +-----------+--------+-----------+---------------+----------+ | | | | PASS | OPEN | FAIL | VNA1 | vna1 | | | | | PASS | OPEN | FAIL | VNA2 | vna2 | | | | '-----------+--------+-----------+---------------+----------' | | '---------------------------------------------------------------+---------------' +-----------------------------------------------------------------------------------+------------------+ tfactl_summary_databaseoverview>list Status Type: Select Status Type - select [status_type_number|status_type_name] 1 => database_clusterwide_status 2 => database_vna1 3 => database_vna2
  88. summary Interactive menu tfactl_summary_databaseoverview>list Status Type: Select Status Type -

    select [status_type_number|status_type_name] 1 => database_clusterwide_status 2 => database_vna1 3 => database_vna2 tfactl_summary_databaseoverview>2 =====> database_sql_statistics =====> database_instance_details =====> database_components_version =====> database_system_events =====> database_hanganalyze =====> database_rman_stats =====> database_incidents =====> database_account_status =====> database_tablespace_details =====> database_status_summary =====> database_sqlmon_statistics =====> database_problems =====> database_statistics =====> database_group_details =====> database_pdb_stats =====> database_configuration_details
  89. tail # Tail logs by name or pattern tfactl tail

    alert_ # Tail all logs matching alert_ tfactl tail alert_ORCL1.log -exact # Tail for an exact match tfactl tail -f alert_ # Follow logs(local node only) [root@node1 ~]# tfactl tail -f alert_ Output from host : node1 ------------------------------ ==> /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log <== NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup00.ocr.274.1095654191' 2022-02-03T12:23:35.194335+00:00 NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup01.ocr.274.1095654191' 2022-02-03T16:23:43.602629+00:00 NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup01.ocr.275.1095668599' ==> /u01/app/oracle/diag/rdbms/db193h1/DB193H11/trace/alert_DB193H11.log <== TABLE SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY: ADDED INTERVAL PARTITION SYS_P301 (44594) VALUES LESS THAN (TO_DATE(‘... SYS.WRI$_OPTSTAT_HISTGRM_HISTORY: ADDED INTERVAL PARTITION SYS_P304 (44594) VALUES LESS THAN (TO_DATE(‘... 2022-02-03T06:00:16.143988+00:00 Thread 1 advanced to log sequence 22 (LGWR switch) Current log# 2 seq# 22 mem# 0: +DATA/DB193H1/ONLINELOG/group_2.265.1095625353