or corruption Data Protection Concepts • Intra-block and inter-block corruption • Intra-block: Confined to one block • Inter-block: Exists between blocks • Block header & footer match • Checksum is valid • Yet data is logically inconsistent • Example: lost write
error Data Protection Concepts • Database level: • Bad DDL or DML • Bad application logic • Software bugs • OS-level: • Deleting/changing database files • May be unintentional or malicious
error Data Protection Concepts • Database level: • Bad DDL or DML • Bad application logic • Software bugs • OS-level: • Deleting/changing database files • May be unintentional or malicious RMAN backups can address scenarios beyond "vanilla" full database recovery. Recovery KPIs and enterprise risks may require a layered backup approach beyond the standard weekly level 0/daily level 1/hourly archivelogs.
besides a full restore? Partial restore of tables, datafiles, tablespaces, PDBs Flashback, point-in-time recovery Restore control file, server parameter file Recover using a backup control file Restore RAC to Restart/single instance Restore/recover from different media (disk, tape)
besides a full restore? Data Guard switchover, failover Reinstantiate a Data Guard standby Change file names/paths (different filesystem, ASM to filesystem, etc) Restore a database copy on the same host (change DB ID/name) Block media recovery Restore to a different host, loss of ORACLE_HOME
besides a full restore? Manual TSPITR using an auxiliary instance Data Pump export and import Using Guaranteed Restore Points (GRP) Rolling back failed patches or upgrades Register backup sets/pieces Recreate missing archive log files
and intangibles? Modifying monitoring and alerting Determining space requirements Running jobs with screen or nohup alter system set job_queue_processes=0 Spooling output, echoing commands, setting meaningful timestamps Estimating recovery time
and intangibles? Preparing an alternative recovery while the primary recovery is running Separating root cause analysis and resolution from "the blame game" Preserving immutable diagnostics, esp. for a suspected breach Responsibilities and boundaries (no C-Suite interference) Rotation schedules and relief teams Phone numbers for 24-hour pizza/food delivery
they tested and validated regularly? • Are the recovery procedures clear and well-documented? • Do they support multiple restore/recovery scenarios? • Do they meet standards for RTO/RPO? • What are the essential dependencies? • What factors might affect recovery time? • Is database recovery validation an isolated or coordinated exercise?
DR/HA Postmortems • Over-reliant on teams "instinctively" knowing what to do • Over-optimistic RTO/RPO without empirical basis • Narrow focus (full restore only) • Isolated scope (DBA steps only) • Strong cognitive bias/blindness • Written once and rarely/never updated
Contents of dba_directories, utl_file_dir including: • External tables • BFILE data • Data Pump parameter, log and dump files • SQL*Loader control files • Compiled Pro*C/C++, Pro*COBOL, etc. • External procedures (e.g. EXTPROC) • OS files/executables called via UTL_FILE
srvctl configuration: • Database and instance settings • Services and service configurations • Environment variables (eg TNS_ADMIN for EBS) • Listener configurations and endpoints
resistant • Hardened systems resist pressure up to a point • Beyond that threshold, they break • Hardening introduces brittleness • Over-engineering resistance is (exponentially) costly
increases the potential of: • Failure • Exceptions/variations between prod/non-prod • Meaningful configuration and parameter differences • Simple procedures limit the scope of QA
reduces manual effort • Cognitive load is a finite resource • Automation takes care of the (technical) "How to" • Allows teams to focus on "What, Why, and (conceptual) How" • Automation addresses • Dependencies, sequence • Trivial activities (easily undervalued/missed/run out of order)
use "common" variables in scripts • Avoid the dreaded "Oops! I ran that in the wrong window!" • Make scripts and documentation "copy/paste-proof" • Copy/paste should work correctly: • ...for every command! • ...in every database! • ...in every environment!
Set recovery parameters with the correct values export __db_name= # Add the database name # Not: # Set recovery parameters; change values as needed export __db_name=test # Change the database name!
are easier to manage when they're the same • Differentiate systems via parameters only • Procedures that work across multiple environments are easier to: • Test • Validate • Practice