Slide 1

Slide 1 text

BBM’s 150M+ users Oracle to Postgres migration without downtime Alvaro Hernandez (OnGres)

Slide 2

Slide 2 text

ALVARO HERNANDEZ @ahachete DBA and Java Software Developer OnGres and ToroDB Founder PostgreSQL Community active member Database Frequent Speaker

Slide 3

Slide 3 text

BBM’s 150M+ users Oracle to Postgres migration 3 Why Oracle to PostgreSQL migrations? ► Cost, cost, cost ► Licensing terms (“men in black”) ► Don’t want to store core data on proprietary RDBMS ► Oracle unavailable as-a-service on most clouds ► PostgreSQL is a natural (best) fit for Oracle migrations: plpgsql, DBA similar ops, advanced SQL support

Slide 4

Slide 4 text

How we Live Migrated Millions of BBM Users & its Infrastructure Across the Pacific Mohan Krishnan (Emtek CTO) Google Cloud Next '18 https:/ /www.youtube.com/watch?v=QsvSr3kGxnQ

Slide 5

Slide 5 text

BBM’s 150M+ users Oracle to Postgres migration 5 BBM: not your father’s BBM ► No longer depending on BlackBerry ► Android & iOS ► Modernized, being constantly improved ► Similar to WeChat: a platform, including payments ► 150M+ users

Slide 6

Slide 6 text

BBM’s 150M+ users Oracle to Postgres migration 6 Some facts about BBM’s platform ► ~ 6.000 servers ► Previous infra running on-prem in Canada. Current in GCP in Asia region ► Oracle/PostgreSQL core part of the infrastructure: id (user login and auth services) ► DB size some TB, high trafficked, OLTP pattern ► Cassandra used for user messages (+400 nodes)

Slide 7

Slide 7 text

BBM’s 150M+ users Oracle to Postgres migration 7 BBM migration requirements ► No downtime (near-zero downtime) ► No data loss (obviously). Data verified/checksummed ► Across the Pacific (200-250ms latency) ► Long testing period with real data ► Full PostgreSQL stack architecture on GCP: HA, monitoring, backups, DR, configuration tuning. Fully automated deployment

Slide 8

Slide 8 text

Migration steps ► DDL migration ► PL/SQL stored procedures ► PostgreSQL architecture on GCP ► Live data migration technique and tools ► Rollback mechanism ► Testing and production cut-over

Slide 9

Slide 9 text

BBM’s 150M+ users Oracle to Postgres migration 9 DDL migration ► Main application was using Hibernate, so application needed few changes ► Used ora2pg for DDL migration (awesome tool!) ► However, requires performance-sensitive changes: number in Oracle is a numeric in PostgreSQL. But may be better an integer or bigint, need to know the data. ► Timestamps are painful (well, always!)

Slide 10

Slide 10 text

BBM’s 150M+ users Oracle to Postgres migration 10 DDL migration (II) ► Carefully adjust users and permissions ► Data constraints, default values from sequences could be refactored from triggers to normal constraints ► Cleanup time! (remove unused indexes...) ► Ensure all tables have a PK (required for live data migration)

Slide 11

Slide 11 text

BBM’s 150M+ users Oracle to Postgres migration 11 DDL migration (III): partitioning ► Data in Oracle was not partitioned. Wanted to use the migration to partition ► Partitioning code and functions/triggers created (9.6, no native partitioning as in 10/11) ► Transparent to data migration as long as schema is compatible between source/destination ► Data clean up

Slide 12

Slide 12 text

BBM’s 150M+ users Oracle to Postgres migration 12 PL/SQL stored procedures ► Plpgsql is similar to Oracle’s PL/SQL, but not directly compatible ► Procedures need to be ported mostly manually ► Create a set of unit tests to define existing valid and invalid inputs for existing code ► Port to plpgsql. This was 9.6. PostgreSQL v11 introduced autonomous transactions, very welcomed! ► BBM did not contain many stored procedures

Slide 13

Slide 13 text

BBM’s 150M+ users Oracle to Postgres migration 13 PostgreSQL Architecture on GCP ► Unmanaged: GCE instances, not CloudSQL ► Fully automated deployment with Ansible (+shell, gcloud API for deployment). Env tested with Vagrant ► Heavily tuned config file (learn more: PostgreSQL Configuration for Humans) ► PgBouncer for connection pooling ► Patroni for HA. Consul for DCS. ► WAL-e for backup and DR, DataDog for monitoring

Slide 14

Slide 14 text

BBM’s 150M+ users Oracle to Postgres migration 14 PostgreSQL Architecture on GCP (II)

Slide 15

Slide 15 text

BBM’s 150M+ users Oracle to Postgres migration 15 Live data migration ► No downtime requirement discards dump/restore solutions (like ora2pg) ► ETL is not an option either: requires created_at, updated_at fields; works in batches; requires expensive extra indexes or slow scans ► A CDC (Change Data Capture) solution is required that supports Oracle as source and PostgreSQL as destination

Slide 16

Slide 16 text

BBM’s 150M+ users Oracle to Postgres migration 16 CDC: Change Data Capture ► Select an arbitrary point in the (source) database timeline (SCN in Oracle, LSN in PG) ► Phase 1: data load. Obtain a snapshot of the database at the selected SCN. Copy all the data from all tables to destination. ► Phase 2. Deltas. Database (or external software) mines the logs (changes to the database) happened since that SCN and extracts them to the destination database (as SQL or equivalent)

Slide 17

Slide 17 text

Oracle Golden Gate Supporting Oracle to PostgreSQL CDC https:/ /unsplash.com/@gersonrepreza

Slide 18

Slide 18 text

BBM’s 150M+ users Oracle to Postgres migration 18 Golden Gate ► Implements CDC pattern, supports many srcs and destinations ► Wait, Oracle sw to migrate off of Oracle? ► Yes, site-wide license

Slide 19

Slide 19 text

BBM’s 150M+ users Oracle to Postgres migration 19 Migration architecture

Slide 20

Slide 20 text

BBM’s 150M+ users Oracle to Postgres migration 20 GoldenGate (GG) gotchas ► PITA to get it working. Took experienced GG admin 2 full- time weeks to get this setup working. Documentation output → 12 pages of commands and config files. ► Oracle to PostgreSQL use case poorly documented (if at all). Blogs/posts almost all do batch uploads or only deltas, but not real-time migration. ► Uses ODBC (by Progress), which brings its own gotchas (e.g.: timestamp(6) to timestamp to make it work).

Slide 21

Slide 21 text

BBM’s 150M+ users Oracle to Postgres migration 21 GoldenGate (GG) gotchas (II) ► Synchronization point (between initial data load and CDC) is not enforced. Easy to overlap or miss data. ► Failure means mostly retry, which requires a lot of manual commands. Ended up fully scripting and automating GG operations. Poor debugging. ► Key to make it work is a hack: Add (where SCN == CSN) “, sqlpredicate 'AS OF SCN XXXXXXXX';” to table copy definitions config to do initial snapshot.

Slide 22

Slide 22 text

BBM’s 150M+ users Oracle to Postgres migration 22 GG configuration snippet examples extract live userid $user password $password rmthost localhost, mgrport 7810 rmttrail ./dirdat/ac table accountservice.account; extract load userid $user password $password rmthost localhost, mgrport 7810 rmttask replicat, group load table schema.table, sqlpredicate 'AS OF SCN XXXXXXXX';

Slide 23

Slide 23 text

BBM’s 150M+ users Oracle to Postgres migration 23 DDL changes for CDC ► DDL cannot be changed while replication happens. However: ► Before initial load: DROP constraints, FKs, indexes to speed up load. ► After initial load: re-create all the missing DDL (indexes may be created concurrently). ► Create a temporal tombstone table to signal the end of data copy.

Slide 24

Slide 24 text

BBM’s 150M+ users Oracle to Postgres migration 24 Sequence management ► Sequences are not propagated. ► During cut-over, once tombstone is propagated, query sequence state in Oracle and apply to sequences in PostgreSQL. ► Needs to be scripted and fully automated. ► Optional: increment sequence numbers to a rounded number, to identify when migration happened.

Slide 25

Slide 25 text

BBM’s 150M+ users Oracle to Postgres migration 25 Rollback mechanism ► What if after cut-over migration fails (i.e. PostgreSQL is not performant enough, application bugs, etc) but writes are already happening in PostgreSQL? ► A “rollback” (switchback) mechanism is required. PostgreSQL’s logical decoding was used. ► raw_decoding plugin. Output to file. If rollback, execute the changes via sqlplus back into Oracle. Required code to adjust timestamp and other data types syntax.

Slide 26

Slide 26 text

BBM’s 150M+ users Oracle to Postgres migration 26 Testing and production cut-over ► Live migration was running cross-DC (Canada to Asia) for more than 2 months. ► Re-started frequently for application testing on PG on destination. ► Cut-over fully scripted, tested 3-5 times and timed. ► Database-contributed downtime: 5 mins real, another 40 for data checksumming.

Slide 27

Slide 27 text

BBM’s 150M+ users BBM’s 150M+ users Oracle to Postgres Oracle to Postgres migration migration without downtime without downtime