Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BBM’s 150M+ Users Oracle to Postgres Migration ...

OnGres
November 09, 2018

BBM’s 150M+ Users Oracle to Postgres Migration without Downtime

BBM (the Black Berry Messenger) is one of the largest chat and voice/video applications in the world, with more than 150M users. And it was running on on-premise Oracle. We helped them migrate to PostgreSQL running on GCP with real-time replication and near-zero downtime.

OnGres

November 09, 2018
Tweet

More Decks by OnGres

Other Decks in Programming

Transcript

  1. ALVARO HERNANDEZ @ahachete DBA and Java Software Developer OnGres and

    ToroDB Founder PostgreSQL Community active member Database Frequent Speaker
  2. BBM’s 150M+ users Oracle to Postgres migration 3 Why Oracle

    to PostgreSQL migrations? ► Cost, cost, cost ► Licensing terms (“men in black”) ► Don’t want to store core data on proprietary RDBMS ► Oracle unavailable as-a-service on most clouds ► PostgreSQL is a natural (best) fit for Oracle migrations: plpgsql, DBA similar ops, advanced SQL support
  3. How we Live Migrated Millions of BBM Users & its

    Infrastructure Across the Pacific Mohan Krishnan (Emtek CTO) Google Cloud Next '18 https:/ /www.youtube.com/watch?v=QsvSr3kGxnQ
  4. BBM’s 150M+ users Oracle to Postgres migration 5 BBM: not

    your father’s BBM ► No longer depending on BlackBerry ► Android & iOS ► Modernized, being constantly improved ► Similar to WeChat: a platform, including payments ► 150M+ users
  5. BBM’s 150M+ users Oracle to Postgres migration 6 Some facts

    about BBM’s platform ► ~ 6.000 servers ► Previous infra running on-prem in Canada. Current in GCP in Asia region ► Oracle/PostgreSQL core part of the infrastructure: id (user login and auth services) ► DB size some TB, high trafficked, OLTP pattern ► Cassandra used for user messages (+400 nodes)
  6. BBM’s 150M+ users Oracle to Postgres migration 7 BBM migration

    requirements ► No downtime (near-zero downtime) ► No data loss (obviously). Data verified/checksummed ► Across the Pacific (200-250ms latency) ► Long testing period with real data ► Full PostgreSQL stack architecture on GCP: HA, monitoring, backups, DR, configuration tuning. Fully automated deployment
  7. Migration steps ► DDL migration ► PL/SQL stored procedures ►

    PostgreSQL architecture on GCP ► Live data migration technique and tools ► Rollback mechanism ► Testing and production cut-over
  8. BBM’s 150M+ users Oracle to Postgres migration 9 DDL migration

    ► Main application was using Hibernate, so application needed few changes ► Used ora2pg for DDL migration (awesome tool!) ► However, requires performance-sensitive changes: number in Oracle is a numeric in PostgreSQL. But may be better an integer or bigint, need to know the data. ► Timestamps are painful (well, always!)
  9. BBM’s 150M+ users Oracle to Postgres migration 10 DDL migration

    (II) ► Carefully adjust users and permissions ► Data constraints, default values from sequences could be refactored from triggers to normal constraints ► Cleanup time! (remove unused indexes...) ► Ensure all tables have a PK (required for live data migration)
  10. BBM’s 150M+ users Oracle to Postgres migration 11 DDL migration

    (III): partitioning ► Data in Oracle was not partitioned. Wanted to use the migration to partition ► Partitioning code and functions/triggers created (9.6, no native partitioning as in 10/11) ► Transparent to data migration as long as schema is compatible between source/destination ► Data clean up
  11. BBM’s 150M+ users Oracle to Postgres migration 12 PL/SQL stored

    procedures ► Plpgsql is similar to Oracle’s PL/SQL, but not directly compatible ► Procedures need to be ported mostly manually ► Create a set of unit tests to define existing valid and invalid inputs for existing code ► Port to plpgsql. This was 9.6. PostgreSQL v11 introduced autonomous transactions, very welcomed! ► BBM did not contain many stored procedures
  12. BBM’s 150M+ users Oracle to Postgres migration 13 PostgreSQL Architecture

    on GCP ► Unmanaged: GCE instances, not CloudSQL ► Fully automated deployment with Ansible (+shell, gcloud API for deployment). Env tested with Vagrant ► Heavily tuned config file (learn more: PostgreSQL Configuration for Humans) ► PgBouncer for connection pooling ► Patroni for HA. Consul for DCS. ► WAL-e for backup and DR, DataDog for monitoring
  13. BBM’s 150M+ users Oracle to Postgres migration 15 Live data

    migration ► No downtime requirement discards dump/restore solutions (like ora2pg) ► ETL is not an option either: requires created_at, updated_at fields; works in batches; requires expensive extra indexes or slow scans ► A CDC (Change Data Capture) solution is required that supports Oracle as source and PostgreSQL as destination
  14. BBM’s 150M+ users Oracle to Postgres migration 16 CDC: Change

    Data Capture ► Select an arbitrary point in the (source) database timeline (SCN in Oracle, LSN in PG) ► Phase 1: data load. Obtain a snapshot of the database at the selected SCN. Copy all the data from all tables to destination. ► Phase 2. Deltas. Database (or external software) mines the logs (changes to the database) happened since that SCN and extracts them to the destination database (as SQL or equivalent)
  15. BBM’s 150M+ users Oracle to Postgres migration 18 Golden Gate

    ► Implements CDC pattern, supports many srcs and destinations ► Wait, Oracle sw to migrate off of Oracle? ► Yes, site-wide license
  16. BBM’s 150M+ users Oracle to Postgres migration 20 GoldenGate (GG)

    gotchas ► PITA to get it working. Took experienced GG admin 2 full- time weeks to get this setup working. Documentation output → 12 pages of commands and config files. ► Oracle to PostgreSQL use case poorly documented (if at all). Blogs/posts almost all do batch uploads or only deltas, but not real-time migration. ► Uses ODBC (by Progress), which brings its own gotchas (e.g.: timestamp(6) to timestamp to make it work).
  17. BBM’s 150M+ users Oracle to Postgres migration 21 GoldenGate (GG)

    gotchas (II) ► Synchronization point (between initial data load and CDC) is not enforced. Easy to overlap or miss data. ► Failure means mostly retry, which requires a lot of manual commands. Ended up fully scripting and automating GG operations. Poor debugging. ► Key to make it work is a hack: Add (where SCN == CSN) “, sqlpredicate 'AS OF SCN XXXXXXXX';” to table copy definitions config to do initial snapshot.
  18. BBM’s 150M+ users Oracle to Postgres migration 22 GG configuration

    snippet examples extract live userid $user password $password rmthost localhost, mgrport 7810 rmttrail ./dirdat/ac table accountservice.account; extract load userid $user password $password rmthost localhost, mgrport 7810 rmttask replicat, group load table schema.table, sqlpredicate 'AS OF SCN XXXXXXXX';
  19. BBM’s 150M+ users Oracle to Postgres migration 23 DDL changes

    for CDC ► DDL cannot be changed while replication happens. However: ► Before initial load: DROP constraints, FKs, indexes to speed up load. ► After initial load: re-create all the missing DDL (indexes may be created concurrently). ► Create a temporal tombstone table to signal the end of data copy.
  20. BBM’s 150M+ users Oracle to Postgres migration 24 Sequence management

    ► Sequences are not propagated. ► During cut-over, once tombstone is propagated, query sequence state in Oracle and apply to sequences in PostgreSQL. ► Needs to be scripted and fully automated. ► Optional: increment sequence numbers to a rounded number, to identify when migration happened.
  21. BBM’s 150M+ users Oracle to Postgres migration 25 Rollback mechanism

    ► What if after cut-over migration fails (i.e. PostgreSQL is not performant enough, application bugs, etc) but writes are already happening in PostgreSQL? ► A “rollback” (switchback) mechanism is required. PostgreSQL’s logical decoding was used. ► raw_decoding plugin. Output to file. If rollback, execute the changes via sqlplus back into Oracle. Required code to adjust timestamp and other data types syntax.
  22. BBM’s 150M+ users Oracle to Postgres migration 26 Testing and

    production cut-over ► Live migration was running cross-DC (Canada to Asia) for more than 2 months. ► Re-started frequently for application testing on PG on destination. ► Cut-over fully scripted, tested 3-5 times and timed. ► Database-contributed downtime: 5 mins real, another 40 for data checksumming.
  23. BBM’s 150M+ users BBM’s 150M+ users Oracle to Postgres Oracle

    to Postgres migration migration without downtime without downtime