Upgrade to Pro — share decks privately, control downloads, hide ads and more …



Peter Geoghegan

October 17, 2019

More Decks by Peter Geoghegan

Other Decks in Technology


  1. https://speakerdeck.com/peterg/xact-rollback-pgconfeu My perspective Understanding any complex system requires good mental

    model - Considering alternative approaches often makes the overall picture clearer - I will use real world alternatives (SQL Server and Oracle) Theory-focussed, but grounded in practicalities
  2. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Overview 1. Concurrency control and recovery Two closely related

    concepts 2. Advantages of versioned storage Advantages for high availability, recovery, index access methods 3. B-Tree deduplication Outlook for future improvements.
  3. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial

    Rollbacks Using Write-Ahead Logging C. MOHAN IBM Almaden Research Center and DON HADERLE IBM Santa Teresa Laboratory and BRUCE LINDSAY, HAMID PIRAHESH and PETER SCHWARZ IBM Almaden Research Center In this paper we present a simple and efficient method, called ARIES ( Algorithm for Recouery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine- granularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks We deal with a variety of features that are very Important in building and operating an industrial-strength transaction processing system ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e. g., increment /decrement) which exploit the semantics of the opera- tions and require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports objects of varying length efficiently. By enabling parallelism during restart, page-oriented redo, and logical undo, it enhances concurrency and performance. We show why some of the System R paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL-based recovery methods of
  4. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Concurrency control: Two common approaches 2PL (Two phase locking)

    is classic approach - Lock records that are written and read - Readers can be blocked by writers, and vice-versa MVCC versions rows so that old versions can be returned instead of locking - Used by PostgreSQL, and by Oracle - Can be used by Microsoft SQL Server
  5. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Berkeley-era storage — goals Original goal was greatly simplifying

    recovery code - No write-ahead log — the database is the log - Tuples versioned and chained together - No-overwrite storage Time travel feature supported under this model - VACUUM used to migrate old versions to archival storage — not for garbage
  6. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Time travel get removed from PostgreSQL PostgreSQL project introduces

    MVCC in the late 1990s - VACUUM now strictly about garbage collection In effect, “logical” approach to MVCC, not Oracle style block-level MVCC - Commit log still maintains status of transactions - Index AMs still don’t need to think about concurrency control (e.g. locks on key values)
  7. https://speakerdeck.com/peterg/xact-rollback-pgconfeu WAL is added to PostgreSQL PostgreSQL introduced WAL based

    recovery around the same time as MVCC was added - REDO-only WAL performs a lot better, and enables WAL-based replication and disaster recovery - Still simpler than traditional approach in many ways — no UNDO, no need to explicitly roll back transactions at end of recovery (or during transaction abort)
  8. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Overview 1. Concurrency control and recovery Two closely related

    concepts 2. Advantages of versioned storage Advantages for high availability, recovery, index access methods 3. B-Tree deduplication Outlook for future improvements.
  9. https://speakerdeck.com/peterg/xact-rollback-pgconfeu “Instantaneous” rollback Aborting a transaction occurs instantly in Postgres

    No need to synchronously roll back - VACUUM will clean it up later - Opportunistic garbage collection can also reuse free space as needed Other DB systems must restore previous/latest state. - Non-obvious disadvantages, particularly around indexing
  10. – Architecture of a Database System [emphasis added], from “Transactions:

    Concurrency Control and Recovery” “All production databases today support ACID transactions. As a rule, they use write-ahead logging for durability, and two-phase locking for concurrency control. An exception is PostgreSQL, which uses multi-version concurrency control throughout. Oracle pioneered the limited use of multi-version concurrency side-by-side with locking as a way to offer relaxed consistency models like Snapshot Isolation and Read Consistency.”
  11. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Bitmap indexes in Oracle Oracle’s bitmap indexes appear to

    be roughly comparable to Postgres GIN index at the data structure level - Each is a B-Tree that is optimized for duplicates - Though GIN is used in many other ways (e.g. full text search, jsonb indexing) - GIN with btree_gin opclass is pretty close Nevertheless, Oracle’s bitmap indexes are strictly for data warehousing only
  12. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Bitmap indexes in Oracle (Cont.) Locking requirements for indexes

    are notably different in Oracle due to approach to concurrency control and recovery - Bitmap indexes have lockable entries (index page entry) that can point to thousands of rows at once (just like GIN) - Even simple INSERTs deadlock with each other, so using bitmap indexes in OLTP environment is a non-starter (very much unlike GIN indexes from Postgres) This is much less noticeable with Oracle B-Tree indexes, though they seem to work in essentially the same way
  13. – Architecture of a Database System [emphasis added], from “Full-Text

    Search” “In most cases the full-text index is updated asynchronously (“crawled”) rather than being maintained transactionally; PostgreSQL is unusual in offering the option of a full-text index with transactional updates”
  14. https://speakerdeck.com/peterg/xact-rollback-pgconfeu “Rollback” within an index All index access methods must

    support VACUUM - VACUUM removes tuples in the index, regardless of index access method - Rollback is instant because it is logical Cleanup is decoupled from transaction abort, simplifying locking - Semantic knowledge of the data not really available for concurrency control to exploit with GIN or GiST
  15. – Architecture of a Database System [emphasis added], from “Transactions:

    Concurrency control and Recovery” “It is a significant intellectual and engineering challenge to take a textbook access method algorithm … and implement a correct, high- concurrency, recoverable version in a transactional system. For this reason, most leading DBMSs still only implement heap files and B+trees as transactionally protected access methods; PostgreSQL’s GiST implementation is a notable exception.”
  16. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Recovery and abort with a GiST index GiST indexes

    are not just highly concurrent, recoverable, and transactional — GiST infrastructure is also highly extensible - Can be used to implement many different kinds of tree-like indexes - Again, rollback is instant because it is logical UNDO-based rollback very difficult to support with GiST, too
  17. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Summary of differences In Oracle and SQL Server without

    ADR, index and table structure always more or less reflects most recent version Postgres centralizes concurrency control by versioning table tuples — tying them to transactions directly SQL Server 2019 with the new ADR feature enabled also versions tuples directly, purely to increase availability/reduce recovery time
  18. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Overview 1. Concurrency control and recovery Two closely related

    concepts 2. Advantages of versioned storage Advantages for high availability, recovery, index access methods 3. B-Tree deduplication Outlook for future improvements.
  19. https://speakerdeck.com/peterg/xact-rollback-pgconfeu B-Tree deduplication Ongoing work for Postgres 13: Deduplication Work

    by Anastasia Lubennikova and myself Targets standard B-Tree access method (default index type in Postgres) Plays to the strengths of the Postgres approach to MVCC
  20. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Why deduplicate? Logical MVCC enables index compression without affecting

    locking/concurrency control Occurs as needed, to avoid a page split - Lazy, goal-orientated, low overhead - Version sensitive but not version aware Made possible by “logical MVCC” Ameliorates downsides of “logical MVCC”, too
  21. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Version sensitive unique index deduplication Unique indexes can also

    support deduplication Version sensitive but not version aware Cooperates with VACUUM - Many workloads have unique indexes that greatly benefit - Extremely lazy here
  22. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Index size reductions Size of all indexes taken together

    0% 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100% TPC-H MGD Postgres 12 (original size) Patch
  23. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Index size reductions Size of single low cardinality index

    on text columns 0% 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100% "UK Land Registry" test Postgres 12 (original size) Patch
  24. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Design goals for deduplication Deduplication strategy differs from GIN,

    since trade-offs differ — standard B-Trees offer high concurrency Opportunistic garbage collection is very important with standard B-Trees “Best of both worlds” performance Posting lists (Heap TID lists) are not compressed - Limits buffer lock contention and other overhead - Keeps logic for page splits largely unchanged
  25. https://speakerdeck.com/peterg/xact-rollback-pgconfeu Conclusions There are significant differences among all major systems

    that implement MVCC - Accidents of history are the norm with MVCC designs - Evolution of each design constrains and enables innovation Overall degree of optimization probably just as important as fundamentals in practice