Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UPSERT use-cases

Peter Geoghegan
September 18, 2015

UPSERT use-cases

Ecobox
Home
Blog
About
Tickets
Sessions
Venue & Hotel
Sponsors

POSTGRES OPEN
SEPT 16TH - 18TH ・ DALLAS

UPSERT use cases

Back
Date: Sept. 18, 2015
Time: 13:30 - 14:20
Room: Houston Ballroom B/C
Level: Intermediate
Feedback: Leave feedback
PostgreSQL 9.5 will have support for a feature that is popularly known as "UPSERT" - the ability to either insert or update a row according to whether an existing row with the same key exists. If such a row already exists, the implementation should update it. If not, a new row should be inserted. This is supported by way of a new high level syntax (a clause that extends the INSERT statement) that more or less relieves the application developer from having to give any thought to race conditions. This common operation for client applications is set to become far simpler and far less error-prone than legacy ad-hoc approaches to UPSERT involving subtransactions. Moreover, the new implementation performs much better than those legacy approaches.

While the feature is most obviously compelling for OLTP and web application use cases, it's also true that the syntax is powerful enough to be very useful in many real world data integration scenarios. The non-standard PostgreSQL syntax offer explicit, fine grained control over where and how to update. For example, an update may not actually affect an existing row due to not satisfying some additional criteria (i.e. due to not passing the ON CONFLICT ... DO UPDATE special, dedicated WHERE clause).

This talk gives an overview of the feature from a high level, and examines these use cases. You will learn how you might want to use the new UPSERT feature in your application beyond the obvious. In passing, there will be brief discussion of why UPSERT's implementation proved to be a hard problem, and, relatedly, why a custom syntax was used instead of the SQL standard's MERGE syntax.

Peter Geoghegan

September 18, 2015
Tweet

More Decks by Peter Geoghegan

Other Decks in Programming

Transcript

  1. What is UPSERT? • INSERT row, or UPDATE existing one

    to reconcile it with what you wanted to INSERT. Always one or the other of those two. No “ifs”, no “buts”. • Available in PostgreSQL 9.5 (recently released). • 9.5 also adds a DO NOTHING variant that never locks row. MySQL calls this INSERT IGNORE. • You more or less don't have to worry about concurrency/race conditions with the feature. 2
  2. What is UPSERT? • “Fundamental UPSERT property”. At READ COMMITTED

    isolation level, you should always get an insert or update. • No unprincipled deadlocks. • No spurious unique constraint violations. • Only way to really ensure this is to make it driven by insert. Take alternative path (i.e., go update existing row) when would-be duplicate violation detected. • Most requested by those with OLTP and web app databases. 3
  3. Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)

    VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; 4
  4. -- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES

    (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; Syntax 5
  5. Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)

    VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; 6
  6. Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)

    VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; • Must specify which unique index to take update action on. • Prefer to do this with inference clause, naming relevant column (or columns, or expressions, or some combination). • Alternatively, ON CONFLICT can name constraint directly, but avoid it (needed for exclusion constraints, but not unique constraints). • Very flexible and forgiving. Optimizer throws error if it cannot find satisfactory unique index. 7
  7. Why not implement MERGE? • SQL standard specified MERGE statement

    has been implemented by a few other popular database systems. • Synchronizes two tables, inserting, updating, and deleting as it goes. Individual MERGE statement may lack WHEN NOT MATCHED THEN INSERT “handler”. In contrast, Postgres UPSERT is “driven by INSERT”. • Source table could be VALUES(), so MERGE appears UPSERT-like. • Lacks “fundamental UPSERT property”, because can (for example) raise duplicate violations for simple UPSERT cases. • ON CONFLICT DO UPDATE has some amount of MERGE-like flexibility. • Idea that MERGE is equivalent to UPSERT is unfortunate myth. 8
  8. MERGE vs. UPSERT • UPSERT is about guaranteeing a certain

    outcome (insert or update). • MERGE operates according to the same rules as individual INSERT, UPDATE, or DELETE statements. • No games are played with visibility, and no unique indexes are required. • Can be simulated with writeable WITH statements in Postgres. • The point of MERGE is that it’s fast to bulk load when reconciliation driven by conventional join, and it’s a higher level construct for reconciliation-orientated DML. 11
  9. Summary of UPSERT • UPSERT has very flexible syntax —

    far more flexible (and far less error prone) than comparable MySQL feature, for example. • Complicated to implement, but not complicated to use. Involved small revision to chapter in documentation on MVCC rules, for example. The behavior of UPSERT is now described specially. • Unlike MERGE, has no race conditions. No duplicate violations from arbiter unique index. No “unprincipled deadlocks”. • Unlike MERGE, inserting is one possible outcome that you must always be happy with. 13
  10. Data integration • Uniform access required (e.g. from SQL). •

    Diverse sources. Often autonomous, distributed. • Diversity around encoding (In the sense of how domain- specific ideas are represented). • Not just from other databases. • Everything must fit into “mediated” schema. • “Screen scraping” is about the worst/best example of this. 15
  11. Where the rubber hits the road • Many sources of

    data may require integration into main database, with varying levels of quality. • With complex data landscape, idempotence is important. • e.g. When consuming same message multiple times from Apache Kafka node. • The need to reconcile table in live system is in tension with MVCC semantics. • Do you even know if you can “fence” tables during ETL? Or are you just too lazy busy to find out for sure? 16
  12. When dealing with this complexity, “just be careful” is a

    bad plan, or at least seems unlikely to scale well. The experience of the last 40 years suggests that in general, a few good primitives should be leveraged to the hilt to deal with complexity. PostgreSQL’s UPSERT feature is built to be such a primitive; it addresses increasingly important problems surrounding data integration. My perspective 17
  13. • Retail system syncs with disparate suppliers • Ad-hoc text

    file formats • May sell same product — still not purchasing from manufacturer
  14. CREATE TABLE products ( sku serial PRIMARY KEY, upc_barcode bigint,

    description text NOT NULL, -- Time that product was entered into our system created_at timestamp with timezone not null default now(), -- Time that third-party file indicates change is effective by last_updated_at timestamp with timezone, case_size numeric NOT NULL, current_stock_qty numeric NOT NULL default 0, current_retail_price numeric NOT NULL, CONSTRAINT unique_barcode UNIQUE (upc_barcode) ); CREATE TABLE suppliers ( supplier_code integer PRIMARY KEY, -- example abridged here ); CREATE TABLE supplier_prices ( upc_barcode integer REFERENCES products, supplier_code integer REFERENCES suppliers, price numeric NOT NULL, CONSTRAINT product_supplier PRIMARY KEY (upc_barcode, supplier_code) ); 20
  15. -- New CSV file downloaded, with new suggested retail price

    -- effective from right now: INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p -- Timestamp here (and everything else) originates from some -- supplier’s system VALUES (804551312731, 'Whistler Brand Shampoo', 5.99, '2015-09-18 12:00:00-5') ON CONFLICT (upc_barcode) DO UPDATE SET description = EXCLUDED.description, last_updated_at = EXCLUDED.last_updated_at, current_retail_price = EXCLUDED.current_retail_price WHERE p.last_updated_at < EXCLUDED.last_updated_at; 21 • INSERT statements like this built from data in ad-hoc CSV formats (XML if you’re lucky). • Retailer always sells products at suggested price in this example. • Probably no way of representing in ad-hoc system that this only concerns “supplier independent” facets of product. • Every CSV file line represents: “Here is a product that you may or may not have encountered before”. Most details (e.g. description) seldom change.
  16. INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p VALUES

    (739189350438, 'Mixed Nuts', 5.99, '2015-09-18 12:00:00-5’) -- Partial unique index support: ON CONFLICT (upc_barcode) WHERE is_active DO UPDATE SET description = EXCLUDED.description, last_updated_at = EXCLUDED.last_updated_at, current_retail_price = EXCLUDED.current_retail_price; 23 • Products may need to be “logically” deleted. Need to stay in table though, which prevents actual deletion. • “is_active” boolean column represents this concept. • Partial unique indexes can enforce uniqueness among active products alone. Must consider what happens if and when product re-added. • With PostgreSQL’s ON CONFLICT DO UPDATE, this “just works”.
  17. Safety first • Lots of thought went into making the

    feature easy to use correctly, as we’ve seen. • Almost as much thought went into making it hard to use incorrectly. • The way that the semantics of a query depend on the presence of a constraint (or unique index) is pretty novel. • This behooves us to bend over backwards to make it safe. • This is thought to be a strong overall advantage for PostgreSQL. I think that we’re the first to make UPSERT easy. 24
  18. postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON

    CONFLICT (key) DO UPDATE SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived'; QUERY PLAN ---------------------------------------------------- Insert on upsert (cost=0.00..0.01 rows=1 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: upsert_pkey Conflict Filter: (excluded.val <> 'Contrived'::text) -> Result (cost=0.00..0.01 rows=1 width=0) (5 rows) 25
  19. postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON

    CONFLICT (key) WHERE is_active DO UPDATE SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived'; QUERY PLAN ---------------------------------------------------- Insert on upsert (cost=0.00..0.01 rows=1 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: upsert_pkey, key_is_active_idx Conflict Filter: (excluded.val <> 'Contrived'::text) -> Result (cost=0.00..0.01 rows=1 width=0) (5 rows) 26
  20. Advanced inference • Also supports expression indexes. • Is extremely

    forgiving of ordering and redundancy. • A predicate on inference specification will work with unique indexes that satisfy it. This includes equivalent indexes without any predicate, for example. • Again: Use inference specification. Do not name constraint directly. • Exclusion constraints require this, though (although you may not have to use an explicit constraint at all for DO NOTHING). 27
  21. Other advanced features • postgres_fdw supports DO NOTHING variant only.

    • Unique index inference specification mandatory for DO UPDATE, but foreign tables never have any local index/constraint. • RLS works — INSERT, UPDATE, and ALL policies handled at each stage. • Updatable views work. 28
  22. Other advanced features • Logical decoding plugins will see either

    INSERT or UPDATE for each affected row. • New-to-9.5 subselect update list syntax works, of course. • In general, ON CONFLICT DO UPDATE statement totally unrestricted in structure. • Can contain subselects in target list, just like an INSERT or UPDATE. • Special WHERE clause (final update decision) can contain subselects. • RETURNING supported, so may appear in WITH clause. 29