Slide 1

Slide 1 text

UPSERT use-cases Peter Geoghegan Twitter: @petervgeoghegan FOSDEM — January 31, 2016

Slide 2

Slide 2 text

What is UPSERT? • INSERT row, or UPDATE existing one to reconcile it with what you wanted to INSERT. Always one or the other of those two. No “ifs”, no “buts”. • Available in PostgreSQL 9.5 (recently released). • 9.5 also adds a DO NOTHING variant that never locks row. MySQL calls this INSERT IGNORE. • You more or less don't have to worry about concurrency/race conditions with the feature. 2

Slide 3

Slide 3 text

What is UPSERT? • “Fundamental UPSERT property”. At READ COMMITTED isolation level, you should always get an insert or update. • No unprincipled deadlocks. • No spurious unique constraint violations. • Only way to really ensure this is to make it driven by insert. Take alternative path (i.e., go update existing row) when would-be duplicate violation detected. • Most requested by those with OLTP and web app databases. 3

Slide 4

Slide 4 text

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; 4

Slide 5

Slide 5 text

-- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; Syntax 5

Slide 6

Slide 6 text

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; 6

Slide 7

Slide 7 text

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; • Must specify which unique index to take update action on. • Prefer to do this with inference clause, naming relevant column (or columns, or expressions, or some combination). • Alternatively, ON CONFLICT can name constraint directly, but avoid it (needed for exclusion constraints, but not unique constraints). • Very flexible and forgiving. Optimizer throws error if it cannot find satisfactory unique index. 7

Slide 8

Slide 8 text

Why not implement MERGE? • SQL standard specified MERGE statement has been implemented by a few other popular database systems. • Synchronizes two tables, inserting, updating, and deleting as it goes. Individual MERGE statement may lack WHEN NOT MATCHED THEN INSERT “handler”. In contrast, Postgres UPSERT is “driven by INSERT”. • Source table could be VALUES(), so MERGE appears UPSERT-like. • Lacks “fundamental UPSERT property”, because can (for example) raise duplicate violations for simple UPSERT cases. • ON CONFLICT DO UPDATE has some amount of MERGE-like flexibility. • Idea that MERGE is equivalent to UPSERT is unfortunate myth. 8

Slide 9

Slide 9 text

UPSERT before 9.5 http://www.postgresql.org/docs/9.4/static/plpgsql-control-structures.html#PLPGSQL- UPSERT-EXAMPLE

Slide 10

Slide 10 text

Making MERGE statement “do UPSERT” http://www.stackoverflow.com/a/22777749

Slide 11

Slide 11 text

MERGE vs. UPSERT • UPSERT is about guaranteeing a certain outcome (insert or update). • MERGE operates according to the same rules as individual INSERT, UPDATE, or DELETE statements. • No games are played with visibility, and no unique indexes are required. • Can be simulated with writeable WITH statements in Postgres. • The point of MERGE is that it’s fast to bulk load when reconciliation driven by conventional join, and it’s a higher level construct for reconciliation-orientated DML. 11

Slide 12

Slide 12 text

MongoDB UPSERT bug https://jira.mongodb.org/browse/SERVER-14322

Slide 13

Slide 13 text

Summary of UPSERT • UPSERT has very flexible syntax — far more flexible (and far less error prone) than comparable MySQL feature, for example. • Complicated to implement, but not complicated to use. Involved small revision to chapter in documentation on MVCC rules, for example. The behavior of UPSERT is now described specially. • Unlike MERGE, has no race conditions. No duplicate violations from arbiter unique index. No “unprincipled deadlocks”. • Unlike MERGE, inserting is one possible outcome that you must always be happy with. 13

Slide 14

Slide 14 text

New use-cases

Slide 15

Slide 15 text

Data integration • Uniform access required (e.g. from SQL). • Diverse sources. Often autonomous, distributed. • Diversity around encoding (In the sense of how domain- specific ideas are represented). • Not just from other databases. • Everything must fit into “mediated” schema. • “Screen scraping” is about the worst/best example of this. 15

Slide 16

Slide 16 text

Where the rubber hits the road • Many sources of data may require integration into main database, with varying levels of quality. • With complex data landscape, idempotence is important. • e.g. When consuming same message multiple times from Apache Kafka node. • The need to reconcile table in live system is in tension with MVCC semantics. • Do you even know if you can “fence” tables during ETL? Or are you just too lazy busy to find out for sure? 16

Slide 17

Slide 17 text

When dealing with this complexity, “just be careful” is a bad plan, or at least seems unlikely to scale well. The experience of the last 40 years suggests that in general, a few good primitives should be leveraged to the hilt to deal with complexity. PostgreSQL’s UPSERT feature is built to be such a primitive; it addresses increasingly important problems surrounding data integration. My perspective 17

Slide 18

Slide 18 text

Example: accidental distributed system

Slide 19

Slide 19 text

• Retail system syncs with disparate suppliers • Ad-hoc text file formats • May sell same product — still not purchasing from manufacturer

Slide 20

Slide 20 text

CREATE TABLE products ( sku serial PRIMARY KEY, upc_barcode bigint, description text NOT NULL, -- Time that product was entered into our system created_at timestamp with timezone not null default now(), -- Time that third-party file indicates change is effective by last_updated_at timestamp with timezone, case_size numeric NOT NULL, current_stock_qty numeric NOT NULL default 0, current_retail_price numeric NOT NULL, CONSTRAINT unique_barcode UNIQUE (upc_barcode) ); CREATE TABLE suppliers ( supplier_code integer PRIMARY KEY, -- example abridged here ); CREATE TABLE supplier_prices ( upc_barcode integer REFERENCES products, supplier_code integer REFERENCES suppliers, price numeric NOT NULL, CONSTRAINT product_supplier PRIMARY KEY (upc_barcode, supplier_code) ); 20

Slide 21

Slide 21 text

-- New CSV file downloaded, with new suggested retail price -- effective from right now: INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p -- Timestamp here (and everything else) originates from some -- supplier’s system VALUES (804551312731, 'Whistler Brand Shampoo', 5.99, '2015-09-18 12:00:00-5') ON CONFLICT (upc_barcode) DO UPDATE SET description = EXCLUDED.description, last_updated_at = EXCLUDED.last_updated_at, current_retail_price = EXCLUDED.current_retail_price WHERE p.last_updated_at < EXCLUDED.last_updated_at; 21 • INSERT statements like this built from data in ad-hoc CSV formats (XML if you’re lucky). • Retailer always sells products at suggested price in this example. • Probably no way of representing in ad-hoc system that this only concerns “supplier independent” facets of product. • Every CSV file line represents: “Here is a product that you may or may not have encountered before”. Most details (e.g. description) seldom change.

Slide 22

Slide 22 text

Advanced capabilities

Slide 23

Slide 23 text

INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p VALUES (739189350438, 'Mixed Nuts', 5.99, '2015-09-18 12:00:00-5’) -- Partial unique index support: ON CONFLICT (upc_barcode) WHERE is_active DO UPDATE SET description = EXCLUDED.description, last_updated_at = EXCLUDED.last_updated_at, current_retail_price = EXCLUDED.current_retail_price; 23 • Products may need to be “logically” deleted. Need to stay in table though, which prevents actual deletion. • “is_active” boolean column represents this concept. • Partial unique indexes can enforce uniqueness among active products alone. Must consider what happens if and when product re-added. • With PostgreSQL’s ON CONFLICT DO UPDATE, this “just works”.

Slide 24

Slide 24 text

Safety first • Lots of thought went into making the feature easy to use correctly, as we’ve seen. • Almost as much thought went into making it hard to use incorrectly. • The way that the semantics of a query depend on the presence of a constraint (or unique index) is pretty novel. • This behooves us to bend over backwards to make it safe. • This is thought to be a strong overall advantage for PostgreSQL. I think that we’re the first to make UPSERT easy. 24

Slide 25

Slide 25 text

postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON CONFLICT (key) DO UPDATE SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived'; QUERY PLAN ---------------------------------------------------- Insert on upsert (cost=0.00..0.01 rows=1 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: upsert_pkey Conflict Filter: (excluded.val <> 'Contrived'::text) -> Result (cost=0.00..0.01 rows=1 width=0) (5 rows) 25

Slide 26

Slide 26 text

postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON CONFLICT (key) WHERE is_active DO UPDATE SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived'; QUERY PLAN ---------------------------------------------------- Insert on upsert (cost=0.00..0.01 rows=1 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: upsert_pkey, key_is_active_idx Conflict Filter: (excluded.val <> 'Contrived'::text) -> Result (cost=0.00..0.01 rows=1 width=0) (5 rows) 26

Slide 27

Slide 27 text

Advanced inference • Also supports expression indexes. • Is extremely forgiving of ordering and redundancy. • A predicate on inference specification will work with unique indexes that satisfy it. This includes equivalent indexes without any predicate, for example. • Again: Use inference specification. Do not name constraint directly. • Exclusion constraints require this, though (although you may not have to use an explicit constraint at all for DO NOTHING). 27

Slide 28

Slide 28 text

Other advanced features • postgres_fdw supports DO NOTHING variant only. • Unique index inference specification mandatory for DO UPDATE, but foreign tables never have any local index/constraint. • RLS works — INSERT, UPDATE, and ALL policies handled at each stage. • Updatable views work. 28

Slide 29

Slide 29 text

Other advanced features • Logical decoding plugins will see either INSERT or UPDATE for each affected row. • New-to-9.5 subselect update list syntax works, of course. • In general, ON CONFLICT DO UPDATE statement totally unrestricted in structure. • Can contain subselects in target list, just like an INSERT or UPDATE. • Special WHERE clause (final update decision) can contain subselects. • RETURNING supported, so may appear in WITH clause. 29

Slide 30

Slide 30 text

Thanks! Any questions?