UPSERT use-cases

UPSERT use-cases Peter Geoghegan Twitter: @petervgeoghegan FOSDEM — January 31,
2016

What is UPSERT? • INSERT row, or UPDATE existing one
to reconcile it with what you wanted to INSERT. Always one or the other of those two. No “ifs”, no “buts”. • Available in PostgreSQL 9.5 (recently released). • 9.5 also adds a DO NOTHING variant that never locks row. MySQL calls this INSERT IGNORE. • You more or less don't have to worry about concurrency/race conditions with the feature. 2

What is UPSERT? • “Fundamental UPSERT property”. At READ COMMITTED
isolation level, you should always get an insert or update. • No unprincipled deadlocks. • No spurious unique constraint violations. • Only way to really ensure this is to make it driven by insert. Take alternative path (i.e., go update existing row) when would-be duplicate violation detected. • Most requested by those with OLTP and web app databases. 3

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)
VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; 4

-- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES
(5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; Syntax 5

VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; -- “INSERT IGNORE” variant: INSERT INTO distributors (did, dname) VALUES (7, 'Redline GmbH') ON CONFLICT (did) DO NOTHING; -- Consider not updating existing distributor: INSERT INTO distributors (did, dname) AS d VALUES (8, 'Anvil Distribution') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; WHERE d.zipcode != '21201'; 6

VALUES (5, 'Gizmo transglobal'), (6, 'Associated Computing, inc') ON CONFLICT (did) DO UPDATE SET dname = EXCLUDED.dname; • Must specify which unique index to take update action on. • Prefer to do this with inference clause, naming relevant column (or columns, or expressions, or some combination). • Alternatively, ON CONFLICT can name constraint directly, but avoid it (needed for exclusion constraints, but not unique constraints). • Very ﬂexible and forgiving. Optimizer throws error if it cannot ﬁnd satisfactory unique index. 7

Why not implement MERGE? • SQL standard speciﬁed MERGE statement
has been implemented by a few other popular database systems. • Synchronizes two tables, inserting, updating, and deleting as it goes. Individual MERGE statement may lack WHEN NOT MATCHED THEN INSERT “handler”. In contrast, Postgres UPSERT is “driven by INSERT”. • Source table could be VALUES(), so MERGE appears UPSERT-like. • Lacks “fundamental UPSERT property”, because can (for example) raise duplicate violations for simple UPSERT cases. • ON CONFLICT DO UPDATE has some amount of MERGE-like ﬂexibility. • Idea that MERGE is equivalent to UPSERT is unfortunate myth. 8

UPSERT before 9.5 http://www.postgresql.org/docs/9.4/static/plpgsql-control-structures.html#PLPGSQL- UPSERT-EXAMPLE

Making MERGE statement “do UPSERT” http://www.stackoverﬂow.com/a/22777749

MERGE vs. UPSERT • UPSERT is about guaranteeing a certain
outcome (insert or update). • MERGE operates according to the same rules as individual INSERT, UPDATE, or DELETE statements. • No games are played with visibility, and no unique indexes are required. • Can be simulated with writeable WITH statements in Postgres. • The point of MERGE is that it’s fast to bulk load when reconciliation driven by conventional join, and it’s a higher level construct for reconciliation-orientated DML. 11

MongoDB UPSERT bug https://jira.mongodb.org/browse/SERVER-14322

Summary of UPSERT • UPSERT has very ﬂexible syntax —
far more ﬂexible (and far less error prone) than comparable MySQL feature, for example. • Complicated to implement, but not complicated to use. Involved small revision to chapter in documentation on MVCC rules, for example. The behavior of UPSERT is now described specially. • Unlike MERGE, has no race conditions. No duplicate violations from arbiter unique index. No “unprincipled deadlocks”. • Unlike MERGE, inserting is one possible outcome that you must always be happy with. 13

New use-cases

Data integration • Uniform access required (e.g. from SQL). •
Diverse sources. Often autonomous, distributed. • Diversity around encoding (In the sense of how domain- speciﬁc ideas are represented). • Not just from other databases. • Everything must ﬁt into “mediated” schema. • “Screen scraping” is about the worst/best example of this. 15

Where the rubber hits the road • Many sources of
data may require integration into main database, with varying levels of quality. • With complex data landscape, idempotence is important. • e.g. When consuming same message multiple times from Apache Kafka node. • The need to reconcile table in live system is in tension with MVCC semantics. • Do you even know if you can “fence” tables during ETL? Or are you just too lazy busy to ﬁnd out for sure? 16

When dealing with this complexity, “just be careful” is a
bad plan, or at least seems unlikely to scale well. The experience of the last 40 years suggests that in general, a few good primitives should be leveraged to the hilt to deal with complexity. PostgreSQL’s UPSERT feature is built to be such a primitive; it addresses increasingly important problems surrounding data integration. My perspective 17

Example: accidental distributed system

• Retail system syncs with disparate suppliers • Ad-hoc text
ﬁle formats • May sell same product — still not purchasing from manufacturer

CREATE TABLE products ( sku serial PRIMARY KEY, upc_barcode bigint,
description text NOT NULL, -- Time that product was entered into our system created_at timestamp with timezone not null default now(), -- Time that third-party file indicates change is effective by last_updated_at timestamp with timezone, case_size numeric NOT NULL, current_stock_qty numeric NOT NULL default 0, current_retail_price numeric NOT NULL, CONSTRAINT unique_barcode UNIQUE (upc_barcode) ); CREATE TABLE suppliers ( supplier_code integer PRIMARY KEY, -- example abridged here ); CREATE TABLE supplier_prices ( upc_barcode integer REFERENCES products, supplier_code integer REFERENCES suppliers, price numeric NOT NULL, CONSTRAINT product_supplier PRIMARY KEY (upc_barcode, supplier_code) ); 20

-- New CSV file downloaded, with new suggested retail price
-- effective from right now: INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p -- Timestamp here (and everything else) originates from some -- supplier’s system VALUES (804551312731, 'Whistler Brand Shampoo', 5.99, '2015-09-18 12:00:00-5') ON CONFLICT (upc_barcode) DO UPDATE SET description = EXCLUDED.description, last_updated_at = EXCLUDED.last_updated_at, current_retail_price = EXCLUDED.current_retail_price WHERE p.last_updated_at < EXCLUDED.last_updated_at; 21 • INSERT statements like this built from data in ad-hoc CSV formats (XML if you’re lucky). • Retailer always sells products at suggested price in this example. • Probably no way of representing in ad-hoc system that this only concerns “supplier independent” facets of product. • Every CSV ﬁle line represents: “Here is a product that you may or may not have encountered before”. Most details (e.g. description) seldom change.

Advanced capabilities

INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p VALUES
(739189350438, 'Mixed Nuts', 5.99, '2015-09-18 12:00:00-5’) -- Partial unique index support: ON CONFLICT (upc_barcode) WHERE is_active DO UPDATE SET description = EXCLUDED.description, last_updated_at = EXCLUDED.last_updated_at, current_retail_price = EXCLUDED.current_retail_price; 23 • Products may need to be “logically” deleted. Need to stay in table though, which prevents actual deletion. • “is_active” boolean column represents this concept. • Partial unique indexes can enforce uniqueness among active products alone. Must consider what happens if and when product re-added. • With PostgreSQL’s ON CONFLICT DO UPDATE, this “just works”.

Safety ﬁrst • Lots of thought went into making the
feature easy to use correctly, as we’ve seen. • Almost as much thought went into making it hard to use incorrectly. • The way that the semantics of a query depend on the presence of a constraint (or unique index) is pretty novel. • This behooves us to bend over backwards to make it safe. • This is thought to be a strong overall advantage for PostgreSQL. I think that we’re the ﬁrst to make UPSERT easy. 24

postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON
CONFLICT (key) DO UPDATE SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived'; QUERY PLAN ---------------------------------------------------- Insert on upsert (cost=0.00..0.01 rows=1 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: upsert_pkey Conflict Filter: (excluded.val <> 'Contrived'::text) -> Result (cost=0.00..0.01 rows=1 width=0) (5 rows) 25

postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON
CONFLICT (key) WHERE is_active DO UPDATE SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived'; QUERY PLAN ---------------------------------------------------- Insert on upsert (cost=0.00..0.01 rows=1 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: upsert_pkey, key_is_active_idx Conflict Filter: (excluded.val <> 'Contrived'::text) -> Result (cost=0.00..0.01 rows=1 width=0) (5 rows) 26

Advanced inference • Also supports expression indexes. • Is extremely
forgiving of ordering and redundancy. • A predicate on inference speciﬁcation will work with unique indexes that satisfy it. This includes equivalent indexes without any predicate, for example. • Again: Use inference speciﬁcation. Do not name constraint directly. • Exclusion constraints require this, though (although you may not have to use an explicit constraint at all for DO NOTHING). 27

Other advanced features • postgres_fdw supports DO NOTHING variant only.
• Unique index inference speciﬁcation mandatory for DO UPDATE, but foreign tables never have any local index/constraint. • RLS works — INSERT, UPDATE, and ALL policies handled at each stage. • Updatable views work. 28

Other advanced features • Logical decoding plugins will see either
INSERT or UPDATE for each affected row. • New-to-9.5 subselect update list syntax works, of course. • In general, ON CONFLICT DO UPDATE statement totally unrestricted in structure. • Can contain subselects in target list, just like an INSERT or UPDATE. • Special WHERE clause (ﬁnal update decision) can contain subselects. • RETURNING supported, so may appear in WITH clause. 29

Thanks! Any questions?

UPSERT use-cases

UPSERT use-cases

Peter Geoghegan

More Decks by Peter Geoghegan

Other Decks in Programming

Featured

Transcript

UPSERT use-cases Peter Geoghegan Twitter: @petervgeoghegan FOSDEM — January 31,

What is UPSERT? • INSERT row, or UPDATE existing one

What is UPSERT? • “Fundamental UPSERT property”. At READ COMMITTED

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)

-- “UPSERT” some distributors: INSERT INTO distributors (did, dname) VALUES

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)

Syntax -- “UPSERT” some distributors: INSERT INTO distributors (did, dname)

Why not implement MERGE? • SQL standard speciﬁed MERGE statement

UPSERT before 9.5 http://www.postgresql.org/docs/9.4/static/plpgsql-control-structures.html#PLPGSQL- UPSERT-EXAMPLE

Making MERGE statement “do UPSERT” http://www.stackoverﬂow.com/a/22777749

MERGE vs. UPSERT • UPSERT is about guaranteeing a certain

MongoDB UPSERT bug https://jira.mongodb.org/browse/SERVER-14322

Summary of UPSERT • UPSERT has very ﬂexible syntax —

New use-cases

Data integration • Uniform access required (e.g. from SQL). •

Where the rubber hits the road • Many sources of

When dealing with this complexity, “just be careful” is a

Example: accidental distributed system

• Retail system syncs with disparate suppliers • Ad-hoc text

CREATE TABLE products ( sku serial PRIMARY KEY, upc_barcode bigint,

-- New CSV file downloaded, with new suggested retail price

Advanced capabilities

INSERT INTO products (upc_barcode, description, current_retail_price, last_updated_at) AS p VALUES

Safety ﬁrst • Lots of thought went into making the

postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON

postgres=# EXPLAIN INSERT INTO upsert (key, val) VALUES(1, 'Baz') ON

Advanced inference • Also supports expression indexes. • Is extremely

Other advanced features • postgres_fdw supports DO NOTHING variant only.

Other advanced features • Logical decoding plugins will see either

Thanks! Any questions?