Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UPSERT use-cases

Peter Geoghegan
September 18, 2015

UPSERT use-cases

Ecobox
Home
Blog
About
Tickets
Sessions
Venue & Hotel
Sponsors

POSTGRES OPEN
SEPT 16TH - 18TH ・ DALLAS

UPSERT use cases

Back
Date: Sept. 18, 2015
Time: 13:30 - 14:20
Room: Houston Ballroom B/C
Level: Intermediate
Feedback: Leave feedback
PostgreSQL 9.5 will have support for a feature that is popularly known as "UPSERT" - the ability to either insert or update a row according to whether an existing row with the same key exists. If such a row already exists, the implementation should update it. If not, a new row should be inserted. This is supported by way of a new high level syntax (a clause that extends the INSERT statement) that more or less relieves the application developer from having to give any thought to race conditions. This common operation for client applications is set to become far simpler and far less error-prone than legacy ad-hoc approaches to UPSERT involving subtransactions. Moreover, the new implementation performs much better than those legacy approaches.

While the feature is most obviously compelling for OLTP and web application use cases, it's also true that the syntax is powerful enough to be very useful in many real world data integration scenarios. The non-standard PostgreSQL syntax offer explicit, fine grained control over where and how to update. For example, an update may not actually affect an existing row due to not satisfying some additional criteria (i.e. due to not passing the ON CONFLICT ... DO UPDATE special, dedicated WHERE clause).

This talk gives an overview of the feature from a high level, and examines these use cases. You will learn how you might want to use the new UPSERT feature in your application beyond the obvious. In passing, there will be brief discussion of why UPSERT's implementation proved to be a hard problem, and, relatedly, why a custom syntax was used instead of the SQL standard's MERGE syntax.

Peter Geoghegan

September 18, 2015
Tweet

More Decks by Peter Geoghegan

Other Decks in Programming

Transcript

  1. UPSERT use-cases
    Peter Geoghegan
    Twitter: @petervgeoghegan
    FOSDEM — January 31, 2016

    View Slide

  2. What is UPSERT?
    • INSERT row, or UPDATE existing one to reconcile it
    with what you wanted to INSERT. Always one or the
    other of those two. No “ifs”, no “buts”.
    • Available in PostgreSQL 9.5 (recently released).
    • 9.5 also adds a DO NOTHING variant that never
    locks row. MySQL calls this INSERT IGNORE.
    • You more or less don't have to worry about
    concurrency/race conditions with the feature.
    2

    View Slide

  3. What is UPSERT?
    • “Fundamental UPSERT property”. At READ COMMITTED
    isolation level, you should always get an insert or update.
    • No unprincipled deadlocks.
    • No spurious unique constraint violations.
    • Only way to really ensure this is to make it driven by insert.
    Take alternative path (i.e., go update existing row) when
    would-be duplicate violation detected.
    • Most requested by those with OLTP and web app
    databases.
    3

    View Slide

  4. Syntax
    -- “UPSERT” some distributors:
    INSERT INTO distributors (did, dname)
    VALUES (5, 'Gizmo transglobal'),
    (6, 'Associated Computing, inc')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    -- “INSERT IGNORE” variant:
    INSERT INTO distributors (did, dname)
    VALUES (7, 'Redline GmbH')
    ON CONFLICT (did)
    DO NOTHING;
    -- Consider not updating existing distributor:
    INSERT INTO distributors (did, dname) AS d
    VALUES (8, 'Anvil Distribution')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    WHERE d.zipcode != '21201';
    4

    View Slide

  5. -- “UPSERT” some distributors:
    INSERT INTO distributors (did, dname)
    VALUES (5, 'Gizmo transglobal'),
    (6, 'Associated Computing, inc')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    -- “INSERT IGNORE” variant:
    INSERT INTO distributors (did, dname)
    VALUES (7, 'Redline GmbH')
    ON CONFLICT (did)
    DO NOTHING;
    -- Consider not updating existing distributor:
    INSERT INTO distributors (did, dname) AS d
    VALUES (8, 'Anvil Distribution')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    WHERE d.zipcode != '21201';
    Syntax
    5

    View Slide

  6. Syntax
    -- “UPSERT” some distributors:
    INSERT INTO distributors (did, dname)
    VALUES (5, 'Gizmo transglobal'),
    (6, 'Associated Computing, inc')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    -- “INSERT IGNORE” variant:
    INSERT INTO distributors (did, dname)
    VALUES (7, 'Redline GmbH')
    ON CONFLICT (did)
    DO NOTHING;
    -- Consider not updating existing distributor:
    INSERT INTO distributors (did, dname) AS d
    VALUES (8, 'Anvil Distribution')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    WHERE d.zipcode != '21201';
    6

    View Slide

  7. Syntax
    -- “UPSERT” some distributors:
    INSERT INTO distributors (did, dname)
    VALUES (5, 'Gizmo transglobal'),
    (6, 'Associated Computing, inc')
    ON CONFLICT (did)
    DO UPDATE SET dname = EXCLUDED.dname;
    • Must specify which unique index to take update action on.
    • Prefer to do this with inference clause, naming relevant column (or
    columns, or expressions, or some combination).
    • Alternatively, ON CONFLICT can name constraint directly, but
    avoid it (needed for exclusion constraints, but not unique
    constraints).
    • Very flexible and forgiving. Optimizer throws error if it cannot find
    satisfactory unique index.
    7

    View Slide

  8. Why not implement MERGE?
    • SQL standard specified MERGE statement has been implemented by a
    few other popular database systems.
    • Synchronizes two tables, inserting, updating, and deleting as it goes.
    Individual MERGE statement may lack WHEN NOT MATCHED THEN
    INSERT “handler”. In contrast, Postgres UPSERT is “driven by INSERT”.
    • Source table could be VALUES(), so MERGE appears UPSERT-like.
    • Lacks “fundamental UPSERT property”, because can (for example)
    raise duplicate violations for simple UPSERT cases.
    • ON CONFLICT DO UPDATE has some amount of MERGE-like flexibility.
    • Idea that MERGE is equivalent to UPSERT is unfortunate myth.
    8

    View Slide

  9. UPSERT before 9.5
    http://www.postgresql.org/docs/9.4/static/plpgsql-control-structures.html#PLPGSQL-
    UPSERT-EXAMPLE

    View Slide

  10. Making MERGE statement
    “do UPSERT”
    http://www.stackoverflow.com/a/22777749

    View Slide

  11. MERGE vs. UPSERT
    • UPSERT is about guaranteeing a certain outcome (insert or
    update).
    • MERGE operates according to the same rules as individual
    INSERT, UPDATE, or DELETE statements.
    • No games are played with visibility, and no unique indexes are
    required.
    • Can be simulated with writeable WITH statements in Postgres.
    • The point of MERGE is that it’s fast to bulk load when
    reconciliation driven by conventional join, and it’s a higher level
    construct for reconciliation-orientated DML.
    11

    View Slide

  12. MongoDB UPSERT bug
    https://jira.mongodb.org/browse/SERVER-14322

    View Slide

  13. Summary of UPSERT
    • UPSERT has very flexible syntax — far more flexible (and far
    less error prone) than comparable MySQL feature, for example.
    • Complicated to implement, but not complicated to use.
    Involved small revision to chapter in documentation on MVCC
    rules, for example. The behavior of UPSERT is now described
    specially.
    • Unlike MERGE, has no race conditions. No duplicate violations
    from arbiter unique index. No “unprincipled deadlocks”.
    • Unlike MERGE, inserting is one possible outcome that you
    must always be happy with.
    13

    View Slide

  14. New use-cases

    View Slide

  15. Data integration
    • Uniform access required (e.g. from SQL).
    • Diverse sources. Often autonomous, distributed.
    • Diversity around encoding (In the sense of how domain-
    specific ideas are represented).
    • Not just from other databases.
    • Everything must fit into “mediated” schema.
    • “Screen scraping” is about the worst/best example of this.
    15

    View Slide

  16. Where the rubber hits the
    road
    • Many sources of data may require integration into main
    database, with varying levels of quality.
    • With complex data landscape, idempotence is important.
    • e.g. When consuming same message multiple times
    from Apache Kafka node.
    • The need to reconcile table in live system is in tension
    with MVCC semantics.
    • Do you even know if you can “fence” tables during ETL?
    Or are you just too lazy busy to find out for sure?
    16

    View Slide

  17. When dealing with this complexity, “just be
    careful” is a bad plan, or at least seems unlikely
    to scale well.
    The experience of the last 40 years suggests
    that in general, a few good primitives should be
    leveraged to the hilt to deal with complexity.
    PostgreSQL’s UPSERT feature is built to be
    such a primitive; it addresses increasingly
    important problems surrounding data
    integration.
    My perspective
    17

    View Slide

  18. Example: accidental
    distributed system

    View Slide

  19. • Retail system syncs with disparate suppliers
    • Ad-hoc text file formats
    • May sell same product — still not purchasing from manufacturer

    View Slide

  20. CREATE TABLE products
    (
    sku serial PRIMARY KEY,
    upc_barcode bigint,
    description text NOT NULL,
    -- Time that product was entered into our system
    created_at timestamp with timezone not null default now(),
    -- Time that third-party file indicates change is effective by
    last_updated_at timestamp with timezone,
    case_size numeric NOT NULL,
    current_stock_qty numeric NOT NULL default 0,
    current_retail_price numeric NOT NULL,
    CONSTRAINT unique_barcode UNIQUE (upc_barcode)
    );
    CREATE TABLE suppliers
    (
    supplier_code integer PRIMARY KEY,
    -- example abridged here
    );
    CREATE TABLE supplier_prices
    (
    upc_barcode integer REFERENCES products,
    supplier_code integer REFERENCES suppliers,
    price numeric NOT NULL,
    CONSTRAINT product_supplier PRIMARY KEY (upc_barcode, supplier_code)
    );
    20

    View Slide

  21. -- New CSV file downloaded, with new suggested retail price
    -- effective from right now:
    INSERT INTO products (upc_barcode,
    description,
    current_retail_price,
    last_updated_at) AS p
    -- Timestamp here (and everything else) originates from some
    -- supplier’s system
    VALUES (804551312731, 'Whistler Brand Shampoo', 5.99, '2015-09-18 12:00:00-5')
    ON CONFLICT (upc_barcode)
    DO UPDATE SET description = EXCLUDED.description,
    last_updated_at = EXCLUDED.last_updated_at,
    current_retail_price = EXCLUDED.current_retail_price
    WHERE p.last_updated_at < EXCLUDED.last_updated_at;
    21
    • INSERT statements like this built from data in ad-hoc CSV formats (XML if
    you’re lucky).
    • Retailer always sells products at suggested price in this example.
    • Probably no way of representing in ad-hoc system that this only concerns
    “supplier independent” facets of product.
    • Every CSV file line represents: “Here is a product that you may or may not have
    encountered before”. Most details (e.g. description) seldom change.

    View Slide

  22. Advanced capabilities

    View Slide

  23. INSERT INTO products (upc_barcode,
    description,
    current_retail_price,
    last_updated_at) AS p
    VALUES (739189350438, 'Mixed Nuts', 5.99, '2015-09-18 12:00:00-5’)
    -- Partial unique index support:
    ON CONFLICT (upc_barcode) WHERE is_active
    DO UPDATE SET description = EXCLUDED.description,
    last_updated_at = EXCLUDED.last_updated_at,
    current_retail_price = EXCLUDED.current_retail_price;
    23
    • Products may need to be “logically” deleted. Need to stay in table
    though, which prevents actual deletion.
    • “is_active” boolean column represents this concept.
    • Partial unique indexes can enforce uniqueness among active
    products alone. Must consider what happens if and when product
    re-added.
    • With PostgreSQL’s ON CONFLICT DO UPDATE, this “just works”.

    View Slide

  24. Safety first
    • Lots of thought went into making the feature easy to use
    correctly, as we’ve seen.
    • Almost as much thought went into making it hard to use
    incorrectly.
    • The way that the semantics of a query depend on the
    presence of a constraint (or unique index) is pretty novel.
    • This behooves us to bend over backwards to make it safe.
    • This is thought to be a strong overall advantage for
    PostgreSQL. I think that we’re the first to make UPSERT easy.
    24

    View Slide

  25. postgres=# EXPLAIN INSERT INTO upsert (key, val)
    VALUES(1, 'Baz') ON CONFLICT (key) DO UPDATE
    SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived';
    QUERY PLAN
    ----------------------------------------------------
    Insert on upsert (cost=0.00..0.01 rows=1 width=0)
    Conflict Resolution: UPDATE
    Conflict Arbiter Indexes: upsert_pkey
    Conflict Filter: (excluded.val <> 'Contrived'::text)
    -> Result (cost=0.00..0.01 rows=1 width=0)
    (5 rows)
    25

    View Slide

  26. postgres=# EXPLAIN INSERT INTO upsert (key, val)
    VALUES(1, 'Baz') ON CONFLICT (key) WHERE is_active DO UPDATE
    SET val = EXCLUDED.val WHERE EXCLUDED.val <> 'Contrived';
    QUERY PLAN
    ----------------------------------------------------
    Insert on upsert (cost=0.00..0.01 rows=1 width=0)
    Conflict Resolution: UPDATE
    Conflict Arbiter Indexes: upsert_pkey, key_is_active_idx
    Conflict Filter: (excluded.val <> 'Contrived'::text)
    -> Result (cost=0.00..0.01 rows=1 width=0)
    (5 rows)
    26

    View Slide

  27. Advanced inference
    • Also supports expression indexes.
    • Is extremely forgiving of ordering and redundancy.
    • A predicate on inference specification will work with unique
    indexes that satisfy it. This includes equivalent indexes
    without any predicate, for example.
    • Again: Use inference specification. Do not name constraint
    directly.
    • Exclusion constraints require this, though (although you may
    not have to use an explicit constraint at all for DO NOTHING).
    27

    View Slide

  28. Other advanced features
    • postgres_fdw supports DO NOTHING variant only.
    • Unique index inference specification mandatory
    for DO UPDATE, but foreign tables never have
    any local index/constraint.
    • RLS works — INSERT, UPDATE, and ALL policies
    handled at each stage.
    • Updatable views work.
    28

    View Slide

  29. Other advanced features
    • Logical decoding plugins will see either INSERT or UPDATE for
    each affected row.
    • New-to-9.5 subselect update list syntax works, of course.
    • In general, ON CONFLICT DO UPDATE statement totally
    unrestricted in structure.
    • Can contain subselects in target list, just like an INSERT or
    UPDATE.
    • Special WHERE clause (final update decision) can contain
    subselects.
    • RETURNING supported, so may appear in WITH clause.
    29

    View Slide

  30. Thanks!
    Any questions?

    View Slide