PyCon Canada 2017 - Postgres, MVCC, and You

Postgres, MVCC, and You  (or, why COUNT(*) is slow) David
Wolever @wolever PyCon Canada 2017 - Hello! - About me: Akindi, excited to be at the ﬁrst PyCon Canada outside of Toronto, love SQL

@wolever COUNT(*) =# select count(*) from attendees; count ------- 540
(1 row) In this talk I’m going to be talking about COUNT(*) COUNT(*) is the SQL function to count the number of rows in a table. For example, if we wanted to count the number of attendees at PyCon this year, we might use: ……… And we would get:

=# select count(*) from messages; count ---------- 35,664,481 (1 row)
Time: 3,396.989 ms @wolever And one day at work I was trying to count the number of `messages` in our messages table: … … … Dang, that’s really slow. So I got curious: counting the number of rows in a table should be easy! What’s making it so dang slow?

What is a COUNT(*)? @wolever Now, before we go any
further, a quick refresher on SQL:

S  Q  L @wolever The initialism SQL stands for Structured
Query Language

Structured  Query  Language @wolever A declarative language for querying data
And it’s a declarative language used for querying data. For example:

@wolever =# select * from pycons; year | country |
city ------+---------+------------- 2012 | CA | Toronto 2012 | US | Santa Clara 2013 | CA | Toronto 2013 | US | Santa Clara 2014 | CA | Montreal 2015 | CA | Montreal 2016 | CA | Toronto 2016 | US | Portland 2017 | CA | Montreal 2017 | US | Portland We could use SQL to get a list of PyCon locations:

@wolever =# select * from pycons; year | country |
city ------+---------+------------- 2012 | CA | Toronto 2012 | US | Santa Clara 2013 | CA | Toronto 2013 | US | Santa Clara 2014 | CA | Montreal 2015 | CA | Montreal 2016 | CA | Toronto 2016 | US | Portland 2017 | CA | Montreal 2017 | US | Portland (and please forgive how incredibly incomplete this list is)

@wolever SQL can also be used to perform aggregate calculations,
like counting. For example:

@wolever =# select count(*) from pycons .. where city =
'Montreal'; count ------- 3 To count the number of PyCons that have been in Montreal,

'Montreal'; count ------- 3 We can use the COUNT aggregate function, which counts…

'Montreal'; count ------- 3 … the number of rows. (and if you’re wondering about the "star"…

'Montreal'; count ------- 3 … it’s there because we want to count all the rows, not just non-null values. And don’t worry; if that doesn’t make any sense, you can ignore it).

So that covers the…

… "querying data" part of "SQL". And I’m sure some of you have tuned out because this is old-hat for you… but don’t worry - you’ll be learning new things in about 4 minutes. But second part of the deﬁnition I want to touch on is the…

… "declarative" portion.

@wolever warning: And before we go any further, I need
to satisfy the language lawyers in the crows and include this disclaimer:

@wolever warning: generalizations ahead I’m going to make generalizations which
aren’t always, strictly, 100% true. But they are True Enough for now.

@wolever declarative: In a declarative language like SQL, you…

@wolever declarative: describe what you want … describe what you
want

@wolever declarative: describe what you want imperative: Where as in
an imperative language like Python, you…

@wolever declarative: describe what you want imperative: describe a series
of steps … describe a series of steps, which hopefully end up at what you want. For example…

of steps If you were hungry, in a declarative language you might say…

of steps … "I want a lettuce and tomato sandwich".

of steps Whereas, in an imperative language, you would say "get a tomato from the fridge, get a knife from the drawer, use the knife to cut the tomato…" and so on. Or for a more concrete example:

@wolever If you were using Python to ﬁnd a list
of PyCons in Montreal, you might have code that looks something like this:

@wolever for pycon in open("pycon-list.txt"):  year, country, city = pycon.strip().split()
if city == "Montreal": print(pycon) Notice that we’re describing a series of steps:

if city == "Montreal": print(pycon) - opening a ﬁle

if city == "Montreal": print(pycon) - iterating over the lines

if city == "Montreal": print(pycon) - Splitting each line on whitespace

if city == "Montreal": print(pycon) - Comparing the city to "Montreal"

if city == "Montreal": print(pycon) - And if the city is Montreal, printing the line

@wolever Now contrast this with SQL…

@wolever select * from pycons where city = 'Montreal'; Where
we describe what we want:

@wolever select * from pycons where city = 'Montreal'; We
want to: - Get all the columns (that is, the year, country, and city)

@wolever select * from pycons where city = 'Montreal'; -
From the list of PyCons - the pycons table, in SQL parlance

@wolever select * from pycons where city = 'Montreal'; -
Where the city is Montreal

@wolever select * from pycons where city = 'Montreal'; Now,
a big advantage of declarative languages is that they can be very powerful and descriptive; you tell the computer what you want, and it decides how to get that thing you want. But this comes at a cost: computers dumb, and sometimes they make bad decisions.

@wolever imperative:  (usually) easy to ﬁgure out what’s going on
"under the hood" Now, with imperative languages, when the computer makes a bad decision or does something unexpected, it’s fairly straight forward to "peek under the hood" and see what’s going on; after all, you’ve described a series of small, granular steps.

@wolever imperative:  (usually) easy to ﬁgure out what’s going on
"under the hood" declarative:  … not so much But this isn’t so true with declarative languages. When a program written in a declarative language is misbehaving - in this case, taking far longer to than we expect to execute the query - we need to understand how the computer is taking that request we’ve made and transforming it into actual, concrete steps.

Time: 3,396.989 ms @wolever So, getting back to the original question of…

Time: 3,396.989 ms @wolever … why is this `COUNT(*)` so slow? The first thing we need to figure out is what’s going on under the hood. That is, when we run this query, what are the concrete steps Postgres is going to take to figure out how many rows there are in the messages table?

@wolever =# explain select * from messages where id =
13291416051; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = '13291416051'::bigint) To ﬁnd out, we can use the "EXPLAIN" command. EXPLAIN, like the name suggests, asks Postgres to explain the steps it will take in executing the query.

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = '13291416051'::bigint) For example, this is what EXPLAIN tells us for a simple query that looks up one message by its ID: (next)

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) (pause for 1 sec)

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) The output of explain is called the "query plan" - the plan Postgres will follow to execute the query - and there are a few important things to notice here:

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) First, the "Index Scan". Properly explaining indexes is out of the scope of this talk, but I’d really encourage all of you – even if you think you know about indexes – to check out use-the-index-luke.com:

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) use-the-index-luke.com This site is great. Unless you’re an actual, professional, database administrator, I promise you’ll learn something new.

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) Anyway, the important thing to know about indexes is that they are one of the tools the database uses to quickly and eﬃciently ﬁnd the data you’re looking for. And the "index scan" portion of the query plan tells us that Postgres will be using the "messages_id" index to look up the message we’re asking for.

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) The next important thing to notice are the cost, row count, and row width estimates. These are, in a word, magical. Postgres (and most other SQL databases) keep track of the statistical distribution of the data they store, and the query planner uses these statistics when it’s deciding how best to execute the query.

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) In this case, the query planner estimates that the start-up cost of the query will be 0.56, and the execution cost will be 8.79. These numbers are totally arbitrary and machine-dependent, so don’t read into them much beyond "smaller is better" and "below 10 is good".

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) Next, we’ve got the row count and width estimates. The row count is the number of rows the query planner thinks the query will return. In this case, because the "messages_id" index is not unique - that is, it’s possible for there to be duplicate ids - it has guessed that this query will return about 13 rows (in reality, though, it actually returns zero). The width is the estimated number of bytes per row returned.

1234; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using messages_id on messages (cost=0.56..8.79 rows=13 width=61) Index Cond: (id = 1234) Finally, the "Index Cond" tells us that it will be using the index to look for all the rows where "id = 1234". It’s fascinating to look at the plans for more complex queries, and I’d really encourage all of you to give it a try some time. And when you do, notice that the plan is dependent on the values being queried for; for example, for some reason, if I change that ID to one billion, Postgres uses a table scan instead of an index scan.

@wolever =# explain select count(*) from messages; QUERY PLAN --------------------------------------------------------------------------
Aggregate (cost=887442.29..887442.30 rows=1 width=8) -> Seq Scan on messages (cost=0.00..798267.83 rows=35669783 width=0) Anyway, now that we know a tiny bit about explain and the query planner, let’s take a look at the plan for our COUNT(*) query:

Aggregate (cost=887442.29..887442.30 rows=1 width=8) -> Seq Scan on messages (cost=0.00..798267.83 rows=35669783 width=0) … ouch

Aggregate (cost=887442.29..887442.30 rows=1 width=8) -> Seq Scan on messages (cost=0.00..798267.83 rows=35669783 width=0) That "seq scan" means that Postgres is going to be walking through every row in the table, and that row estimate…

Aggregate (cost=887442.29..887442.30 rows=1 width=8) -> Seq Scan on messages (cost=0.00..798267.83 rows=35669783 width=0) … tells us that’s going to be about 35 million rows.

Aggregate (cost=887442.29..887442.30 rows=1 width=8) -> Seq Scan on messages (cost=0.00..798267.83 rows=35669783 width=0) No wonder it’s so slow: no shortcuts are being taken, and no optimization is being performed.

@wolever "Counting all the things" is pretty common… … why
shouldn’t that be fast? And that seems pretty weird. After all, "counting all the things" is a pretty common operation, and databases are supposed to make common operations fast. Why shouldn’t that be fast?

@wolever A quick trip to Google, though, leads us to…

https://wiki.postgresql.org/wiki/Slow_Counting    “The reason why this is slow is related
to the MVCC implementation in PostgreSQL. The fact that multiple transactions can see different states of the data means that there can be no straightforward way for "COUNT(*)" to summarize data across the whole table; PostgreSQL must walk through all rows, in some sense.” @wolever … the Postgres wiki page on "slow counting", which has this helpful explanation:

https://wiki.postgresql.org/wiki/Slow_Counting    “The reason why this is slow is related
to the MVCC implementation in PostgreSQL. The fact that multiple transactions can see different states of the data means that there can be no straightforward way for "COUNT(*)" to summarize data across the whole table; PostgreSQL must walk through all rows, in some sense.” @wolever <read explanation>

@wolever But that’s…

… @wolever

…… @wolever

……… @wolever but that’s… not very helpful

@wolever The next thing we need to do is understand…

@wolever M  V  C  C … MVCC. MVCC stands for

@wolever Multi  Version  Concurrency  Control Multi Version Concurrency Control. MVCC
allows…

Multi  Version  Concurrency  Control @wolever Allows multiple concurrent readers and
writers… … multiple people to be reading from and writing to the database, all at the same time …

Multi  Version  Concurrency  Control @wolever Allows multiple concurrent readers and
writers… … without everything breaking all the time … without everything breaking all the time.

@wolever To use the very most cliché example: imagine you’re
working at a bank and implementing the "transfer funds" function. It might look something like this:

@wolever def transfer(amount, src_account, dst_account): src_account.balance -= amount dst_account.balance +=
amount src_account.save() dst_account.save()

amount src_account.save() dst_account.save() Deduct the amount from the source

amount src_account.save() dst_account.save() Add the amount to the destination

amount src_account.save() dst_account.save() And then save the new balances.

@wolever But there are some problems… Now, there are some
problems here. In fact there are quite a few. But I’m going to focus on one in particular:

@wolever def transfer(amount, src_account, dst_account): if src_account.balance < amount: raise
InsufficientFunds() src_account.balance -= amount dst_account.balance += amount src_account.save() *CHECK BALANCE OF ALL ACCOUNTS* dst_account.save() What happens if - and this is slightly contrived, so bear with me - between saving the balance of the ﬁrst account and saving the value of the second account, the FDIC comes along to audit the bank and asks them to account for the balance of each bank account. They’ll ﬁnd that $100 has just vanished - it isn’t in the source account, but hasn’t made it to the destination account yet. (and for a slightly less contrived example: the same thing will happen if the program crashes between saving the account balances)

@wolever Enter MVCC! This is exactly the problem (well, exactly
one of the problems) addressed by Postgres’ multi version concurrency control…

@wolever Transactions … transactions. Transactions are the tool that Postgres
(and all SQL databases) use to group a bunch of related queries together, making sure that other people querying the database can’t see intermediary results, and make sure the database doesn’t end up in an inconsistent state if something crashing in the middle of those queries. The best way to show how transactions work is with every speaker’s favourite thing…

@wolever (live demo ) Transactions a live demo - We’ll
start with just two bank accounts: select * from accounts; - Next, from this other terminal, we’re going to transfer $100 from one to the other. We’ll start by creating a transaction with "begin": begin; - this tells postgres that the following commands only make sense to perform as a group. For example, when we deduct $100 from Alex’s account: update accounts set balance = balance - 100 where name = 'Alex'; - we should see the $100 deducted: select * from accounts; - but no one else should: select * from accounts; - once we’ve added that $100 into Chris’ account, though: update accounts set balance = balance + 100 where name = 'Chris';

@wolever cool Okay, I don’t know about you, but I
think that’s absurdly cool.

@wolever Anyway, by this point you might, rightfully, be wondering…

@wolever but what does this have to do with counting?
… what the heck does this have to do with counting?

To understand, we need to go one level deeper and find out exactly how transactions are implemented under the hood. How is it that Postgres can have one set of data for one transaction, and a different set of data for a different connection?

And I only used one transaction in this example… but it works just as well with 10, 100, or 1,000 transactions, each changing an arbitrarily large number of rows! How does that work?

@wolever Rows are immutable Once written to disk, they’re never
changed The ﬁrst thing to understand is that rows in Postgres are immutable. Once a row has been written to disk, it’s never changed or removed.

@wolever When an update statement would change a row… …
a whole new row is saved instead And when a row is updated - for example, with the update statement - a whole new row is written.

@wolever name balance Alex 500 Chris 500 For example, our
accounts table might start out looking like this:

@wolever name balance Alex 500 Chris 500 update accounts  set
balance = balance - 100  where name = 'Alex'; And when Alex’s balance is updated, a new row will be added:

@wolever name balance Alex 500 Chris 500 Alex 400 update
accounts  set balance = balance - 100  where name = 'Alex'; Similarly, when Chris’ balance is updated

@wolever name balance Alex 500 Chris 500 Alex 400 update
accounts  set balance = balance - 100  where name = 'Alex'; update accounts  set balance = balance - 100  where name = 'Chris';

@wolever name balance Alex 500 Chris 500 Alex 400 Chris
600 update accounts  set balance = balance - 100  where name = 'Alex'; update accounts  set balance = balance + 100  where name = 'Chris'; A new row will be added to reﬂect that new balance.

@wolever … but wait … but wait. If the table
actually has four rows, why aren’t four rows returned? How come we only see two? How does Postgres know which rows should be seen by each transaction?

@wolever xid Enter the ﬁnal piece of the puzzle… XID.
The xid, or transaction ID, is a unique, sequential, integer assigned to each transaction.

@wolever =# begin; =# select txid_current(); txid_current -------------- 4713389 =#
select txid_current(); txid_current -------------- 4713389 =# commit; =# begin; =# select txid_current(); txid_current -------------- 4713390 =# commit; =# select txid_current(); txid_current -------------- 4713391 We can see how it works using the `txid_current()` function. If we begin a transaction then call txid_current(), we’ll see the current transaction ID.

select txid_current(); txid_current -------------- 4713389 =# commit; =# begin; =# select txid_current(); txid_current -------------- 4713390 =# commit; =# select txid_current(); txid_current -------------- 4713391 And notice that, if we call it again in the same transaction, we’ll get the same ID…

select txid_current(); txid_current -------------- 4713389 =# commit; =# begin; =# select txid_current(); txid_current -------------- 4713390 =# commit; =# select txid_current(); txid_current -------------- 4713391 … but if we commit that transaction and start another, we’ll get a new ID.

select txid_current(); txid_current -------------- 4713389 =# commit; =# begin; =# select txid_current(); txid_current -------------- 4713390 =# commit; =# select txid_current(); txid_current -------------- 4713391 And ﬁnally, each statement executed outside a transaction is assigned a new txid. When ever a transaction creates, updates, or deletes a row, that transaction’s ID is included in the row in one of two columns:

@wolever xmin name balance 100 Alex 500 100 Chris 500
xmin, which is the ID of the transaction which created the row, and …

@wolever xmin xmax name balance 100 Alex 500 100 Chris
500 … xmax, which is the ID of the transaction which deleted that row. Notice that, in this example, the xmax is empty because neither of these rows have been deleted. So let’s see those updates again, but this time including the xmin and xmax:

@wolever xmin xmax name balance 100 Alex 500 100 Chris
500 110 Alex 400 update accounts  set balance = balance - 100  where name = 'Alex'; When $100 is deducted from Alex’s balance, the new row that’s written includes the ID of that transaction; we’ll say 110.

@wolever xmin xmax name balance 100 110 Alex 500 100
Chris 500 110 Alex 400 update accounts  set balance = balance - 100  where name = 'Alex'; Next, the old row is "deleted" by writing the same transaction ID to the row’s xmax. (and yes, I did lie a bit earlier when I said rows were immutable; to be more precise, the data in the rows are immutable, but the xids are changed)

110 Chris 500 110 Alex 400 110 Chris 600 update accounts  set balance = balance - 100  where name = 'Alex'; update accounts  set balance = balance + 100  where name = 'Chris'; And the same thing happens when Chris’ balance is updated. (Now, interestingly, notice that rows aren’t changed when the transaction is committed or aborted. Explaining why is out of the scope of this talk, but I’d encourage you to try and ﬁgure it out!)

110 Chris 500 110 Alex 400 110 Chris 600 update accounts  set balance = balance - 100  where name = 'Alex'; update accounts  set balance = balance + 100  where name = 'Chris'; PHEW! By now you’re hopefully starting to see how Postgres is able to ﬁgure out which rows are visible to a particular transaction:

def is_row_visible(cur_xid, row): if row.xmin > cur_xid: return False if
row.xmax <= cur_xid: return False return True @wolever For each row, a straight forward visibility check is done:

row.xmax <= cur_xid: return False return True @wolever First, the row’s xmin - that is, the transaction that created the row - is checked. If it’s greater than the current transaction’s ID, then the row was created in the future and is (obviously) invisible.

row.xmax <= cur_xid: return False return True @wolever Second, the row’s xmax - that is, the transaction that removed row - is checked. If it’s less than or equal to the current transaction’s ID, then the row was deleted in the past and shouldn’t be visible.

row.xmax <= cur_xid: return False return True @wolever And otherwise the row is visible. Now… this is a bit of a simpliﬁcation, but it’s good enough to understand what’s going on. For the database geeks, though, the real algorithm looks a bit more like this:

def is_row_visible(cur_xid, row): if row.xmin > cur_xid or tx_aborted(row.xmin): return
False if row.xmax <= cur_xid and tx_comitted(row.xmax): return False return True @wolever Recall that transactions can be either committed or aborted, and that needs to be considered when checking visibility:

False if row.xmax <= cur_xid and tx_comitted(row.xmax): return False return True @wolever Rows are hidden if the transaction that created them was aborted…

False if row.xmax <= cur_xid and tx_comitted(row.xmax): return False return True @wolever … and similarly, deleted rows are only hidden if the deleting transaction was committed.

def is_row_visible(cur_xid, row): if row.xmin > cur_xid or row.xmin_aborted: return
False if tx_aborted(row.xmin): row.xmin_aborted = True row.xmin_committed = False return False if row.xmax < cur_xid and row.xmax_committed: return False if row.xmax == cur_xid: return False if row.xmax < cur_xid and tx_committed(row.xmax): row.xmax_aborted = False row.xmax_committed = True return False return True @wolever … okay and actually that was a bit of a lie too, because checking a transaction’s state - whether it was committed or aborted - can be slow, so that’s also cached on the row. BUT! Don’t worry - those details aren’t important right now. Because now we can ﬁnally answer the question we came here to ask…

COUNT(*) is slow because there is no one “correct” COUNT(*)
A table’s row count depends on the current transaction @wolever Why is COUNT(*) so dang slow? And there can be lots of transactions Why is COUNT(*) so slow? There is no one "correct" count The number of rows in a table depends on the current transaction And since there can be any number of transactions at a time, there’s no sensible way to cache or otherwise optimize the counting.

… do we still have time? @wolever If so, there
are a couple obvious questions I want to touch on…

How are deleted rows removed from disk? VACUUM @wolever Does
that mean the database will grow forever? How are deleted rows removed from disk? Does that mean that, even if there’s only one table with one row being updated, the database’s size on disk will grow unboundedly? That’s what VACUUM is for. Vacuum performs "routine maintenance" on the database, including identifying rows that will always be invisible to every future transaction and deletes them. Note, though, that for Reasons, the space isn’t released back to the operating system; it will just be re-used for new rows in the same table.

Transaction IDs are integers… @wolever What happens when they overflow?
You may also have noticed that transaction IDs are integers… and integers have a nasty tendency to overflow. Another piece of the routine maintenance performed by vacuum is updating minimum/maximum xids to make sure they don’t overflow. But if that doesn’t happen…

WARNING: database "pycon" must be vacuumed within 177009986 transactions HINT:
To avoid a database shutdown, execute a database-wide VACUUM in "pycon". @wolever … within two million transactions of an overﬂow, Postgres will start issuing these warnings

ERROR: database is not accepting commands to avoid wraparound data
loss in database "pycon" HINT: Stop the postmaster and use a standalone backend to VACUUM in "pycon". @wolever … and when there are less than one million transactions left before an overﬂow, the database will refuse to do any writes.

Our friends at Sentry have a good writeup @wolever https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres.html 

Each row only has one xmax… @wolever What if two
transactions write to the same row? The second always blocks

David Wolever @wolever Work with me: [email protected]  https://akindi.com/pages/jobs Also, I’m
hiring! My company, Akindi, builds a product that lets teachers print Scantron-style bubble sheets from any printer, scan them from any scanner, and get all the results online. We are - and this isn’t bragging, it’s an objective fact - the best on the market at what we do… and that’s mostly because we’re the only product in our market built after the year 2000. So if you’d enjoy working on software that makes teacher’s lives less terrible, come say hi! I’d love to chat with you!

References: - Postgres Internals Presentations:  http://momjian.us/main/presentations/internals.html - Especially: http://momjian.us/main/writings/pgsql/mvcc.pdf -
Introduction to Postgres' Physical Storage:  http://rachbelaid.com/introduction-to-postgres-physical-storage/ - Transaction ID wraparound:  https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres.html - Use the Index Luke:  http://use-the-index-luke.com/ - Postgres documentation on Vacuuming:  https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html @wolever

PyCon Canada 2017 - Postgres, MVCC, and You

PyCon Canada 2017 - Postgres, MVCC, and You

More Decks by David Wolever

Other Decks in Technology

Featured

Transcript