Slide 1

Slide 1 text

Thinking in SQL M I C H A E L S W I E T O N S W I E T O N @ AT O M I C O B J E C T. C O M turn off wifi! Hello - I’m Michael Swieton. I’m a software developer at Atomic Object. We are a software design and development consultancy, and we built software for all sorts of platforms - embedded, mobile, web, server, and desktop. Today I am here to talk a bit about SQL databases because they end up being the datastore for so many of the apps that get written.

Slide 2

Slide 2 text

“… you want to build a model in your head that allows you to develop a deep understanding of the language; if you encounter a puzzle, you’ll feed it to your model and deduce the answer.” B R U C E E C K E L - T H I N K I N G I N J AVA The title of this talk is inspired by Bruce Eckel’s “Thinking in …” series of books. I hope that what I cover today will help you to improve your mental model of how SQL works. We tend to look at our problems through the lens of our tools. Often this works well - but with ORMs like ActiveRecord or Hibernate, it can be constraining.

Slide 3

Slide 3 text

How to think about joins How to think about indexes COMING ATTRACTIONS We'll cover SQL joins and some related bits - that's about 70% Then I have a largely disconnected section about indexes Everything is going to be fairly vendor-agnostic. You shouldn't have to be a SQL expert, but I won't be covering syntax at all here (despite the conference name, ha ha) I’ve made sure this talk will fit in this time slot (45 minutes) with room for questions. Feel free to ask questions I know we should have some really solid experts in the room - please feel free to heckle me and tell me all the things I'm doing wrong Finally, a disclaimer before we begin: databases are complicated, and everything I say should be mostly correct - but nothing I say will be completely 100% correct all the time.

Slide 4

Slide 4 text

J O I N

Slide 5

Slide 5 text

a INNER JOIN b a LEFT OUTER JOIN b a RIGHT OUTER JOIN b BACK TO SCHOOL A quick review: * “Give me all the pairs” * “Give me all the items on the left, and any pairings if we find a match” * “Give me all the items on the right, and any pairings if we find a match” But despite being able to regurgitate these simplified definitions, often we still think about it more like this NEXT

Slide 6

Slide 6 text

http://guides.rubyonrails.org/association_basics.html * We think about joins like "let's link record A to record B" ** Joins are just how a “has many”, “belongs to” relationship works But… these are not relational relationships. These are OO. This isn’t wrong per se - but you miss out on a lot of power if this is your only worldview of your data

Slide 7

Slide 7 text

For each A, find an associated B I N S T E A D O F: Instead of this, let’s turn it around. Instead, turn it around NEXT

Slide 8

Slide 8 text

Link every A to every B, then filter the results. T U R N I T A R O U N D : * Instead think about it as a 2 step operation: # Let's link every A to every B (that's the cartesian product) # And then let's exclude pairings that don't work * This is kind of an inversion of how we normally think - but it’s closer to the relational model Note the way we changed the language here: We stopped being imperative and talking about single records and instead are focusing on sets.

Slide 9

Slide 9 text

SELECT people.* FROM people LEFT JOIN party_invitations ON people.id = party_invitations.person_id AND party_invitations.party_id = “SyntaxCon 2016” WHERE party_invitation.id IS NULL Here’s a warm up exercise for us. I’ve got a more in-depth example, but I want to start with this. I think this query shows the problem with the idea of JOINs as “for every A, get a B”. This query finds people that have not been invited to the party. Poor them! What’s so interesting about this is that we aren’t looking for pairings: we’re looking for where we haven’t found one. Thinking about joins as just how we do “has many” makes it easy to miss a query like this, since this is the negative.

Slide 10

Slide 10 text

EXAMPLE PROBLEM What is each person’s most popular blog post? Ok, here’s problem that we’re going to dig into in depth. NEXT At Atomic, everyone writes for our blog, so this is a question we might want to ask. This form of question is very common question actually: What’s the most recent log message for each device? What’s the most expensive car in each class? Whatever. It’s easy to answer for a single person

Slide 11

Slide 11 text

SELECT * FROM blog_posts WHERE author_id=5 ORDER BY popularity DESC LIMIT 1 What is each person’s most popular blog post? It’s easy to get the answer for a single record, and we can use LINQ or Ruby’s enumerable to run the query many times - but this is inefficient.

Slide 12

Slide 12 text

What is each person’s most popular blog post? Get every post that has none more popular by author. There are other ways to get it. I’ll show you what those look like later. But right now I’m going to do it the hard way to demonstrate how I think about queries. NEXT First, I’m going to reword it a little. This is the same transformation I talked about before: • We aren’t imperatively talking about every record • Instead we have a rule that can apply declaratively to all records • And only the ones we want will fall out

Slide 13

Slide 13 text

Get every post that has none more popular by author. This looks like a join Here’s the key thing: NEXT This is a join. Rows in table blog_posts are related to other blog_posts - even though there isn’t a foreign key relationship. Ok, let’s write a query for this. NEXT

Slide 14

Slide 14 text

Get every post that has none more popular by author. SELECT * FROM blogposts p1

Slide 15

Slide 15 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views Get every post that has none more popular by author. Joins don’t need to be a simple equality: it is any boolean expression.

Slide 16

Slide 16 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id Get every post that has none more popular by author.

Slide 17

Slide 17 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL Get every post that has none more popular by author.

Slide 18

Slide 18 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL Ok, so I hope that’s useful to see how I built that, and it’s a nice example, but it feels really opaque still.

Slide 19

Slide 19 text

Why is this so hard? A D I G R E S S I O N : Why is this so nonobvious?

Slide 20

Slide 20 text

Joins are not fundamental operations. Outer joins are very not fundamental operations. A H Y P O T H E S I S : In arithmetic, addition and subtraction are fairly fundamental. Multiplication is essentially repeated addition, but it’s still an “atom” - an effective single building block. SQL joins are not.

Slide 21

Slide 21 text

A INNER JOIN B ON A.id = B.fk σ (A⋈B) A D E F I N I T I O N : A.id=B.fk An inner join is essentially 2 operations: • the cartesian product • then filter Let’s get that definition into a more familiar syntax NEXT

Slide 22

Slide 22 text

A INNER JOIN B ON A.id = B.fk cross(A, B).filter( function(a, b) { return a.id == b.id; } ); A D E F I N I T I O N : I made up this syntax. It’s not real - I just have it here to demonstrate that there’s really two operations happening.

Slide 23

Slide 23 text

A LEFT OUTER JOIN B ON A.id = B.fk … wailing and gnashing of teeth … A D E F I N I T I O N : Now, outer joins… These are messy.

Slide 24

Slide 24 text

This is the definition on wikipedia on the “Relational Algebra” page. NEXT A left outer join involves 6 separate operations. All this work is to let you get the rows on the left (or right for a right outer join) even if there’s no match on the right (or left).

Slide 25

Slide 25 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL Back to our query. There’s still a lot going on here. Here’s the question: What happens first? Let’s take another short digression. NEXT

Slide 26

Slide 26 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL With *THIS* notation it is not obvious what the order of operations is. So I made one up. This isn’t official or rigorous - it’s just me writing out how I think about things and I welcome feedback.

Slide 27

Slide 27 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL

Slide 28

Slide 28 text

1. JOIN (i.e. the cartesian product) SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL Your FROM clause happens first, but within the FROM, your cartesian products happen first. Note that I’ve highlighted the JOIN but NOT the condition - it (conceptually) creates the cartesian product before it filters

Slide 29

Slide 29 text

1. JOIN (i.e. the cartesian product) 2. Filter rows by JOIN & OUTER JOIN SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL

Slide 30

Slide 30 text

1. JOIN (i.e. the cartesian product) 2. Filter rows by JOIN & OUTER JOIN 3. Filter rows by WHERE SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL

Slide 31

Slide 31 text

1. JOIN (i.e. the cartesian product) 2. Filter rows by JOIN & OUTER JOIN 3. Filter rows by WHERE 4. GROUP and aggregates 5. Filter rows by HAVING 6. Filter columns by SELECT SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL Your GROUP, HAVING, and SELECT clauses all work more straightforward. The only really wacky one is the join, for reasons we discussed earlier. Ok, now let’s see how this actually works in practice for this query.

Slide 32

Slide 32 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 1 1 BAR 30 2 2 QUX 20 So here’s that query, NEXT and here’s some data. Intentionally small dataset just for space on the slide. Let’s execute it according to my made-up SQL order of operations.

Slide 33

Slide 33 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 1 1 BAR 30 2 2 QUX 20 These are the droids we’re looking for.

Slide 34

Slide 34 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 1 1 BAR 30 2 2 QUX 20 1 . J O I N - C A R T E S I A N P R O D U C T

Slide 35

Slide 35 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 1 . J O I N - C A R T E S I A N P R O D U C T

Slide 36

Slide 36 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 2 . F I LT E R B Y J O I N

Slide 37

Slide 37 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 2 . F I LT E R B Y J O I N It’s actually fairly likely that your database will start its filtering here too - an index on employee is going to make this match restrict the size of the problem very effectively.

Slide 38

Slide 38 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 2 . F I LT E R B Y J O I N

Slide 39

Slide 39 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 2 . 5 . F I N I S H T H E J O I N As we discussed earlier, outer joins are complicated. An OUTER JOIN will give you NULLs where there are no matches, but since we had matches here, we don’t need this tuple.

Slide 40

Slide 40 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 3 . F I LT E R B Y W H E R E

Slide 41

Slide 41 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 G R O U P, H AV I N G , A G G R E G AT E S The remaining steps are for queries with GROUP BY, HAVING, or aggregate functions, so we don’t change anything for those.

Slide 42

Slide 42 text

SELECT * FROM blogposts p1 LEFT OUTER JOIN blogposts p2 ON p1.views < p2.views AND p1.employee_id=p2.employee_id WHERE p2.id IS NULL P1.ID P1.EMP_ID P1.TITLE P1.VIEWS P2.ID P2.EMP_ID P2.TITLE P2.VIEWS 0 1 FOO 15 NULL NULL NULL NULL 0 1 FOO 15 0 1 FOO 15 0 1 FOO 15 1 1 BAR 30 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 NULL NULL NULL NULL 1 1 BAR 30 0 1 FOO 15 1 1 BAR 30 1 1 BAR 30 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 NULL NULL NULL NULL 2 2 QUX 20 0 1 FOO 15 2 2 QUX 20 1 1 BAR 30 2 2 QUX 20 2 2 QUX 20 And now we’re finished with the top records per employee. Let’s pause for a second there. What do you think of that? Do you believe me that this works?

Slide 43

Slide 43 text

SELECT * FROM ( SELECT *, RANK(views) OVER (PARTITION BY employee_id) AS post_rank FROM blogposts ) WHERE post_rank = 1; Now, I just showed you the hard way to do it. Here’s the easy way, and honestly a lot more powerful. You probably should be using this solution when you have this problem. I’m not going to dig into this much - go read your database server’s manual. And here’s another NEXT

Slide 44

Slide 44 text

SELECT DISTINCT ON (employee_id) * FROM blogposts ORDER BY employee_id, views DESC; And here’s an even simpler way of doing it. If you understand the hard way of doing it, you’re going to have the mental model you’ll need to understand a lot of other and more complicated queries, so that’s why I wanted to cover it. But this is a good lesson too: You should check out your tools and see what they’ll allow. They often support much more elegant solutions than we may have been stuck with twenty years ago.

Slide 45

Slide 45 text

SELECT people.* FROM people LEFT JOIN party_invitations ON people.id = party_invitations.person_id AND party_invitations.party_id = “SyntaxCon 2016” WHERE party_invitation.id IS NULL 1. Make the cartesian product 2. Filter to desired pairings and NULLs 3. Then select rows from the result Ok let’s go back to the beginning for a moment. NEXT This is the same query we looked at first. Now, having walked through that execution, it should be clear how this one works. Let’s mark it up a bit. NEXT NEXT NEXT That’s all I wanted to say about joins before we move on.

Slide 46

Slide 46 text

I N D E X E S We use indexes to make things faster. But if you’re going to be even remotely effective you need to know a little bit more about how they work. Let’s dig in.

Slide 47

Slide 47 text

A redundant copy Of some columns Of all rows In sorted order That is automatically maintained as data is updated AN INDEX IS: Here’s my definition of an index: Ok, now let’s see how it works.

Slide 48

Slide 48 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 SELECT * FROM people WHERE last_name=‘EVERS’ L. NAME AIKEN EMERSON EVERS GAGNON MARINO NALL NOBLES PINO Here’s some data. Suppose we want to query it. NEXT Now this isn’t that hard of a question for us to answer, but it’s still harder than it needs to be, and without some ordering to the records it quickly becomes impossible to answer manually as our dataset gets bigger. Let’s add an index on last name. NEXT Things to note: • The index is in order: this allows us to search according to its keys • And it references the original source data. An indirection must happen when we want rows not in the index! • In practice this will be a B-tree or equivalent search-oriented structure - displayed here as a list for ease • It’s a redundant copy of the data - this must be kept up to date as things change! So if we want to answer this query, we can very efficiently search the index.

Slide 49

Slide 49 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 SELECT * FROM people WHERE last_name=‘EVERS’ L. NAME AIKEN EMERSON EVERS GAGNON MARINO NALL NOBLES PINO So if we want to answer this query, we can very efficiently search the index. It’ll find the record in the index, and then follow a pointer back to the source table for the remaining columns.

Slide 50

Slide 50 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 SELECT * FROM people WHERE last_name LIKE ‘E%’ AND gender=‘Female’ L. NAME AIKEN EMERSON EVERS GAGNON MARINO NALL NOBLES PINO Ok, let’s try this again with a new query. NEXT Now the index will quickly get us a smaller set of records to match: NEXT

Slide 51

Slide 51 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 L. NAME AIKEN EMERSON EVERS GAGNON MARINO NALL NOBLES PINO SELECT * FROM people WHERE last_name LIKE ‘E%’ AND gender=‘Female’ But the index does not get us all the way: we need to go back to the source data before we can filter it to the one matching row.

Slide 52

Slide 52 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 L. NAME AIKEN EMERSON EVERS GAGNON MARINO NALL NOBLES PINO SELECT * FROM people WHERE last_name LIKE ‘E%’ AND gender=‘Female’ But the index does not get us all the way: we need to go back to the source data before we can filter it to the one matching row.

Slide 53

Slide 53 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 SELECT * FROM people WHERE first_name=‘Kevin’ L. NAME AIKEN EMERSON EVERS GAGNON MARINO NALL NOBLES PINO At this point you should be getting a clue to the main indexing principle: the database can only take advantage of the indexed columns in the order you give it. The index is sorted by last name, so it obviously does not help us answer this question NEXT However, less obviously, even this index doesn’t help any NEXT

Slide 54

Slide 54 text

L. NAME F. NAME GENDER DOB NALL KEVIN MALE 1983-03-25 EMERSON ALYSIA FEMALE 1979-03-23 PINO ADRIANE FEMALE 2002-07-24 GAGNON GRACIA FEMALE 2015-09-15 EVERS CHARLES MALE 1995-04-15 NOBLES MAIA FEMALE 1992-06-23 AIKEN LORINDA FEMALE 1964-05-05 MARINO MARVIN MALE 1980-08-12 SELECT * FROM people WHERE first_name=‘Kevin’ L. NAME F. NAME AIKEN LORINDA EMERSON ALYSIA EVERS CHARLES GAGNON GRACIA MARINO MARVIN NALL KEVIN NOBLES MAIA PINO ADRIANE The index is sorted by the columns you put in it in that order. Merely having the column in an index does not help you, if you are not also matching on the first column of the index! More examples in a moment. For example, NEXT

Slide 55

Slide 55 text

CoreyCurtin KevinNall AlysiaEmerson AdrianePino GraciaGagnon CharlesEvers MaiaNobles LorindaAiken AN INDEX ON: first_name, last_name WHERE first_name=‘Gracia’ AND last_name=‘Gagnon’ HELPS A LOT WITH: Suppose we had an index on first_name, last_name. Here’s a binary tree representation of the data in the index. Note that I’ve run the keys together - this is how I think about indexes. Since you can’t query by the second or third column or nth column of a compound index, I think this is a good way to think about it. Now your database doesn’t use a naive binary tree for indexes. But I do think it’s a good way to visualize it. Please forgive my crappy binary tree here, but I hope it gets the idea across. Looking at this, it should be obvious that it will help with a value like this. NEXT

Slide 56

Slide 56 text

CoreyCurtin KevinNall AlysiaEmerson AdrianePino GraciaGagnon CharlesEvers MaiaNobles LorindaAiken AN INDEX ON: first_name, last_name GraciaGagnon HELPS A LOT WITH: It’s essentially just walking the tree looking for matches on this value.

Slide 57

Slide 57 text

CoreyCurtin KevinNall AlysiaEmerson AdrianePino GraciaGagnon CharlesEvers MaiaNobles LorindaAiken AN INDEX ON: first_name, last_name GraciaGagnon HELPS A LOT WITH: It’s essentially just walking the tree looking for matches on this value.

Slide 58

Slide 58 text

CoreyCurtin KevinNall AlysiaEmerson AdrianePino GraciaGagnon CharlesEvers MaiaNobles LorindaAiken AN INDEX ON: first_name, last_name WHERE last_name=‘Gagnon’ DOES NOTHING FOR: Now at the same time, it’s clear that this is just not something

Slide 59

Slide 59 text

AN INDEX ON: first_name, last_name HELPS A LOT WITH: WHERE first_name = ‘Bob’ WHERE first_name = ‘Bob’ AND last_name = ‘Jones’ WHERE first_name = ‘Bob’ AND last_name LIKE ‘J%’ WHERE first_name = ‘B%’ WHERE first_name = ‘B%’ AND last_name LIKE ‘J%’ DOES NOTHING FOR: WHERE last_name LIKE ‘J%’ WHERE first_name LIKE ‘%ob’ Let me summarize: Here is a different example index. In this case, we do first and last name.

Slide 60

Slide 60 text

Any column Other stuff… sometimes WHAT CAN BE INDEXED? Most of the time, the index contains the values in a column. However, you can sometimes index expressions: Function results for example. In particular, Postgres does include mechanisms for indexing json columns effectively. That’s not going to be a magical way to speed up access to anything, but if there’s a specific key in the json column that is a condition in queries, you can index a specific path or expression in the document.

Slide 61

Slide 61 text

WHY ISN’T IT USING MY INDEX? Your database will estimate costs and pick the cheapest strategy. Now, let’s assume that it’s possible for it to be using the index. E.g. you have a possible prefix match. NEXT The database has a cost estimator that uses a number of techniques to guess what the best strategy is. It’s not using your index because it disagrees with you. Here’s an example

Slide 62

Slide 62 text

SELECT * FROM people WHERE last_name=‘EVERS’ Not using an index: 1. Scan the table 2. Filter on last_name=‘EVERS’ (TABLE ROW READ COST) * (NUMBER OF ROWS) Let’s look at this query again. We could do it just by reading rows in the table NEXT And it guesses the cost like this. NEXT And if we fill in what we know (and what the database knows), it looks like this NEXT

Slide 63

Slide 63 text

SELECT * FROM people WHERE last_name=‘EVERS’ Not using an index: 1. Scan the table 2. Filter on last_name=‘EVERS’ (TABLE ROW READ COST) * 8

Slide 64

Slide 64 text

SELECT * FROM people WHERE last_name=‘EVERS’ Using an index on last_name: 1. Index range scan on last_name=‘EVERS’ 2. Read remaining columns from source table (INDEX READ COST) * (ESTIMATED NUMBER OF ROWS) + (TABLE ROW READ COST) * (ESTIMATED NUMBER OF ROWS) Now, is this cheaper than a full scan or not? NEXT Well… maybe NEXT

Slide 65

Slide 65 text

SELECT * FROM people WHERE last_name=‘EVERS’ Using an index on last_name: 1. Index range scan on last_name=‘EVERS’ 2. Read remaining columns from source table (INDEX READ COST) * 1+ (TABLE ROW READ COST) * 1 Your database keeps statistics about the data. This is summary data - designed to be small so it can look at it quickly to make decisions, and it’s cached and not 100% up to date all the time. Based on these statistics, it can estimate that an equality comparison on last_name will result in a single match and it’s going to do this instead of the full scan. Here’s how it knows that

Slide 66

Slide 66 text

Postgres has some good “system tables” that you can look at. I copied this query (and the one from earlier) from their documentation. We’re querying on DOB and gender, so let’s look at those: • First, look at the n_distinct column: • The “-1” means that every value is unique • If it were a positive integer, it’s the number of unique values • And the -0.22 is a proportion: It means that for gender, there’s one unique value for every 5 records. Our test data had 3 males and 5 females, so that seems close. • It also knows some of the most common values for each column. Given this, your database can make a guess at the rows we will get back.

Slide 67

Slide 67 text

SELECT * FROM people WHERE dob < ‘1990-01-01’ Not using an index: 1. Scan the table 2. Filter on dob < ‘1990-01-01’ (TABLE ROW READ COST) * (NUMBER OF ROWS) Here’s a different condition. Now this one is interesting. Again, we could do it just by reading rows in the table and it will have comparable costs.

Slide 68

Slide 68 text

SELECT * FROM people WHERE dob < ‘1990-01-01’ Not using an index: 1. Scan the table 2. Filter on dob < ‘1990-01-01’ (TABLE ROW READ COST) * 8 Here’s a different condition. Now this one is interesting. Again, we could do it just by reading rows in the table and it will have comparable costs.

Slide 69

Slide 69 text

SELECT * FROM people WHERE dob < ‘1990-01-01’ Using an index on dob 1. Index range scan on dob < ‘1990-01-01’ 2. Read remaining columns from source table (INDEX READ COST) * (ESTIMATED NUMBER OF ROWS) + (TABLE ROW READ COST) * (ESTIMATED NUMBER OF ROWS) And here’s the index version. Everything here looks the same, but let’s fill in values.

Slide 70

Slide 70 text

SELECT * FROM people WHERE dob < ‘1990-01-01’ Using an index on dob 1. Index range scan on dob < ‘1990-01-01’ 2. Read remaining columns from source table (INDEX READ COST) * 8 + (TABLE ROW READ COST) * 8 And here’s the index version. Everything here looks the same, but let’s fill in values. In this case, it doesn’t know how many rows will match so it assumes all of them will. In this case it’s going to probably prefer a full table scan or try using a different index.

Slide 71

Slide 71 text

SELECT * FROM people WHERE last_name = ‘Evers’ AND gender=‘Male’ Cost for using either index: (INDEX READ COST) * (ESTIMATED NUMBER OF ROWS) + (TABLE ROW READ COST) * (ESTIMATED NUMBER OF ROWS) One last example. Suppose in this case we have two indexes: one on last_name, and one on gender. In this case, the formula for cost is the same. But let’s look more closely NEXT

Slide 72

Slide 72 text

SELECT * FROM people WHERE last_name = ‘Evers’ AND gender=‘Male’ Cost for using last_name: (INDEX READ COST) * 1 + (TABLE ROW READ COST) * 1 So no surprise, if we use last_name we estimate 1 row and it’s fairly cheap.

Slide 73

Slide 73 text

SELECT * FROM people WHERE last_name = ‘Evers’ AND gender=‘Male’ Cost for using gender: (INDEX READ COST) * (8 * 0.22) + (TABLE ROW READ COST) * (8 * 0.22) But here’s what it looks like if we use gender Note that extra factor in there: For gender, we know we don’t have a lot of unique values. So it scales it according to it’s statistics about how much that value narrows the field.

Slide 74

Slide 74 text

Prefix match Prefer indexes with higher cardinality INDEXING PRINCIPLES Reviewing, we earlier covered the first principle of indexes: you can use an index if its first key is in your query. And now we can add a second principle: indexes on more varied columns are better. An index on a value like gender usually only reduce your result set by 50%. But a value like date of birth might get you down to a single row.

Slide 75

Slide 75 text

http://use-the-index-luke.com http://modern-sql.com/ FURTHER READING These are two web sites by Markus Winand. They have some really fantastic content for understanding indexes and new SQL features. There’s a lot out there and your database probably has some tools you’re not using that would let you better understand your queries and perhaps write better ones, so you should understand that. I hope that what I’ve covered today is gives you some helpful heuristics to start with. Thanks!

Slide 76

Slide 76 text

M I C H A E L S W I E T O N S W I E T O N @ AT O M I C O B J E C T. C O M Hello - I’m Michael Swieton. I’m a software developer at Atomic Object. We are a software design and development consultancy, and we built software for all sorts of platforms - embedded, mobile, web, server, and desktop. Today I am here to talk a bit about SQL databases because they end up being the datastore for so many of the apps that get written.