Voilá! Indexes -- A Look at Some Simple Preventative Magick

This is Basil Smockwhitener, wizard and gentleman. He’s currently trekking
across golden sands, dunes rippling like waves across the landscape, with his intrepid manservant Fabian trudging along manfully behind.

Basil & Fabian are visiting Basil’s old school-friend Ptolemy, taking
in the sights, getting their portraits painted at the pyramids —generally living it up as tourists, though Fabian probably could do without carrying all their luggage.

Somehow, Basil ﬁnds an old brass oil lamp. “Tarnished as
anything,” Basil would later say of it. “Bit of buﬃng brought out a nice shine.”

It turns out that a “bit of buﬃng” also brings
out a nice genie. A cliche, perhaps, but no one turns down a cliche oﬀering three wishes. Now, Basil is faced with a decision. What does a wizard wish for, who already has just about everything he wants?

After a bit of thought, Basil remembers a passion of
his from childhood: librarians. He always wanted to be a librarian, though his father tried to discourage him. Basil had all of the librarian trading cards, even the rare ones, and his Eratosthenes card was even signed!

But libraries were few and far between in those days,
and the only way to become a librarian was to wait for one to die. Or oﬀ them yourself—librarians are a particularly cut-throat bunch. But Basil didn’t have enough patience (or bloodlust) for that, and so traded his dream (and collection of trading cards) for a career in magic, instead. Now, though, it all comes rushing back. With the genie standing before him, he suddenly knows exactly what to wish for: A library!

Not just any library, though. Basil wants one to rival
any other in the world. To make sure he has more content than anyone else, he stipulates that the library must receive one new scroll every minute while the sun shines. Basil is in a hurry. We’ve all been there, right? The product is practically right there, all you need to do is finish it. So he takes some shortcuts. He builds out his schema, but leaves off a few “minor details” that he’s sure can be fixed later.

create_table :languages do |t| t.string :name end create_table :nations do
|t| t.string :name t.belongs_to :language end create_table :people do |t| t.string :first_name t.string :last_name t.date :born_on t.date :died_on t.belongs_to :nation end create_table :scrolls do |t| t.string :title t.integer :pages t.integer :word_count t.belongs_to :material t.belongs_to :language t.string :color t.string :ink_color t.integer :author_id t.datetime :published_at t.datetime :added_at end create_table :classifications do |t| t.integer :subject_id t.integer :scroll_id end create_table :subjects do |t| t.string :name end create_table :materials do |t| t.string :name end Library Schema v1.0 “What’s the Worst that Could Happen” His schema looks something like this. Can you spot what’s missing? How many of us have ever created a database table without immediately adding indexes to query it? How many of us have lived to regret it? We’re not proud of it, but it happens.

Well, Basil is about to learn exactly why this is
a bad idea, just as we all have (or soon will, I’m sure). With one new scroll coming in every minute, within an hour his shelves hold sixty of them. By lunch there are more than two hundred. By night-fall, his collection numbers over seven hundred scrolls. Big deal, right? Libraries are supposed to have scrolls.

But without an index, the only way to find a
scroll is to scan the entire set. To look at each scroll, one at a time, until you find the one you’re looking for. Not. Fun. Not only that, but only Basil and Fabian can answer queries. Patrons come in, ask for a scroll, and one of the two go and find it. But as the collection grows, the response time grows, too.

Even with both fetching scrolls simultaneously, the job is just
too big. Many of you can probably feel his pain. Those queries that are plenty fast with a handful of rows, become molasses when the rows start to number in the tens and hundreds of thousands. Why? Because your poor database has to do like Basil and Fabian, and look at each row in sequence, scanning the whole table to ﬁnd the rows that match the query. Eventually, things grind to a halt and your app goes down in a blaze of panic and ﬂame.

So, too, Basil’s library. It isn’t long before the line
snakes out the door, down the road, and over several large sand dunes, and by that time the patrons have had enough.

They torch the library. Our heroes barely escape. Still, Basil
isn’t one to let a little setback keep him down, especially when a genie still owes him two more wishes. Singed and ﬁre-shy, he considers what went wrong, realizes he needed some indexes, and tries again.

add_index :languages, :name add_index :nations, :name add_index :nations, :language_id add_index
:people, :first_name add_index :people, :last_name add_index :people, :born_on add_index :people, :died_on add_index :people, :nation_id add_index :scrolls, :title add_index :scrolls, :pages add_index :scrolls, :word_count add_index :scrolls, :material_id add_index :scrolls, :language_id add_index :scrolls, :color add_index :scrolls, :ink_color add_index :scrolls, :author_id add_index :scrolls, :published_at add_index :scrolls, :added_at add_index :classifications, :scroll_id add_index :classifications, :subject_id add_index :materials, :name Library Schema v2.0 “Index All The Things” Library 2.0! His updated schema looked like this. He isn’t taking any chances! Every bit of data he could lay hands on gets an index. Title and author name, sure, but also subject matter, word count, scroll length, material used, author gender and nationality, and much, much more. If not having indexes was bad, surely having all of them would be better. Right? Well, “better” is a relative term.

How many of you remember using a card catalog? A
remnant of different times… They were basically these small drawers that contained thousands and thousands of little cards. To find a book by author, you looked in the author catalog, which had all the cards sorted by author name. If you wanted to find a book by subject, you looked in the subject catalog, which again had all books sorted by subject. Each card, then, told you where to find the book (or, in Basil’s case, scroll) in question. This is, essentially, just what indexes are: mere sorted lists of information, where each entry points to the location of a record.

For instance, Basil’s title index might look something like this.
Indexes aren’t free, though. They require work to maintain. Every addition, every change, every deletion has to go in and update the corresponding indexes. (Can you see where this is going yet?) At first, Basil’s new library works wonderfully. Scrolls come in, he and Fabian fill out and file dozens of little cards, and the queries are now smooth, painless, and fast.

Only, the number of people asking for books varies at
diﬀerent times of day. Folks come in during lunch, or after work, inundating the library with requests for scrolls. The queries are fast, but even with both Basil and Fabian working together there isn’t time to process the new scrolls! Remember, a new one shows up every minute, regardless of how many people are in line. If a deluge of visitors takes half an hour to drain, that’s a pile of thirty scrolls that needs to be indexed…and indexing is no simple matter, now.

Dozens of indexes, with cards to be filled out and
properly filed for each one. And this doesn’t even consider the occasional damaged scroll that had to be removed from circulation! Those required Basil or Fabian to find all the entries for that scroll, in every index, and remove them.

This was all compounded with the difficulty of figuring out
which indexes to use for a query. Some are easy—want a scroll by title? Piece of cake. But what indexes would you use to find all scrolls by Greek authors on the topic of linguistics, written more than two centuries ago? In blue ink? Fabian especially grows paralyzed when trying to work his way through queries like that.

The job is still too big. Basil even summons a
small army of homunculi to assist in ﬁling the cards, but the little buggers are constantly hungry and the cost to maintain them is just too great. In the end, once again, the line of patrons begins to snake out the door, down the road, and over the next two dunes.

And once again, the library is put to the torch
as those in line vent their frustration at the lengthy wait. Well, now, you can imagine the Basil’s despair. (Fabian, not so much—he was generally resigned to his master’s escapades, though being nearly burned to death twice is a bit of a bummer.) What to do? Having no indexes is bad, and (apparently) having indexes is bad. Only…it isn’t that having the indexes was bad, is it? It’s that there were too many. The maintenance of those indexes took too much work, which prevented the two from being to apply themselves to ﬁnding scrolls. This is exactly the case with databases, too.

Every index that you add increases the work that the
database must do, both to ﬁgure out how to satisfy queries, and also to keep those indexes up-to-date. In a write-heavy system, too many indexes will slow you way down. So, what do you do? I’ll tell you what Basil does—he goes to work ﬁguring out what indexes are most important. He realizes that although dozens of indexes are impossible for the two of them to stay on top of, together they ought to be able to manage a few. If he can identify the few that are most necessary, he’ll have it in the bag. Well, but that’s the crux of the matter, isn’t it? How do you know which “few” matter the most?

SELECT * FROM scrolls WHERE title = 'The Republic'; SELECT
people.* FROM people, nations WHERE people.nation_id = nations.id AND nations.name = 'Greece'; SELECT * FROM scrolls WHERE title = 'Elements'; SELECT * FROM scrolls WHERE title = 'Autobiography of Kurigalzu'; SELECT scrolls.* FROM scrolls, people WHERE scrolls.author_id = people.id AND people.first_name = 'Aristotle'; SELECT * FROM scrolls WHERE title = 'Odyssey'; SELECT scrolls.* FROM scrolls, people WHERE scrolls.author_id = people.id AND people.first_name = 'Wenamun'; SELECT * FROM scrolls WHERE title = 'Analects'; SELECT scrolls.* FROM classifications, scrolls, subjects WHERE classifications.subject_id = subjects.id AND classifications.scroll_id = scrolls.id AND subjects.name = 'Geometry'; SELECT * FROM scrolls WHERE title = 'History of the Peloponnesian War'; SELECT * FROM scrolls WHERE title = 'Tao Te Ching'; In Basil’s case, he has a whole host of prior queries he can analyze to see which kinds of questions were asked the most, and if you have a corpus of data like that, it lets you take a brute-force approach. For example, Basil can see that, statistically, most people asked for a scroll by title, with queries by subject being second-most-common. But who’s to say this historical data is complete? What if there are columns that ought to be indexed, but which never happened to show up in queries during the period being sampled? Also, what happens when you don’t have prior data to learn from? What happens when you’re building something brand new? That ﬁrst migration you write, where you add the tables you’re going to need—how do you know which indexes to add as part of that migration? (Because, let’s face it, if you don’t add them then, when are you going to add them? Probably not until you feel the lack of them.)

The key, it turns out, is a foreign one. (Ugh,
okay, that was bad. Let’s try again.) “The important things here are foreign keys.” Better.

In database design, there’s this concept of foreign keys. If
you have a column in one table that references a key in another table, that’s a foreign key. For example, Basil’s scrolls table has an author_id ﬁeld, to indicate who authored the scroll. That’s a foreign key, because it refers to another table.

In Rails, these correspond to belongs_to and has_many associations. You
put a belongs_to on the child table (the one that contains the foreign key), and has_many on the parent table (the one that is referenced by the foreign key). So, here’s the ﬁrst tip: many times (but not always!) a foreign key should be indexed. Okay, but “not always” is not helpful. When should they be indexed?

Consider both sides of the association. Let’s say we have
Person, and Nation, and Person belongs_to Nation, and Nation has_many Persons. It’s entirely reasonable to say that you might want to know what nation a person is from — person.nation — which only needs an index on the primary key for the nations table. However, it’s going to be unlikely that asking for all people from a particular nation — nation.people — would (by itself) be helpful, since it could return thousands or even millions of rows. In that case, you may not need the index on people.nation_id.

class Scroll < ActiveRecord::Base has_many :classifications has_many :subjects, through: :classifications
end class Classification < ActiveRecord::Base belongs_to :scroll belongs_to :subject end class Subject < ActiveRecord::Base has_many :classifications has_many :scrolls, through: :classifications end Here’s another common relation: has_many :through. This is where two tables are related via an intermediate table sitting in between them, called a join table. In Basil’s case, scrolls were related to subjects via a classiﬁcations table.

That joining classiﬁcations table lets the scrolls and subjects be
connected like this, with each scroll having (potentially) multiple subjects, and each subject having (potentially) multiple scrolls.

Where are the foreign keys here? They’re not on the
scrolls table, nor on the subjects table. They all exist on that classiﬁcations table in the middle. It has both the scroll_id, and the subject_id.

class Scroll < ActiveRecord::Base has_many :classifications has_many :subjects, through: :classifications
end # this... scroll.subjects # is conceptually the same as this... scroll.classifications.map(&:subject) Do you have to index both of them? It depends. Think about how your data is being queried. Are you going to ask which subjects a scroll has? scroll.subjects. Remember it is (conceptually) a two-step process — find the classifications, and then find the subjects. Scroll.has_many :classifications, right? scroll.classifications.map(&:subject). NOTE: It’s not really implemented this way! This is just a thought experiment, to help see how these relate!

class Subject < ActiveRecord::Base has_many :classifications has_many :scrolls, through: :classifications
end # this... subject.scrolls # is conceptually the same as this... subject.classifications.map(&:scroll) What about the other way? Subject.has_many :classifications, too, or subject.classifications.map(&:scroll). Basil knows he’ll need to support both queries, so he adds indexes on both scroll_id and subject_id on the classifications table.

But wait! Fabian thinks he’s detected a ﬂaw in Basil’s
implementation. Isn’t having two indexes redundant? Can’t you just declare a single index with both ﬁelds in it?

Alas, no. Remember that indexes are sorted lists of data.
When you declare an index on multiple columns, this is what you get. It’s easy to see how you could use this to find subjects by scroll_id, because that’s sorted. But to use this index to find scrolls using only the subject_id…ugh. You’d have to do a full scan! So, no, Fabian. Nice try, but no. If you want to be able to find something by a particular column, you need that column to be declared first in the index. Basil needs both indexes.

Fabian: “So, why use multi-column indexes at all?” Oh, I’m
so glad you asked, Fabian. Remember the example with Nation and Person, where we said that it might not be useful to have an index on people.nation_id? Well.

It turns out that if you include that column in
an index with another column, it can still be useful. What if you wanted to know all Greek authors named Aristotle? An index on [nation_id, first_name] could be quite helpful in that case. Does it matter which order the columns is specified? Not really, but if we swap them around, [first_name, nation_id], now the index can be used for two different things: finding all authors with a particular name, AND finding all authors with a particular, from a particular country.

This is especially useful when you have a continuous value
(like a datetime) rather than a discrete one (like an integer, e.g. nation_id). Let’s say we have published_at, which tells the exact moment that a scroll was published. You’re not likely to ever want to query all scrolls published on April 23, 38 BC at 9:42 am.

But, if you index that ﬁeld with another column, like
author: [author_id, published_at] — hey presto! Now you can ﬁnd all scrolls by a particular author, ordered by published_at. Without that index the database has to do a lot more work to satisfy that query and return the results in order. It’ll create a temporary table (either in memory, or on disk, depending on the size of the result set), populate it, sort it, and then return the results. Much better with an index. This suggests that when deciding on your indexes, you ought to also give some consideration to how you’ll be ordering things. Having the database sort large numbers of records when there’s not a reasonable index to choose from will hurt your performance! Consider the following fairly common example from Basil’s library.

Scroll.order("added_at DESC").limit(10) Here we have a query that returns the
ten most recently added scrolls. We’re not finding by anything here—we’re only sorting by added_at, and returning the first ten. At first blush, it might seem like this isn’t so bad—we’re only returning the first ten rows, after all. But if you don’t have an index on added_on, it falls back to some default ordering (maybe the order of the records on disk), and the database will have to create a temporary table to hold the entire scrolls table in order to do the sort for you. Ouch! Please, index on those sort columns!

So, Basil’s almost ready, but Fabian asks him another question:
how do you know which indexes a query is going to use? How can you know whether it will use any indexes at all? Well, now that pulls Basil up short. Here he’d gone and added all these indexes, but how did he know they were actually going to be useful? A bit of hunting through his spellbook turned up a handy incantation: EXPLAIN.

Scroll. joins(:author, author: :nation). where(nations: { name: "Greece" }) It’s
super handy, and super useful, and all of you in the room can learn it. Consider this query. It’s ﬁnding all scrolls by Greek authors, right? Here, let’s add the EXPLAIN incantation to it.

Scroll. joins(:author, author: :nation). where(nations: { name: "Greece" }). explain
It’s as easy as that!

+--------+---------+------+---------------------------+--------------------+ | type | table | type | possible_keys |
key |... +--------+---------+------+---------------------------+--------------------+ | SIMPLE | nations | ref | PRIMARY,idx_nations_name | idx_nations_name |... | SIMPLE | people | ref | PRIMARY,idx_people_nation | idx_people_nation |... | SIMPLE | scrolls | ref | idx_scrolls_author | idx_scrolls_author |... +--------+---------+------+---------------------------+--------------------+ +---------+ +---------+------------+------+--------------------------+ | table |...| key_len | ref | rows | Extra | +---------+ +---------+------------+------+--------------------------+ | nations |...| 768 | const | 1 | Using where; Using index | | people |...| 5 | nations.id | 1 | Using where; Using index | | scrolls |...| 5 | people.id | 107 | Using where | +---------+ +---------+------------+------+--------------------------+ (MySQL output) EXPLAIN with indexes But once indexes have been added…! A bit of warning: query optimization (the process your database uses to determine the best way to answer the query) is a complicated science. Databases consider more than just the presence or absence of indexes; they also look at how much data is in a table, or how well a particular index appears to cover a table. When your database has little data in it, you may ﬁnd that EXPLAIN will prefer to do a full table scan, and refuse to use those indexes you so painstakingly added. If you’re worried about whether an index will be used at crunch time, you can populate a database with mocked data, and then run the explain.

remove_index :people, :born_on remove_index :people, :died_on remove_index :scrolls, :pages remove_index
:scrolls, :word_count remove_index :scrolls, :material_id remove_index :scrolls, :color remove_index :scrolls, :ink_color # ... Library Schema v3.0 “Just What the Doctor Ordered” At any rate, Basil’s new v3 schema looks something like this. Note the indexes this time — he’s only got a few of them. He’s tried the EXPLAIN spell on the queries he expects to get most often, and he’s happy with what those returned. He’s conﬁdent that together, he and Fabian can stay on top of this.

He summons his genie again, makes his wish, and gets
to work. And… (dum, duh duh DUM! *dramatic pause*) It seems to be working! Hour after hour, Basil and Fabian fetch, index and ﬁle scroll after scroll. The heavy loads are easy handled, with Basil needing to call in a homunculus or two only at the peakiest of peak times.

There are a few queries that were stumpers — pesky
researchers wanting to know how many Egyptian authors wrote on the topic of geometry between the 8th and 5th centuries BC, on yellow parchment, in blue ink — but Basil simply tells them to wait, and he handles those after hours, when load is particularly low.

It turns out, though, that Alexandrians have a thing about
burning libraries. They torch Basil’s on general principle just a few weeks later. Oh, well! Users are notoriously ﬁckle.

How about you? How healthy is your database? Which indexes
are you missing? Remember: foreign keys and sort criteria are candidates for indexes!

Jamis Buck @jamis [email protected] For more about Basil & Fabian:
http://blog.jamisbuck.org “Mazes for Programmers” http://pragprog.com/book/jbmaze

Voilá! Indexes -- A Look at Some Simple Prevent...

Voilá! Indexes -- A Look at Some Simple Preventative Magick

Other Decks in Technology

Featured

Transcript