Databases for developers
Heimir Sverrisson
Mojotech
Boulder
Slide 2
Slide 2 text
Overview
● Storage methods
● Access paths
● Schema considerations
● Query tuning (Execution plans)
● The available database engines
Slide 3
Slide 3 text
The common use case
● We have a collection of objects in memory
● Need to persist them in an effective way
○ Effective to write and read(find)
● This is where ORM’s enter the picture
○ ActiveRecords (Rails/Ruby)
○ SQLAlchemy (Python)
○ Hibernate/NHibernate (Java and .Net)
Slide 4
Slide 4 text
Object Relational Mapping (ORM)
● Designed to simplify the life of developers
● Supposed to fix the “Impedance mismatch”
between an Object model and a Relational
Database
● Usually simple to use for the common case
● Can be really horrible to deal with more
complex situations
Slide 5
Slide 5 text
Structured Query Language (SQL)
● Somewhat standard way of interacting with a
Relational Database
● One flavor per vendor with a growing
common subset
● Set language that is sometimes hard to
grasp if you are used to procedural thinking
● Well supported for development and end-
user interaction
Slide 6
Slide 6 text
Database storage
● Data and Indices are stored in fixed size
blocks on disk (2k, 4k, 8k, 16k)
● Some databases have one file per
table/index other have the objects organized
into table spaces and one or more file per
table space
Slide 7
Slide 7 text
The Rails DB data model
● Two tables with a foreign key relationship
Slide 8
Slide 8 text
Structure of the users
table
And an index on the last_name column
Slide 9
Slide 9 text
Table data blocks
● Here only showing id and last_name columns
Slide 10
Slide 10 text
Finding by last name -> Table scan
User.where("last_name = 'Adams'")
User Load (0.8ms) SELECT "users".* FROM "users" WHERE
(last_name = 'Adams')
#
Slide 11
Slide 11 text
Table scan
● Need to read all 100 blocks of the data, as
there is no guaranteed order by last_name
● Access time grows as O(n) where n is the
number of rows in the table
● Fine for small tables, but very bad for big
tables and selective queries
Slide 12
Slide 12 text
Create index leaf blocks
● One entry per table row, containing index value and a row-id
Slide 13
Slide 13 text
Intermediate index blocks
Slide 14
Slide 14 text
Index root block
Slide 15
Slide 15 text
Using index to access data
SELECT * FROM users
WHERE last_name = ‘Armstrong’;
Slide 16
Slide 16 text
Index root block
last_name=’Armstrong’
Slide 17
Slide 17 text
Index root block
last_name=’Armstrong’
Slide 18
Slide 18 text
Index root block
last_name=’Armstrong’
Slide 19
Slide 19 text
Index root block
last_name=’Armstrong’
Slide 20
Slide 20 text
Index root block
last_name=’Armstrong’
Slide 21
Slide 21 text
Index lookup
● Start by reading the root block of the index
● The index is ordered by last_name
● Navigate through the index down to a leaf
block and finally read the data
● Access time grows as O(log(n)) where n is
the number of rows in the table
● I.e. access time triples when data grows by
three orders of magnitude!
Slide 22
Slide 22 text
Index impact
Slide 23
Slide 23 text
The Rails DB data model (again)
● A user has many orders
● An order belongs to a user
Slide 24
Slide 24 text
Index impact
Order.joins(:user).
select('users.id').
select('sum(price) as the_sum').
where("last_name='Haley'").
group("users.id")
Becomes:
SELECT users.id,sum(price) as the_sum
FROM "orders"
INNER JOIN "users" ON "users"."id" = "orders"."user_id"
WHERE (last_name='Haley')
GROUP BY users.id
Slide 25
Slide 25 text
Time for 1k users, 850k orders
No index: 0.243674 elapsed 0.221640 user 0.021172 system
CREATE INDEX index_users_on_last_name ON users(last_name);
0.243382 elapsed 0.221215 user 0.021065 system
CREATE INDEX index_orders_on_user_id ON orders(user_id);
0.004092 elapsed 0.004052 user 0.000024 system
60-fold speedup!
Slide 26
Slide 26 text
The downside of indexing
● Extra work for all DML statements
● Deleting a row needs to also delete from
each index on the the table
● Inserting will have to also insert into each
index (possibly causing a block split)
● Update is the combination of Delete/Insert
on the indices of the updated columns
Slide 27
Slide 27 text
Access Paths
How to get to the data
Slide 28
Slide 28 text
● Table scan - read all the data
○ Fast for small tables
● Index lookup - one row at a time
○ Traverse the index from the root down to the leaf
○ Equi-join
● Index range-scan
○ Find the first row with the index
○ Scan the index for subsequent rows
○ BETWEEN, greater-than, less-than
Access paths
Slide 29
Slide 29 text
Schema considerations
Slide 30
Slide 30 text
Primary keys
● All tables should have a PK
● Preferably a meaningless integer called id
● Should not depend on business rules or
even be visible to end-users
● Never create a composite primary key!
Slide 31
Slide 31 text
Foreign keys
● Should always be declared
● Should have a meaningful name i.e. user_id,
owner_id
● Make sure to index all foreign keys!
Slide 32
Slide 32 text
Query tuning
Execution plans
Slide 33
Slide 33 text
EXPLAIN
SELECT u.last_name, SUM(o.price) AS total_price
FROM users AS u
INNER JOIN orders AS o ON o.user_id = u.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
Explain a Query
Slide 34
Slide 34 text
Query plan
Limit (cost=34753.11..34753.12 rows=5 width=13)
-> Sort (cost=34753.11..34754.15 rows=416 width=13)
Sort Key: (sum(o.price))
-> HashAggregate (cost=34741.00..34746.20 rows=416 width=13)
Group Key: u.last_name
-> Hash Join (cost=30.50..30491.00 rows=850000 width=13)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (rows=850000 width=10)
-> Hash (rows=1000 width=11)
-> Seq Scan on users u (rows=1000 width=11)
Slide 35
Slide 35 text
Query plan GUI
● Hover over each step to get more info
Slide 36
Slide 36 text
Cost estimation
● The database engine needs metadata to
come up with a good plan
○ Is there an index on the join column or do we have
to do a full scan
○ In case of multiple indices, pick the best one to use
● It also needs data statistics, row count and
block count to adapt to changing data
volume
Slide 37
Slide 37 text
Collecting Statistics
● Useful metadata
○ Number of distinct values in a column
○ Histogram of the distribution
● Must be updated when the data is changing
drastically
○ Initial data load
○ Big batch updates
Slide 38
Slide 38 text
PostgreSQL statistics
● Statistics are updated as part of `vacuuming`
tables (get rid of obsolete row versions)
● Can be auto-configured with:
○ autovacuum=on in postgresql.conf
● Can also be done interactively with the
ANALYZE or VACUUM ANALYZE
commands
Slide 39
Slide 39 text
MySQL Statistics
● Depend on the setting of the
innodb_stats_on_metadata variable
● If set (which is default), the statistics will be
updated on:
○ SHOW TABLE STATUS and SHOW INDEX
○ Access to the INFORMATION_SCHEMA
● Can also run ANALYZE TABLE command
Slide 40
Slide 40 text
Database engines
Slide 41
Slide 41 text
SQLite
● Open Source single file library implementing
a good chunk of standard SQL
● Very lightweight and small footprint
● No separate server process
● Available on most all platforms, including cell
phones and files are portable
● Trivial to use with Rails
Slide 42
Slide 42 text
MySQL
● Open source (and commercial) database
● Falling behind on standards
● Not very active development since Oracle
bought Sun that bought MySQL
● Decent support for high-reliability and
scaling
● Excellent community support
Slide 43
Slide 43 text
PostgreSQL
● Open source and commercial versions
● Good support for standard SQL (CTE and
window functions supported)
● Active development society
● Some support for high-availability and
scaling
● Great community support
Slide 44
Slide 44 text
MS SQLServer
● Acquired from Sybase in 1993 (v 4.21 NT)
● Maintained for Windows since then by MS
● Runs only on MS Windows (Linux coming!)
● Very standards compliant
● Rather scalable, partitioning, parallel query
● Somewhat HA (failover clustering)
● Not free
Slide 45
Slide 45 text
Oracle
● First version 1978
● Very standards compliant
● Highly scalable, partitioning, parallel
everything, real clustering
● Multi-platform (written in C since version 3)
● Highly available
● Not inexpensive