Databases for developers

Databases for developers Heimir Sverrisson Mojotech Boulder

Overview • Storage methods • Access paths • Schema considerations
• Query tuning (Execution plans) • The available database engines

The common use case • We have a collection of
objects in memory • Need to persist them in an effective way ◦ Effective to write and read(find) • This is where ORM’s enter the picture ◦ ActiveRecords (Rails/Ruby) ◦ SQLAlchemy (Python) ◦ Hibernate/NHibernate (Java and .Net)

Object Relational Mapping (ORM) • Designed to simplify the life
of developers • Supposed to fix the “Impedance mismatch” between an Object model and a Relational Database • Usually simple to use for the common case • Can be really horrible to deal with more complex situations

Structured Query Language (SQL) • Somewhat standard way of interacting
with a Relational Database • One flavor per vendor with a growing common subset • Set language that is sometimes hard to grasp if you are used to procedural thinking • Well supported for development and end- user interaction

Database storage • Data and Indices are stored in fixed
size blocks on disk (2k, 4k, 8k, 16k) • Some databases have one file per table/index other have the objects organized into table spaces and one or more file per table space

The Rails DB data model • Two tables with a
foreign key relationship

Structure of the users table And an index on the
last_name column

Table data blocks • Here only showing id and last_name
columns

Finding by last name -> Table scan User.where("last_name = 'Adams'")
User Load (0.8ms) SELECT "users".* FROM "users" WHERE (last_name = 'Adams') #<User id: 564,last_name: "Adams” #<User id: 610,last_name: "Adams" #<User id: 804,last_name: "Adams"

Table scan • Need to read all 100 blocks of
the data, as there is no guaranteed order by last_name • Access time grows as O(n) where n is the number of rows in the table • Fine for small tables, but very bad for big tables and selective queries

Create index leaf blocks • One entry per table row,
containing index value and a row-id

Intermediate index blocks

Index root block

Using index to access data SELECT * FROM users WHERE
last_name = ‘Armstrong’;

Index root block last_name=’Armstrong’

Index lookup • Start by reading the root block of
the index • The index is ordered by last_name • Navigate through the index down to a leaf block and finally read the data • Access time grows as O(log(n)) where n is the number of rows in the table • I.e. access time triples when data grows by three orders of magnitude!

Index impact

The Rails DB data model (again) • A user has
many orders • An order belongs to a user

Index impact Order.joins(:user). select('users.id'). select('sum(price) as the_sum'). where("last_name='Haley'"). group("users.id") Becomes:
SELECT users.id,sum(price) as the_sum FROM "orders" INNER JOIN "users" ON "users"."id" = "orders"."user_id" WHERE (last_name='Haley') GROUP BY users.id

Time for 1k users, 850k orders No index: 0.243674 elapsed
0.221640 user 0.021172 system CREATE INDEX index_users_on_last_name ON users(last_name); 0.243382 elapsed 0.221215 user 0.021065 system CREATE INDEX index_orders_on_user_id ON orders(user_id); 0.004092 elapsed 0.004052 user 0.000024 system 60-fold speedup!

The downside of indexing • Extra work for all DML
statements • Deleting a row needs to also delete from each index on the the table • Inserting will have to also insert into each index (possibly causing a block split) • Update is the combination of Delete/Insert on the indices of the updated columns

Access Paths How to get to the data

• Table scan - read all the data ◦ Fast
for small tables • Index lookup - one row at a time ◦ Traverse the index from the root down to the leaf ◦ Equi-join • Index range-scan ◦ Find the first row with the index ◦ Scan the index for subsequent rows ◦ BETWEEN, greater-than, less-than Access paths

Schema considerations

Primary keys • All tables should have a PK •
Preferably a meaningless integer called id • Should not depend on business rules or even be visible to end-users • Never create a composite primary key!

Foreign keys • Should always be declared • Should have
a meaningful name i.e. user_id, owner_id • Make sure to index all foreign keys!

Query tuning Execution plans

EXPLAIN SELECT u.last_name, SUM(o.price) AS total_price FROM users AS u
INNER JOIN orders AS o ON o.user_id = u.id GROUP BY 1 ORDER BY 2 DESC LIMIT 5; Explain a Query

Query plan Limit (cost=34753.11..34753.12 rows=5 width=13) -> Sort (cost=34753.11..34754.15 rows=416
width=13) Sort Key: (sum(o.price)) -> HashAggregate (cost=34741.00..34746.20 rows=416 width=13) Group Key: u.last_name -> Hash Join (cost=30.50..30491.00 rows=850000 width=13) Hash Cond: (o.user_id = u.id) -> Seq Scan on orders o (rows=850000 width=10) -> Hash (rows=1000 width=11) -> Seq Scan on users u (rows=1000 width=11)

Query plan GUI • Hover over each step to get
more info

Cost estimation • The database engine needs metadata to come
up with a good plan ◦ Is there an index on the join column or do we have to do a full scan ◦ In case of multiple indices, pick the best one to use • It also needs data statistics, row count and block count to adapt to changing data volume

Collecting Statistics • Useful metadata ◦ Number of distinct values
in a column ◦ Histogram of the distribution • Must be updated when the data is changing drastically ◦ Initial data load ◦ Big batch updates

PostgreSQL statistics • Statistics are updated as part of `vacuuming`
tables (get rid of obsolete row versions) • Can be auto-configured with: ◦ autovacuum=on in postgresql.conf • Can also be done interactively with the ANALYZE or VACUUM ANALYZE commands

MySQL Statistics • Depend on the setting of the innodb_stats_on_metadata
variable • If set (which is default), the statistics will be updated on: ◦ SHOW TABLE STATUS and SHOW INDEX ◦ Access to the INFORMATION_SCHEMA • Can also run ANALYZE TABLE command

Database engines

SQLite • Open Source single file library implementing a good
chunk of standard SQL • Very lightweight and small footprint • No separate server process • Available on most all platforms, including cell phones and files are portable • Trivial to use with Rails

MySQL • Open source (and commercial) database • Falling behind
on standards • Not very active development since Oracle bought Sun that bought MySQL • Decent support for high-reliability and scaling • Excellent community support

PostgreSQL • Open source and commercial versions • Good support
for standard SQL (CTE and window functions supported) • Active development society • Some support for high-availability and scaling • Great community support

MS SQLServer • Acquired from Sybase in 1993 (v 4.21
NT) • Maintained for Windows since then by MS • Runs only on MS Windows (Linux coming!) • Very standards compliant • Rather scalable, partitioning, parallel query • Somewhat HA (failover clustering) • Not free

Oracle • First version 1978 • Very standards compliant •
Highly scalable, partitioning, parallel everything, real clustering • Multi-platform (written in C since version 3) • Highly available • Not inexpensive

Databases for developers

Databases for developers

Other Decks in Programming

Featured

Transcript