Slide 1

Slide 1 text

Databases for developers Heimir Sverrisson Mojotech Boulder

Slide 2

Slide 2 text

Overview ● Storage methods ● Access paths ● Schema considerations ● Query tuning (Execution plans) ● The available database engines

Slide 3

Slide 3 text

The common use case ● We have a collection of objects in memory ● Need to persist them in an effective way ○ Effective to write and read(find) ● This is where ORM’s enter the picture ○ ActiveRecords (Rails/Ruby) ○ SQLAlchemy (Python) ○ Hibernate/NHibernate (Java and .Net)

Slide 4

Slide 4 text

Object Relational Mapping (ORM) ● Designed to simplify the life of developers ● Supposed to fix the “Impedance mismatch” between an Object model and a Relational Database ● Usually simple to use for the common case ● Can be really horrible to deal with more complex situations

Slide 5

Slide 5 text

Structured Query Language (SQL) ● Somewhat standard way of interacting with a Relational Database ● One flavor per vendor with a growing common subset ● Set language that is sometimes hard to grasp if you are used to procedural thinking ● Well supported for development and end- user interaction

Slide 6

Slide 6 text

Database storage ● Data and Indices are stored in fixed size blocks on disk (2k, 4k, 8k, 16k) ● Some databases have one file per table/index other have the objects organized into table spaces and one or more file per table space

Slide 7

Slide 7 text

The Rails DB data model ● Two tables with a foreign key relationship

Slide 8

Slide 8 text

Structure of the users table And an index on the last_name column

Slide 9

Slide 9 text

Table data blocks ● Here only showing id and last_name columns

Slide 10

Slide 10 text

Finding by last name -> Table scan User.where("last_name = 'Adams'") User Load (0.8ms) SELECT "users".* FROM "users" WHERE (last_name = 'Adams') #

Slide 11

Slide 11 text

Table scan ● Need to read all 100 blocks of the data, as there is no guaranteed order by last_name ● Access time grows as O(n) where n is the number of rows in the table ● Fine for small tables, but very bad for big tables and selective queries

Slide 12

Slide 12 text

Create index leaf blocks ● One entry per table row, containing index value and a row-id

Slide 13

Slide 13 text

Intermediate index blocks

Slide 14

Slide 14 text

Index root block

Slide 15

Slide 15 text

Using index to access data SELECT * FROM users WHERE last_name = ‘Armstrong’;

Slide 16

Slide 16 text

Index root block last_name=’Armstrong’

Slide 17

Slide 17 text

Index root block last_name=’Armstrong’

Slide 18

Slide 18 text

Index root block last_name=’Armstrong’

Slide 19

Slide 19 text

Index root block last_name=’Armstrong’

Slide 20

Slide 20 text

Index root block last_name=’Armstrong’

Slide 21

Slide 21 text

Index lookup ● Start by reading the root block of the index ● The index is ordered by last_name ● Navigate through the index down to a leaf block and finally read the data ● Access time grows as O(log(n)) where n is the number of rows in the table ● I.e. access time triples when data grows by three orders of magnitude!

Slide 22

Slide 22 text

Index impact

Slide 23

Slide 23 text

The Rails DB data model (again) ● A user has many orders ● An order belongs to a user

Slide 24

Slide 24 text

Index impact Order.joins(:user). select('users.id'). select('sum(price) as the_sum'). where("last_name='Haley'"). group("users.id") Becomes: SELECT users.id,sum(price) as the_sum FROM "orders" INNER JOIN "users" ON "users"."id" = "orders"."user_id" WHERE (last_name='Haley') GROUP BY users.id

Slide 25

Slide 25 text

Time for 1k users, 850k orders No index: 0.243674 elapsed 0.221640 user 0.021172 system CREATE INDEX index_users_on_last_name ON users(last_name); 0.243382 elapsed 0.221215 user 0.021065 system CREATE INDEX index_orders_on_user_id ON orders(user_id); 0.004092 elapsed 0.004052 user 0.000024 system 60-fold speedup!

Slide 26

Slide 26 text

The downside of indexing ● Extra work for all DML statements ● Deleting a row needs to also delete from each index on the the table ● Inserting will have to also insert into each index (possibly causing a block split) ● Update is the combination of Delete/Insert on the indices of the updated columns

Slide 27

Slide 27 text

Access Paths How to get to the data

Slide 28

Slide 28 text

● Table scan - read all the data ○ Fast for small tables ● Index lookup - one row at a time ○ Traverse the index from the root down to the leaf ○ Equi-join ● Index range-scan ○ Find the first row with the index ○ Scan the index for subsequent rows ○ BETWEEN, greater-than, less-than Access paths

Slide 29

Slide 29 text

Schema considerations

Slide 30

Slide 30 text

Primary keys ● All tables should have a PK ● Preferably a meaningless integer called id ● Should not depend on business rules or even be visible to end-users ● Never create a composite primary key!

Slide 31

Slide 31 text

Foreign keys ● Should always be declared ● Should have a meaningful name i.e. user_id, owner_id ● Make sure to index all foreign keys!

Slide 32

Slide 32 text

Query tuning Execution plans

Slide 33

Slide 33 text

EXPLAIN SELECT u.last_name, SUM(o.price) AS total_price FROM users AS u INNER JOIN orders AS o ON o.user_id = u.id GROUP BY 1 ORDER BY 2 DESC LIMIT 5; Explain a Query

Slide 34

Slide 34 text

Query plan Limit (cost=34753.11..34753.12 rows=5 width=13) -> Sort (cost=34753.11..34754.15 rows=416 width=13) Sort Key: (sum(o.price)) -> HashAggregate (cost=34741.00..34746.20 rows=416 width=13) Group Key: u.last_name -> Hash Join (cost=30.50..30491.00 rows=850000 width=13) Hash Cond: (o.user_id = u.id) -> Seq Scan on orders o (rows=850000 width=10) -> Hash (rows=1000 width=11) -> Seq Scan on users u (rows=1000 width=11)

Slide 35

Slide 35 text

Query plan GUI ● Hover over each step to get more info

Slide 36

Slide 36 text

Cost estimation ● The database engine needs metadata to come up with a good plan ○ Is there an index on the join column or do we have to do a full scan ○ In case of multiple indices, pick the best one to use ● It also needs data statistics, row count and block count to adapt to changing data volume

Slide 37

Slide 37 text

Collecting Statistics ● Useful metadata ○ Number of distinct values in a column ○ Histogram of the distribution ● Must be updated when the data is changing drastically ○ Initial data load ○ Big batch updates

Slide 38

Slide 38 text

PostgreSQL statistics ● Statistics are updated as part of `vacuuming` tables (get rid of obsolete row versions) ● Can be auto-configured with: ○ autovacuum=on in postgresql.conf ● Can also be done interactively with the ANALYZE or VACUUM ANALYZE commands

Slide 39

Slide 39 text

MySQL Statistics ● Depend on the setting of the innodb_stats_on_metadata variable ● If set (which is default), the statistics will be updated on: ○ SHOW TABLE STATUS and SHOW INDEX ○ Access to the INFORMATION_SCHEMA ● Can also run ANALYZE TABLE command

Slide 40

Slide 40 text

Database engines

Slide 41

Slide 41 text

SQLite ● Open Source single file library implementing a good chunk of standard SQL ● Very lightweight and small footprint ● No separate server process ● Available on most all platforms, including cell phones and files are portable ● Trivial to use with Rails

Slide 42

Slide 42 text

MySQL ● Open source (and commercial) database ● Falling behind on standards ● Not very active development since Oracle bought Sun that bought MySQL ● Decent support for high-reliability and scaling ● Excellent community support

Slide 43

Slide 43 text

PostgreSQL ● Open source and commercial versions ● Good support for standard SQL (CTE and window functions supported) ● Active development society ● Some support for high-availability and scaling ● Great community support

Slide 44

Slide 44 text

MS SQLServer ● Acquired from Sybase in 1993 (v 4.21 NT) ● Maintained for Windows since then by MS ● Runs only on MS Windows (Linux coming!) ● Very standards compliant ● Rather scalable, partitioning, parallel query ● Somewhat HA (failover clustering) ● Not free

Slide 45

Slide 45 text

Oracle ● First version 1978 ● Very standards compliant ● Highly scalable, partitioning, parallel everything, real clustering ● Multi-platform (written in C since version 3) ● Highly available ● Not inexpensive