Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Databases for developers

Databases for developers

A talk given at Ruby Meetup in Boulder on April 20. 2016.

Heimir Sverrisson

April 21, 2016
Tweet

Other Decks in Programming

Transcript

  1. Overview • Storage methods • Access paths • Schema considerations

    • Query tuning (Execution plans) • The available database engines
  2. The common use case • We have a collection of

    objects in memory • Need to persist them in an effective way ◦ Effective to write and read(find) • This is where ORM’s enter the picture ◦ ActiveRecords (Rails/Ruby) ◦ SQLAlchemy (Python) ◦ Hibernate/NHibernate (Java and .Net)
  3. Object Relational Mapping (ORM) • Designed to simplify the life

    of developers • Supposed to fix the “Impedance mismatch” between an Object model and a Relational Database • Usually simple to use for the common case • Can be really horrible to deal with more complex situations
  4. Structured Query Language (SQL) • Somewhat standard way of interacting

    with a Relational Database • One flavor per vendor with a growing common subset • Set language that is sometimes hard to grasp if you are used to procedural thinking • Well supported for development and end- user interaction
  5. Database storage • Data and Indices are stored in fixed

    size blocks on disk (2k, 4k, 8k, 16k) • Some databases have one file per table/index other have the objects organized into table spaces and one or more file per table space
  6. Finding by last name -> Table scan User.where("last_name = 'Adams'")

    User Load (0.8ms) SELECT "users".* FROM "users" WHERE (last_name = 'Adams') #<User id: 564,last_name: "Adams” #<User id: 610,last_name: "Adams" #<User id: 804,last_name: "Adams"
  7. Table scan • Need to read all 100 blocks of

    the data, as there is no guaranteed order by last_name • Access time grows as O(n) where n is the number of rows in the table • Fine for small tables, but very bad for big tables and selective queries
  8. Create index leaf blocks • One entry per table row,

    containing index value and a row-id
  9. Index lookup • Start by reading the root block of

    the index • The index is ordered by last_name • Navigate through the index down to a leaf block and finally read the data • Access time grows as O(log(n)) where n is the number of rows in the table • I.e. access time triples when data grows by three orders of magnitude!
  10. The Rails DB data model (again) • A user has

    many orders • An order belongs to a user
  11. Index impact Order.joins(:user). select('users.id'). select('sum(price) as the_sum'). where("last_name='Haley'"). group("users.id") Becomes:

    SELECT users.id,sum(price) as the_sum FROM "orders" INNER JOIN "users" ON "users"."id" = "orders"."user_id" WHERE (last_name='Haley') GROUP BY users.id
  12. Time for 1k users, 850k orders No index: 0.243674 elapsed

    0.221640 user 0.021172 system CREATE INDEX index_users_on_last_name ON users(last_name); 0.243382 elapsed 0.221215 user 0.021065 system CREATE INDEX index_orders_on_user_id ON orders(user_id); 0.004092 elapsed 0.004052 user 0.000024 system 60-fold speedup!
  13. The downside of indexing • Extra work for all DML

    statements • Deleting a row needs to also delete from each index on the the table • Inserting will have to also insert into each index (possibly causing a block split) • Update is the combination of Delete/Insert on the indices of the updated columns
  14. • Table scan - read all the data ◦ Fast

    for small tables • Index lookup - one row at a time ◦ Traverse the index from the root down to the leaf ◦ Equi-join • Index range-scan ◦ Find the first row with the index ◦ Scan the index for subsequent rows ◦ BETWEEN, greater-than, less-than Access paths
  15. Primary keys • All tables should have a PK •

    Preferably a meaningless integer called id • Should not depend on business rules or even be visible to end-users • Never create a composite primary key!
  16. Foreign keys • Should always be declared • Should have

    a meaningful name i.e. user_id, owner_id • Make sure to index all foreign keys!
  17. EXPLAIN SELECT u.last_name, SUM(o.price) AS total_price FROM users AS u

    INNER JOIN orders AS o ON o.user_id = u.id GROUP BY 1 ORDER BY 2 DESC LIMIT 5; Explain a Query
  18. Query plan Limit (cost=34753.11..34753.12 rows=5 width=13) -> Sort (cost=34753.11..34754.15 rows=416

    width=13) Sort Key: (sum(o.price)) -> HashAggregate (cost=34741.00..34746.20 rows=416 width=13) Group Key: u.last_name -> Hash Join (cost=30.50..30491.00 rows=850000 width=13) Hash Cond: (o.user_id = u.id) -> Seq Scan on orders o (rows=850000 width=10) -> Hash (rows=1000 width=11) -> Seq Scan on users u (rows=1000 width=11)
  19. Cost estimation • The database engine needs metadata to come

    up with a good plan ◦ Is there an index on the join column or do we have to do a full scan ◦ In case of multiple indices, pick the best one to use • It also needs data statistics, row count and block count to adapt to changing data volume
  20. Collecting Statistics • Useful metadata ◦ Number of distinct values

    in a column ◦ Histogram of the distribution • Must be updated when the data is changing drastically ◦ Initial data load ◦ Big batch updates
  21. PostgreSQL statistics • Statistics are updated as part of `vacuuming`

    tables (get rid of obsolete row versions) • Can be auto-configured with: ◦ autovacuum=on in postgresql.conf • Can also be done interactively with the ANALYZE or VACUUM ANALYZE commands
  22. MySQL Statistics • Depend on the setting of the innodb_stats_on_metadata

    variable • If set (which is default), the statistics will be updated on: ◦ SHOW TABLE STATUS and SHOW INDEX ◦ Access to the INFORMATION_SCHEMA • Can also run ANALYZE TABLE command
  23. SQLite • Open Source single file library implementing a good

    chunk of standard SQL • Very lightweight and small footprint • No separate server process • Available on most all platforms, including cell phones and files are portable • Trivial to use with Rails
  24. MySQL • Open source (and commercial) database • Falling behind

    on standards • Not very active development since Oracle bought Sun that bought MySQL • Decent support for high-reliability and scaling • Excellent community support
  25. PostgreSQL • Open source and commercial versions • Good support

    for standard SQL (CTE and window functions supported) • Active development society • Some support for high-availability and scaling • Great community support
  26. MS SQLServer • Acquired from Sybase in 1993 (v 4.21

    NT) • Maintained for Windows since then by MS • Runs only on MS Windows (Linux coming!) • Very standards compliant • Rather scalable, partitioning, parallel query • Somewhat HA (failover clustering) • Not free
  27. Oracle • First version 1978 • Very standards compliant •

    Highly scalable, partitioning, parallel everything, real clustering • Multi-platform (written in C since version 3) • Highly available • Not inexpensive