MODELING THE DATA ▸ Start with an RDBMS. Proven and reliable way of modeling schemas. ▸ Central facts table centered around the User ▸ Joined with other tables to do analytical queries ▸ Number of hits, Number of conversions and so on
DEEP DIVING INTO USER BEHAVIOR ▸ Clickstream data - HTML + CSS data capture and visualization ▸ Record an entire user session on site and then replay it later. ▸ “Send an HTTP call when the user moves the mouse.” ▸ Considerable increase in volume
PERILS OF SQLITE ▸ Minimal Concurrency Support. Only for Reading. ▸ Database locking ▸ “When any process wants to write, it must lock the entire database file for the duration of its update.” ▸ This results in direct data loss. RIP
SCALING SQLITE? ▸ Maybe I can have a lot of sqlite DB files instead of one. ▸ Divide them up based on customer_id ▸ Divide them up even further based on time ▸ Daily is too frequent, monthly too infrequent. ▸ account_id/week_{week_number}.db looks good
MEH ▸ My locking problem has still not been fixed ▸ Too primitive and bare-bones. ▸ I really don’t wanna manage 10k database files myself. ▸ Please don’t do this.
ITERATION #2 ▸ PostgreSQL! ▸ SQL standard compliant, good JSON support. ▸ Parallel query execution ▸ MVCC - Each process has its own snapshot of the DB ▸ Before implementing it, lets do some benchmarking.
DETOUR - BENCHMARKING ▸ Isn’t an exact science. ▸ Wrong: “My code takes n seconds to run” ▸ Kinda sorta right: ▸ “My code takes n seconds to run on a 4 core CPU, 12Gig RAM machine which has no other processes running, and has an SSD capable of reaching 4k maximum IOPS. I can optimize it to insert k messages per second.” ▸ TL;DR: It’s complicated.
BENCHMARKING TIPS ▸ Keep it close to the real use case if you can. ▸ Utilities like iostat, iotop, htop, ftop are your friends. ▸ Tweak a config and see the impact. ▸ Don’t use resource-heavy monitoring tools. ▸ Don’t believe blog posts! Do your own benchmarking.
PORTING LEGACY CODE ▸ “Sqlite to Postgres migration should be simple!” ▸ Nope. SQL is just one part of the equation. Different systems work differently. ▸ Still easier than re-writing for a NoSQL system from scratch.
BACK TO ITERATION #2 ▸ We now have a system that is ▸ Concurrent. No more database locks. Yay! ▸ Keeps data consistent. ▸ Handles a decent amount of load. ▸ What can go wrong?
PROBLEMS WITH ITERATION #2 ▸ Data insertion performance will drop in a few days or weeks. ▸ Or even hours during your load testing. ▸ Huge tables with hundreds of millions of rows. ▸ Worse: Large indexes on these tables.
INDEXES ARE A SCAM! ▸ Great for READ operations ▸ Suck for WRITE operations. 1 Write = 1 table write + 1 index update. ▸ On disk in PG if can’t store in its shared buffers. ▸ We’ll end up with higher disk I/O.
ITERATION #3 - IMPROVING PG ▸ Avoid indexes in the beginning. ▸ Partition one huge table into many smaller tables. PG offers partitioning based on inheritance. ▸ Divide up your data - ▸ random(data) ▸ my_custom_algorithm(data)
SHARDING ▸ Shard means “a small part of a whole”. ▸ How to choose where should an incoming message go? ▸ database_node = (message_id % 2) + 1 ▸ database_node = random(1, 2)
REPLICATION ▸ Your primary DB should always have a slave running with it. ▸ The slave becomes master when the old master dies. ▸ Question: What happens when the old master comes back to life? We now have 2 masters and no slaves!
ITERATION #6 ▸ We now have a cluster of DB nodes ▸ Each node has a replica ▸ We are HA or “Highly Available”. ▸ Can something still go wrong? Why do you hate me?
ITERATION #6 ▸ On master machine, run the following: ▸ “DROP DATABASE” ▸ What happens to slave ? It friggin’ deletes everything too. ▸ Slaves are Dumb!
BACKUPS! ▸ Replicas are not backups ▸ Take regular backups of your entire database. ▸ PG offers fantastic support for base backups + WAL archiving. ▸ An untested backup is no backup at all.
1. THINK ABOUT PERF FROM THE START 2. DO EXTENSIVE BENCHMARKING 3. USE QUEUES AND PARTITION YOUR DATA 4. SHARDING IS UNAVOIDABLE AT SCALE 5. USE REPLICATION 6. BACKUPS ARE GOING TO SAVE YOU LESSONS IN SCALABILITY