of time series data points per day • Must write data at wire speeds • Read slices of data for graphing and analysis • Also write various aggregations and summarizations Why We Need a Database
• Replicates data across fault domains • EC2-aware data placement strategies • Good support for write-heavy workloads • Compatible data model for time series data • Automatic data expiration with TTLs Why not MySQL? • Relational data model not a good match • Experience with operating large, sharded deployments Why not HBase? • Operational complexity - zk, Hadoop, HDFS, ... • Special "master" role Why not Dynamo? • Avoid vendor lock-in and high cost
◦ Boto for AWS API ◦ Fabric + Puppet for bootstrapping ◦ Fabric for operations • One CLI tool ◦ Launch a new cluster ◦ Upsize a cluster ◦ Replace a dead node ◦ Remove existing nodes ◦ List nodes in a cluster
• Automatic upgrades • Full-time operations • “Infinitely” scalable • Automatic scaling • Likely decreasing costs • AWS has a history of aggressively reducing prices • Last Dynamo price reduction March, 2013 Cons • Vendor lock-in • Complicated cost model • Based on “write units” and “read units” • Request rate, data size, consistency model • No organizational experience • Must endure growing pains of new service adoption • No TTL for data • Impacts costs • Efficient data deletion requires engineering investment