Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Running Cassandra in AWS

Running Cassandra in AWS

In this presentation (first delivered at the Boston AWS meetup on April 8th, 2013), we highlight our reasons for choosing Cassandra versus other platforms and describe the novel architecture that allows us to tolerate inevitable failures that one would expect running at scale on AWS.


April 08, 2013

More Decks by Stackdriver

Other Decks in Technology


  1. Stackdriver at a Glance Stackdriver's hosted monitoring service helps SaaS

    companies innovate more by reducing the burden of day-to-day operations • Focus on complex distributed systems • Founded by cloud/infrastructure industry veterans (Microsoft, VMware, EMC, Endeca, Red Hat) with deep systems and DevOps expertise • Team of 15, based in Downtown Boston • Private beta underway, let us know if you want to get involved
  2. Problem Domain Monitor customer cloud-hosted applications • Inventory • Services

    • Performance data Analyze • Groups • Aggregation • Report, recommend, alert, optimize...
  3. Lambda Architecture • Typical of modern architectures for on-line applications.

    • Formalized by Nathan Marz • Composed of "batch", "speed", and "serving" layers • Batch layer ◦ Store of record ◦ Compute arbitrary views • Speed layer ◦ Low latency updates ◦ Streaming algorithms • Serving layer ◦ Combine data from batch and speed layers to answer queries Speed Batch Serving Data
  4. Stackdriver Architecture • Shares characteristics of lambda architecture • Analysis

    path ◦ Compute aggregations ◦ Create recommendations • Indexing path ◦ Make "live" data available "pre-analysis" • Query layer ◦ Combine "live" and analyzed data to answer queries ◦ May require on-the-fly analysis • Alerting path ◦ Stream processing to detect policy-based anomalies (not discussed here) Database Data Query (Serving) Analysis (Batch) Indexing (Speed) Alerting (Speed) Notification (Serving)
  5. Database Options • We chose Cassandra! ◦ True P2P architecture

    ◦ Good support for write-heavy workloads ◦ Compatible data model • Why not MySQL? ◦ Experience with operating large, sharded deployments ◦ Relational data model not a good match • Why not HBase? ◦ Operational complexity - zk, hadoop, hdfs, ... ◦ Special "Master" role • Why not Dynamo? ◦ Avoid vendor lock-in and high cost
  6. Stackdriver Architecture ++ • Critical archival pipeline has very small

    surface area • Data path has multiple recovery options • Scales out easily • Cassandra consolidates results of analysis (batch) and indexing (speed) • Cassandra stores immutable data, so consistency is not a problem • Cassandra is "soft state" Replicate Analyze Archive Index Cleanse Roll-ups Recs Analysis Inventory Data Series Data Query
  7. Cassandra at Stackdriver Cluster Configuration • Version: Datastax Community Edition

    1.2.3 • Replication Factor: 3 • Vnodes • Murmur3Partitioner • Ec2Snitch ◦ Aids in request efficiency ◦ Enables Cassandra to ensure replicas are in different Availability Zones • phi_convict_threshold: 8 -> 12 ◦ Used to determine when nodes are down ◦ AWS network can be spotty
  8. Cassandra Topology in AWS 1 1 4 us-east-1a 3 6

    us-east-1c 2 5 us-east-1b us-east-1a 3 us-east-1c 2 us-east-1b Where we started... Where we are... Keep it balanced!
  9. Cassandra EC2 Node Configuration • m1.xlarge (4 cores, 15 GB

    RAM) ◦ 4 ephemeral disks available • 1 disk used for CommitLog ◦ ext4 - defaults,noatime ◦ Sequential Writes • 3 disks RAID-0 for Data Volume ◦ ext4 - defaults,noatime ◦ mdadm RAID-0 ◦ Compactions ◦ Heavy Read/Write IO
  10. Cassandra Automation and Operations • Combination of Boto, Fabric, &

    Puppet ◦ Boto for AWS API ◦ Fabric + Puppet for Bootstrapping ◦ Fabric for Operations • One command to: ◦ Launch a new cluster ◦ Upsize a cluster ◦ Replace a dead node ◦ Remove existing nodes ◦ List nodes in a cluster
  11. Cassandra Backups using S3 • No Cassandra Powered Backups •

    Restore from S3 • Useful for major version upgrades S3 Bulk Loader Elastic Map Reduce Cassandra Data 1. Data is archived when it is received 2. Bulk loader reads from S3 3. EMR re-analyzes data 4. Cassandra is repopulated