the advanced supercomputing facility (NAS) at NASA Ames in silicon valley • consulted at various federal agencies during the tech boom of the 90’s • classified and unclassified • http://about.me/christopherkeller
writes around the clock from incoming data, but far fewer analytical reads • we retain the data in a raw format, but it does not need to be in a database (however we can easily load old data) • we need flexibility as technology and our requirements evolve over time
our available data to make supercomputing more secure for our customers • replace a COTS security event management system • poor query performance • difficult to extend and integrate with our custom software • pre-defined analytics were a big plus, but more overall minuses for our environment
no single point of failure • reads are fast, writes are faster • i did research other solutions (couchbase, hbase, mongo, riak, etc), but didn’t find anything compelling enough to trial
cluster on a spare server • wrote the cassandra equivalent of “hello world” to check • replication / availability • data expiration • rough performance estimates
easier than i thought • the NAS is very receptive to new technology even though we prefer to be system integrators rather than developers • my testing showed that cassandra works...shocking!!! • opens source resources are good, DataStax being able to provide support after i leave is better
22.5k • 3 of them for our production cluster • 1 for our data parsing and loading • 1 for our analytics • those were our only purchases, the rest has been primarily my labor hours
1.1.3 • three virtual nodes per physical server • 7 cpu’s, 15gig RAM, 1.2 TB disk • eight disks per physical server • 2 running the hypervisor + OS in a RAID 1 • 2 disks per virtual machine in a RAID 0
days over the course of christmas/new years 2011 • very helpful to understand our hardware limits and how cassandra scaled • understanding how to model the data and effectively use cassandra took a lot longer...i’m still learning
to ask the data • if you know these your job just got exponentially easier • if you never deviate from this, you’re lucky • once you realize how powerful cassandra is, you’ll figure out new questions that may change things • don’t use supercolumns
tools from scratch • cluster start up and shutdown scripts • use good CM software (we use puppet) • OS, Cassandra & JVM upgrades • cassandra-env.sh & cassandra.yaml
to organize the data • secondary indexes • materialized views • i’d get failures and errors in cassandra that were solved by changing the schema to be more efficient (based on our questions) • try not to think relationally, it wasn’t helping me
a single table • aren’t you glad you optimized the schemas for the questions ahead of time? • manual joins by reading successive column families resulted in timeout errors even though the cluster was idle and everything was on the same switch segment
annoying, but can be solved with discipline • give yourself a lot of experimentation time if you’re new to cassandra • if you are hitting problems...likely you’re doing it wrong
least every gc_grace_seconds • script which staggers weekly repairs across each node • once you assign a token ID, you can remove it from cassandra.yaml and keep the same file across nodes • you are free to use the Thrift bindings for the language of your choice, but save yourself time and use a high level client (eg Java, Python, Scala, PHP, Erlang, etc)
hours writing code to load data into cassandra, then another few hours writing code to retrieve it • the data browsers aren’t great and unhelpful with blobs • then i’d profile the performance, tweak the code, tweak the schema, reload the data and repeat until i was happy
using sub- processes for parallel performance • pycassa is our cassandra client library • our web layer is currently ruby on rails, but we might end up going with django to stay language consistent
crash but it doesn’t seem to affect cassandra stability • other cassandra sites haven’t seen this, so it may just be a consequence of java6 on gentoo • commitlog_total_space_in_mb was being ignored FIXED in 1.1.3
drop column families • pick your poison - full disks or data that never goes away • recent v6 JVM patches required per-thread stack sizes to 180k • nodes were up individually, zero log errors, gossip is up, but the nodes weren’t talking collectively