Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cassandra at NASA

Cassandra at NASA

Near real time event monitoring project with Cassandra as the data store.

Avatar for Christopher Keller

Christopher Keller

December 13, 2012
Tweet

Other Decks in Technology

Transcript

  1. WHO AM I? • a CSC solutions architect working at

    the advanced supercomputing facility (NAS) at NASA Ames in silicon valley • consulted at various federal agencies during the tech boom of the 90’s • classified and unclassified • http://about.me/christopherkeller
  2. ENVIRONMENT • unix based enterprise (desktops, servers, supercomputers) • heavy

    writes around the clock from incoming data, but far fewer analytical reads • we retain the data in a raw format, but it does not need to be in a database (however we can easily load old data) • we need flexibility as technology and our requirements evolve over time
  3. THE PROBLEM • TL;DR - how to use all of

    our available data to make supercomputing more secure for our customers • replace a COTS security event management system • poor query performance • difficult to extend and integrate with our custom software • pre-defined analytics were a big plus, but more overall minuses for our environment
  4. WHY CASSANDRA • snapshotting for backups was lightning fast •

    no single point of failure • reads are fast, writes are faster • i did research other solutions (couchbase, hbase, mongo, riak, etc), but didn’t find anything compelling enough to trial
  5. WHY CASSANDRA • simple clustering = win • availability +

    scalability + replication • built in data expiration was key • enabling technology that allowed us to ask new questions
  6. IN THE BEGINNING... • set up a virtualized three node

    cluster on a spare server • wrote the cassandra equivalent of “hello world” to check • replication / availability • data expiration • rough performance estimates
  7. ARE YOU KIDDING ME? • selling cassandra to management was

    easier than i thought • the NAS is very receptive to new technology even though we prefer to be system integrators rather than developers • my testing showed that cassandra works...shocking!!! • opens source resources are good, DataStax being able to provide support after i leave is better
  8. TAX DOLLARS AT WORK • bought five servers for around

    22.5k • 3 of them for our production cluster • 1 for our data parsing and loading • 1 for our analytics • those were our only purchases, the rest has been primarily my labor hours
  9. write operations Operations 5000 11250 17500 23750 30000 Elapsed Time

    1 6 12 17 23 28 34 39 45 50 56 61 67 72 78 83 89 94 6 nodes 9 nodes (v) 9 nodes (p) latency Milliseconds 0 10 20 Elapsed Time 1 6 12 17 23 28 34 39 45 50 56 61 67 72 78 83 89 94 6 nodes 9 nodes (v) 9 nodes (p) http://christophernkeller.tumblr.com/post/15242366864/cassandra-benchmarks 1.0.6
  10. TAKEAWAY • bare metal > virtualized w/ assigned disks >

    fully virtualized • match your hardware to your environment , expertise, and requirements
  11. CURRENT CLUSTER • gentoo running xen 4.1.2 & apache cassandra

    1.1.3 • three virtual nodes per physical server • 7 cpu’s, 15gig RAM, 1.2 TB disk • eight disks per physical server • 2 running the hypervisor + OS in a RAID 1 • 2 disks per virtual machine in a RAID 0
  12. ELAPSED TIME • empty rack to benchmarks took about five

    days over the course of christmas/new years 2011 • very helpful to understand our hardware limits and how cassandra scaled • understanding how to model the data and effectively use cassandra took a lot longer...i’m still learning
  13. HELPFUL TIPS • always start with the questions you plan

    to ask the data • if you know these your job just got exponentially easier • if you never deviate from this, you’re lucky • once you realize how powerful cassandra is, you’ll figure out new questions that may change things • don’t use supercolumns
  14. MAINTENANCE • i haven’t done serious sys-admin years...had to develop

    tools from scratch • cluster start up and shutdown scripts • use good CM software (we use puppet) • OS, Cassandra & JVM upgrades • cassandra-env.sh & cassandra.yaml
  15. TRIAL AND ERROR • a lot of testing dealing how

    to organize the data • secondary indexes • materialized views • i’d get failures and errors in cassandra that were solved by changing the schema to be more efficient (based on our questions) • try not to think relationally, it wasn’t helping me
  16. THIS WORKED...POORLY uid job hobby 1 architect jiu-jitsu 2 toddler

    gaming uid name age gender 1 chris 39 male 2 jaeden 2 male uid employer phone address 1 csc 5555555555 123 Main St 2 mom 4444444444 123 Main St
  17. THIS WORKED WELL 1234 1235 architect json blob toddler json

    blob 4567 7364 male json blob json blob {“age”:”39”, “name”:”chris”,”gender”:”male”...} {“age”:”2”, “name”:”jaeden”,”gender”:”male”...} 3453 4554 chris json blob jaeden json blob
  18. WHY DID THAT WORK • we only have to query

    a single table • aren’t you glad you optimized the schemas for the questions ahead of time? • manual joins by reading successive column families resulted in timeout errors even though the cluster was idle and everything was on the same switch segment
  19. LESSONS LEARNED • if your data changes frequently, de-normalization is

    annoying, but can be solved with discipline • give yourself a lot of experimentation time if you’re new to cassandra • if you are hitting problems...likely you’re doing it wrong
  20. TECHNICAL TIPS • use ‘-pr’ to repair each node at

    least every gc_grace_seconds • script which staggers weekly repairs across each node • once you assign a token ID, you can remove it from cassandra.yaml and keep the same file across nodes • you are free to use the Thrift bindings for the language of your choice, but save yourself time and use a high level client (eg Java, Python, Scala, PHP, Erlang, etc)
  21. HOW I SPENT MY TIME • i’d spend a few

    hours writing code to load data into cassandra, then another few hours writing code to retrieve it • the data browsers aren’t great and unhelpful with blobs • then i’d profile the performance, tweak the code, tweak the schema, reload the data and repeat until i was happy
  22. ANALYTICS • all server side analytics are developed in python

    using sub- processes for parallel performance • pycassa is our cassandra client library • our web layer is currently ruby on rails, but we might end up going with django to stay language consistent
  23. SHOW STOPPERS • dealing with an incredibly annoying JMX recurring

    crash but it doesn’t seem to affect cassandra stability • other cassandra sites haven’t seen this, so it may just be a consequence of java6 on gentoo • commitlog_total_space_in_mb was being ignored FIXED in 1.1.3
  24. RECENT SHOW STOPPERS • 1.1.3 accidentally removed the ability to

    drop column families • pick your poison - full disks or data that never goes away • recent v6 JVM patches required per-thread stack sizes to 180k • nodes were up individually, zero log errors, gossip is up, but the nodes weren’t talking collectively
  25. SHOUT OUT • the folks at datastax have been very

    helpful • Tyler Hobbs (cassandra developer) • Darren Sack (accounts) • Michael Shaler (biz dev) • everyone in #cassandra on irc.freenode.org