Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cassandra at Gowalla

Adam Keys
November 29, 2011

Cassandra at Gowalla

How we use the Cassandra distributed database at Gowalla, what worked well, and what we'll do differently in the future.

Adam Keys

November 29, 2011
Tweet

More Decks by Adam Keys

Other Decks in Programming

Transcript

  1. Cassandra at Gowalla, A Retrospective Adam Keys Austin on Rails,

    November 2011 @therealadam, http://therealadam.com, http://github.com/therealadam Tuesday, November 29, 11
  2. Why Cassandra? Tuesday, November 29, 11 Applications that have stable

    access patterns. High-velocity data growth. Time-oriented data access. Dynamo-style operation.
  3. Why not Cassandra? Tuesday, November 29, 11 Prototypes, getting things

    off the ground. Applications that change their query patterns often. Applications that don’t grow data quickly.
  4. Audit https://github.com/therealadam/audit Tuesday, November 29, 11 Store AR change data

    to Cassandra. Our training-wheels trial project. Incrementally deployed using rollout and degrade. Worked well, so we proceeded.
  5. Chronologic https://github.com/gowalla/chronologic/ Tuesday, November 29, 11 Activity feeds stored in

    Cassandra. Started off as a secondary index cache, but became a system of record. Works pretty well, but the query/access model didn’t always jive with how web developers expected to access data.
  6. Active stories Tuesday, November 29, 11 Store “joinability” data for

    users at a spot so we can pre-merge stories. Built and integrated in one pull request a few weeks before launch. Has worked pretty well.
  7. Social graph caches Tuesday, November 29, 11 Store friends from

    other systems so we can quickly list/suggest friends. This started life on Redis, but the data was growing too quickly. We decoupled it from Redis and wrote a Cassandra backend. We incrementally deployed it and got Redis out of the picture within two weeks. That was cool.
  8. Stable on launch Tuesday, November 29, 11 A couple weeks

    before launch, I switched to “devlops” mode. Along with Adam McManus, our ops guy, we focused on tuning Cassandra for better read performance and to resolve stability problems. We ended up bringing in a DataStax consultant to help us verify we were doing the right things with Cassandra. The result of this was that, at launch, our cluster held up well and we didn’t have any Cassandra-related problems.
  9. Easy to tune Tuesday, November 29, 11 I found Cassandra

    interesting and easy to tune. There is a little bit of upfront research in figuring out exactly what the knobs mean and what the reporting tools are saying. Once I figured that out, it was easy to iteratively tweak things and see if they were having a positive effect on the performance of our cluster.
  10. Time-series or semi-granular data Tuesday, November 29, 11 Of the

    databases I’ve tinkered with, Cassandra stands out in terms of modeling time-related data. If an application is going to pull data in time-order most of the time, Cassandra is a really great place to start. I also like the column-oriented data model. It’s great if you mostly need a key-value store, but occasionally need a key-key-value store.
  11. Developer localhost setups Tuesday, November 29, 11 We started using

    Cassandra in the 0.6 release, when it was a giant pain to set up locally (XML configs). It’s better now, but I should have put more energy into helping the other developers on our team getting Cassandra up and working properly. If I were to do it again, I’d probably look into leaning on the install scripts the cassandra gem includes, rather than Homebrew and a myriad of scripts to hack the Cassandra config.
  12. Eventual consistency, magic database voodoo Tuesday, November 29, 11 Cassandra

    does not work like MySQL or Redis. It has different design constraints and a relatively unique approach to those constraints. In advocating and explaining Cassandra, I think I pitched too much as a database nerd and not enough as “here’s a great tool that can help us solve some problems”. I hope that CQL makes it easier to put Cassandra in front of non-database nerds in terms that they can easily relate to and immediately find productivity.
  13. Rigid query model Tuesday, November 29, 11 Once we got

    several million rows of data into Cassandra, we found it difficult to quickly change how we represented that data. It became a game of “how can we incrementally rejigger this data structure to have these other properties we just figured out we want?” I’m not sure that’s a game you can easily win at with Cassandra. I’d love to read more about building evolvable data structures in Cassandra and see how people are dealing with high- volume, evolving data.
  14. More like a hash, less like a database Tuesday, November

    29, 11 Having developed a database-like thing, I have come to the conclusion that developers really don’t like them very much. AR was hugely successful because it was so much more effective than anything previous to it that tried to make databases just go away. The closer a database is to one of the native data structures in the host language, the better. If it’s not a native data structure, it should be something they can create in a REPL and then say “magically save this for me!”
  15. Better tools and automation Tuesday, November 29, 11 That said,

    every abstraction leaks. Once it does, developers want simple and useful tools that let them figure out what’s going on, what the data really looks like, tinker with it, and get back to their abstracted world as quickly as possible. This starts with tools for setting up the database, continues through interacting with it (database REPL), and for operating it (logging, introspection, etc.) Cassandra does pretty well with these tools, but they’re still a bit nerdy.
  16. Moar indexes Tuesday, November 29, 11 We didn’t design our

    applications to use secondary indexes (a great feature) because they didn’t exist just yet. I should have spent more time integrating this into the design of our services. We got bit a lot towards the end of our release cycle because we were building all of our indexes in the application and hadn’t designed for reverse indexes. We also designed a rather coarse schema, which further complicated ad-hoc querying, which is another thing non-database-nerds love.