MongoDBWorld 2014: Service oriented clusters

Foursquare is an app you can use to discover the
best places to go for things like food and drinks. You can also use Foursquare to check-‐in and see where friends are. We just released a new app called Swarm that totally focused on check-‐ins and making plans with your friends. We’re based here in NYC. Around 50 million users, 160 employees, and most of our data in stored in mongo 2

I’m going talk about service oriented clusters. When a applicaKon
reaches a certain size in terms of complexity it’s usually a good idea to split it up into mulKple services that talk over the network. I think it can also be prudent to split up and isolate large mongo clusters to align with those services. But what’s the best way to split up an exisKng cluster and what are some things you might have to think about. Also, what else can you do to truly isolate a data source that’s shared by mulKple services. 3

Why do we love mongo? It has so many great
features. Four and a half years ago when Foursquare ﬁrst started playing around with mongo the ease of geRng started and the built-‐in geo indexing were killer features for us. We also knew that we were going to need to shard some collecKons and were a bit daunted at the prospect of wriKng the tooling to shard and rebalance ourselves. So mongo was and sKll is a big win for us. 4

Mongo does it all, so let’s use it for all
our online data, full text search, blobs, and anaylKcs. Let’s add put all of our data into mongo and store for it all Kme and we’ll add more replicas to our replica sets. and we’ll add more shards to our clusters 5

That’s deﬁnitely a reasonable strategy. It’s basically what we did
a Foursquare for a couple years. In those couple of years we signed up millions of users and collected billions of check-‐ ins in our mongo clusters. Although check-‐ins were actually always stored in their own dedicated cluster because by the Kme we moved them into mongo we had over 100milllion So, doing everything in one cluster makes sense and if you manage it eﬀecKvely it will probably get you really far. 6

But aXer some Kme, maybe many years aXer you started
down this road, you’ll realize you have a mongolith on your hands 7

And you’ll have mongolithic problems. If you’re doing
analyKcs work you could have online and oﬄine work compeKng for resources. Although you might be able to miKgate that with Tag Sets But what might be harder to separate is legiKmate online queries over many collecKons have much diﬀerent performance characterisKcs. 8

Within any system, the more moving parts there are, the
harder it is to understand what’s going on. Everything is highly correlated with everything else. And in the case of mongo one query affects the performance of another. I think the addiKon of a new query pa]ern to mongo actually has a nonlinear detrimental effect on a human’s ability to understand what the hell is going on. low cost/high frequency queries will execute on data that’s fully paged into memory, but if you’re also high cost queries that execute less frequently, that can cause faults and that could cause the next batch of high rate queries to fault because the data was flushed out of memory. Or even worse, if the low frequency operaKon is a write it can cause those high frequency queries to queue up. The more collecKons and the more query pa]erns you have on a single cluster, the harder it is to figure out exactly what’s going on when there is a problem. If you look in the profling logs someKmes it’s unclear if things are there because they are actually slow, or general compeKKon for resources made them slow. It can be 9

You can use a command line tool that ships with
mongo called mong top to see which collecKons are accounKng for the majority of your query execuKon Kme. It polls mongo every second at tells how much Kme out of every second was spent in which mongo collecKons. We had a relaKvely small collecKon spread out on a 25 shard cluster that was queried at a pre]y high rate with all-‐shards queries. we noKced that it was taking very signiﬁcant porKon of the overall query Kme so we moved it from that 25 shard cluster to a 3 node cluster and saw a big reducKon in load on the larger cluster. 10

To sum up, by spliRng out collecKons into mulKple clusters
you’re isolaKng the aﬀects they can have on each other. We’ve found that It’s possible to get more predictable query performance on more isolated clusters. We’ve also been able to cut down on our overall resource usage in some cases by spliRng into smaller clusters. 11

Of course there are some downsides. I think the major
one is just hat there are more configs to manage. You’ll need more tooling to be able to scale that out. There’ll be more maintanence work, and a lot more pressure to automate a lot of the manual work you’re doing if you haven’t already. But that could be offset by less Kme spent invesKgaKng performance issues due to the isolaKon SomeKmes the decision around where to put things can be difficult. So you’ll probably want to think about placement rules ahead of Kme. It’s also possible that if you run mulKple clusters before you’re really ready or for the wrong reasons you’ll just create more problems for yourself. What’s worked for us may not work for you. 12

I’ve presented a case for spliRng your collecKons across mulKple
clusters, but the reasons I’ve talked about so far have nothing to do with services. So let’s talk about monolithic applicaKons. They’re analogous to the mongolith. But spliRng up giant complex applicaKons is something everybody’s been on the same page about for a long Kme. The fix is this idea called service oriented architecture where you split your monolithic applicaKon into mulKple networked services And maybe all your new features are developed as li]le microservices We started spliRng things up at Fourquare a couple of years ago. At first it was along coarse boundaries like web site, api, offline qworkers, and now most new features end up on their own services 13

Less code to build. less can go wrong in a
deploy. deploy more oXen Less going on in a given binary. easier to proﬁle and opKmize. It’s easier to pin down the cause of a resource leak because there’s less code running and less stuﬀ going on at the same Kme. Services that don’t depend on each other fail independently 14

So you build out these services and they’re independent, but
are they really loosely coupled? Well, not if they depend on the same mongo cluster. 15

If all your services depend on the same mongo cluster
then an operaKons error on the cluster affects all the services And you have a lot going on at the same Kme which can make it difficult to figure out the source of the problems You’ll have resource contenKon across your services, so the biggest problem is that the presumed benefit of having independent services is lost 16

Ok, so you’ve decided that it’s a good idea to
break out some collecKons that are only used by certain services. How do you actually do that? 17

The high level procedure is pre]y simple. You take advantage
of the mongo oplog to note the Kmestamp of the latest op, you copy the collecKon data to a new cluster and then you start to replay the oplog from the noted Kmestamp. Dropbox actually created a suite of tools to do those operaKons eﬃciently. They also have a really great blog post where the go into detail about how it works and how the were able to tune performance. Although we actually used our own code at Foursquare because it predates hydra. Those steps and tools are available and pre]y easy to understand. They’ll get you to a point where all your collecKon remains in sync between two clusters, but then you have to cutover the reads and writes. What are the complicaKons involved in doing that? and what are some possible strategies. 18

Before I get into how to cutover your read and
write traﬃc, I want to go over how the oplog works since it’s important to understand some of the more nuanced problems that can arise when you cutover. Every write operaKon that the primary of a replica set receives is recorded in the oplog and that’s what the secondaries of the replica set tail to remain synchronized. It’s implemented as a regular capped collecKon which means you can use your regular mongo client libraries to interact with it. And, it’s designed to be idempotent 19

What does idempotent actually mean. I found a pre]y clear
and concise summary on wikipedia. ... basically, any operaKon that’s dependent on prior state is not idempotent. mongo will write those to the oplog as $sets. a $set operaKon doesn’t depend on the value of the ﬁeld. it’s stateless 20

Here’s an example of an oplog entry. The
important bits here are “ts”. Newer entries will always have a larger Kmestamp than older “op” is the update, delete, or insert “ns” is the collecKon “o2” is a query to ﬁnd the document that changed and “o” is the operaKon to apply to the document 21

A single operaKon pulled from the oplog isn’t idempotent by
itself if newer operaKons have already have already been applied reapplying an old oplog entry could revert a change made may a newer oplog entry However, the stream of all operaKons from some point in Kme going forward is Kme is idempotent. So you can go back to any old Kmestamp in the oplog and replay all the ops onto copies of the same dataset and you should always end up with the same result. 22

Here’s a refresh of the high level collecKon copy precedure
First, take note of the latest op Kmestamp in the oplog Then copy copy data and indexes Then tail the oplog operaKons from the noted Kmestamp going forward in Kme and apply those ops the the new cluster You would leave the oplog tailing process running unKl you cutover all your applicaKons to read and write from the new cluster. 23

Back to how to actually cutover. I think you have
a few basic opKons You can just change the hosts in your code and redeploy your applicaKon either via rolling deploy are by turning all the old code off before turning the new code on. Another opKon is to dynamically switch which database to use while your code is sKll running. In that case, you also have a few opKons. You can just flip the switch, you can turn your writes off, flip the switch and then turn writes back on, or you can do something more complex to ensure an atomic switch over. I’ll go into these opKons in a li]le be more detail. 24

If you’re cool with some downKme, the simplest thing to
do is just turn everything oﬀ and start up new code. Otherwise, if you opt to change your code and do a rolling deploy, you should consider a few possible problems. While the roll is taking place reads and writes from some servers will be hiRng your old cluster while other servers will be hiRng the new cluster. If you need to read your writes, keep in mind that writes to the primary of the old cluster will be delayed in appearing on the primary of the new cluster due to replicaKon lag. It’s also possible to experience some data loss due to write conﬂicts on the source and desKnaKon. 25

The reason data loss can occur goes back to CAP
theory. We’re introducing a parKKon in the system and opKmizing for availability. Here we’re allowing two separate nodes accept writes and in doing so, we’re sacrificing data consistency. Here’s one example of how things can become inconsistent and you can lose data. You have to servers running different versions of the code. server1 is running the old version and talking to your old cluster. server2 is running the new version and talking to the new cluster. Assume you start out with some document where the field “a” has a value of zero. And increment comes in from server2 which is talking to the new cluster. So server two has “a”’s value as one Then another increment operaKon comes in from server1 talking to the old cluster. Now the old cluster has “a”’s value as 1. Remember that the oplog stores idempotent versions of all operaKons. So and increment operaKon gets stored as a set to the ulKmate value. The operaKon that’s recored in the oplog and propagated from the old mongo to the new mongo is to {$set: {a:1}} 26

First of all, if a collecKon is only be updated
using stateless operaKons like sets, this isn’t a problem. Also, there’s no such thing as 100% durability. If you’re using mongo, you’re probably cool with some risk of losing data even if you’re using it in the non-‐default and most risk averse way. But the fact that you can’t have 100% durability applies to any system. You have to decide for yourself what level or risk for data loss is acceptable. It may be the case that stateful updates to the same document would be rare enough during a deploy window that it wouldn’t be worth the eﬀort the avoid them. 27

To reduce the amount of Kme it takes to switch
over and reduce your risk of data loss, you can implement dynamic switching in your code. We have system that we call thro]les to dynamically control which code gets executed. It’s implemented on top of zookeeper and there’s a nice web page where you can turn code paths on or oﬀ. We also use use that for gradually rolling out new features and AB tesKng. 28

The dynamic switch sKll isn’t atomic, though in a typical
case the changes will be reﬂected across all applicaKons within a second. Because there is sKll a li]le bit of delay, it’s sKll possible to have dataloss. although the risk is reduced. Based on write rates and your tolerance for loss that might be acceptable. 29

If you really want to minimize the risk of data
loss, but are willing to accept a small amount of downKme. You can implement a distributed barrier where all the the applicaKons acknowledge that they’re ready to switch over and block unKl every server has acknowledged. This code is just a theoreKcal high level example of how that might be implemented on top of zookeeper. 30

To recap the opKons. You can take some downKme.
If all your writes come from a queuing system that could be pre]y seamless. You could just ﬂip the switch or rollout your code. But you have to accept the possibility of some data loss. Or you can create some sort of distributed barrier, which would be a lot of fun to build, it would sacraﬁce avaiability, basically there would be some downKme, but the idea is that it would be very very short if it works correctly. But that’s potenKally a big if, and the downsides are that you could block your enKre system. 31

Here’s what we’ve actually done. We used different
soluKons for different collecKons. In one case all the collecKon updates were actually coming from an asynchronous queue worker, so we just let the queues build up while we did the dynamic code switch (which only took a few seconds). In another case we deemed it very unlikely that we’d have write conflicts within the few seconds it took for the switch to take affect. We always try to do the simplest thing possible given our risk tolerance and our knowledge of the our own applicaKon. 32

Let’s say you have services that depend on a single
mongo collecKon. If it’s the case that some of those services only need read access you might be able to isolate that read traﬃc to some other system if you’re willing to accept some replicaKon delay 33

So you could go to something that looks like this.
That read only system could also be mongo, but maybe it would only store some piece of the data, or maybe it would only have some but not all of the indexes. Or maybe it wouldn’t be mongo, but your custom in memory storage system. 34

We’ve actually done something like that at foursquare.
We have a system that will use the basic procedure for copying collecKons but copy them into our own storage system. We keep the data in sync by tailing the mongo oplogs and dumping the changed documents into kapa queus that mulKple storage systems can read from. The backing store for our system is redis and we’ll use the most appropriate data structure that redis exposes for a parKcular query pa]ern. The limitaKon to this system is that we don’t have the ability to do arbitrary queries of the data. We only expose very speciﬁc apis The system is isolated from mongo operaKonal issues although the data is contains will always be delayed from what’s in mongo. And those delays can increase if there’s some sort of problem. 35

To summarize. there are possible benefits to spliRng out your
collecKons across mulKple clusters. You might be able to get more predictable performance, make be]er use of your resources and be able to be]er understand what’s going on. But there are trade-‐offs to made with the complexity of managing mulKple clusters. When you start spliRng your applicaKon into mulKple services and you intend for them to be isolated, you have another reason to split up your data storage layer if that’s a shared dependency. Switching over write traffic is a bit tricky, but like anything else, try to fully understand potenKal risks and make decisions based on your own tolerances. 36

MongoDBWorld 2014: Service oriented clusters

MongoDBWorld 2014: Service oriented clusters

More Decks by Jon Hoffman

Other Decks in Technology

Featured

Transcript