Switching Data Stores: A Postmodern Comedy

Switching Data Stores A Postmodern Comedy Sarah Mei Ministry of
Velocity @sarahmei 1 Hello! I'm Sarah Mei. I’m super excited to be here today to do this talk, and I owe a huge debt to John, and Ruth, and the rest of the White October folks for inviting me and continuing to invite me, even though I’m really bad at replying to email. But I think this has been a fantastic conference, hands down the best-organized event I’ve ever spoken at. I want to ask a bit of an odd question - how many .NET devs do we have here? That is pretty awesome. I do a lot of conference talks, but mostly at Ruby conferences and JavaScript conferences. I hardly ever get to talk to .NET folks. So I’m super happy to be here. I suppose by now it’s obvious, but uh...I’m not from around here [conf was in Oxford UK]. I’m from San Francisco. But this is a lovely place. I walked around Oxford a little bit yesterday and I did see one thing that was really weird. Hopefully one of y’all can explain it to me.

Picture taken by me! CC BY 3.0 license 2 This
place - in addition to being quite beautiful - had a sign next to it saying it’s a ﬁeld for “football.” Now there were people playing a game here before I took this picture. It was very strange, though - they weren’t wearing helmets or shoulder pads, I didn’t see any spandex, anywhere, and strangest of all, actually, nobody seemed to be receiving any kind of concussive head injuries. That don’t sound like no kinda football...

Copyright 2000 Green Bay Packers 3 ...I’m used to. So
that was confusing. However, right across the walking path from that ﬁeld was this one:

Picture taken by me! CC BY 3.0 license 4 That’s
a cow down there in the lower right. The picture doesn’t adequately convey how large this cow is. It is a massive cow. I know this is business as usual for you folks who live here, but I grew up in the farm country of California, so this made me feel right at home. So thanks for that. I don’t know about your strange nonviolent sports habits, but this I can get behind. So I’m super happy to be here, with you to talk about databases. Let me tell you a little bit about who I am.

of @sarahmei OBLIGATORY VANITY SLIDE Twitter logo copyright Twitter Inc.,
Octocat copyright GitHub Inc., RailsBridge logo copyright RailsBridge Inc. 5 This is me, on twitter and github. There’s 95% of my internet usage right there. I’m the founder of RailsBridge - we’re a nonproﬁt and we run workshops to get more women and other underrepresented folks into the Ruby community. This is a way more diverse crowd than at any Ruby or Javascript conference I’ve been to in the last two years. So, congratulations. I would like to get us where you are now.

of Logo copyright Ministry of Velocity LLC 6 This is
my company - I run a little consulting company in San Francisco. My official title is Head of Propaganda, because I’m the one who does the conference talks, and also because the best part of making your own company is that you get to make your own title. I do Ruby, JavaScript, Go, Android, iOS, pair programming, test-driven development. I build web and mobile applications for clients. This logo is new, by the way, please let me force you to take a business card, some time today, because I love them so much. Now you’ll notice that nowhere in there did I say that I build database engines. Neither I nor my company builds, or maintains, or provides support for any data store. We don’t sell a service based on any particular database. As a consultant, I have the luxury to choose whatever data store is right for each project. I run 3 to 6 projects a year, on average, so I see a lot of different kinds of data, a lot of different use cases, and I’ve put into production many of the data stores we’ve heard about here today, and a few that we haven’t. And that’s why I’m here. I’m here to tell you about what happens when you pick the wrong one. The example I’m going to use is actually not from any of my paid work. It’s from an open source project that I worked on called Diaspora.

Picture taken by me! CC BY 3.0 license 7 Anyone
heard of Diaspora before? I was on the core team for a couple of years. Diaspora is a distributed social network built in Ruby. Their name may be familiar to you because they were the ﬁrst big....

Copyright 2010 New York Times 8 ...Kickstarter success story. Back
in early 2010, four students from New York University - Max, Dan, Raﬁ, and Ilya - made a video asking for $10,000 to spend the summer eating ramen and working on a distributed alternative to Facebook.

Copyright 2010 Kickstarter 9 And they hit a nerve. There
had just been another Facebook privacy scandal, and by the time the dust settled on their Kickstarter, they had raised over $200,000 - for a software project that did not yet have a single line of code written. This was the ﬁrst kickstarter project that way overran its goal. It got written up in the New York Times - which was a bit of a scandal, actually, because the photo they used, which I showed you in the last slide, had a dirty technology joke written on the chalkboard and no one noticed until it was printed in the Times. Anyway! They left school and came out to San Francisco to start working on this project.

Copyright 2011 Henrik Moltke 10 They ended up in my
office. I was working for Pivotal Labs at the time, and one of the guys’ older brothers also worked there, so we offered them desk space and internet, and...as you can see in the picture...beer. They stayed there all summer, mostly keeping to themselves, banging away on code, and by September they had a working, pretty minimal, but working implementation of a distributed social network built in Ruby on Rails, and backed by MongoDB.

Diaspora diagram by Daniel Grippi. Rails logo copyright David Heinemeier
Hansson. MongoDB logo copyright 10Gen. 11 Now let's talk for a moment about what that actually means, because that’s a lot of buzzwords at once. “A distributed social network built in Ruby on Rails, backed by MongoDB.” Let’s start with "distributed social network." Now, "social network," I'm pretty sure y'all know what that is, right...

Copyright 2010 Columbia Pictures 12 ...having seen the movie? There’s
everything you need to know about Facebook in two hours. Diaspora's interface, once you log in...

13 13 ...looks pretty similar to Facebook's. There's a feed
in the middle showing all your friends' posts - that’s what everyone looks at - and some other random stuff to do on the sides that no one ever looks at. The main difference between this and a traditional centralized social network is invisible to the end users. It’s the "distributed" part.

Diaspora diagram by Daniel Grippi. 14 Distributed means that the
Diaspora infrastructure is not located behind a single web address. There are hundreds of independent Diaspora servers. The code is open source, so if you wanted to, you could clone it and stand up your own server. Each server, which we call a pod, has its own database and its own set of users, and it will interoperate with all the other Diaspora servers, which each have their own database and set of users. So in this diagram each of the circles is a server, a pod, and they communicate with each other through an API. Once you set up an account on a pod, in order to get anything out of it, you have to follow some other people. You can follow other folks who have accounts on your server, and you can also follow people who have accounts on other servers. When someone you follow on another server posts a message, it first goes into their pod database, then your pod is notified over the API and the message goes into your pod’s database, and then the next time you look at your activity feed, you see that post mixed in with posts from the other folks you follow. Comments work the same way. On any single post, some comments might be from people on the same server as the posts' originator, and some might be from people on other servers. Everyone who has permission to see the post sees all the comments, just as you would expect if everyone were on a single logical server. There are a couple of interesting advantages to this architecture - first the technical. Technically, it’s...

http://www.ﬂickr.com/photos/criminalintent/2569906181 15 ...fault-tolerant. Here is a very important fault tolerant
system that every office should have. If any one of the pods goes down, it doesn't bring the others down. Related, if you’re in a country that’s not friendly towards dissidents, and your government shuts down outgoing internet, you can run a pod internally without having to rely on servers in an outside data center. Legally, there are interesting ramiﬁcations to this architecture as well.

http://www.ﬂickr.com/photos/bstabler/770416963 16 Each pod is free to set their own
terms of service, and on most of them, you can post content without giving up your rights to it, which you do when you post content to a service like Facebook. Diaspora is free software both in the gratis and the libre sense of the term, and most of the people who run pods care deeply about that sort of thing. So that's the architecture of the system. Let's look at the architecture within a single pod.

UI API Federation Mobile Website } { Ruby logo copyright
Yukihiro Matsumoto. MongoDB logo copyright 10Gen. 17 Each pod on its own is a pretty standard web application. It has a database, originally MongoDB, some Rails code, and two main interfaces: a visual UI that we saw a screenshot of just now, which is how website users interact with Diaspora, and a programmatic API that is used by inter-pod communication - the technical term for that is “federation” - as well as by the various Diaspora mobile clients. So that's the architecture of a server. Let's look more closely at this bottom part. Let’s look at how we store all of this social data. But before we do, let’s talk a little bit about what social data is.

Photos all from rubyfriends.com. Thanks Matt Rogers, Steve Klabnik, Nell
Shamrell, Katrina Owen, Sam Livingston-Grey, Josh Susser, Akshay Khole, Pradyumna Dandwate, and Hephzibah Watharkar! 18 Social data is information about our network of friends, their friends, and their activity. Conceptually, we do think about it as a network - an undirected graph in which we are in the center, and our friends radiate out around us. When we store social data, we’re storing that graph topology, as well as the activity that moves along those edges. And for a long time now, the received wisdom, mostly from people working at large social data companies, has been that...

of “Social data is not relational.” THE RECEIVED WISDOM 19
...social data is not relational. The implication being, of course, that if you store it in a relational database, you’re doing it wrong. Some people say graph databases are more natural - I’m not going to cover that here, since graph databases are by and large too niche to be actually used. Other people say that document databases are perfect for social data, and those are mainstream enough to actually be used. So let’s look at why people think that.

of “Document-oriented storage” WHAT IS MONGO? MongoDB logo copyright 10Gen.
20 MongoDB, of course, is one of these document-oriented databases. Instead of storing your data in tables made out of individual rows, like a relational database does, it stores your data in collections made out of individual documents. And a document is essentially just a big JSON blob with no particular format or schema. Here's an example of how Mongo stores data.

of tv_shows seasons episodes reviews cast_members many many many many
THE RELATIONSHIPS IN YOUR DATA 21 Let's say you have a set of relationships like this that you need to model. This is quite similar to a project that come through Pivotal that used Mongo, and was actually the best use case I've ever seen for a document database. So here we have a set of TV shows, each show has many seasons, each season has many episodes, and each episode has many reviews and many cast members. The way this site works is that users come in to a page for a particular TV show, they see all the seasons and all the episodes and all the reviews and all the cast members for that show on one page. So from the application perspective, when the user visited a page, we wanted to pull all of the information connected to that TV show all at once. There are a number of ways you could model this data. In a typical relational store, each of these boxes would be a table. You have a tv_shows table, a seasons table with a foreign key into tv_shows, an episodes table with a foreign key into the seasons table, and reviews and cast_members tables with foreign keys into the episodes table. So to get all the information for a tv show, you're looking at a ﬁve-table join. Another way to think about modeling this data, though, is as nested hashes, or dictionaries - as one big nested key/value data structure. So a TV show is a hash, and within that, an array of seasons, each of which is a hash. Within each season, an array of episodes, each of which is a hash, and so on. That's how you’d model this in Mongo - each TV show is a document that contains within it all of the information we need for one show.

22 {title: 'Babylon 5', seasons: [ {season_number: '1', episodes: [
{ordinal_within_season: '1', title: 'Midnight on the Firing reviews: [{...}], cast_members: [{...}] } ] } ] } 22 Here's an example document for one TV show, Babylon 5. It's got some metadata up top - the title - and then it's got an array of seasons, with the square brackets there, and each season is itself a hash with metadata, like the season number, and then an array of episodes. And within each episode, some metadata, and then arrays of reviews and cast members within that. It’s basically...

http://en.wikipedia.org/wiki/File:Cauliflower_Fractal_AVM.JPG 23 ...a huge fractal data structure. Sets of sets
of sets of sets. And because all of the data we need for a TV show is under one document, it's very fast to pull all the required data at once, even if there's a lot of it. There’s a TV show in the states called General Hospital that’s been running for over 50 seasons now, it has over 12,000 episodes. The time it takes in a relational store to do the five-table join to get all that data for 12,000 episodes is not inconsiderable. So this is the ideal use case for a document store. And looking at this structure of documents made out of nested hashes, social data fits it quite reasonably. When you come to a social networking site - here's the same screenshot of Diaspora we saw earlier -

24 24 - there's only one important part of the
page, and that's your activity stream. This is the query that gets all of the posts from the people you follow, ordered by most recent. Each of those posts have nested information within them like photos, likes, reshares, comments, and so forth. The nested structure of activity stream data looks similar to what we were looking at with the TV shows. Here are the relationships drawn out.

of users friends posts comments likes many many many many
one commenter one liker THE RELATIONSHIPS IN SOCIAL DATA 25 Users have friends, friends have posts, posts have comments and likes, each comment has one commenter and each like has one liker. Relationship-wise, it's not a whole lot more complicated than TV shows. And just like TV shows, we want to pull all this data at one time, right after the user logs in. In a relational store, with the data fully normalized, that is, with no duplicate data, it would be a seven-table join to get everything out. It certainly seems like it would be better to store a user’s activity stream as one big denormalized nested data structure, rather than doing that join every time.

looking over a cliff http://www.ﬂickr.com/photos/ccheviron/3602582959/ 26 And so, four students
from NYU, after reading some blog posts from people at Etsy and Flickr and Facebook, chose MongoDB for storage of their social data. It was not an unreasonable choice at the time, given the information they had. But there is one really important difference between the data they were modeling, and Mongo’s ideal use case, that they didn’t notice in their evaluation.

THE IDEAL CASE 27 In the ideal case - the TV show example - each of these domain objects is a different type. But with our social relationships -

of users friends posts comments likes many many many many
one commenter one liker users friends commenter liker THE REAL WORLD 28 - some of these domain objects are the same type. In fact, all of these green boxes are the same type - they are all Diaspora users. A user has friends, and the friends may themselves be a users, or they may not, because it's a distributed system (that's a whole layer of complexity that I'm kind of skimming over...). And in the same way, commenters and likers may be users. This may not seem relevant at ﬁrst, but this duplication of types makes it much more difficult to denormalize a stream into a single document, than it is to denormalize a TV show. And that’s because in different places in your documents, you may be referring to the same conceptual domain object. Someone who liked a post in your activity stream may also be a friend, and may also have commented on a post in your stream. This is not impossible to handle in Mongo. One option is...

http://www.ﬂickr.com/photos/yannisag/3175076105 29 ...duplication. You can copy all the user data
you need into every place you need it. So data for an individual user is duplicated - and every time they comment on something, and every time they like something, every time someone follows them - you make a separate copy of their user data. The advantage is that it's present everywhere you need it, and you can still pull the whole activity stream back as a single document. Here’s what this kind of fully denormalized stream document looks like:

of { name: Joe, url: '...', stream: [{ user: {name:
Jane, url: '...'}, title: 'today', body: 'go fly a kite', likes: [ {user: {name: Lu, url: '...'}}, {user: {name: Joe, url: '...'}} ], }] } JOE’S FULLY DENORMALIZED STREAM 30 Notice that we’ve got copies of user data inlined here. This is Joe’s stream, and it has a copy of his user data at the top level. Joe’s stream contains Jane’s post. Joe has clicked the “like” button on Jane’s post, so under likes for Jane’s post, we have another copy of Joe’s data. I think you can see why this is attractive. All the data you need is already located where you need it. I think you can also see why this is dangerous. When someone’s user data changes, we need to walk through all the activity streams where they’ve been active and change it in all those different places. This is very error-prone. So the other approach you can take to this problem in Mongo, which is more familiar to those of us with a relational background, is to store references to users in the activity stream document, rather than the actual user data. So instead of inlining this user data wherever you need it, ...

of { name: Joe, url: '...', stream: [{ user: {name:
Jane, url: '...'}, title: 'today', body: 'go fly a kite', likes: [ {user: {name: Lu, url: '...'}}, {user: {name: Joe, url: '...'}} ], }] } JOE’S SLIGHTLY MORE NORMALIZED STREAM id: 1, 2, 3}, 1}, 31 ...you give each user an ID, here at the top - as an aside, Mongo actually uses BSON IDs, which are strings sort of like GUIDs, but to make the slides easier to read I’m just using integers. So once users have IDs, we store the user’s ID every place that we were previously inlining data. So this eliminates our duplication problem - now, when user data changes, there’s only one place we have to change it. However, now we can't get all the information we need to construct an activity stream from a single document. Our efficiency has been reduced, and our complexity has been increased, because now to construct the activity stream, we have to retrieve the stream document, and then go back and retrieve all the user documents that the stream references, to get their user data. And what you really want, here...

joins http://www.ﬂickr.com/photos/evaekeblad/2351025393/ 32 ...is a join - one query that
mashes together the activity stream and all the users that the stream references. SQL is great at joins. Mongo is not. It just plain doesn’t do them. So you end up manually doing that mashup in your application code, instead. In contrast when you're looking at this set of relationships...

THESE DOCUMENTS ARE SELF-CONTAINED 33 ...nothing here has that type of complexity. Because all these boxes are different entities, the entire query can be denormalized into one document with no duplication and no references. In this document database, there are no links between documents. It requires no joins. But on a social network, nothing is that self-contained.

34 34 When you come a site like this, any
time you see something that looks like a name or a picture - these things here - you expect to be able to click on it and go see that user. Go see their proﬁle, go see their posts! In a TV show application, it doesn't really happen that way. If you're on season 1 episode 1 of Babylon 5, you don't expect to be able to click through to season 1 episode 1 of General Hospital. TV shows have no links between the documents. But social data has a lot of linking between the documents, and in many ways, that’s the most important data in the whole system. This linking is inherent in the nature of social data, because it's all about who's doing it and what’s your relationship to them. Once we started doing ugly joins manually in the Diaspora code, we knew it was ...

http://www.flickr.com/photos/davida3/939738862/ 35 ...the first sign of trouble. This definitely looks
like trouble. It certainly doesn’t look like some place you want to in...sert... your credit card, right? Maybe it’s not malicious! Maybe they just need a dictionary. But it’s definitely a red flag. The moment that you start storing references to other objects in mongo, that’s a red flag. It's a sign that your data actually is relational, that there's value to that structure, and that you are going against the basic concept of a document data store. Because when Mongo says it’s about documents...

http://www.ﬂickr.com/photos/horrgakx/2963449465 36 ...you should really believe it. And when they
say "document," in many ways they mean something you can print out on a piece of paper and hold. It's a self-contained piece of information. It may have a lot of internal structure - it may have headings and subheadings and paragraphs and footers - but it doesn't have links to other things, and it doesn't have references to other things. It's a self-contained piece of semi-structured data. So if your data looks like that, you’ve got documents, and then it's a good use case for Mongo. If there’s value in the links between documents - then you don’t actually have documents. And therefore it’s probably not a good use case for Mongo.

of WHAT THEY ACTUALLY MEAN 1. It’s more naturally modeled
as a graph TRUE 2. It’s faster to retrieve when denormalized TRUE “Social data is not relational.” 37 And it turns out, that when people say "social data isn't relational," that's not actually what they mean. What they mean is one of these things: Sometimes, they mean that conceptually, social data is more of a graph than a set of tables. And this is true. But there are actually very few concepts in the world that are naturally modeled as normalized tables. We use that structure because it's efficient, because it avoids duplication, and because when it is slow, we know how to ﬁx it. And other times, they mean that it will be faster to get all the data for a social query when it's denormalized into a single document. And this is also true. When your social data is in a relational store, you need a many-table join to extract the activity stream for a particular user, and that does get slow as your tables get bigger. But there is a well-known solution to this problem, and it’s called caching. I loved Neha’s talk this morning on this subject, and I've actually seen lots of projects that cache denormalized activity stream data into a document database like Mongo. Once you do that, hey! The only problem you have is cache invalidation! And, really, how hard could that be?

of CACHE INVALIDATION “There are only two hard problems in
computer science: cache invalidation and naming things.” PHIL KARLTON 38 Ha. Yeah, it turns out cache invalidation is actually pretty hard. Phil Karlton is the guy who wrote most of SSL version 3, X11, and OpenGL, so he knows a thing or two about computer science. Here’s my favorite variation on this by the way:

of CACHE INVALIDATION “There are only two hard problems in
computer science: cache invalidation, naming things, and oﬀ-by- one errors.” UNKNOWN SMART*SS 39 This one is not attributed to Phil Karlton. But what is cache invalidation, anyway, and why is it so hard? Well, cache invalidation is just knowing when a piece of your cached data is out of date, and needs to be updated or replaced. Here's a typical example that I see every day in web applications:

of WRITES WITH A CACHE title “halo”, body “whirled” User
with 2 followers posts: 1 post >> follower 1 activity stream post >> follower 2 activity stream 2 Read Requests 40 We have a standard backing store, call it, say, “postgres,” and in front of it we have a cache layer, call it, say, “memcached.” Application read requests for a user’s activity stream go to the cache rather than the database directly. Application writes are more complicated. Let's say a user with two followers writes a new post. The ﬁrst thing that happens [1] is that the post data is copied into the backing store. Once that completes, a background job [2] appends that post to the cached activity stream of both of the users who follow the author. Neha talked about this this morning, that Twitter holds the users activity stream in an in-memory cache, and appends to it when someone that user follows posts something. And that’s a very common approach, even for smaller applications that employ some kind of activity stream. When the author in this example changes an existing post, for example updating the title, the write process is essentially the same as a create, except instead of appending to the cache, it updates an item that’s already there. But…what happens if that step 2 background job fails partway through? It’s bound to happen. Machines get rebooted, partitioned, deployed to, all the time. And chances are, when that happens, you'll have inconsistent data in your cache. Some copies of the post will have the old title, and some copies will have the new title. That's a hard problem, but it's not impossible to solve, because with a cache, there's always...

http://www.ﬂickr.com/photos/andyz/3856834913 41 ...the nuclear option. You can always delete the
entire activity stream record out of your cache and regenerate it from your consistent backing store. It may be slow, but at least it’s possible. But what if...

of WRITES WITH A CACHE post >> follower 1 activity
stream post >> follower 2 activity stream Read Requests 42 ...there is no backing store? What if you skip step 1? When Mongo is all you have - it's like having a cache with no backing store behind it. It will become inconsistent. Not eventually consistent - just plain, ﬂat-out inconsistent - and at that point, you have no options. Not even a nuclear one. There is no way to regenerate the data in a consistent state.

of PRO TIP Database != Cache Oh sh*t. What now?
43 So when we stored our social data in Mongo, we were conflating a database with a cache. Databases and caches are very different things. They have very different ideas about permanence and transience, duplication and references, integrity, and speed. So once we figured out that we had accidentally chosen a cache for our database...what did we do about it? Well, that's the million dollar question. Which is, I looked it up, 622,704 pounds, according to today's exchange rate. But if that's the million-dollar question, then I think we already answered the billion-dollar question. In the last few slides, I talked about how we used Mongo, versus how it was designed to be used, as though it were all obvious, and we just failed to research adequately before choosing. But it wasn’t obvious at all. Mongo docs tell you what it’s good at, without emphasizing what it’s not good at. Naturally, right? All projects do that. So it took us quite a while to figure out that we were using Mongo the wrong way. It took about six months, actually. A lot of user complaints about inconsistent and lost data, and a lot of investigation.

of CONVERSION PROCESS users friends posts comments likes many many
many many one commenter one liker users friends commenter liker comments likes 44 The data conversion itself was straightforward in comparison. I'll run you through our approach, which I've used on a bunch of other conversion projects since this one. To pull data out of Mongo and put it in a SQL store, you need to normalize and de-duplicate it. And you want to start with the leaf nodes in your dependency graph. Let's look at what that means. Here our leaf nodes are commenters and likers, but they aren’t really leaf nodes, because they’re users, and users are actually the root of this whole mess. So the leaf nodes from a dependency standpoint are actually comments and likes. Let's start with likes.

of HOW TO CONVERT DATA 1. Export 2. Convert 3.
Normalize 4. Import 5. Process 4a. Repeat 45 Once you've identiﬁed the data to be extracted, the procedure is: 1. EXPORT 2. CONVERT 3. NORMALIZE 4. IMPORT 4A. REPEAT 5. PROCESS I break it into these steps primarily because they are easy to test independently. Let’s go through each one.

of 1. Export STEP ONE {‘likes’: [ {‘post_bson_id’: 1234, ‘user_bson_id’:
5678}, {‘post_bson_id’: 1234, ‘user_bson_id’: 9012}, ]} 46 The ﬁrst step is to write a map reduce in JavaScript that dumps all the like data to JSON, one JSON hash per like. You need two things for each like: the post's BSON ID, the user's BSON ID. There’s no other metadata - likes are essentially just a join table.

of 2. Convert STEP TWO {‘likes’: [ {‘post_bson_id’: 1234, ‘user_bson_id’:
5678}, {‘post_bson_id’: 1234, ‘user_bson_id’: 9012}, ]} post_bson_id, user_bson_id 1234, 5678 1234, 9012 47 Step two is to convert that JSON to a CSV ﬁle, one row per like.

of 3. Normalize STEP THREE id post_bson_id user_bson_id post_id user_id
auto-increment primary key from Mongo from Mongo initially blank initially blank 48 Step three is to normalize. Create a likes table in your relational store with ﬁve columns: id, which is your auto increment primary key, post_bson_id and user_bson_id, which are the two IDs you exported from Mongo, and post_id and user_id, initially blank, which will eventually contain the foreign keys into those tables that don’t actually exist yet.

of STEP FOUR 4. Import LOAD DATA INFILE 49 Step
four is to load that CSV into your relational store. It’s a pretty basic LOAD DATA INFILE. At this stage, only id, post_bson_id, and user_bson_id will be present in each row.

of 4a. Repeat STEP FOUR A id bson_id user_bson_id text
(+ other metadata) auto-increment primary key from Mongo from Mongo from Mongo 50 Step 4A is to repeat the process with posts so that we can populate the likes table’s post_id ﬁeld. We now have a table for likes, the leaf node, so we go up one level to its container - posts - and repeat steps 1 through 4. We make a posts table with an auto-increment primary key, a bson_id column containing the post's BSON ID, a user_bson_id column containing the BSON ID of the author, and of course, all the actual data for a post like date and text.

of 5. Process STEP FIVE UPDATE likes, posts SET likes.post_id
= posts.id WHERE posts.bson_id = likes.post_bson_id; ALTER TABLE likes DROP COLUMN post_bson_id; ALTER TABLE likes MODIFY post_id INT NOT NULL; 51 Step 5 is to process. This is the hard one usually. Write a SQL statement similar to this [first statement] that copies post_ids into the likes table based on matching up the BSON IDs. Once you’ve done that, you can delete the post_bson_id column from the likes table [second statement]. But I recommend before you do that, you do this [third statement]: make likes.post_id not null. If you’re lucky, this will actually work the first time, meaning that every row in the likes table actually has a post_id. I’ve never been lucky. Sometimes when it doesn’t work, it means we didn’t import all the posts. Sometimes it means we have likes that refer to a post that has been deleted. This is where we do the drudge work of actually figuring out how to make the data consistent. Once you’ve got that (NO PROBLEM AMIRITE), go back and repeat for users. This process is sadly the best case scenario.

of WHAT IF YOU’RE MORE FULLY DENORMALIZED? 1. Export 2.
Convert 3. Normalize 4. Import 5. Process XNormalize 4a. Repeat XRepeat A LOT 52 The process I just describes assumes that everything you need to refer to in the normalized version of your schema already has a BSON ID inside of Mongo. That is, it assumes you've gotten to the point where you're referring to duplicated objects using Mongo's BSON IDs, rather than actually duplicating the data. If that's not the case - if you're copying data - then it's even harder. At that point you need to figure out which instances of the copied data are referring to the same conceptual instance, and then reconcile the inevitable conflicting data. Usually the easiest way to do this is to: 1. export all the data, duplicates, conflicts, and all, out of Mongo, 2. convert it to a CSV, 3. skip normalizing for the moment, 4. import it into a temporary table in your SQL store, 4A. no repeating, and then finally, 5. figure out which SQL statements you need to transform that mess of a temporary table into the consistent version in the final table. Then you can delete the temporary table. Most projects need a combination of these two approaches. The relative proportion depends on how the project has constructed their document store.

of THE DAMAGE 8 months of production data 1.2 million
rows 4 pair-weeks of development 2 hours of downtime 53 So here’s the damage - we had 8 months of production data. It turned into about 1.2 million rows all told in our SQL store. We spent four pair-weeks on development, and had about 2 hours of downtime. That was more than acceptable for a project that was in pre-alpha. We could have actually reduced that downtime more, but we had budgeted for 8 hours of downtime, so 2 actually seemed fantastic. All in all -

NOT BAD http://i.imgur.com/4aRuE.png 54 Not bad for a project that
was pre-alpha at the time. Barack, Michelle, and grumpy cat all approve. I want to leave you with one ﬁnal thought. Remember this project?

EPILOGUE 55 The TV shows? This is the perfect use case for Mongo - each show is one document, perfectly self-contained. No references to anything, no duplication, and no way for the data to become inconsistent. This is the second time I’ve given this talk. About a month after the ﬁrst time I gave it, this project was still humming along nicely on Mongo, and then one Monday, in the weekly planning meeting, the client brought up a new feature that one of their investors wanted: when they were looking at an episode, and looking at all the cast members - the actors that were in that episode - they wanted to be able to click on an actor’s name and see that person’s entire television career. A chronological listing of all of the episodes of all the different shows they had ever been in. Since each show and all of its information including cast members were individual documents, to implement this, we would have to search through every document to ﬁnd and de-duplicate instances of the actor they clicked on. At a minimum, we would have had to maintain an external index of actor information, which would have the same invalidation issues as any other kind of cache.

EPILOGUE 56 The client expected this feature to be trivial. And if the data had been in a SQL store, it would have been. As it was, we ﬁrst tried to convince the PM that they didn’t need it, then we offered some alternatives, such as linking to an IMDB search for the actor's name, but of course, since this company made money from advertising, they wanted users to stay on their site rather than going off to IMDB. This feature request eventually prompted this project’s conversion to postgres. Because after a lot more conversation with the client, we realized that the business saw lots of value in linking TV shows together - this is one of them, but also things like related shows - "if you like this, you may also like" - these other shows. See other shows this director has been involved with. See episodes of other shows that were released the same week this episode was. This was ultimately...

of http://www.d80.co.uk/image.axd?picture=2011%2F3%2FTHE_SOFTWARE_DEVELOPMENT_PROJECT.jpg 57 ... a communication problem rather than a
technical problem. If we had had these conversations sooner, if we had taken the time to really understand how the client saw the data and what they wanted to do with it, we probably would have done the conversion earlier, when there was less data, and it was easier.

of FINAL PRO TIP Structure < Value 58 So that's
what I want to leave you with. My last pro tip. The structure of the data is important when you're picking a data store, but more important than structure is the value of the data. You need to understand where in your data - and where in its connections - the value lies. Because if it looks like you have documents but there is value in the links between them - you don't actually have documents. Ultimately you want to choose a data store...

of OK, ONE MORE Make valuable things easy. 59 ...
that makes valuable things easy. And my very last pro tip -

of http://www.ﬂickr.com/photos/24018267@N00/4937109161 60 ...you should really consider those helmets and
shoulder pads and stuff, you guys, because this looks super dangerous. We need all your IQ points.

Thank you! Sarah Mei Ministry of Velocity @sarahmei http://www.tehcute.com/baby-turtle-eats-strawberry.html 61
And that’s all I’ve got. Thank you very much!

Switching Data Stores: A Postmodern Comedy

Switching Data Stores: A Postmodern Comedy

More Decks by sarahmei

Other Decks in Technology

Featured

Transcript