Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Switching Data Stores: A Postmodern Comedy

sarahmei
October 18, 2013

Switching Data Stores: A Postmodern Comedy

What happens when you pick the wrong data store? A few years ago, open source web application Diaspora did just that. We started out on MongoDB, and then after eight months in production, switched to MySQL.

With eight months of production data to convert, it was not a switch we made lightly. I'll talk about the mechanics of how it was done, as well as the motivations for the original choice, and how we knew when it was time.

sarahmei

October 18, 2013
Tweet

More Decks by sarahmei

Other Decks in Technology

Transcript

  1. Switching Data Stores
    A Postmodern Comedy
    Sarah Mei
    Ministry of Velocity
    @sarahmei
    1
    Hello! I'm Sarah Mei. I’m super excited to be here today to do this talk, and I owe a huge debt
    to John, and Ruth, and the rest of the White October folks for inviting me and continuing to
    invite me, even though I’m really bad at replying to email. But I think this has been a fantastic
    conference, hands down the best-organized event I’ve ever spoken at.
    I want to ask a bit of an odd question - how many .NET devs do we have here? That is pretty
    awesome. I do a lot of conference talks, but mostly at Ruby conferences and JavaScript
    conferences. I hardly ever get to talk to .NET folks. So I’m super happy to be here.
    I suppose by now it’s obvious, but uh...I’m not from around here [conf was in Oxford UK]. I’m
    from San Francisco. But this is a lovely place. I walked around Oxford a little bit yesterday and
    I did see one thing that was really weird. Hopefully one of y’all can explain it to me.

    View Slide

  2. Picture taken by me! CC BY 3.0 license
    2
    This place - in addition to being quite beautiful - had a sign next to it saying it’s a field for
    “football.”
    Now there were people playing a game here before I took this picture. It was very strange,
    though - they weren’t wearing helmets or shoulder pads, I didn’t see any spandex, anywhere,
    and strangest of all, actually, nobody seemed to be receiving any kind of concussive head
    injuries. That don’t sound like no kinda football...

    View Slide

  3. Copyright 2000 Green Bay Packers
    3
    ...I’m used to. So that was confusing.
    However, right across the walking path from that field was this one:

    View Slide

  4. Picture taken by me! CC BY 3.0 license
    4
    That’s a cow down there in the lower right. The picture doesn’t adequately convey how large
    this cow is. It is a massive cow. I know this is business as usual for you folks who live here,
    but I grew up in the farm country of California, so this made me feel right at home. So thanks
    for that. I don’t know about your strange nonviolent sports habits, but this I can get behind.
    So I’m super happy to be here, with you to talk about databases. Let me tell you a little bit
    about who I am.

    View Slide

  5. of
    @sarahmei
    OBLIGATORY VANITY SLIDE
    Twitter logo copyright Twitter Inc., Octocat copyright GitHub Inc., RailsBridge logo copyright RailsBridge Inc.
    5
    This is me, on twitter and github. There’s 95% of my internet usage right there.
    I’m the founder of RailsBridge - we’re a nonprofit and we run workshops to get more women
    and other underrepresented folks into the Ruby community. This is a way more diverse crowd
    than at any Ruby or Javascript conference I’ve been to in the last two years. So,
    congratulations. I would like to get us where you are now.

    View Slide

  6. of
    Logo copyright Ministry of Velocity LLC
    6
    This is my company - I run a little consulting company in San Francisco. My official title is
    Head of Propaganda, because I’m the one who does the conference talks, and also because
    the best part of making your own company is that you get to make your own title.
    I do Ruby, JavaScript, Go, Android, iOS, pair programming, test-driven development. I build
    web and mobile applications for clients. This logo is new, by the way, please let me force you
    to take a business card, some time today, because I love them so much.
    Now you’ll notice that nowhere in there did I say that I build database engines. Neither I nor
    my company builds, or maintains, or provides support for any data store. We don’t sell a
    service based on any particular database. As a consultant, I have the luxury to choose
    whatever data store is right for each project. I run 3 to 6 projects a year, on average, so I see
    a lot of different kinds of data, a lot of different use cases, and I’ve put into production many
    of the data stores we’ve heard about here today, and a few that we haven’t.
    And that’s why I’m here. I’m here to tell you about what happens when you pick the wrong
    one. The example I’m going to use is actually not from any of my paid work. It’s from an
    open source project that I worked on called Diaspora.

    View Slide

  7. Picture taken by me! CC BY 3.0 license
    7
    Anyone heard of Diaspora before? I was on the core team for a couple of years.
    Diaspora is a distributed social network built in Ruby. Their name may be familiar to you
    because they were the first big....

    View Slide

  8. Copyright 2010 New York Times
    8
    ...Kickstarter success story. Back in early 2010, four students from New York University -
    Max, Dan, Rafi, and Ilya - made a video asking for $10,000 to spend the summer eating
    ramen and working on a distributed alternative to Facebook.

    View Slide

  9. Copyright 2010 Kickstarter
    9
    And they hit a nerve. There had just been another Facebook privacy scandal, and by the time
    the dust settled on their Kickstarter, they had raised over $200,000 - for a software project
    that did not yet have a single line of code written. This was the first kickstarter project that
    way overran its goal. It got written up in the New York Times - which was a bit of a scandal,
    actually, because the photo they used, which I showed you in the last slide, had a dirty
    technology joke written on the chalkboard and no one noticed until it was printed in the
    Times.
    Anyway! They left school and came out to San Francisco to start working on this project.

    View Slide

  10. Copyright 2011 Henrik Moltke
    10
    They ended up in my office. I was working for Pivotal Labs at the time, and one of the guys’
    older brothers also worked there, so we offered them desk space and internet, and...as you
    can see in the picture...beer.
    They stayed there all summer, mostly keeping to themselves, banging away on code, and by
    September they had a working, pretty minimal, but working implementation of a distributed
    social network built in Ruby on Rails, and backed by MongoDB.

    View Slide

  11. Diaspora diagram by Daniel Grippi. Rails logo copyright David Heinemeier Hansson. MongoDB logo copyright 10Gen.
    11
    Now let's talk for a moment about what that actually means, because that’s a lot of
    buzzwords at once. “A distributed social network built in Ruby on Rails, backed by
    MongoDB.”
    Let’s start with "distributed social network." Now, "social network," I'm pretty sure y'all know
    what that is, right...

    View Slide

  12. Copyright 2010 Columbia Pictures
    12
    ...having seen the movie? There’s everything you need to know about Facebook in two hours.
    Diaspora's interface, once you log in...

    View Slide

  13. 13
    13
    ...looks pretty similar to Facebook's. There's a feed in the middle showing all your friends'
    posts - that’s what everyone looks at - and some other random stuff to do on the sides that
    no one ever looks at. The main difference between this and a traditional centralized social
    network is invisible to the end users. It’s the "distributed" part.

    View Slide

  14. Diaspora diagram by Daniel Grippi.
    14
    Distributed means that the Diaspora infrastructure is not located behind a single web
    address. There are hundreds of independent Diaspora servers. The code is open source, so if
    you wanted to, you could clone it and stand up your own server. Each server, which we call a
    pod, has its own database and its own set of users, and it will interoperate with all the other
    Diaspora servers, which each have their own database and set of users.
    So in this diagram each of the circles is a server, a pod, and they communicate with each
    other through an API.
    Once you set up an account on a pod, in order to get anything out of it, you have to follow
    some other people. You can follow other folks who have accounts on your server, and you
    can also follow people who have accounts on other servers. When someone you follow on
    another server posts a message, it first goes into their pod database, then your pod is
    notified over the API and the message goes into your pod’s database, and then the next time
    you look at your activity feed, you see that post mixed in with posts from the other folks you
    follow.
    Comments work the same way. On any single post, some comments might be from people on
    the same server as the posts' originator, and some might be from people on other servers.
    Everyone who has permission to see the post sees all the comments, just as you would
    expect if everyone were on a single logical server.
    There are a couple of interesting advantages to this architecture - first the technical.
    Technically, it’s...

    View Slide

  15. http://www.flickr.com/photos/criminalintent/2569906181
    15
    ...fault-tolerant. Here is a very important fault tolerant system that every office should have.
    If any one of the pods goes down, it doesn't bring the others down. Related, if you’re in a
    country that’s not friendly towards dissidents, and your government shuts down outgoing
    internet, you can run a pod internally without having to rely on servers in an outside data
    center.
    Legally, there are interesting ramifications to this architecture as well.

    View Slide

  16. http://www.flickr.com/photos/bstabler/770416963
    16
    Each pod is free to set their own terms of service, and on most of them, you can post content
    without giving up your rights to it, which you do when you post content to a service like
    Facebook. Diaspora is free software both in the gratis and the libre sense of the term, and
    most of the people who run pods care deeply about that sort of thing.
    So that's the architecture of the system. Let's look at the architecture within a single pod.

    View Slide

  17. UI API
    Federation
    Mobile
    Website
    }
    {
    Ruby logo copyright Yukihiro Matsumoto. MongoDB logo copyright 10Gen.
    17
    Each pod on its own is a pretty standard web application. It has a database, originally
    MongoDB, some Rails code, and two main interfaces: a visual UI that we saw a screenshot of
    just now, which is how website users interact with Diaspora, and a programmatic API that is
    used by inter-pod communication - the technical term for that is “federation” - as well as by
    the various Diaspora mobile clients.
    So that's the architecture of a server. Let's look more closely at this bottom part. Let’s look at
    how we store all of this social data.
    But before we do, let’s talk a little bit about what social data is.

    View Slide

  18. Photos all from rubyfriends.com. Thanks Matt Rogers, Steve Klabnik, Nell Shamrell, Katrina Owen, Sam Livingston-Grey, Josh Susser, Akshay Khole, Pradyumna Dandwate, and Hephzibah Watharkar!
    18
    Social data is information about our network of friends, their friends, and their activity.
    Conceptually, we do think about it as a network - an undirected graph in which we are in the
    center, and our friends radiate out around us. When we store social data, we’re storing that
    graph topology, as well as the activity that moves along those edges.
    And for a long time now, the received wisdom, mostly from people working at large social
    data companies, has been that...

    View Slide

  19. of
    “Social data is not relational.”
    THE RECEIVED WISDOM
    19
    ...social data is not relational. The implication being, of course, that if you store it in a
    relational database, you’re doing it wrong.
    Some people say graph databases are more natural - I’m not going to cover that here, since
    graph databases are by and large too niche to be actually used. Other people say that
    document databases are perfect for social data, and those are mainstream enough to actually
    be used. So let’s look at why people think that.

    View Slide

  20. of
    “Document-oriented storage”
    WHAT IS MONGO?
    MongoDB logo copyright 10Gen.
    20
    MongoDB, of course, is one of these document-oriented databases. Instead of storing your
    data in tables made out of individual rows, like a relational database does, it stores your data
    in collections made out of individual documents. And a document is essentially just a big
    JSON blob with no particular format or schema.
    Here's an example of how Mongo stores data.

    View Slide

  21. of
    tv_shows
    seasons
    episodes
    reviews
    cast_members
    many
    many
    many
    many
    THE RELATIONSHIPS IN YOUR DATA
    21
    Let's say you have a set of relationships like this that you need to model. This is quite similar
    to a project that come through Pivotal that used Mongo, and was actually the best use case
    I've ever seen for a document database.
    So here we have a set of TV shows, each show has many seasons, each season has many
    episodes, and each episode has many reviews and many cast members. The way this site
    works is that users come in to a page for a particular TV show, they see all the seasons and
    all the episodes and all the reviews and all the cast members for that show on one page. So
    from the application perspective, when the user visited a page, we wanted to pull all of the
    information connected to that TV show all at once.
    There are a number of ways you could model this data. In a typical relational store, each of
    these boxes would be a table. You have a tv_shows table, a seasons table with a foreign key
    into tv_shows, an episodes table with a foreign key into the seasons table, and reviews and
    cast_members tables with foreign keys into the episodes table. So to get all the information
    for a tv show, you're looking at a five-table join.
    Another way to think about modeling this data, though, is as nested hashes, or dictionaries -
    as one big nested key/value data structure. So a TV show is a hash, and within that, an array
    of seasons, each of which is a hash. Within each season, an array of episodes, each of which
    is a hash, and so on. That's how you’d model this in Mongo - each TV show is a document
    that contains within it all of the information we need for one show.

    View Slide

  22. 22
    {title: 'Babylon 5',
    seasons: [
    {season_number: '1',
    episodes: [
    {ordinal_within_season: '1',
    title: 'Midnight on the Firing
    reviews: [{...}],
    cast_members: [{...}]
    }
    ]
    }
    ]
    }
    22
    Here's an example document for one TV show, Babylon 5. It's got some metadata up top -
    the title - and then it's got an array of seasons, with the square brackets there, and each
    season is itself a hash with metadata, like the season number, and then an array of episodes.
    And within each episode, some metadata, and then arrays of reviews and cast members
    within that.
    It’s basically...

    View Slide

  23. http://en.wikipedia.org/wiki/File:Cauliflower_Fractal_AVM.JPG
    23
    ...a huge fractal data structure. Sets of sets of sets of sets. And because all of the data we
    need for a TV show is under one document, it's very fast to pull all the required data at once,
    even if there's a lot of it. There’s a TV show in the states called General Hospital that’s been
    running for over 50 seasons now, it has over 12,000 episodes. The time it takes in a relational
    store to do the five-table join to get all that data for 12,000 episodes is not inconsiderable.
    So this is the ideal use case for a document store.
    And looking at this structure of documents made out of nested hashes, social data fits it quite
    reasonably. When you come to a social networking site - here's the same screenshot of
    Diaspora we saw earlier -

    View Slide

  24. 24
    24
    - there's only one important part of the page, and that's your activity stream. This is the
    query that gets all of the posts from the people you follow, ordered by most recent. Each of
    those posts have nested information within them like photos, likes, reshares, comments, and
    so forth.
    The nested structure of activity stream data looks similar to what we were looking at with the
    TV shows. Here are the relationships drawn out.

    View Slide

  25. of
    users
    friends
    posts
    comments
    likes
    many
    many
    many
    many
    one commenter
    one liker
    THE RELATIONSHIPS IN SOCIAL DATA
    25
    Users have friends, friends have posts, posts have comments and likes, each comment has
    one commenter and each like has one liker. Relationship-wise, it's not a whole lot more
    complicated than TV shows. And just like TV shows, we want to pull all this data at one time,
    right after the user logs in.
    In a relational store, with the data fully normalized, that is, with no duplicate data, it would be
    a seven-table join to get everything out. It certainly seems like it would be better to store a
    user’s activity stream as one big denormalized nested data structure, rather than doing that
    join every time.

    View Slide

  26. looking over a cliff
    http://www.flickr.com/photos/ccheviron/3602582959/
    26
    And so, four students from NYU, after reading some blog posts from people at Etsy and Flickr
    and Facebook, chose MongoDB for storage of their social data. It was not an unreasonable
    choice at the time, given the information they had.
    But there is one really important difference between the data they were modeling, and
    Mongo’s ideal use case, that they didn’t notice in their evaluation.

    View Slide

  27. of
    tv_shows
    seasons
    episodes
    reviews
    cast_members
    many
    many
    many
    many
    THE IDEAL CASE
    27
    In the ideal case - the TV show example - each of these domain objects is a different type.
    But with our social relationships -

    View Slide

  28. of
    users
    friends
    posts
    comments
    likes
    many
    many
    many
    many
    one commenter
    one liker
    users
    friends
    commenter
    liker
    THE REAL WORLD
    28
    - some of these domain objects are the same type. In fact, all of these green boxes are the
    same type - they are all Diaspora users. A user has friends, and the friends may themselves
    be a users, or they may not, because it's a distributed system (that's a whole layer of
    complexity that I'm kind of skimming over...). And in the same way, commenters and likers
    may be users.
    This may not seem relevant at first, but this duplication of types makes it much more difficult
    to denormalize a stream into a single document, than it is to denormalize a TV show. And
    that’s because in different places in your documents, you may be referring to the same
    conceptual domain object. Someone who liked a post in your activity stream may also be a
    friend, and may also have commented on a post in your stream.
    This is not impossible to handle in Mongo. One option is...

    View Slide

  29. http://www.flickr.com/photos/yannisag/3175076105
    29
    ...duplication.
    You can copy all the user data you need into every place you need it. So data for an individual
    user is duplicated - and every time they comment on something, and every time they like
    something, every time someone follows them - you make a separate copy of their user data.
    The advantage is that it's present everywhere you need it, and you can still pull the whole
    activity stream back as a single document.
    Here’s what this kind of fully denormalized stream document looks like:

    View Slide

  30. of
    {
    name: Joe,
    url: '...',
    stream:
    [{
    user: {name: Jane, url: '...'},
    title: 'today',
    body: 'go fly a kite',
    likes: [
    {user: {name: Lu, url: '...'}},
    {user: {name: Joe, url: '...'}}
    ],
    }]
    }
    JOE’S FULLY DENORMALIZED STREAM
    30
    Notice that we’ve got copies of user data inlined here. This is Joe’s stream, and it has a copy
    of his user data at the top level. Joe’s stream contains Jane’s post. Joe has clicked the “like”
    button on Jane’s post, so under likes for Jane’s post, we have another copy of Joe’s data.
    I think you can see why this is attractive. All the data you need is already located where you
    need it. I think you can also see why this is dangerous. When someone’s user data changes,
    we need to walk through all the activity streams where they’ve been active and change it in all
    those different places. This is very error-prone.
    So the other approach you can take to this problem in Mongo, which is more familiar to those
    of us with a relational background, is to store references to users in the activity stream
    document, rather than the actual user data.
    So instead of inlining this user data wherever you need it, ...

    View Slide

  31. of
    {
    name: Joe,
    url: '...',
    stream:
    [{
    user: {name: Jane, url: '...'},
    title: 'today',
    body: 'go fly a kite',
    likes: [
    {user: {name: Lu, url: '...'}},
    {user: {name: Joe, url: '...'}}
    ],
    }]
    }
    JOE’S SLIGHTLY MORE NORMALIZED STREAM
    id: 1,
    2,
    3},
    1},
    31
    ...you give each user an ID, here at the top - as an aside, Mongo actually uses BSON IDs,
    which are strings sort of like GUIDs, but to make the slides easier to read I’m just using
    integers. So once users have IDs, we store the user’s ID every place that we were previously
    inlining data.
    So this eliminates our duplication problem - now, when user data changes, there’s only one
    place we have to change it. However, now we can't get all the information we need to
    construct an activity stream from a single document. Our efficiency has been reduced, and
    our complexity has been increased, because now to construct the activity stream, we have to
    retrieve the stream document, and then go back and retrieve all the user documents that the
    stream references, to get their user data.
    And what you really want, here...

    View Slide

  32. joins
    http://www.flickr.com/photos/evaekeblad/2351025393/
    32
    ...is a join - one query that mashes together the activity stream and all the users that the
    stream references. SQL is great at joins. Mongo is not. It just plain doesn’t do them. So you
    end up manually doing that mashup in your application code, instead.
    In contrast when you're looking at this set of relationships...

    View Slide

  33. of
    tv_shows
    seasons
    episodes
    reviews
    cast_members
    many
    many
    many
    many
    THESE DOCUMENTS ARE SELF-CONTAINED
    33
    ...nothing here has that type of complexity. Because all these boxes are different entities, the
    entire query can be denormalized into one document with no duplication and no references.
    In this document database, there are no links between documents. It requires no joins.
    But on a social network, nothing is that self-contained.

    View Slide

  34. 34
    34
    When you come a site like this, any time you see something that looks like a name or a
    picture - these things here - you expect to be able to click on it and go see that user. Go see
    their profile, go see their posts! In a TV show application, it doesn't really happen that way. If
    you're on season 1 episode 1 of Babylon 5, you don't expect to be able to click through to
    season 1 episode 1 of General Hospital.
    TV shows have no links between the documents. But social data has a lot of linking between
    the documents, and in many ways, that’s the most important data in the whole system. This
    linking is inherent in the nature of social data, because it's all about who's doing it and
    what’s your relationship to them.
    Once we started doing ugly joins manually in the Diaspora code, we knew it was ...

    View Slide

  35. http://www.flickr.com/photos/davida3/939738862/
    35
    ...the first sign of trouble. This definitely looks like trouble. It certainly doesn’t look like some
    place you want to in...sert... your credit card, right? Maybe it’s not malicious! Maybe they just
    need a dictionary. But it’s definitely a red flag.
    The moment that you start storing references to other objects in mongo, that’s a red flag. It's
    a sign that your data actually is relational, that there's value to that structure, and that you are
    going against the basic concept of a document data store.
    Because when Mongo says it’s about documents...

    View Slide

  36. http://www.flickr.com/photos/horrgakx/2963449465
    36
    ...you should really believe it. And when they say "document," in many ways they mean
    something you can print out on a piece of paper and hold. It's a self-contained piece of
    information. It may have a lot of internal structure - it may have headings and subheadings
    and paragraphs and footers - but it doesn't have links to other things, and it doesn't have
    references to other things. It's a self-contained piece of semi-structured data.
    So if your data looks like that, you’ve got documents, and then it's a good use case for
    Mongo. If there’s value in the links between documents - then you don’t actually have
    documents. And therefore it’s probably not a good use case for Mongo.

    View Slide

  37. of
    WHAT THEY ACTUALLY MEAN
    1. It’s more naturally modeled
    as a graph
    TRUE
    2. It’s faster to retrieve when
    denormalized
    TRUE
    “Social data is not relational.”
    37
    And it turns out, that when people say "social data isn't relational," that's not actually what
    they mean. What they mean is one of these things:
    Sometimes, they mean that conceptually, social data is more of a graph than a set of tables.
    And this is true. But there are actually very few concepts in the world that are naturally
    modeled as normalized tables. We use that structure because it's efficient, because it avoids
    duplication, and because when it is slow, we know how to fix it.
    And other times, they mean that it will be faster to get all the data for a social query when it's
    denormalized into a single document. And this is also true. When your social data is in a
    relational store, you need a many-table join to extract the activity stream for a particular user,
    and that does get slow as your tables get bigger.
    But there is a well-known solution to this problem, and it’s called caching. I loved Neha’s talk
    this morning on this subject, and I've actually seen lots of projects that cache denormalized
    activity stream data into a document database like Mongo. Once you do that, hey! The only
    problem you have is cache invalidation! And, really, how hard could that be?

    View Slide

  38. of
    CACHE INVALIDATION
    “There are only two hard
    problems in computer
    science: cache invalidation
    and naming things.”
    PHIL KARLTON
    38
    Ha. Yeah, it turns out cache invalidation is actually pretty hard.
    Phil Karlton is the guy who wrote most of SSL version 3, X11, and OpenGL, so he knows a
    thing or two about computer science. Here’s my favorite variation on this by the way:

    View Slide

  39. of
    CACHE INVALIDATION
    “There are only two hard
    problems in computer
    science: cache invalidation,
    naming things, and off-by-
    one errors.”
    UNKNOWN SMART*SS
    39
    This one is not attributed to Phil Karlton.
    But what is cache invalidation, anyway, and why is it so hard?
    Well, cache invalidation is just knowing when a piece of your cached data is out of date, and
    needs to be updated or replaced. Here's a typical example that I see every day in web
    applications:

    View Slide

  40. of
    WRITES WITH A CACHE
    title “halo”, body “whirled”
    User with 2 followers posts:
    1
    post >> follower 1 activity stream
    post >> follower 2 activity stream
    2
    Read Requests
    40
    We have a standard backing store, call it, say, “postgres,” and in front of it we have a cache
    layer, call it, say, “memcached.” Application read requests for a user’s activity stream go to
    the cache rather than the database directly.
    Application writes are more complicated. Let's say a user with two followers writes a new
    post. The first thing that happens [1] is that the post data is copied into the backing store.
    Once that completes, a background job [2] appends that post to the cached activity stream of
    both of the users who follow the author. Neha talked about this this morning, that Twitter
    holds the users activity stream in an in-memory cache, and appends to it when someone that
    user follows posts something. And that’s a very common approach, even for smaller
    applications that employ some kind of activity stream.
    When the author in this example changes an existing post, for example updating the title, the
    write process is essentially the same as a create, except instead of appending to the cache, it
    updates an item that’s already there.
    But…what happens if that step 2 background job fails partway through? It’s bound to happen.
    Machines get rebooted, partitioned, deployed to, all the time. And chances are, when that
    happens, you'll have inconsistent data in your cache. Some copies of the post will have the old
    title, and some copies will have the new title. That's a hard problem, but it's not impossible to
    solve, because with a cache, there's always...

    View Slide

  41. http://www.flickr.com/photos/andyz/3856834913
    41
    ...the nuclear option. You can always delete the entire activity stream record out of your cache
    and regenerate it from your consistent backing store. It may be slow, but at least it’s possible.
    But what if...

    View Slide

  42. of
    WRITES WITH A CACHE
    post >> follower 1 activity stream
    post >> follower 2 activity stream
    Read Requests
    42
    ...there is no backing store? What if you skip step 1? When Mongo is all you have - it's like
    having a cache with no backing store behind it. It will become inconsistent. Not eventually
    consistent - just plain, flat-out inconsistent - and at that point, you have no options. Not
    even a nuclear one.
    There is no way to regenerate the data in a consistent state.

    View Slide

  43. of
    PRO TIP
    Database != Cache
    Oh sh*t.
    What now?
    43
    So when we stored our social data in Mongo, we were conflating a database with a cache.
    Databases and caches are very different things. They have very different ideas about
    permanence and transience, duplication and references, integrity, and speed.
    So once we figured out that we had accidentally chosen a cache for our database...what did
    we do about it?
    Well, that's the million dollar question. Which is, I looked it up, 622,704 pounds, according to
    today's exchange rate. But if that's the million-dollar question, then I think we already
    answered the billion-dollar question. In the last few slides, I talked about how we used
    Mongo, versus how it was designed to be used, as though it were all obvious, and we just
    failed to research adequately before choosing.
    But it wasn’t obvious at all. Mongo docs tell you what it’s good at, without emphasizing what
    it’s not good at. Naturally, right? All projects do that. So it took us quite a while to figure out
    that we were using Mongo the wrong way. It took about six months, actually. A lot of user
    complaints about inconsistent and lost data, and a lot of investigation.

    View Slide

  44. of
    CONVERSION PROCESS
    users
    friends
    posts
    comments
    likes
    many
    many
    many
    many
    one commenter
    one liker
    users
    friends
    commenter
    liker
    comments
    likes
    44
    The data conversion itself was straightforward in comparison. I'll run you through our
    approach, which I've used on a bunch of other conversion projects since this one.
    To pull data out of Mongo and put it in a SQL store, you need to normalize and de-duplicate
    it. And you want to start with the leaf nodes in your dependency graph. Let's look at what that
    means. Here our leaf nodes are commenters and likers, but they aren’t really leaf nodes,
    because they’re users, and users are actually the root of this whole mess.
    So the leaf nodes from a dependency standpoint are actually comments and likes. Let's start
    with likes.

    View Slide

  45. of
    HOW TO CONVERT DATA
    1. Export
    2. Convert
    3. Normalize
    4. Import
    5. Process
    4a. Repeat
    45
    Once you've identified the data to be extracted, the procedure is:
    1. EXPORT
    2. CONVERT
    3. NORMALIZE
    4. IMPORT
    4A. REPEAT
    5. PROCESS
    I break it into these steps primarily because they are easy to test independently. Let’s go
    through each one.

    View Slide

  46. of
    1. Export
    STEP ONE
    {‘likes’: [
    {‘post_bson_id’: 1234, ‘user_bson_id’: 5678},
    {‘post_bson_id’: 1234, ‘user_bson_id’: 9012},
    ]}
    46
    The first step is to write a map reduce in JavaScript that dumps all the like data to JSON, one
    JSON hash per like. You need two things for each like: the post's BSON ID, the user's BSON ID.
    There’s no other metadata - likes are essentially just a join table.

    View Slide

  47. of
    2. Convert
    STEP TWO
    {‘likes’: [
    {‘post_bson_id’: 1234, ‘user_bson_id’: 5678},
    {‘post_bson_id’: 1234, ‘user_bson_id’: 9012},
    ]}
    post_bson_id, user_bson_id
    1234, 5678
    1234, 9012
    47
    Step two is to convert that JSON to a CSV file, one row per like.

    View Slide

  48. of
    3. Normalize
    STEP THREE
    id
    post_bson_id
    user_bson_id
    post_id
    user_id
    auto-increment primary key
    from Mongo
    from Mongo
    initially blank
    initially blank
    48
    Step three is to normalize.
    Create a likes table in your relational store with five columns:
    id, which is your auto increment primary key,
    post_bson_id and user_bson_id, which are the two IDs you exported from Mongo, and
    post_id and user_id, initially blank, which will eventually contain the foreign keys into those
    tables that don’t actually exist yet.

    View Slide

  49. of
    STEP FOUR
    4. Import
    LOAD DATA INFILE
    49
    Step four is to load that CSV into your relational store. It’s a pretty basic LOAD DATA INFILE.
    At this stage, only id, post_bson_id, and user_bson_id will be present in each row.

    View Slide

  50. of
    4a. Repeat
    STEP FOUR A
    id
    bson_id
    user_bson_id
    text (+ other
    metadata)
    auto-increment primary key
    from Mongo
    from Mongo
    from Mongo
    50
    Step 4A is to repeat the process with posts so that we can populate the likes table’s post_id
    field.
    We now have a table for likes, the leaf node, so we go up one level to its container - posts -
    and repeat steps 1 through 4.
    We make a posts table with an auto-increment primary key, a bson_id column containing the
    post's BSON ID, a user_bson_id column containing the BSON ID of the author, and of course,
    all the actual data for a post like date and text.

    View Slide

  51. of
    5. Process
    STEP FIVE
    UPDATE likes, posts
    SET likes.post_id = posts.id
    WHERE posts.bson_id = likes.post_bson_id;
    ALTER TABLE likes
    DROP COLUMN post_bson_id;
    ALTER TABLE likes
    MODIFY post_id INT NOT NULL;
    51
    Step 5 is to process. This is the hard one usually. Write a SQL statement similar to this [first
    statement] that copies post_ids into the likes table based on matching up the BSON IDs.
    Once you’ve done that, you can delete the post_bson_id column from the likes table [second
    statement].
    But I recommend before you do that, you do this [third statement]: make likes.post_id not
    null. If you’re lucky, this will actually work the first time, meaning that every row in the likes
    table actually has a post_id.
    I’ve never been lucky. Sometimes when it doesn’t work, it means we didn’t import all the
    posts. Sometimes it means we have likes that refer to a post that has been deleted. This is
    where we do the drudge work of actually figuring out how to make the data consistent.
    Once you’ve got that (NO PROBLEM AMIRITE), go back and repeat for users.
    This process is sadly the best case scenario.

    View Slide

  52. of
    WHAT IF YOU’RE MORE FULLY DENORMALIZED?
    1. Export
    2. Convert
    3. Normalize
    4. Import
    5. Process
    XNormalize
    4a. Repeat
    XRepeat
    A LOT
    52
    The process I just describes assumes that everything you need to refer to in the normalized
    version of your schema already has a BSON ID inside of Mongo. That is, it assumes you've
    gotten to the point where you're referring to duplicated objects using Mongo's BSON IDs,
    rather than actually duplicating the data.
    If that's not the case - if you're copying data - then it's even harder. At that point you need to
    figure out which instances of the copied data are referring to the same conceptual instance,
    and then reconcile the inevitable conflicting data. Usually the easiest way to do this is to:
    1. export all the data, duplicates, conflicts, and all, out of Mongo,
    2. convert it to a CSV,
    3. skip normalizing for the moment,
    4. import it into a temporary table in your SQL store,
    4A. no repeating, and then finally,
    5. figure out which SQL statements you need to transform that mess of a temporary table into
    the consistent version in the final table. Then you can delete the temporary table.
    Most projects need a combination of these two approaches. The relative proportion depends
    on how the project has constructed their document store.

    View Slide

  53. of
    THE DAMAGE
    8 months of production data
    1.2 million rows
    4 pair-weeks of development
    2 hours of downtime
    53
    So here’s the damage - we had 8 months of production data. It turned into about 1.2 million
    rows all told in our SQL store. We spent four pair-weeks on development, and had about 2
    hours of downtime.
    That was more than acceptable for a project that was in pre-alpha. We could have actually
    reduced that downtime more, but we had budgeted for 8 hours of downtime, so 2 actually
    seemed fantastic.
    All in all -

    View Slide

  54. NOT BAD http://i.imgur.com/4aRuE.png
    54
    Not bad for a project that was pre-alpha at the time. Barack, Michelle, and grumpy cat all
    approve.
    I want to leave you with one final thought. Remember this project?

    View Slide

  55. of
    tv_shows
    seasons
    episodes
    reviews
    cast_members
    many
    many
    many
    many
    EPILOGUE
    55
    The TV shows? This is the perfect use case for Mongo - each show is one document, perfectly
    self-contained. No references to anything, no duplication, and no way for the data to become
    inconsistent.
    This is the second time I’ve given this talk. About a month after the first time I gave it, this
    project was still humming along nicely on Mongo, and then one Monday, in the weekly
    planning meeting, the client brought up a new feature that one of their investors wanted:
    when they were looking at an episode, and looking at all the cast members - the actors that
    were in that episode - they wanted to be able to click on an actor’s name and see that
    person’s entire television career. A chronological listing of all of the episodes of all the
    different shows they had ever been in.
    Since each show and all of its information including cast members were individual
    documents, to implement this, we would have to search through every document to find and
    de-duplicate instances of the actor they clicked on. At a minimum, we would have had to
    maintain an external index of actor information, which would have the same invalidation
    issues as any other kind of cache.

    View Slide

  56. of
    tv_shows
    seasons
    episodes
    reviews
    cast_members
    many
    many
    many
    many
    EPILOGUE
    56
    The client expected this feature to be trivial. And if the data had been in a SQL store, it would
    have been. As it was, we first tried to convince the PM that they didn’t need it, then we
    offered some alternatives, such as linking to an IMDB search for the actor's name, but of
    course, since this company made money from advertising, they wanted users to stay on their
    site rather than going off to IMDB.
    This feature request eventually prompted this project’s conversion to postgres. Because after
    a lot more conversation with the client, we realized that the business saw lots of value in
    linking TV shows together - this is one of them, but also things like related shows - "if you
    like this, you may also like" - these other shows. See other shows this director has been
    involved with. See episodes of other shows that were released the same week this episode
    was.
    This was ultimately...

    View Slide

  57. of
    http://www.d80.co.uk/image.axd?picture=2011%2F3%2FTHE_SOFTWARE_DEVELOPMENT_PROJECT.jpg
    57
    ... a communication problem rather than a technical problem. If we had had these
    conversations sooner, if we had taken the time to really understand how the client saw the
    data and what they wanted to do with it, we probably would have done the conversion earlier,
    when there was less data, and it was easier.

    View Slide

  58. of
    FINAL PRO TIP
    Structure < Value
    58
    So that's what I want to leave you with. My last pro tip. The structure of the data is important
    when you're picking a data store, but more important than structure is the value of the data.
    You need to understand where in your data - and where in its connections - the value lies.
    Because if it looks like you have documents but there is value in the links between them - you
    don't actually have documents.
    Ultimately you want to choose a data store...

    View Slide

  59. of
    OK, ONE MORE
    Make valuable
    things easy.
    59
    ... that makes valuable things easy.
    And my very last pro tip -

    View Slide

  60. of
    http://www.flickr.com/photos/[email protected]/4937109161
    60
    ...you should really consider those helmets and shoulder pads and stuff, you guys, because
    this looks super dangerous. We need all your IQ points.

    View Slide

  61. Thank you!
    Sarah Mei
    Ministry of Velocity
    @sarahmei
    http://www.tehcute.com/baby-turtle-eats-strawberry.html
    61
    And that’s all I’ve got. Thank you very much!

    View Slide