Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 Common Misconceptions about Apache CouchDB

Joan Touzet
November 13, 2013

10 Common Misconceptions about Apache CouchDB

Learn about the common mistakes people make when they first try Apache CouchDB, and what designs work best with its architecture.

Joan Touzet

November 13, 2013
Tweet

More Decks by Joan Touzet

Other Decks in Programming

Transcript

  1. 10 Common Misconceptions
    about Apache CouchDB
    Joan Touzet - @wohali – http://www.atypical.net/

    View full-size slide

  2. First…
    This presentation is a bit dry.
    But these problems keep coming up
    on mailing lists, IRC, and in discussion
    with developers.
    Today, I’d rather inform than
    entertain.
    2

    View full-size slide

  3. The Platform
    Source: Wikimedia Commons
    3

    View full-size slide

  4. 1. “Every cell of my SQL table is now a doc”
    Even “every row is now a doc” is a rough approximation.
    RDBMS Normal Forms do not strictly apply to CouchDB or
    NoSQL.
    I had a big analysis here of Normal Forms and CouchDB but it
    doesn’t fit into a 20 minute slot…
    4

    View full-size slide

  5. What are the objectives of normalization?
    1. To free the collection of relations from undesirable insertion, update
    and deletion dependencies;
    2. To reduce the need for restructuring the collection of relations, as new
    types of data are introduced, and thus increase the life span of
    application programs;
    3. To make the relational model more informative to users;
    4. To make the collection of relations neutral to the query statistics,
    where these statistics are liable to change as time goes by.
    5
    Source: Codd, E.F. “Further Normalization of the Data Base Relational Model,” ACM Trans. on DB Systems, 1971.

    View full-size slide

  6. Do these still apply for CouchDB?
    1. To free the collection of relations from undesirable
    insertion, update and deletion dependencies;
    CouchDB is ACID compliant per single document
    operation.
    Keep this in mind when designing your data model.
    6

    View full-size slide

  7. Do these still apply for CouchDB?
    2. To reduce the need for restructuring the collection of
    relations, as new types of data are introduced, and
    thus increase the life span of application programs;
    Just create a new document.
    type is a near-universal document field.
    7

    View full-size slide

  8. Do these still apply for CouchDB?
    3. To make the relational model more informative to
    users;
    The document model replaces the relational
    model.
    The spirit is met: docs should be self-contained,
    self-describing, & stand alone.
    8

    View full-size slide

  9. Do these still apply for CouchDB?
    4. To make the collection of relations neutral to the
    query statistics, where these statistics are liable to
    change as time goes by.
    Views are independent of the document
    structure.
    9

    View full-size slide

  10. Recommendation
    Design your database around access patterns, not
    warehousing.
    1. Record everything as it comes in, as a new doc
    2. Timestamp everything*
    3. Highly relational data may be a bad fit for
    CouchDB.
    10
    *Do not trust client-side time stamps to give you absolute event ordering!

    View full-size slide

  11. Document Access & Design
    Source: Itoya, Ginza, Tokyo, Japan
    11

    View full-size slide

  12. 2. “Cool, attachments!”
    Large attachments can create performance issues, especially
    for replication.
    – Replication of other documents may get held up by a big one
    • This can be catered for via filtered replication
    – Large files can rapidly eat available disk space
    – >1GB attachments are not a first-order design scenario.
    Attachments are also not available to view servers.
    You wouldn’t store video files as BLOBs in Oracle, would you?
    12

    View full-size slide

  13. Recommendation
    Use Couch doc _ids/UUIDs to tag large assets
    Stash them in a CDN, S3, Riak, Dropbox, BitTorrent…
    13

    View full-size slide

  14. 3. “What’s a conflict / 409 HTTP error?”
    For a single, never-replicated CouchDB:
    Someone updated the doc while you weren’t looking.
    The database is not conflicted, you are. ☺
    Repeat your GET-modify-PUT loop.
    14

    View full-size slide

  15. 3. “What’s a conflict / 409 HTTP error?”
    For multiple, replicating CouchDB servers:
    Multiple Couches may have different views of the DB.
    Replication will reconcile differences, but will leave behind
    traces of the disagreement.
    These traces are your document conflicts. Look at them.
    15

    View full-size slide

  16. My personal recommendation
    Design to avoid conflicts at all costs.
    Consider creating a new document for every change.
    Views will then help you find the latest info.
    For more info:
    http://guide.couchdb.org/v1/conflicts.html
    http://docs.couchdb.org/en/latest/intro/consistency.html
    16

    View full-size slide

  17. 3b. “Conflicts don’t happen in my design.”
    Check anyway.
    GET /db/bob and look for "_conflicts":true
    GET /db/bob?conflicts=true
    17

    View full-size slide

  18. 4. “What happened to my doc revisions?”
    Repeat after me: "MVCC tokens are not a revision control system"
    CouchDB does not keep all document history forever.
    Compaction will remove all but the latest + conflicts.
    Replication will not replicate all historical versions.
    – In fact, it transfers every leaf and its body, and just the path of revisions that lead to that leaf, but
    only up to _revs_limit revisions per leaf.
    18

    View full-size slide

  19. Recommendations
    If you need it, keep everything. Don’t overwrite/delete it.
    Write a new document for every single transaction.
    Views can then help you find the latest info.
    In short, design for your access pattern.
    19

    View full-size slide

  20. Document Views/Queries
    Source: Graffiti as captured by Google Earth, Tokyo, Japan
    20

    View full-size slide

  21. 5. “Views are hard / Couch has no SELECT”
    Multiple variations on this theme:
    “I’ll just treat CouchDB as a key/value store and do all my
    own queries client-side.”
    “I have highly relational data and views aren’t powerful
    enough.”
    “I need keyword / full text search.”
    21

    View full-size slide

  22. Recommendations
    MapReduce covers many query needs.
    It is not SQL SELECT. Still, it is very powerful.
    Think about your access patterns up front.
    couchdb-lucene and Cloudant offer full-text search.
    22

    View full-size slide

  23. Worked example: directory tree-view
    map:
    function(doc) {
    if(doc.type === "file") {
    if(doc.path.substr(-1) === "/") {
    var raw_path = doc.path.slice(0,-1);
    } else {
    var raw_path = doc.path;
    }
    emit(raw_path.split('/'), 1);
    }
    }
    reduce:
    _sum
    23

    View full-size slide

  24. Worked example: tree view stats
    GET /files/_design/group/_view/tree_view?
    group_level=2 (depth)
    &startkey=["Docs"]&endkey=["Music"] (split directory paths)
    {"rows":[
    {"key":["Docs"],"value":108},
    {"key":["Music"],"value":328}
    }
    24

    View full-size slide

  25. Worked example: dir contents
    GET /files/_design/group/_view/tree_view?
    reduce=false (entries, not stats)
    &startkey=["Music","The Beatles"] (directory of interest)
    &endkey=["Music","The Beatles", {}]
    {"total_rows":8,"offset":3,"rows":[
    {"id":"c4d985d30df7c30de29f1774e5001baa","key":["Music","The
    Beatles"],"value":1},
    {"id":"c4d985d30df7c30de29f1774e5002422","key":["Music","The
    Beatles","I_Wanna_Hold_Your_Hand.mp3"],"value":1},
    {"id":"c4d985d30df7c30de29f1774e50040fb","key":["Music","The
    Beatles","Magical_Mystery_Tour.mp3"],"value":1},
    {"id":"c4d985d30df7c30de29f1774e500315b","key":["Music","The
    Beatles","Norwegian_Wood.mp3"],"value":1},
    {"id":"c4d985d30df7c30de29f1774e5004689","key":["Music","The
    Beatles","Sgt._Peppers_Lonely_Hearts_Club_Band.mp3"],"value":1}}, … ]}
    25

    View full-size slide

  26. 6. “SHOW/LIST is easy server-side HTML”
    26
    (A Geo Metro with a trailer hitch is not an ideal towing vehicle.)

    View full-size slide

  27. Recommendations
    Views are best at MapReduce.
    Render HTML in the browser, or use a full server-side
    framework.
    SHOW/LIST is a last resort for legacy clients that expect CSV,
    XML, etc. It is not for rendering images, HTML, etc.
    To me, SHOW/LIST is ugly and should probably be deprecated.
    27

    View full-size slide

  28. 7. “View groups are for developer convenience”
    28
    Each view group becomes a separate view server
    process when accessed.
    8 view groups means 8 couchjs processes
    Use this to your advantage! (# CPUs/cores)

    View full-size slide

  29. Replication
    Source: Memory Base Alpha
    29

    View full-size slide

  30. 8. “Cool, so I now have a message queue, right?”
    30
    Replication makes no guarantee of delivery, timeliness, or that
    all updates occur in the source’s _changes feed order.
    It’s more of a mailbox with weak ordering.
    Strong ordering is hard, if not impossible.
    “…it is impossible to order events with respect to time in a distributed
    system, this means they must be ordered causally.” (Riak docs)
    Partial ordering is possible, via causal relations.
    Read up on Lamport timestamps, vector clocks, etc. for more.

    View full-size slide

  31. Recommendations
    You want a message queue? Use one.
    Apache ActiveMQ or Qpid
    RabbitMQ (AMQP)
    …or heck, CICS / IBM WebSphere MQ
    You want eventually-consistent ordered info? Couch.
    31

    View full-size slide

  32. 9. “I just need replication, not Couch. I’ll write my own.”
    32
    It’s not easy.
    To get all the edge cases right, and to be interoperable,
    it’ll take you months of effort.
    This is worse if you attempt it with flat files.
    Just ask how long it took PouchDB to get it right!

    View full-size slide

  33. 10. “I need High Availability, so I’ll use replication.”
    33
    No, you want the bigcouch merge and haproxy/nginx:
    1. Automatic database and view sharding
    2. Optimized internal replication
    3. Tunable DynamoDB-like parameters.
    4. Lots more I don’t have time to talk about

    View full-size slide

  34. Thank you for listening!
    34
    Joan Touzet - @wohali – http://www.atypical.net/

    View full-size slide