Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 Common Misconceptions about Apache CouchDB

Joan Touzet
November 13, 2013

10 Common Misconceptions about Apache CouchDB

Learn about the common mistakes people make when they first try Apache CouchDB, and what designs work best with its architecture.

Joan Touzet

November 13, 2013
Tweet

More Decks by Joan Touzet

Other Decks in Programming

Transcript

  1. First… This presentation is a bit dry. But these problems

    keep coming up on mailing lists, IRC, and in discussion with developers. Today, I’d rather inform than entertain. 2
  2. 1. “Every cell of my SQL table is now a

    doc” Even “every row is now a doc” is a rough approximation. RDBMS Normal Forms do not strictly apply to CouchDB or NoSQL. I had a big analysis here of Normal Forms and CouchDB but it doesn’t fit into a 20 minute slot… 4
  3. What are the objectives of normalization? 1. To free the

    collection of relations from undesirable insertion, update and deletion dependencies; 2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs; 3. To make the relational model more informative to users; 4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by. 5 Source: Codd, E.F. “Further Normalization of the Data Base Relational Model,” ACM Trans. on DB Systems, 1971.
  4. Do these still apply for CouchDB? 1. To free the

    collection of relations from undesirable insertion, update and deletion dependencies; CouchDB is ACID compliant per single document operation. Keep this in mind when designing your data model. 6
  5. Do these still apply for CouchDB? 2. To reduce the

    need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs; Just create a new document. type is a near-universal document field. 7
  6. Do these still apply for CouchDB? 3. To make the

    relational model more informative to users; The document model replaces the relational model. The spirit is met: docs should be self-contained, self-describing, & stand alone. 8
  7. Do these still apply for CouchDB? 4. To make the

    collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by. Views are independent of the document structure. 9
  8. Recommendation Design your database around access patterns, not warehousing. 1.

    Record everything as it comes in, as a new doc 2. Timestamp everything* 3. Highly relational data may be a bad fit for CouchDB. 10 *Do not trust client-side time stamps to give you absolute event ordering!
  9. 2. “Cool, attachments!” Large attachments can create performance issues, especially

    for replication. – Replication of other documents may get held up by a big one • This can be catered for via filtered replication – Large files can rapidly eat available disk space – >1GB attachments are not a first-order design scenario. Attachments are also not available to view servers. You wouldn’t store video files as BLOBs in Oracle, would you? 12
  10. Recommendation Use Couch doc _ids/UUIDs to tag large assets Stash

    them in a CDN, S3, Riak, Dropbox, BitTorrent… 13
  11. 3. “What’s a conflict / 409 HTTP error?” For a

    single, never-replicated CouchDB: Someone updated the doc while you weren’t looking. The database is not conflicted, you are. ☺ Repeat your GET-modify-PUT loop. 14
  12. 3. “What’s a conflict / 409 HTTP error?” For multiple,

    replicating CouchDB servers: Multiple Couches may have different views of the DB. Replication will reconcile differences, but will leave behind traces of the disagreement. These traces are your document conflicts. Look at them. 15
  13. My personal recommendation Design to avoid conflicts at all costs.

    Consider creating a new document for every change. Views will then help you find the latest info. For more info: http://guide.couchdb.org/v1/conflicts.html http://docs.couchdb.org/en/latest/intro/consistency.html 16
  14. 3b. “Conflicts don’t happen in my design.” Check anyway. GET

    /db/bob and look for "_conflicts":true GET /db/bob?conflicts=true 17
  15. 4. “What happened to my doc revisions?” Repeat after me:

    "MVCC tokens are not a revision control system" CouchDB does not keep all document history forever. Compaction will remove all but the latest + conflicts. Replication will not replicate all historical versions. – In fact, it transfers every leaf and its body, and just the path of revisions that lead to that leaf, but only up to _revs_limit revisions per leaf. 18
  16. Recommendations If you need it, keep everything. Don’t overwrite/delete it.

    Write a new document for every single transaction. Views can then help you find the latest info. In short, design for your access pattern. 19
  17. 5. “Views are hard / Couch has no SELECT” Multiple

    variations on this theme: “I’ll just treat CouchDB as a key/value store and do all my own queries client-side.” “I have highly relational data and views aren’t powerful enough.” “I need keyword / full text search.” 21
  18. Recommendations MapReduce covers many query needs. It is not SQL

    SELECT. Still, it is very powerful. Think about your access patterns up front. couchdb-lucene and Cloudant offer full-text search. 22
  19. Worked example: directory tree-view map: function(doc) { if(doc.type === "file")

    { if(doc.path.substr(-1) === "/") { var raw_path = doc.path.slice(0,-1); } else { var raw_path = doc.path; } emit(raw_path.split('/'), 1); } } reduce: _sum 23
  20. Worked example: tree view stats GET /files/_design/group/_view/tree_view? group_level=2 (depth) &startkey=["Docs"]&endkey=["Music"]

    (split directory paths) {"rows":[ {"key":["Docs"],"value":108}, {"key":["Music"],"value":328} } 24
  21. Worked example: dir contents GET /files/_design/group/_view/tree_view? reduce=false (entries, not stats)

    &startkey=["Music","The Beatles"] (directory of interest) &endkey=["Music","The Beatles", {}] {"total_rows":8,"offset":3,"rows":[ {"id":"c4d985d30df7c30de29f1774e5001baa","key":["Music","The Beatles"],"value":1}, {"id":"c4d985d30df7c30de29f1774e5002422","key":["Music","The Beatles","I_Wanna_Hold_Your_Hand.mp3"],"value":1}, {"id":"c4d985d30df7c30de29f1774e50040fb","key":["Music","The Beatles","Magical_Mystery_Tour.mp3"],"value":1}, {"id":"c4d985d30df7c30de29f1774e500315b","key":["Music","The Beatles","Norwegian_Wood.mp3"],"value":1}, {"id":"c4d985d30df7c30de29f1774e5004689","key":["Music","The Beatles","Sgt._Peppers_Lonely_Hearts_Club_Band.mp3"],"value":1}}, … ]} 25
  22. 6. “SHOW/LIST is easy server-side HTML” 26 (A Geo Metro

    with a trailer hitch is not an ideal towing vehicle.)
  23. Recommendations Views are best at MapReduce. Render HTML in the

    browser, or use a full server-side framework. SHOW/LIST is a last resort for legacy clients that expect CSV, XML, etc. It is not for rendering images, HTML, etc. To me, SHOW/LIST is ugly and should probably be deprecated. 27
  24. 7. “View groups are for developer convenience” 28 Each view

    group becomes a separate view server process when accessed. 8 view groups means 8 couchjs processes Use this to your advantage! (# CPUs/cores)
  25. 8. “Cool, so I now have a message queue, right?”

    30 Replication makes no guarantee of delivery, timeliness, or that all updates occur in the source’s _changes feed order. It’s more of a mailbox with weak ordering. Strong ordering is hard, if not impossible. “…it is impossible to order events with respect to time in a distributed system, this means they must be ordered causally.” (Riak docs) Partial ordering is possible, via causal relations. Read up on Lamport timestamps, vector clocks, etc. for more.
  26. Recommendations You want a message queue? Use one. Apache ActiveMQ

    or Qpid RabbitMQ (AMQP) …or heck, CICS / IBM WebSphere MQ You want eventually-consistent ordered info? Couch. 31
  27. 9. “I just need replication, not Couch. I’ll write my

    own.” 32 It’s not easy. To get all the edge cases right, and to be interoperable, it’ll take you months of effort. This is worse if you attempt it with flat files. Just ask how long it took PouchDB to get it right!
  28. 10. “I need High Availability, so I’ll use replication.” 33

    No, you want the bigcouch merge and haproxy/nginx: 1. Automatic database and view sharding 2. Optimized internal replication 3. Tunable DynamoDB-like parameters. 4. Lots more I don’t have time to talk about