Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Take the Magic out of NoSQL Data Stores

Take the Magic out of NoSQL Data Stores

We’ll discuss the planning and development process of Cashbox, a completely managed document store written for .NET that sits within a single process. From the first steps of understanding the problem and discovering how to address it to finally writing a custom storage engine to replace the stop-gap storage solutions used at first. This will discuss most things at a high level, using code merely as a way to point out ways to address problems and discuss approaches used.

Avatar for Travis Smith

Travis Smith

March 22, 2012
Tweet

More Decks by Travis Smith

Other Decks in Programming

Transcript

  1. $ whoami travis $ finger travis Login: travis Name: Travis

    Smith Directory: @legomasternet Shell: Rails, CoffeeScript, Mongo No Mail for [email protected] Scientist, Gamer, OSS Developer, and Lead Engineer at Facio. 2
  2. Cashbox v0.1ish • API contains • Store • Retrieve •

    Delete • List • Keeps all data in in-memory dictionary • After x transactions, writes dictionary to disk • Start up is just loading dictionary from disk • No UPSERT, two SQL commands per Store 9
  3. Look it up, I swear it’s true • Moving on

    to v0.5 • SQLite for storage • Nothing in memory • No indexing 11
  4. Heaping piles of files • Easy • Just dump it

    on the end • Header marked for deletion • Slow to search 13
  5. Hashing • Hash function describes location of record • Space

    is fluffed • Fast for retrieving by key • Collisions! 14
  6. I (am) SAM • Wikipedia claims ISAM has been replaced

    with B+ Trees (except for VSAM) • Only works well with fixed length records • Index not stored with data, fast index searches 15
  7. Bee + Tree = Honey • Indexes all data in

    a B+ tree • Who wants to write binary trees? • Thank Wikipedia for this image! 17
  8. Break-it-down • Heap • All data retrieved by key, slow

    • Hash • Problems, collisions, partitions • ISAM/VSAM • Not fixed length data • B+ Trees • Complicated • Roll my own • Yay! 18
  9. Whatever will I do? • Write forward only data storage

    with in- memory index • No free-heap • Fast read, fast write • Sequential reads slower • Easily transactional; presistable partial tx 19
  10. // TODO: add title • For .NET; always work with

    streams • Easy to use memory streams for testings • Easy to port to mobile platforms (isolated storage or file streams) 20
  11. Recorded • int HeaderVersion • long RecordSize • StorageAction Action

    { Store / Delete } • string Key • string Table 21
  12. Tabled • Table is like a topic • Used to

    store type - .NET is strongly typed • Could just be a classification 22
  13. Streamed • Content of a stream has a header •

    Plus n records • Each record is read on start up to build an index Stream Start Stream Header Record Record Record Record Stream End 23
  14. Where did I put that? • Indexes are important •

    Index is a dictionary of key (as RecordHeader) and stream location of record • Read from the stream at start up • Updated as data changes during runtime 24
  15. How does it work? • Start with empty stream •

    Insert stream header • Leave pointer at end of stream 25
  16. Inserting • Place a new store record at the pointer

    location (end of stream) • Update index to point at that location 26
  17. Inserting, again • Place a new store record at the

    pointer location (end of stream) • Update index to point at that location • Exactly the same as before 27
  18. Update is a store • Place a new store record

    at the pointer location (end of stream) • Update index to point at that location • Seems similar, except we have a dead record 28
  19. Delete • Place a new delete record at the pointer

    location (end of stream) • Delete records are just a header, no data • Remove item from index 29
  20. Compact • Go through the index • Copy records in

    the index to a second stream • Clean up • Eliminates dead records and recovers space 30
  21. Where did I put that? • Indexes are important •

    Index is a dictionary of key (as RecordHeader) and stream location of record • Read from the stream at start up • Updated as data changes during runtime 31
  22. Limitations • High volumes will require more compactions • Currently

    no transactions • Actor model has eventual consistency in this implementation • Only within processes boundaries 32