Upgrade to Pro — share decks privately, control downloads, hide ads and more …

File-Based Data Stores: Where's the Magic?

File-Based Data Stores: Where's the Magic?

We’ll discuss the planning and development process of Cashbox, a completely managed document store written for .NET that sits within a single process. From the first steps of understanding the problem and discovering how to address it to finally writing a custom storage engine to replace the stop-gap storage solutions used at first. This will discuss most things at a high level, using code merely as a way to point out ways to address problems and discuss approaches used.

Avatar for Travis Smith

Travis Smith

November 08, 2012
Tweet

More Decks by Travis Smith

Other Decks in Programming

Transcript

  1. Cashbox v0.1ish • API contains • Store • Retrieve •

    Delete • List • Keeps all data in in-memory dictionary • After x transactions, writes dictionary to disk • Start up is just loading dictionary from disk • No UPSERT, two SQL commands per Store
  2. Look it up, I swear it’s true • Moving on

    to v0.5 • SQLite for storage • Nothing in memory • No indexing
  3. Heaping piles of files • Easy • Just dump it

    on the end • Header marked for deletion • Slow to search
  4. Hashing • Hash function describes location of record • Space

    is fluffed • Fast for retrieving by key • Collisions!
  5. I (am) SAM • Wikipedia claims ISAM has been replaced

    with B+ Trees (except for VSAM) • Only works well with fixed length records • Index not stored with data, fast index searches
  6. Bee + Tree = Honey • Indexes all data in

    a B+ tree • Who wants to write binary trees? • Thank Wikipedia for this image!
  7. Break-it-down • Heap • All data retrieved by key, slow

    • Hash • Problems, collisions, partitions • ISAM/VSAM • Not fixed length data • B+ Trees • Complicated • Roll my own • Yay!
  8. Whatever will I do? • Write forward only data storage

    with in- memory index • No free-heap • Fast read, fast write • Sequential reads slower • Easily transactional; presistable partial tx
  9. // TODO: add title • For .NET; always work with

    streams • Easy to use memory streams for testings • Easy to port to mobile platforms (isolated storage or file streams)
  10. Recorded • int HeaderVersion • long RecordSize • StorageAction Action

    { Store / Delete } • string Key • string Table
  11. Tabled • Table is like a topic • Used to

    store type - .NET is strongly typed • Could just be a classification
  12. Streamed • Content of a stream has a header •

    Plus n records • Each record is read on start up to build an index Stream Start Stream Header Record Record Record Record Stream End
  13. Where did I put that? • Indexes are important •

    Index is a dictionary of key (as RecordHeader) and stream location of record • Read from the stream at start up • Updated as data changes during runtime
  14. How does it work? • Start with empty stream •

    Insert stream header • Leave pointer at end of stream
  15. Inserting • Place a new store record at the pointer

    location (end of stream) • Update index to point at that location
  16. Inserting, again • Place a new store record at the

    pointer location (end of stream) • Update index to point at that location • Exactly the same as before
  17. Update is a store • Place a new store record

    at the pointer location (end of stream) • Update index to point at that location • Seems similar, except we have a dead record
  18. Delete • Place a new delete record at the pointer

    location (end of stream) • Delete records are just a header, no data • Remove item from index
  19. Compact • Go through the index • Copy records in

    the index to a second stream • Clean up • Eliminates dead records and recovers space
  20. Where did I put that? • Indexes are important •

    Index is a dictionary of key (as RecordHeader) and stream location of record • Read from the stream at start up • Updated as data changes during runtime
  21. Limitations • High volumes will require more compactions • Currently

    no transactions • Actor model has eventual consistency in this implementation • Only within processes boundaries