r5l

RaleighSL v5 Abstract Storage Layer

RaleighSL v5 How well your system perform, is very much
determined by how easy it is to make little changes to it. …and the more little experiments you make the higher your performance is going to be. Object Layer Objects SSet Deque Flow Vector RecNo Bitmap SpaceMap RTree … Raleigh Server RaleighSL Raleigh Client Binary-Field Protocol RPC Layer Native Protocol Text  Protocol SQL … Protocols User Managment Security/Auth Quota/RateLimit Scheduling … Device Layer Devices File BlkDev Remote … Dev Groups RAID0 RAID1 RAIDZ … I/O Managment Space Map Balancer Repacker … Quota/Throttleing Q-Scheduling Semantic Layer  r5l://metadata/key r5l://snapshot/NAME/path Client Layer Sync/Async Libraries C C++ Python Tools Admin Shell Debug Tools Repair Tools FUSE Instead of Files and Directories, you have Objects! Every Object has it’s own on-disk layout, and features TXN  Manager Core Layer Task Scheduling Object  Observers Semantic Layer API • Pluggable Semantic Layer (File Lookup, Access Rights, ...) • Pluggable Device Layer (Block allocator, Disk Read/Write, ...) • Pluggable “File-Structures” with custom on-disk layouts. • Pluggable “File-Structures” queryables and with own semantic. • Objects Observers (Trigger operations)

struct r5l_vtable_dev { r5l_errno_t (*open) (r5l_dev_t *self, va_list args); void
(*close) (r5l_dev_t *self); r5l_errno_t (*attach) (r5l_dev_t *self, r5l_dev_t *sub_dev); r5l_errno_t (*detach) (r5l_dev_t *self, r5l_dev_t *sub_dev); r5l_errno_t (*add_io) (r5l_dev_t *self, r5l_dev_io_t *io); }; Device Layer Two Kind of Devices: • Raw Devices • Where the actual read/write data happens • Disk, File, Remote Interface, … • Group Devices • Set of Raw Devs with speciﬁed RAID • Requests are distributed among the raw devices, to get the best performances The Admin can Add Devices & Dev Groups on the Fly: • r5l dev group create <name> <raid-type> • r5l dev group attach [<dev-id> [ -t <type> <args>]] • r5l dev group detach [<dev-id> [-t <type> <args>]] • r5l dev group set <name> <key> <value> • r5l dev group set <name> cache 0.30 • r5l dev group set <name> compression lz4 Root Device  (It’s a Group Device) Group Devices (Raid0, Raid1, …) Raw Devices (BlkDev, File, Socket, …) Group 1  (RAID0) Group 2 (RAID1) Group 3 (RAIDZ) Put a back Ref into Data blocks! Metadata (ctime, mtime, mode, ...) (Data Blocks) (Block Pointers) Why fsck takes the whole day? Who owns the block X?

Device Layer • Group Devices • Requests are distributed among
the raw devices, to get the best performances • A read() corruption from a raw dev, is retried on another device (if the RAID allows) and the block is ﬁxed on that raw device. • Raw Devices • Perform the read/write requests received. • Perform validation of the data block Each Raw Devices has a set of “Priority Queues” • Each queue has a its own “elevator algorithm”  (Deadline, SSFT, FIFO, …) • Each queue can be throttled to provide more fairness. • Journal Write, Recovery Reads, Async Reads, Async Writes struct r5l_vtable_dev_ioq { r5l_errno_t (*open) (r5l_dev_ioq_t *self, va_list args); void (*close) (r5l_dev_ioq_t *self); r5l_errno_t (*add) (r5l_dev_ioq_t *self, r5l_dev_io_t *io); r5l_errno_t (*fetch) (r5l_dev_ioq_t *self, r5l_dev_io_batch_t *batch); }; I/O Request Object Request r5l_dev_add_io() r5l_dev_ioq_fetch() Dev I/O Executor read()/write() r5l_dev_ioq_add() (ReRequest on failure) Verify Block Notify the Object Root Device  (It’s a Group Device) Group Devices (Raid0, Raid1, …) Raw Devices (BlkDev, File, Socket, …) struct r5l_dev_io_batch { uint64_t phy_offset; uint64_t phy_length; r5l_dev_io_t *ios[N]; }; struct r5l_dev_io { uint8_t type; uint8_t advice; uint8_t priority; r5l_dev_ptr_t ptr; }; struct r5l_dev_ptr { uint64_t seqid; uint32_t flags; uint32_t section; uint32_t offset; uint32_t size; uint64_t dev_id; uint64_t cksum[4]; };

Root Device  (It’s a Group Device) Group Devices (Raid0, Raid1,
…) Raw Devices (BlkDev, File, Socket, …) v5 | Format Objects has Different Layout …for different types …for different workloads Dev Group ID Dev ID Device  Master  Block Raid Type Disk Format … Dev Type Dev bShift … … Super-Blk 0 Super-Blk 1 … Super-Blk N … … Super-Block (v5 format) Object Map Object 0 (Object Map) 123 | 0xabcdef 456 | 0xfedcba … … Object-Map ~ Snapshots obj-map.add(oid, version, location) Space Map Dedup Map Wondering Log Object 1 (Space Map) Object 123 Object 456 Object 2 (Dedup Map) eccbc | 2 | 0xabcd 1b2dc | 7 | 0x1234 5849b | 5 | 0x4567 7baf3 | 3 | 0x9ab8 dup-map.add(hash, refs, location) TXN Journal … Encryption Keys Transparent Sharding Encryption Deduplication Repacking …

Core Layer RQ Next RQ Object 3 Object 4 Object
5 Object 1 Object 2 sched queue obj_prio = [10.0, 20.0, 5.0] obj_fair = [p / sum(obj_prio) for p in obj_prio] obj_dispatch = [f * QUANTUM for f in obj_fair] Each RQ will have a fair-share based on priorities struct z_task_rq { z_dlink_node_t sched_link; ! uint8_t prio; /* rq priority */ uint8_t quantum; /* rq avail quantum */ uint32_t length; /* rq length */ ! const z_vtable_task_rq_t *vtable; }; struct z_vtable_task_rq { void (*open) (z_task_rq_t *self); void (*close) (z_task_rq_t *self); void (*add) (z_task_rq_t *self, z_task_t *task); z_task_t * (*fetch) (z_task_rq_t *self); }; Task Scheduler Run-Queue 0 Run-Queue 1 … Run-Queue N RQ-0 RQ-1 (fair queue) (ﬁfo queue) RQ-N (prio queues) Each User will have different priorities Read vs Writes Object-A vs Object-B … User Requests on the same TXN must be sorted by SeqId. Other may be concurrent

Core Layer TXN Object track the changes if no object-reads
“everything” can be applied (optimistic concurrency) e.g. count +=1 e.g. sset.insertOrUpdate(“k”, “v”) Default Locking each operation takes a lock on the read or modiﬁed “key”. (Not the full object, just the “key”) txn = r5l_txn_open(OPTIMISTIC_TXN) try: sset.insertOrUpdate(“k”, “v”, txn) counter.inc(txn) txn.try_apply(nretries) // Update to the KEY_LOCK_TXN txn.commit() except: txn.rollback() txn = r5l_txn_open(KEY_LOCK_TXN, nretries, retry_timeout) try: sset.insertOrUpdate(“k”, “v”, txn) counter.inc(txn) txn.commit() except: txn.rollback(); Multi-Machine TXN are handled as normal txn up to the apply/commit where each machines write to the TXN-Journal   TXN_BEGIN_COMMIT (m1, m2, m3) (Wait Other machines to write it)  TXN_COMPLETED_COMMIT (Everyone wrote the begin commit) (Operations) Commit Transaction Rollback Transaction Create Transaction Obj-0 (Lock Key-0) Obj-1 (Lock Key-1) Obj-2 (Lock Key-2)

• To interact with an object you name it, and
you say what you want it to do.! • Traditional file-system take the name you give, looks through directories to find the object, and then gives the object your request to do something.! • Semantic layer concerns itself with naming objects, and doesn’t concern as how to pack objects into particular places on disk.! • Every object has a name that identify it, for the end user this name has a meaning and this meaning should be captured by the Semantic Layer, while the rest of the Storage Layer is not interested in the meaning of the name.! • User defined name has generally a variable length and tends to be verbose, while the storage layer needs something fixed size and short, to ensure a quick lookup. To do this, objects names are converted in keys that can be a simple hash of the name or something more elaborated.! • The semantic layer takes names and converts them into keys, the Storage Layer take keys and finds the objects. Semantic Layer

Object Layer struct r5l_vtable_object { r5l_errno_t (*create) (r5l_object_t *self); r5l_errno_t
(*open) (r5l_object_t *self); void (*close) (r5l_object_t *self); r5l_errno_t (*unlink) (r5l_object_t *self); ! r5l_errno_t (*apply) (r5l_object_t *self, r5l_txn_atom_t *atom); r5l_errno_t (*revert) (r5l_object_t *self, r5l_txn_atom_t *atom); r5l_errno_t (*commit) (r5l_object_t *self); ! r5l_errno_t (*balance) (r5l_object_t *self); r5l_errno_t (*sync) (r5l_object_t *self); }; An object contains your data Different Data Types have different methods and needs Mimic Languages Types set, dict, list, ... Every Object has it’s own on-disk layout, and features Static {int, ﬂoat, char(n), vchar(n), time, bits(N)} Dynamic {counter, number, ﬂow, deque, space-map, bitmap, sset}

Object Root Partition 1 Partition 2 … Partition N Object
Plugin Object Layer RWC Semaphore Multiple Readers Single Writer + Readers Single Committer Single Lock per “Object” Instead of waiting on locks let someone else run Each Object can be partitioned to avoiding hotspotting on Write/Commit locks

Object … Row-Oriented Table  (Schema + SSet) Col 0 Col
1 Col 2 Col 3 Col 0 Col 1 Col 2 Col 3 Col 0 Col 1 Col 2 Col 3 Column-Oriented Table  (Schema + N-Vectors) Row N Row 0 Row 1 Row 2 Col 0 Row N Row 0 Row 1 Row 2 Col 1 Row N Row 0 Row 1 Row 2 Col 2 Row N Row 0 Row 1 Row 2 Col 3 SSet Key 0 Value 0 Key 1 Value 1 Key 2 Value 2 Key N Value N Vector 0 Value 0 1 Value 1 2 Value 2 N Value N Insert/Remove data at specified offset, without rewriting the whole file. Flow Object Like a regular ‘80s file but with more flexibility (Specializations: Append-Only Log, Fixed-Block) Deque Object appends and pops from  either side of the deque Sorted-Set Object Metadata keep tract fields sizes and names Key-Value Map (Specializations: Vector, RecNo, Row-Oriented Table) Object Composition

@Th30z Q&A Abstract Storage Layer

r5l

r5l

Matteo Bertozzi

More Decks by Matteo Bertozzi

Featured

Transcript

RaleighSL v5 Abstract Storage Layer

RaleighSL v5 How well your system perform, is very much

struct r5l_vtable_dev { r5l_errno_t (open) (r5l_dev_t self, va_list args); void

Device Layer • Group Devices • Requests are distributed among

Root Device  (It’s a Group Device) Group Devices (Raid0, Raid1,

Core Layer RQ Next RQ Object 3 Object 4 Object

Core Layer TXN Object track the changes if no object-reads

• To interact with an object you name it, and

Object Layer struct r5l_vtable_object { r5l_errno_t (create) (r5l_object_t self); r5l_errno_t

Object Root Partition 1 Partition 2 … Partition N Object

Object … Row-Oriented Table  (Schema + SSet) Col 0 Col

@Th30z Q&A Abstract Storage Layer