Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Storage '99

Matteo Bertozzi
June 15, 2014
210

Storage '99

Matteo Bertozzi

June 15, 2014
Tweet

Transcript

  1. Storage ‘99 Lesson Learned from two different Teams implementing “v2”

    of their file-System (If you have to ask, please leave the room)
  2. Overview • Focus on Storage • what can we do

    with a single machine • multiple machines distribution is just a simple rpc component,
 no need to enter in details here. It has nothing to do with storage. • Focus on Architecture • What are the problems with Monolithic systems (by examples) • What a system rewrite should give you for free, since day 0 • some Implementation details can be found elsewhere,
 but probably not needed, since there is nothing complex in here. • There may be a couple of highly techincal mentions around B+Tree balancing
 and node temperatures, but should not be the main focus for this overview. • No buzzwords, because everything is simple and clean. • No google references, because google is 15 years behind
 what was the state of the art in storage in 1999.
 …and they are known to build new similar system from scratch
 because the previous architecture was not thought out.
  3. When filesystems are not really designed for the needs of

    the storage layers above them, and none of them are,then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. “Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed”
  4. Monolithic Systems are complex, and difficult to change “The expressive

    power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it” - Reiser’s Law Of Information Economics This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. How well your system perform, is very much determined by how easy it is to make little changes to it. …and the more little experiments you make the higher your performance is going to be. Architecture Matters!
  5. “How well your system perform, is very much determined by

    how easy it is to make little changes to it. …and the more little experiments you make the higher your performance is going to be” Monolithic Systems are complex, and difficult to change Monolithic Q: Can we add compression? A: sure, let me add another component and change all the code to use the “Compressed Stream” instead of the default one. Q: Can we add encryption? A: sure, let me add another component and change all the code to use the “Encryption Stream” instead of the default one. Q: What is faster compress before encrypt or encrypt before compress A: I don’t know I have to change all the code to swap the order Q: can you do that, how long it takes? also do I have to run the test twice? A: yes you have to run the test twice, it will probably take forever… Q: Can’t we send the same request to two different devices? A: …no, it is all hardcoded that will take forever to rewrite Micro-Kernel Q: Can we add compression? A: sure, let me add another component then you can just say “add compression to object X” Q: Can we add encryption? A: sure, let me add another component then you can just say “add encryption to object X” Q: What is faster compress before encrypt or encrypt before compress A: you can just reorder the flow at runtime and measure Q: cool, also do I have to run the test twice? A: No you can just plug a Mirror group and see which one is faster Output Encryption Compression Input Disk Encryption Compression Input Mirror Disk Encryption Compression Output Encryption Compression Input Code v1 Code v2 by Examples
  6. Monolithic Systems are complex, and difficult to change Monolithic Q:

    Can we put our journal on SSD? A: I guess, you’ll have to configure it on startup and I have to change the code to make sure that is using that device. Q: cool, can you do that? A: sure, there is a bunch of code to refactor because at the moment everything is tight togheder Q: Do we need a migration step? A: …oh yeah, I think so Q: …after 6 months, can we use that thing you implemented to put my “Table X” on SSD too A: no… we have to change all the internal flow… Micro-Kernel Q: Can we put our journal on SSD? A: sure, just add a “object placement” (like you did with the mirror used before) and say on which node the “journal” object should go. Q: Can I use the same thing to put my “Table X” on SSD too? A: Yes, there is no difference. Just map your object to the destination node SSD Input Object Placement Disk SSD Disk Journal Input Disk Others Code v1 Code v2 “How well your system perform, is very much determined by how easy it is to make little changes to it. …and the more little experiments you make the higher your performance is going to be” by Examples
  7. Monolithic Systems are complex, and difficult to change Monolithic Q:

    Can I simulate a bad disk? A: m… no, we can make you a build with some random exception on the I/O path if you want to see how it behaves. Q: What if we have two disks? Are we able to keep going A: ..in theory I don’t know, we don’t have much control on the disk we choose Q: Is there a way to test how my app works against dirs with more than 10M items or files of 1EiB? A: you insert the data and then run the app? Q: nevermind… Micro-Kernel Q: Can I simulate a bad disk? A: You have to write a device, and then add it to your tree few lines of code to implement the interface and throw the different exceptions you want. Q: What if we have two disks? Are we able to keep going A: Just try it, add the “bad virtual disk” and a good one and see what happens basically is the same thing you did to test the two different compres/encrypt Q: Is there a way to test how my app works against dirs with more than 1M items or files of 1PiB? A: Like you did for the vdisk you can write your own object, that gives back your results so, you don’t need to waste time to fill up the disk and you don’t need the real disk Bad vDisk Input Disk Disk Input Code v1 Code v2 (test build only) “How well your system perform, is very much determined by how easy it is to make little changes to it. …and the more little experiments you make the better is going to be” by Examples if rand() % 2 throw exception Input Multi-Disk Policy (e.g. Stripe, Mirror, …)
  8. Monolithic Systems are complex, and difficult to change Monolithic Q:

    How do you set owner and permission to a file? A: You have to use chmod/chown chmod 777 myFile chown user myFile Q: How can copy owner and permission? A: …by hand? you list and clone Q: Can we add ACLs? A: Sure it will be another tool, but it will take time because it is different setfactl -m u:lisa:r file getfacl … Q: Can we add custom attribtues? A: Sure, I have to do a similar thing as we have done for ACLs and you’ll get another tool like setfattr -n myAttr -v myValue getfattr … Q: Why aren’t these attributes retained when I take/restore a snapshot? A: F!#@K forgot to add it Q: Why do we have all these different set of tools to do the same thing? A: …our architecture sucks Micro-Kernel Q: How do you set owner and permission to a file? A: Just write the “owner” and “permission” field echo “user” > myFile/@owner echo “777” > myFile/@permission Q: How can copy owner and permission? A: same way you copy files… via cat or cp cat myFile/@owner > myFile2/@owner cp myFile/@permission myFile2/@permission Q: Can we add ACLs? A: sure, echo “u:lisa:r” >> myFile2/@acls Q: ..oh same way as owner/permission and data ok. Q: Can we add custom attributes? A: same way… echo “my-value” > myFile/@myAttr by Examples “The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it” - Reiser’s Law Of Information Economics If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop.
  9. “Storage Layers Above the FS: A Sure Symptom The FS

    Developer Has Failed” Most of databases implements a B+Tree. Can you see a problem with that? aside the fact that database people don’t understand why balanced trees are better than unbalanced trees (see BLOB). A B+Tree is efficient only if you know how the underline storage works and you can align your block boundaries properly. To access the leaf node 1 you expect to do 3 logical seeks which in theory result in 3 physical seeks or less. You think to be smart by aligning your block to a 4k boundary which is a common block size used by most of the fs. Can you see a problem with that? What if the File-System has an header at the beginning of the block? (See “Why fsck thakes the whole day”) now, each time you want to access a single block you have to read two blocks. What if instead the File-System uses compression? each read will pull in another block that may not have relations with the one that you want. The other block may have a higher or lower temperature, which means that you are wasting cache space because you have to store the other block that comes in as dependency of the one you want. 2 3 1 A 5 6 4 B 8 9 7 C R Put a back Ref into Data blocks! Metadata (ctime, mtime, mode, ...) (Data Blocks) (Block Pointers) Why fsck takes the whole day? Who owns the block X? by Examples
  10. “Storage Layers Above the FS: A Sure Symptom The FS

    Developer Has Failed” 99% of the times Sequential Writes/Reads DO NOT means 0 seeks. Let assume the best case: There is a single process running, with only one file and each operation (read/write) is serialized. In this case your file is perfectly sequential, and the operation worst case is a single seek to reach the position request and then can start read up to the size you specified. Let get something a bit more realistic, same single file as before but you have a single writer and multiple readers. The file is still perfectly sequential (assuming that rewrite are in-place or you append) but if you push a read of 100MiB and your disk can do 100M/ses means that you are blocking the other readers. To avoid this problem you are now probably chunking the request at 1M, which allows you to get a delay of 10msec per-request, which will probably result in a logical seek per-request (the elevator is probably smart and be able to avoid the seek). …but, once you start serving multiple files, writers and reads there are no tricks and you’ll start seeking for each request. so the only what to optimize this is to know the read and write patterns and optimize the layout based on that. (see later, object layouts, balancers, repackers, ditto blocks) by Examples Reader 2 Reader 3 Reader 0/1 A1 B1 A3 A4 B1 B2 M1 M2 Reader 2 Reader 3 Reader 0/1 Air Holes for the next insert ditto-blocks for read optimization
  11. “Storage Layers Above the FS: A Sure Symptom The FS

    Developer Has Failed” Can you see any problem with the database WAL?
 (aside the fact that is implementing something that the fs should have already, and that the most common fsync implementation is just return 0, see mysql fsync ifdefs) Do you understand why the File-System Journals logs only metadata operations? Assume the simplest case, only one writer sequential data. Why do you need to write the data twice? Let assume that you have multiple writer, and you can afford to send the ACKs back once your object has reached a good buffer size 8k, 64k, ..1M Why do you need to write the data twice? …but writing twice may be optimal sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location. forcing people to disable the WAL just because it has a negative performance impact, should be the first warning about how bad is your system. (Also look at the alternatives, e.g. Write-Anywhere) by Examples
  12. What are the Problems with the current Storage Systms Hierarchy

    doesn’t scale well for Humans Hierarchical and relational databases add structure
 (not related to the information) Need to match rather than mold structure Namespace fragmentation is BAD Expressive power of an OS proportional to the number of component connection not number of components. Monolithic Systems are complex,
 and difficult to change How long it takes to change the allocation policy?
 …maybe few months How can you compare different allocation policies?
 …no, you can not. it’s a different code-base How long it takes to change the layout and avoid the 4k alignment?
 …rewriting the whole system. Brutal to manage: labels, partitions, volumes. 
 Lots of limits: fs/volume size, number of files, directories Different tools to manage file, block, NFS, … No guarantees No defense against silent data corruption No Consistency, Atomicity, Transaction for the app/user Dog Slow linear time create, fixed block size, painful RAID rebuild, growing backup time
  13. Objective
 End of the Suffering Figure out why storage has

    gotten so complicated Blow away 20 years of obsolete assumptions Design an integrated system from scratch “Plugin” based code gives order of magnitude more tinkering
 not counting more people choosing to tinker
  14. B+Tree Not just a Data-Structure Internal Nodes gives you the

    path to the data Leaves contains the data “B+Tree as a scalable architecture” Root Device
 (It’s a Group Device) Group Devices (Raid0, Raid1, …) Raw Devices (BlkDev, File, Socket, …) Coordinator Coordinator Raw Executor A relaxed B+Tree describe perfectly the architecture that we want, we start with a single node that does everything,
 if we add another node we add a coordinator,
 which as “internal node” indicates gives you the pointer to reach the data.
 (the coordinator will be easily able to handle 500M objects
 on a modern laptop without hitting the disk. Once we hit the coordinator limit we add another coordinator and so on. In the same way we describe the machines of the cluster using the B+Tree metaphor we can also describe the Device Layer of a storage system. We can start off with a single disk, and later we can add more disk, having the internal nodes distributing and balancing the requests to the leaves. by relaxing the b+tree balancing, we can add other internal nodes (or “links”) to transform the data before getting to the underline device.
  15. Device Layer • More like the Virtual Memory system
 “a

    virtually infinite storage space” • No partitions to manage on the user side • All bandwidth available • plug/unplug of devices at runtime Storage Pool File-System How do you distribute read and writes? What about SSD vs HDD? Journal Write, Recovery Reads, Async Reads, Async Writes? Encryption, Compression? Adding/Removing devices at runtime? Self-Healing? Which set of objects is hot? balance and replicate Tree Hierarchy (Group Devices & Devices) HDD Group Device (e.g. Striping, Mirroring, ..) Devices (e.g. HDD, Remote, vdev, …) Remote Bad HDD Simulator A storage pool can be composed by multiple different devices. A device can be added to a group, and the group device will manage its own set of devices with its own group policy The device tree can have different composition e.g. the root stripes across the 3 groups, and each group is a mirror of 3 With pluggable devices and group you can efficently inject new changes at runtime or even having something like a shadow cluster for a migration or experiement. (See the previous encryption/compression example) Everything is a device, even the RAM used by the File-System which means that in conjunction with the pluggable device model you can get a free L2 cache layer. RAM SSD HDDs Group Cache Device (with SSD as L2)
  16. Device Layer Type Balancer (Journal vs Async, …) Balancing/Transformation (Striping,

    Mirroring, Encryption, Compression, …) Local HDD Disks Remote Disk SSD L2 Cache SSD Journal “Two” Kind of Devices: • Group Devices
 Container of “Raw” Devs with specified a specified policy. (e.g. Striping, Mirroring, Balancing, …) 
 Requests are distributed among the raw devices, to get the best performances.
 A read() corruption from a raw dev, is retried on another device (if the policy allows) and the block is fixed on that raw device. • “Raw” Devices
 Where the actual read/write data happens
 (Disk, File, Remote Interface, …) • “Virtual” Devices
 They are “Raw” Devices but in general those devices are used to simulate something like a bad-disk or a super large- disk, with read and writes possibly not being persisted anywhere. • “Transformation” Devices
 They are placed as parent of Group or “Raw” device, and perform a transformation of the data (e.g. compression, encryption, …). They are considered “Raw” Devices since they don’t dispatch requests to multiple devices. “Raw” devices are traditionally simple, they don’t know much about what is the best layout for the objects or the relations between them. The layers above (“Object Layer” and “Semantic Layer”) are responsible to provide “hints” to the device and an efficient on-disk layout. In some cases an object can provide a repacker that allows 
 to repack the object based on the updated statistics, this is can be another way of escaping from the balancing
 time vs. space efficiency tradeoff on write but it will be also 
 useful for static objects with relationships. Keep in mind that Hardware RAID rewrite the full disk instead of only the live data, it also may have problems with parity on failure. The software implementation can run on commodity hardware since if we are able to read something is because the full transaction was completed, and we know which blocks are live.
  17. Object Layer A “File” is too limiting A file is

    something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write. You can also cut bytes off of the end of the file. Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted because of a fairly ancient Unix file semantics, but this is likely to change someday. • An append only-file is too limiting, 
 hard for the user to implement deletion
 if records are variable size (see compactions)
 (for the filesystem is just removing an index from an array) • A table add structure, 
 structure that is not inherent to the information • Key-Value adds too much overhead An object describe your data Every Object has it’s own on-disk layout, and features Different Data Types (“Objects”) have different methods and needs An Object can also be used as “compatibility layer” e.g. MySQL Table Object which reads directly from MySQL Insert/Remove data at specified offset, without rewriting the whole file. Flow Object Like a regular ‘80s file but with more flexibility The Object can give hints to the device about read/write access patterns. (e.g. prefetching, repacker holes, …) Like Devices, Objects are pluggable so you can easily write your own “Simulated Data Objects” that allows you to test your application wasting a Terabyte of disk just to try your own “row count”.
  18. Object Layer match rather than mold structure … Row-Oriented Table


    (Schema + SSet) Col 0 Col 1 Col 2 Col 3 Col 0 Col 1 Col 2 Col 3 Col 0 Col 1 Col 2 Col 3 Column-Oriented Table
 (Schema + N-Vectors) Row N Row 0 Row 1 Row 2 Col 0 Row N Row 0 Row 1 Row 2 Col 1 Row N Row 0 Row 1 Row 2 Col 2 Row N Row 0 Row 1 Row 2 Col 3 SSet Key 0 Value 0 Key 1 Value 1 Key 2 Value 2 Key N Value N Vector 0 Value 0 1 Value 1 2 Value 2 N Value N Insert/Remove data at specified offset, without rewriting the whole file. Flow Object Like a regular ‘80s file but with more flexibility (Specializations: Append-Only Log, Fixed-Block) Deque Object appends and pops from
 either side of the deque Sorted-Set Object Metadata keep tract fields sizes and names Key-Value Map (Specializations: Vector, RecNo, Row-Oriented Table) Object Composition one format does not fit all, geometric objects has different needs than logs, or sset and others examples
  19. Semantic Layer • To interact with an object you name

    it, and you say what you want it to do. • Traditional file-system take the name you give, looks through directories to find the object, and then gives the object your request to do something. • Semantic layer concerns itself with naming objects, and doesn’t concern as how to pack objects into particular places on disk. • Every object has a name that identify it, for the end user this name has a meaning and this meaning should be captured by the Semantic Layer, while the rest of the Storage Layer is not interested in the meaning of the name. • User defined name has generally a variable length and tends to be verbose, while the storage layer needs something fixed size and short, to ensure a quick lookup. To do this, objects names are converted in keys that can be a simple hash of the name or something more elaborated. • The semantic layer takes names and converts them into keys, the Storage Layer take keys and finds the objects.
  20. Semantic Layer Resolve Semantic Layer Path/Query to Key Lookup Key

    Metadata
 Object Layer Lookup Metadata from Key “Object Pointer” for Read/Write Requests User Requests The semantic layer takes names and convert them into keys. The object layer take keys and find the object. A simple example that motivates the necessity of having an “Object Key” instead of the one provided by the user is the “Rename” operation. With an object-key you just need to replace the old-mapping with the new-mapping. With the user-key you need to replace every pointer containing that name (e.g. object references, snapshots…) A more complex example is balancing, assuming that the layout is based on a sorted data structure, by changing the key assignment the objects can be laid out in a better way. A more understandable example, for the user is the namespace unification for the metadata access $ cat ~/myfile/@owner $ cat ~/myfile/@ctime $ echo “yanez” > ~/myfile/@owner
  21. Root Device
 (It’s a Group Device) Group Devices (Raid0, Raid1,

    …) Raw Devices (BlkDev, File, Socket, …) v5 | Format Objects has Different Layout …for different types …for different workloads Dev Group ID Dev ID Device
 Master
 Block Raid Type Disk Format … Dev Type Dev bShift … … Super-Blk 0 Super-Blk 1 … Super-Blk N … … Super-Block (v5 format) Object Map Object 0 (Object Map) 123 | 0xabcdef 456 | 0xfedcba … … Object-Map ~ Snapshots obj-map.add(oid, version, location) Space Map Dedup Map Wondering Log Object 1 (Space Map) Object 123 Object 456 Object 2 (Dedup Map) eccbc | 2 | 0xabcd 1b2dc | 7 | 0x1234 5849b | 5 | 0x4567 7baf3 | 3 | 0x9ab8 dup-map.add(hash, refs, location) TXN Journal … Encryption Keys Transparent Sharding Encryption Deduplication Repacking …
  22. v5 | Layers Object Layer Objects SSet Deque Flow Vector

    RecNo Bitmap SpaceMap RTree … Server Storage Layer Client Binary-Field Protocol RPC Layer Native Protocol Text
 Protocol SQL … Protocols User Managment Security/Auth Quota/RateLimit Scheduling … Device Layer Devices File BlkDev Remote … Dev Groups RAID0 RAID1 RAIDZ … I/O Managment Space Map Balancer Repacker … Quota/Throttleing Q-Scheduling Client Layer Sync/Async Libraries C C++ Python Tools Admin Shell Debug Tools Repair Tools FUSE Instead of Files and Directories, you have Objects! Every Object has it’s own on-disk layout, and features TXN
 Manager Core Layer Task Scheduling Object
 Observers Semantic Layer API • Pluggable Semantic Layer (File Lookup, Access Rights, ...) • Pluggable Device Layer (Block allocator, Disk Read/Write, ...) • Pluggable “File-Structures” with custom on-disk layouts. • Pluggable “File-Structures” queryables and with own semantic. • Objects Observers (Trigger operations)
  23. v5 | Distributed Coordinator Coordinator Executor • Coordinators are v5

    machines with few objects.
 The objects mapping tables, and the balancer process.
 Handle easily 2B objects without hitting the disk on a 64G machine • Executors are v5 machines that execute operations
 create, delete, read, write, … • If a write arrives on a server not containing the object
 that write will be passed to the correct server, since
 the internal bandwidth should be higher than the user one. Object-ID Replica 0 Replica 1 Replica 2 Next State Object Location Table Machine Address Machine Address Set Machine Address Entry 2 ^ 16 machines = 65,536 IPv6 length = 16 bytes full machine name 64 bytes (65,536 * 16) = 1MiB (65,536 * 64) = 4MiB (1,000,000 * 16) = 16MB (1,000,000 * 64) = 64MB Object Location Entry 12 oid + (2 * 3) replicas + 2state + 4next = 24 1GiB / entry = 44,739,242 = 44M objects (< 2^26) 4GiB / entry = 178,956,970 = 178M objects (< 2^28) 24GiB / entry = 1,073,741,824 = 1B objects (2 ^ 30) 96GiB / entry = 4,294,967,296 = 4B objects (2^32) uint64_t 200Kreq/sec = 2,924,712 years 200Mreq/sec = 2,924 years 200Greq/sec = 2.9 years uint96_t 1T req/sec = 38,334 years 10G req/sec * 65,536machines = 58 years …just an other rpc plugin at the RPC level. and it uses the “client api” like any other client, because if we need to implement another log at this level we have done something wrong on the storage layer (see the first slide :D)
  24. Conclusions • Architecture Matters! • Keep the code small and

    reduce the limitations • Device Layer • Single storage pool for the end user • Pluggable devices at runtime • Grouping devices for Striping, Mirroring, Balancing, … • Transparent encryption/compression per-device • Transactional • Object Layer • Need to match rather than mold structure • Different Data Types (“Objects”) have different methods and needs • Every Object has it’s own on-disk layout, and features • An Object can also be used as “compatibility layer”
 (e.g. MySQL Table Object which reads directly from MySQL) • Semantic Layer • Translate the “user query” into something understandable by the system • Unifies the namespaces (cat ~/myfile/@owner) • Core Layer • Glue between the layers • User level transactions (aka object-requests coordination) • Object observers, snapshots, de-duplication How well your system perform, is very much determined by how easy it is to make little changes to it. …and the more little experiments you make the higher your performance is going to be. “The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it” - Reiser’s Law Of Information Economics
  25. @Th30z Q&A Abstract Storage Layer (must read if you don’t

    know FSs) https://speakerdeck.com/matteobertozzi/python-fuse-pycon4-1 https://speakerdeck.com/matteobertozzi/r5l http://slideshare.net/matteobertozzi/raleighfs-v5 ask-me://BalancedTreeVsUnbalancedTrees-andBLOB ask-me://node-temperature-segregation-and-variance ask-me://spmc-shard-vs-locks ask-me://journal-vs-cow-and-transactions ask-me://repacker-and-nesting-algorithm