Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed File Systems: An Overview

Distributed File Systems: An Overview

A brief overview of the various types of distributed filesystems.

Anant Narayanan

March 11, 2009
Tweet

More Decks by Anant Narayanan

Other Decks in Technology

Transcript

  1. Introduction Classification Lustre Conclusion Distributed File Systems An Overview Anant

    Narayanan Cluster and Grid Computing Vrije Universiteit 11 March 2009 Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  2. Introduction Classification Lustre Conclusion Outline Introduction Classification Storage Fault Tolerance

    Applications Lustre Overview Architecture Implementation Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  3. Introduction Classification Lustre Conclusion Definition Allows access to files located

    on a remote host In a transparent manner, as though the client is actually working on the host Typically, clients do not have access to the underlying block storage They interact over the network using a protocol Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  4. Introduction Classification Lustre Conclusion Why do we need them? Distributed

    applications usually require a common data store Eases ability to keep data consistent Access control is possible both on the server and client Depending on how the protocol is designed Allows for implementation of Replication Fault tolerance Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  5. Introduction Classification Lustre Conclusion Storage Block Oriented “Usual” meaning of

    a file system Deal with storing data on a block basis Most distributed file systems are based on this at the lowest level Examples: ext3, NTFS, HFS+ Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  6. Introduction Classification Lustre Conclusion Storage Record Oriented Were used on

    Mainframes and Minicomputers Fetch and put whole records, seek to boundaries Have a lot in common with today’s databases Examples: Files-11, Virtual Storage Access Method (VSAM) Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  7. Introduction Classification Lustre Conclusion Storage Object Oriented Splits file metadata

    from file data File data is further split into objects Objects stored on object storage servers May or may not have a block oriented FS at the lowest layer Examples: Lustre, XtreemFS Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  8. Introduction Classification Lustre Conclusion Fault Tolerance High Availability Replication Parallel

    Striping Examples: Brtfs, Coda, GlusterFS Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  9. Introduction Classification Lustre Conclusion Applications Clusters Shared disk systems (GFS)

    Distributed disk systems (Lustre) Typically used on local networks Fast network access Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  10. Introduction Classification Lustre Conclusion Applications Grids or Clouds Dynamic nature

    Deal with heterogeneity Deal with VOs (Grids) and SLAs (Clouds) Examples: XtreemFS, Dynamo Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  11. Introduction Classification Lustre Conclusion Overview Linux + Cluster Distributed, parallel,

    fault tolerant, object based file system Tens of thousands of nodes Petabytes of storage capacity Hundreds of Gigabytes / second of throughput Without compromising on speed or security Small workgroup clusters, to large-scale, multi-site clusters, to super-computers Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  12. Introduction Classification Lustre Conclusion Architecture Layout Lustre Clusters Lustre clusters

    contain three kinds of systems: • File system clients, which can be used to access the file system • Object storage servers (OSS), which provide file I/O service • Metadata servers (MDS), which manage the names and directories in the file system MDS disk storage containing Metadata Targets (MDT) Pool of clustered MDS servers 1-100 Lustre clients 1-100,000 Elan Myrinet InfiniBand Router = Failover GigE OSS servers 1-1000s OSS 1 OSS 2 OSS 3 OSS 4 OSS 5 OSS 6 OSS 7 OSS storage with object storage targets (OST) Commodity Storage Shared storage enables failover OSS Enterprise-Class Storage Arrays and SAN Fabric Simultaneous support of multiple network types MDS 1 (active) MDS 2 (standby) Figure 1. Systems in a Lustre cluster Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  13. Introduction Classification Lustre Conclusion Architecture Components File system clients: used

    to access the file system Object storage servers (OSS): provide file I/O service, deals with block storage Metadata servers (MDS): manage the names and directories in the file system, deals with authentication Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  14. Introduction Classification Lustre Conclusion Architecture Characteristics The following table shows

    the characteristics associated with each of the three types of systems. The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM) and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems. Each Sun Microsystems, Inc. 4 Lustre Clusters Typical number of systems Performance Required attached storage Desirable hardware characteristics Clients 1–100,000 1 GB/sec I/O, 1000 metadata ops None None OSS 1–1000 500 MB/sec — 2.5 GB/sec File system capacity/OSS count Good bus bandwidth MDS 2 (in the future 2–100) 3000–15,000 metadata ops/sec (operations) 1–2% of file system capacity Adequate CPU power, plenty of memory Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  15. Introduction Classification Lustre Conclusion Architecture Heterogeneous? MDS and OSS may

    store actual data on ext3 or ZFS block file systems Infiniband, TCP/IP over Ethernet and Myrinet are supported network types Multiple CPU architectures: x86, x86_64, PPC Requires patched Linux kernel Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  16. Introduction Classification Lustre Conclusion Implementation Setup MDS mkfs.lustre -mdt -mgs

    -fsname=large-fs /dev/sdamount -t lustre /dev/sda /mnt/mdt OSS1 mkfs.lustre -ost -fsname=large-fs -mgsnode=mds@tcp0 /dev/sdb mount -t lustre /dev/sdb /mnt/ost1 Client mount -t lustre mds.your.org:/large-fs /mnt/lustre-client Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  17. Introduction Classification Lustre Conclusion Implementation Networking LNET abstracts over multiple

    supported networks Provides the communication infrastructure required by Lustre Takes care of abstracting over fail-over servers, load balancing Provides support for Remote Direct Memory Access (RDMA) Provides an end-to-end throughput of 100MB per sec on Gigabit Ethernet networks Upto 1.5GB per sec on Infiniband Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  18. Introduction Classification Lustre Conclusion Implementation Where are the files? Where

    Are the Files? Traditional UNIX disk file systems use inodes, which contain lists of block numbers where the file data for the inode is stored. Similarly, one inode exists on the MDT for each file in the Lustre file system. However, in the Lustre file system, the inode on the MDT does not point to data blocks, but instead points to one or more objects associated with the files. This is illustrated in Figure 5. These objects are implemented as files on the OST file systems and contain file data. Figure 6 shows how a file open operation transfers the object pointers from the MDS File on MDT Data Block ptrs Indirect Double Indirect inode inode Indirect Data Blocks Extended Attributes obj1 oss1 obj2 oss2 obj3 oss3 Ordinary ext3 File Direct Data Blocks Figure 5. MDS inodes point to objects; ext3 inodes point to data Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  19. Introduction Classification Lustre Conclusion Implementation Striping and Replication One object

    per MDS inode implies “unstriped” data Multiple objects per MDS inode implies that the file has been split, similar to RAID 0 These stripes may be duplicated across several OSS Provides fault tolerance and high availability Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  20. Introduction Classification Lustre Conclusion Conclusion Reliable, Scalable and Performant filesystem

    Open Architecture and Protocols BUT Does not handle dynamic addition / removal of servers Does not provide the kind of access control and security that a VO might need Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  21. Introduction Classification Lustre Conclusion What next? Other distributed file systems

    and implementation details Grids (XtreemFS), Clouds (Dynamo) Questions? Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems
  22. Introduction Classification Lustre Conclusion References You may be required to

    register on the Sun Website to access these documents! Datasheet: http://www.sun.com/software/products/lustre/ datasheet.pdf Scalable Cluster Filesystem Whitepaper: http://www.sun.com/offers/docs/LustreFileSystem.pdf LNET: http://www.sun.com/offers/docs/ lustre_networking.pdf Lustre Documentation Index: http://manual.lustre.org/index.php?title=Main_Page Lustre Publications Index: http://wiki.lustre.org/ index.php?title=Lustre_Publications Anant Narayanan Cluster and Grid Computing Vrije Universiteit Distributed File Systems