Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ceph

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Derek Weitzel Derek Weitzel
February 17, 2012

 Ceph

Avatar for Derek Weitzel

Derek Weitzel

February 17, 2012
Tweet

Other Decks in Technology

Transcript

  1. In the Before… v  Lets go back through some of

    the mentionable distributed file systems used in HPC
  2. In the Before… v  There were distributed filesystems like: v 

    Lustre – RAID over storage boxes v  Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one) v  When functional, reading/writing EXTREMELY fast v  Used in heavily in HPC
  3. In the Before… v  There were distributed filesystems like: v 

    NFS – Network File System v  Does this really count as distributed? v  Single large server v  Full POSIX support, in kernel since…forever v  Slow with even a moderate number of clients v  Dead simple
  4. In the Current… v  There are distributed filesystems like: v 

    Hadoop – Apache Project inspired by Google v  Massive throughput v  Throughput scales with attached HDs v  Have seen VERY LARGE production clusters v  Facebook, Yahoo… Nebraska v  Doesn’t even pretend to be POSIX
  5. In the Current… v  There are distributed filesystems like: v 

    GPFS(IBM) / Panasas – Propriety file systems v  Requires closed source kernel driver v  Not flexible with newest kernels / OS’s v  Good: Good support and large communities v  Can be treated as black box for administrators v  HUGE Installments (Panasas at LANL is HUGE!!!!)
  6. Motivation v  Ceph is a emerging technology in the production

    clustered environment v  Designed for: v  Performance – Striped data over data servers. v  Reliability – No single point of failure v  Scalability – Adaptable metadata cluster
  7. Timeline v  2006 – Ceph Paper written v  2007 –

    Sage Weil earned PhD from Ceph (largely) v  2007 – 2010 Development continued, primarily for DreamHost v  March 2010 – Linus merged Ceph client into mainline 2.6.34 kernel v  No more patches needed for clients
  8. Adding Ceph to Mainline Kernel v  Huge development! v  Significantly

    lowered cost to deploy Ceph v  For production environments, it was a little too late – 2.6.32 was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6).
  9. Ceph Overview v  Decoupled data and metadata v  IO directly

    with object servers v  Dynamic distributed metadata management v  Multiple metadata servers handling different directories (subtrees) v  Reliable autonomic distributed storage v  OSD’s manage themselves by replicating and monitoring
  10. Decoupled Data and Metadata v  Increases performance by limiting interaction

    between clients and servers v  Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas… v  In contrast to other filesystems, CEPH uses a function to calculate the block locations
  11. Dynamic Distributed Metadata Management v  Metadata is split among cluster

    of servers v  Distribution of metadata changes with the number of requests to even load among metadata servers v  Metadata servers also can quickly recover from failures by taking over neighbors data v  Improves performance by leveling metadata load
  12. Reliable Autonomic Distributed Storage v  Data storage servers act on

    events by themselves v  Initiates replication and v  Improves performance by offloading decision making to the many data servers v  Improves reliability by removing central control of the cluster (single point of failure)
  13. Ceph Components v  Some quick definitions before getting into the

    paper v  MDS – Meta Data Server v  ODS – Object Data Server v  MON – Monitor (Now fully implemented)
  14. Ceph Components v  Ordered: Clients, Metadata, Object Storage Metadata storage

    File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in 1 2 3
  15. Ceph Components v  Ordered: Clients, Metadata, Object Storage Metadata storage

    File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in 1 2 3
  16. Client Overview v  Can be a Fuse mount v  File

    system in user space v  Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system) v  Can link directly to the Ceph Library v  Built into newest OS’s.
  17. Client Overview – File IO v  1. Asks the MDS

    for the inode information Metadata storage File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in
  18. Client Overview – File IO v  2. Responds with the

    inode information Metadata storage File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in
  19. Client Overview – File IO v  3. Client Calculates data

    location with CRUSH Metadata storage File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in
  20. Client Overview – File IO v  4. Client reads directly

    off storage nodes Metadata storage File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in
  21. Client Overview – File IO v  Client asks MDS for

    a small amount of information v  Performance: Small bandwidth between client and MDS v  Performance Small cache (memory) due to small data v  Client calculates file location using function v  Reliability: Saves the MDS from keeping block locations v  Function described in data storage section
  22. Ceph Components v  Ordered: Clients, Metadata, Object Storage Metadata storage

    File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in 1 2 3
  23. Client Overview – Namespace v  Optimized for the common case,

    ‘ls –l’ v  Directory listing immediately followed by a stat of each file v  Reading directory gives all inodes in the directory v  Namespace covered in detail next! $ ls -l total 0 drwxr-xr-x 4 dweitzel swanson 63 Aug 15 2011 apache drwxr-xr-x 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-api-java drwxr-xr-x 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-common drwxr-xr-x 7 dweitzel swanson 103 Jan 18 16:37 bestman2 drwxr-xr-x 6 dweitzel swanson 75 Jan 18 12:25 buildsys-macros
  24. Metadata Overview v  Metadata servers (MDS) server out the file

    system attributes and directory structure v  Metadata is stored in the distributed filesystem beside the data v  Compare this to Hadoop, where metadata is stored only on the head nodes v  Updates are staged in a journal, flushed occasionally to the distributed file system
  25. MDS Subtree Partitioning v  In HPC applications, it is common

    to have ‘hot’ metadata that is needed by many clients v  In order to be scalable, Ceph needs to distributed metadata requests among many servers v  MDS will monitor frequency of queries using special counters v  MDS will compare the counters with each other and split the directory tree to evenly split the load
  26. MDS Subtree Partitioning v  Multiple MDS split the metadata v 

    Clients will receive metadata partition data from the MDS during a request Root MDS 0 MDS 4 MDS 1 MDS 2 MDS 3 Busy directory hashed across many MDS’s (owner, number change, pendent of state ferent a content for the
  27. MDS Subtree Partitioning v  Busy directories (multiple creates or opens)

    will be hashed across multiple MDS’s Root MDS 0 MDS 4 MDS 1 MDS 2 MDS 3 Busy directory hashed across many MDS’s (owner, number change, pendent of state ferent a content for the
  28. MDS Subtree Partitioning v  Clients will read from random replica

    v  Update to the primary MDS for the subtree Root MDS 0 MDS 4 MDS 1 MDS 2 MDS 3 Busy directory hashed across many MDS’s (owner, number change, pendent of state ferent a content for the
  29. Ceph Components v  Ordered: Clients, Metadata, Object Storage Metadata storage

    File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in 1 2 3
  30. Data Placement v  Need a way to evenly distribute data

    among storage devices (OSD) v  Increased performance from even data distribution v  Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution v  Problem: Don’t want to keep data locations in the metadata servers v  Requires lots of memory if lots of data blocks
  31. CRUSH v  CRUSH is a pseudo-random function to find the

    location of data in a distributed filesystem v  Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored v  Input data is: v  inode number – From MDS v  OSD Cluster Map (CRUSH map) – From OSD/ Monitors
  32. CRUSH v  CRUSH maps a file to a list of

    servers that have the data … … … … … … CRUSH(pgid) (osd1, osd2) OSDs (grouped by failure domain) File Objects hash(oid) & mask pgid PGs (ino,ono) oid to m u th so le fu c
  33. CRUSH v  File to Object: Takes the inode (from MDS)

    … … … … … … CRUSH(pgid) (osd1, osd2) OSDs (grouped by failure domain) File Objects hash(oid) & mask pgid PGs (ino,ono) oid to m u th so le fu c
  34. CRUSH v  File to Placement Group (PG): Object ID and

    number of PG’s … … … … … … CRUSH(pgid) (osd1, osd2) OSDs (grouped by failure domain) File Objects hash(oid) & mask pgid PGs (ino,ono) oid to m u th so le fu c
  35. Placement Group v  Sets of OSDs that manage a subset

    of the objects v  OSD’s will have many Placement Groups v  Placement Groups will have R OSD’s, where R is number of replicas v  An OSD will either be a Primary or Replica v  Primary is in charge of accepting modification requests for the Placement Group v  Clients will write to Primary, read from random member of Placement Group
  36. CRUSH v  PG to OSD: PG ID and Cluster Map

    (from OSD) … … … … … … CRUSH(pgid) (osd1, osd2) OSDs (grouped by failure domain) File Objects hash(oid) & mask pgid PGs (ino,ono) oid to m u th so le fu c
  37. CRUSH v  Now we know where to write the data

    / read the data v  Now how do we safely handle replication and node failures?
  38. Replication v  Replicates to nodes also in the Placement Group

    Write Apply update Ack Commit to disk Commit Time Client Primary Replica Replica
  39. Replication v  Write the the placement group primary (from CRUSH

    function). Write Apply update Ack Commit to disk Commit Time Client Primary Replica Replica
  40. Replication v  Primary OSD replicates to other OSD’s in the

    Placement Group Write Apply update Ack Commit to disk Commit Time Client Primary Replica Replica
  41. Replication v  Commit update only after the longest update Write

    Apply update Ack Commit to disk Commit Time Client Primary Replica Replica
  42. Failure Detection v  Each Autonomic OSD looks after nodes in

    it’s Placement Group (possible many!). v  Monitors keep a cluster map (used in CRUSH) v  Multiple monitors keep eye on cluster configuration, dole out cluster maps.
  43. Recovery & Updates v  Recovery is entirely between OSDs v 

    OSD have two off modes, Down and Out. v  Down is when the node could come back, Primary for a PG is handed off v  Out is when a node will not come back, data is re- replicated.
  44. Recovery & Updates v  Each object has a version number

    v  Upon bringing up, check version number of Placement Groups to see if current v  Check version number of objects to see if need update
  45. Ceph Components v  Ordered: Clients, Metadata, Object Storage (Physical) Metadata

    storage File I/O Metadata Cluster Object Storage Cluster client bash Linux kernel fuse vfs libfuse ls … client bash Linux kernel fuse vfs libfuse ls … myproc client myproc client Clients Metadata operations Figure 1: System architecture. Clients perform file I/O by communicating directly with OSDs. Each process can de vi ba bl ca di tri sto th in 1 2 4
  46. Object Storage v  The underlying filesystem can make or break

    a distributed one v  Filesystems have different characteristics v  Example: RieserFS good at small files v  XFS good at REALLY big files v  Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanle attrs.
  47. Object Storage v  Ceph can run on normal file systems,

    but slow v  XFS, ext3/4, … v  Created own Filesystem in order to handle special object requirements of Ceph v  EBOFS – Extent and B-Tree based Object File System.
  48. Object Storage v  Important to note that development of EBOFS

    has ceased v  Though Ceph can run on any normal filesystem (I have it running on ext4) v  Hugely recommend to run on BTRFS
  49. Object Storage - BTRFS v  Fast Writes: Copy on write

    file system for Linux v  Great Performance: Supports small files with fast lookup using B-Tree algorithm v  Ceph Requirement: Supports unlimited chaining of attributes v  Integrated into mainline kernel 2.6.29 v  Considered next generation file system v  Peer of ZFS from Sun v  Child of ext3/4
  50. Performance & Scalability v  Write latency with different replication factors

    v  Remember, has to write to all replicas before ACK write to client 4096 lication ication ication Write Size (KB) 4 16 64 256 1024 Write Latency (ms) 0 5 10 15 20 no replication 2x replication 3x replication sync write sync lock, async write
  51. Performance & Scalability v  X-Axis is size of the write

    to Ceph v  Y-Axis is the Latency when writing X KB 4096 lication ication ication Write Size (KB) 4 16 64 256 1024 Write Latency (ms) 0 5 10 15 20 no replication 2x replication 3x replication sync write sync lock, async write
  52. Performance & Scalability v  Notice, this is still small writes,

    < 1MB v  As you can see, the more replicas Ceph has to write, the slower the ACK to the client 4096 lication ication ication Write Size (KB) 4 16 64 256 1024 Write Latency (ms) 0 5 10 15 20 no replication 2x replication 3x replication sync write sync lock, async write
  53. Performance & Scalability v  Obviously, async write is faster v 

    Latency for async is from flushing buffers to Ceph 4096 lication ication ication Write Size (KB) 4 16 64 256 1024 Write Latency (ms) 0 5 10 15 20 no replication 2x replication 3x replication sync write sync lock, async write
  54. Performance and Scalability v  2 lines for each file system

    v  Writes are bunched at top, reads at bottom Write Size (KB) 4 16 64 256 1024 4096 Per− 0 10 2x replication 3x replication Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD through- put, although if the number of OSDs is fixed, n-way replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs. I/O Size (KB) 4 16 64 256 1024 4096 16384 Per−OSD Throughput (MB/sec) 0 10 20 30 40 50 60 ebofs ext3 reiserfs xfs reads writes 4 Write 0 5 Figure 7: Write cation. More th cost for small concurrently. F sion times dom for writes over asynchronously 2 Per−OSD Throughput (MB/sec) 30 40 50 60 Figure 8: OSD
  55. Performance and Scalability v  X-Axis is the KBs written to

    or read from v  Y-Axis is the throughput per OSD (node) Write Size (KB) 4 16 64 256 1024 4096 Per− 0 10 2x replication 3x replication Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD through- put, although if the number of OSDs is fixed, n-way replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs. I/O Size (KB) 4 16 64 256 1024 4096 16384 Per−OSD Throughput (MB/sec) 0 10 20 30 40 50 60 ebofs ext3 reiserfs xfs reads writes 4 Write 0 5 Figure 7: Write cation. More th cost for small concurrently. F sion times dom for writes over asynchronously 2 Per−OSD Throughput (MB/sec) 30 40 50 60 Figure 8: OSD
  56. Performance and Scalability v  The custom ebofs does much better

    on both writes and reads Write Size (KB) 4 16 64 256 1024 4096 Per− 0 10 2x replication 3x replication Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD through- put, although if the number of OSDs is fixed, n-way replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs. I/O Size (KB) 4 16 64 256 1024 4096 16384 Per−OSD Throughput (MB/sec) 0 10 20 30 40 50 60 ebofs ext3 reiserfs xfs reads writes 4 Write 0 5 Figure 7: Write cation. More th cost for small concurrently. F sion times dom for writes over asynchronously 2 Per−OSD Throughput (MB/sec) 30 40 50 60 Figure 8: OSD
  57. Performance and Scalability v  Writes for ebofs max the throughput

    of the underlying HD Write Size (KB) 4 16 64 256 1024 4096 Per− 0 10 2x replication 3x replication Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD through- put, although if the number of OSDs is fixed, n-way replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs. I/O Size (KB) 4 16 64 256 1024 4096 16384 Per−OSD Throughput (MB/sec) 0 10 20 30 40 50 60 ebofs ext3 reiserfs xfs reads writes 4 Write 0 5 Figure 7: Write cation. More th cost for small concurrently. F sion times dom for writes over asynchronously 2 Per−OSD Throughput (MB/sec) 30 40 50 60 Figure 8: OSD
  58. Performance and Scalability v  X-Axis is size of the cluster

    v  Y-Axis is the per OSD throughput ical gh- way ctor Ds. 384 fs cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmis- sion times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously flushing the data. OSD Cluster Size 2 6 10 14 18 22 26 Per−OSD Throughput (MB/sec) 30 40 50 60 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear
  59. Performance and Scalability v  Most configurations hover around HD speed

    ical gh- way ctor Ds. 384 fs cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmis- sion times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously flushing the data. OSD Cluster Size 2 6 10 14 18 22 26 Per−OSD Throughput (MB/sec) 30 40 50 60 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear
  60. Performance and Scalability v  32k PGs will distribute data more

    evenly over the cluster than the 4k PGs ical gh- way ctor Ds. 384 fs cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmis- sion times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously flushing the data. OSD Cluster Size 2 6 10 14 18 22 26 Per−OSD Throughput (MB/sec) 30 40 50 60 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear
  61. Performance and Scalability v  Evenly splitting the data will lead

    to a balanced load across the OSDs ical gh- way ctor Ds. 384 fs cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmis- sion times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously flushing the data. OSD Cluster Size 2 6 10 14 18 22 26 Per−OSD Throughput (MB/sec) 30 40 50 60 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear
  62. Conclusions v  Very fast POSIX compliant file system v  General

    enough for many applications v  No single point of failure – Important for large data centers v  Can handle HPC like applications (lots of metadata, small files)
  63. Demonstration v  Some quick things if the demo doesn’t work

    v  MDS log of a MDS handing off a directory to another for load balancing 2012-02-16 18:15:17.686167 7f964654b700 mds.0.migrator nicely exporting to mds.1 [dir 10000000004 /hadoop-grid/ [2,head] auth{1=1} pv=2574 v=2572 cv=0/0 ap=1+2+3 state=1610612738| complete f(v2 m2012-02-16 18:14:21.322129 1=0+1) n(v86 rc2012-02-16 18:15:16.423535 b36440689 292=213+79) hs=1+8,ss=0+0 dirty=9 | child replicated dirty authpin 0x29a0fe0]
  64. Demonstration v  Election after a Monitor was overloaded v  Lost

    another election (peon L ): 2012-02-16 16:23:22.920514 7fcf40904700 log [INF] : mon.gamma calling new monitor election 2012-02-16 16:23:26.167868 7fcf40904700 log [INF] : mon.gamma calling new monitor election 2012-02-16 16:23:31.558554 7fcf40103700 log [INF] : mon.gamma@1 won leader election with quorum 1,2 2012-02-16 17:15:36.301172 7f50b360e700 mon.gamma@1(peon).osd e26 e26: 3 osds: 2 up, 3 in
  65. Where to Find More Info v  New company sponsoring development

    v  http://ceph.newdream.net/ v  Instruction on setting up CEPH can be found on the Ceph wiki: v  http://ceph.newdream.net/wiki/ v  Or my blog v  http://derekweitzel.blogspot.com/