Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPFS @ NOAO

GPFS @ NOAO

AstDatCenTec April 18-19, 2013

Joshua Hoblitt

April 18, 2013
Tweet

More Decks by Joshua Hoblitt

Other Decks in Technology

Transcript

  1. Overview • Very handy wavy discussion of block level filesystems

    • GPFS Concepts and Features • Why we selected GPFS • Experience @ NOAO • General Suggestions on RAID configuration • Faults
  2. Types of Filesystems • Limiting definition of a FS to

    software which organizes data directly on block level storage • Ignoring “user space” filesystems – Gluster, iRODS, etc. • A few hand wavy examples...
  3. Shared FS disk Controller driver SCSI transport SCSI block dev

    gfs2 VFS Kernel space Controller driver SCSI transport SCSI block dev gfs2 VFS Kernel space DLM DLM “shared disk”
  4. Distributed FS disk Controller driver SCSI transport SCSI block dev

    AFS VFS Kernel space Controller driver SCSI transport SCSI block dev AFS VFS Kernel space disk No shared block access
  5. GPFS' model User space mmfsd Controller driver SCSI transport SCSI

    block dev mmfs VFS Kernel space disk Controller driver SCSI transport SCSI block dev mmfs VFS Kernel space disk User space mmfsd “shared disk”
  6. GPFS' model (cont.) disk disk NSD disk NSD GPFS client

    SAS / FC /etc. GPFS client GPFS client GPFS client disk disk NSD disk NSD Data & token requests metadata
  7. GPFS Concepts and Feature • Does not taint Linux kernel

    • Massively multithreaded user space daemon – Manages buffering, prefetch, metadata • Failure groups • Storage pools • Policies • cNFS • Filesets • Quotas • 287 utilities • File clones • Replication • Snapshots • Remote cluster access • Direct C API • Callbacks
  8. GPFS Organization Cluster NSD NSD NSD Disk Disk Disk Filesystem

    Failure Group Failure Group Pool Fileset Fileset
  9. Why choose GPFS? • Decision was made in 2010 –

    At that time Oracle was still digesting the remains of Sun • Lustre lacks replication functionality • Desire not be completely dependent on iRODS • Desire to use the same skill set for storage heaps with different usage scenarios – Performance & reliability • Experience at partner institutions – SDSC & NCSA • Negative feedback on glusterfs, etc. ceph was still a year away from being production ready
  10. C. A. R. P. • Cheap • Available • Reliable

    • Performant • (pick two)
  11. Experience @ NOAO • Overall experience is highly positive •

    Scott Fadden @ IBM is a great resource – GPFS Wiki • Tools and support are excellent • Licensing with IBM has had some issues
  12. Faults Observed • Exposed System problems – Bad system RAM

    – Bad RAID controller RAM – Problems with ixgbe – Bad SAS cable – RAID controller firmware bug(s) • GPFS' – GPFS version / kernel compatibility issue
  13. DECam CP cluster • 1st GPFS build @ NOAO •

    4MB block size • 3.4 -> 3.5 • Metadata migrated to SSDs • Metadata replication decnsd1 decnsd2 disk disk disk disk dec01 dec02 ... dec15 3G SAS 10G Ethernet
  14. Seperate Data & Metadata disk disk size failure holds holds

    free TB free TB name in TB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 70 TB) nsd1 9 1 No Yes 1 ( 5%) 1 ( 1%) nsd7 1 1 Yes No 1 ( 66%) 1 (13%) nsd3 9 1 No Yes 1 ( 5%) 1 ( 1%) nsd5 1 1 Yes No 1 ( 65%) 1 (13%) nsd4 9 2 No Yes 1 ( 5%) 1 ( 1%) nsd2 9 2 No Yes 1 ( 8%) 1 ( 1%) nsd6 1 2 Yes No 1 ( 64%) 1 (14%) nsd8 1 2 Yes No 1 ( 65%) 1 (14%) ------------- -------------------- ------------------- (pool total) 36 3 ( 6%) 1 ( 1%) ============= ==================== =================== (data) 35 3 ( 6%) 1 ( 1%) (metadata) 1 1 ( 65%) 1 (13%) ============= ==================== =================== (total) 36 3 ( 6%) 1 ( 1%) Inode Information ----------------- Number of used inodes: 1663869 Number of free inodes: 5708931 Number of allocated inodes: 7372800 Maximum number of inodes: 18300928
  15. Pollux cluster pollux1 pollux3 disk disk decnsd2 disk pollux2 disk

    6G SAS 10G Ethernet • 2nd GPFS build @ NOAO • 2MB block size • 3.5 from start (used to test 3.5 prior to dec upgrade) • Data and metadata replication
  16. Mixed Data & Metadata disk disk size failure holds holds

    free TB free TB name in TB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) pollux1_nsd1 30 1 Yes Yes 3 ( 7%) 1 ( 1%) pollux1_nsd2 30 1 Yes Yes 3 ( 9%) 1 ( 1%) pollux1_nsd3 30 1 Yes Yes 2 ( 7%) 1 ( 1%) pollux1_nsd4 30 1 Yes Yes 2 ( 5%) 1 ( 1%) pollux2_nsd2 30 2 Yes Yes 3 ( 9%) 1 ( 1%) pollux2_nsd3 30 2 Yes Yes 2 ( 7%) 1 ( 1%) pollux2_nsd4 30 2 Yes Yes 2 ( 5%) 1 ( 1%) pollux2_nsd1 30 2 Yes Yes 3 ( 7%) 1 ( 1%) pollux3_nsd3 30 3 Yes Yes 2 ( 6%) 1 ( 1%) pollux3_nsd2 30 3 Yes Yes 3 ( 9%) 1 ( 1%) pollux3_nsd1 30 3 Yes Yes 3 ( 8%) 1 ( 1%) pollux3_nsd4 30 3 Yes Yes 2 ( 7%) 1 ( 1%) ------------- -------------------- ------------------- (pool total) 350 25 ( 7%) 4 ( 1%) ============= ==================== =================== (total) 350 25 ( 7%) 4 ( 1%) Inode Information ----------------- Number of used inodes: 71011457 Number of free inodes: 7041919 Number of allocated inodes: 78053376 Maximum number of inodes: 146800640
  17. Future @ NOAO • 2 more GPFS clusters are planned

    – Archive copy in La Serena – “3rd” copy on Kitt Peak • Tucson archive will have it's metadata migrated to dedicated SSDs • On going annual expansion • Aggregate capacity of 1020TiB by end of FY13
  18. General Suggestions on RAID configuration • Model I/O capacity through

    the entire • Tunneling SATA over SAS has high overhead (~30%?) • Match filesystem write width to RAID set array width • Array set blocksize can be tricky / non-intutive – Larger blocksize tend to optimize heavily for writes over reads
  19. RAID6 Reads 32K 64K 128K 256K 512K 1024K 2048K 4096K

    0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 Reads raid6_18disk_64k_nr raid6_18disk_64k_ra raid6_20disk_64k_nr raid6_20disk_64k_ra raid6_18disk_128K_nr raid6_18disk_128K_ra raid6_20disk_128k_nr raid6_20disk_128k_ra raid6_18disk_256k_nr raid6_18disk_256k_nr_cached raid6_18disk_256k_ra raid6_18disk_256k_ra_cached raid6_18disk_256k_ra_ndc raid6_20disk_256k_nr raid6_20disk_256k_nr_cached raid6_20disk_256k_ra. raid6_20disk_256k_ra_cached raid6_20disk_256k_ra_ndc block size MB/s
  20. RAID6 Writes 32K 64k 128k 256k 512K 1024K 2048K 4069k

    0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 900.0 1000.0 Writes raid6_18disk_64k_nr raid6_18disk_64k_ra raid6_20disk_64k_nr raid6_20disk_64k_ra raid6_18disk_128K_nr raid6_18disk_128K_ra raid6_20disk_128k_nr raid6_20disk_128k_ra raid6_18disk_256k_nr raid6_18disk_256k_nr_cached raid6_18disk_256k_ra raid6_18disk_256k_ra_cached raid6_18disk_256k_ra_ndc raid6_20disk_256k_nr raid6_20disk_256k_nr_cached raid6_20disk_256k_ra. raid6_20disk_256k_ra_cached raid6_20disk_256k_ra_ndc block size MB/s