Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NFS Tuning Secrets: or Why Does "Sync" Do Two D...

Greg Banks
February 01, 2008

NFS Tuning Secrets: or Why Does "Sync" Do Two Different Things

I'm often asked to rescue situations where NFS is working too slowly for people's liking. Sometimes the answer is pointing out that NFS is already running as fast as the hardware will allow it. Sometimes the answer involves undoing some clever "tuning improvements" that have been performed in the name of "increasing performance". Every now and again, there's an actual NFS bug that needs fixing.

This talk will distill that experience, and give guidance on the right way to tune Linux NFS servers and clients. I talked a little about this last year, in the context of tuning performed by SGI's NAS Server product. This year I'll expand on the subject and give direct practical advice.

I'll cover the fundamental hardware and software limits of NFS performance, so you can tell if there's any room for improvement. I'll mention some of the more dangerous or slow "improvements" you could make by tuning unwisely. I'll explain how some of the more obscure tuning options work, so you can see why and how they need to be tuned. I'll even cover some bugs which can cause performance problems.

After hearing this talk, you'll leave the room feeling rightfully confident in your ability to tune NFS in the field.

Plus, you'll have a warm fuzzy feeling about NFS and how simple and obvious it is. Sorry, just kidding about that last bit.

Greg Banks

February 01, 2008
Tweet

More Decks by Greg Banks

Other Decks in Programming

Transcript

  1. Headline in Arial Bold 30pt Greg Banks <[email protected]> Senior Software

    Engineer File Serving Technologies NFS Tuning Secrets
  2. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  3. 07/10/12 About SGI • We used to make all sorts

    of cool stuff • Now we make different cool stuff: – Altix: honking great NUMA / Itanium computers – Altix XE: high-end Xeon rackmount servers – ICE: integrated Xeon / Infiniband blade clusters – Nexis: mid- to high-end NAS & SAN servers
  4. 07/10/12 File Serving Technologies • 1 of 2 storage software

    groups • We make the bits that make Nexis run • Based right here in Melbourne • Teams – XFS – NFS, Samba, iSCSI – Nexis web GUI – Performance Co-Pilot
  5. 07/10/12 About This Talk • Part of my job is

    answering questions on NFS performance • “NFS is too slow! What can I do to speed it up?” • This talk distils several years of answers • Mostly relevant to Linux NFS clients & servers – some more generic advice • Mostly considering bulk read/write workloads – Metadata workloads have less room to optimise with the NFS protocol
  6. 07/10/12 Mythbusters • Customer says, “Performance Doesn't Matter”. • This

    is never true • Translation: “I have a performance requirement...but I'm going to make you guess it” • Performance tuning will always matter
  7. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  8. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  9. 07/10/12 Server Workload • Mixing NFS serving and other demand

    on the server will slow down both • NFS serving isn't “cheap” – costs CPU time – costs memory – costs PCI bandwidth – costs network bandwidth – costs disk bandwidth • The NFS server is an application not a kernel feature – Works best when given all the machine's resources
  10. 07/10/12 Network Performance • Network hardware – NICs – switches

    • Network software – NIC drivers – Bonding – TCP/IP stack
  11. 07/10/12 Server FS Performance • Hardware – Disks – RAID

    controllers – FC switches – HBAs • Software – Filesystem – Volume Manager – Snapshots – Block layer – SCSI layer
  12. 07/10/12 NFS vs. Server FS Interactions • NFS server does

    things no sane local workload does – Calls f_op->open / f_op->release around each RPC – Looks up files by inode number each call – Uses disconnected dentries; reconnects disconnected dentries – Synchronises data more often – Synchronises metadata more often – Wants to write into page cache while VM is pushing pages out – Perform IO from a kernel thread – Calls f_op->sendfile / f_op->splice_read support for zero-copy reads
  13. 07/10/12 NFS vs. Server FS Interactions • NFS server does

    things no sane local workload does – Calls f_op->open / f_op->release around each RPC – Looks up files by inode number each call – Uses disconnected dentries; reconnects disconnected dentries – Synchronises data more often – Synchronises metadata more often – Wants to write into page cache while VM is pushing pages out – Perform IO from a kernel thread – Calls f_op->sendfile / f_op->splice_read support for zero-copy reads • The filesystem may have difficulty with these – Because developers often aren't aware of the above – Filesystem bugs might appear only when serving NFS
  14. 07/10/12 NFS Server Behaviour • Other clients may be using

    the same files • Other clients are sharing the server's resources • Server may obey client's order to synchronise data...or not – even though that breaks the NFS protocol and is unsafe
  15. 07/10/12 NFS Server Behaviour • Other clients may be using

    the same files • Other clients are sharing the server's resources • Server may obey client's order to synchronise data...or not – even though that breaks the NFS protocol and is unsafe • NFS server application efficiency issues – Thread to CPU mapping – Data structures – Lock contention, cacheline bouncing
  16. 07/10/12 NFS Client Behaviour • Parallelism on the wire, for

    performance – Can result in IO mis-ordering at server – Can defeat filesystem & VFS streaming optimisations – e.g. VFS readahead on server •
  17. 07/10/12 NFS Client Behaviour • Parallelism on the wire, for

    performance – Can result in IO mis-ordering at server – Can defeat filesystem & VFS streaming optimisations – e.g. VFS readahead on server • Transfer size is limited by client support • Linux clients do not align IOs (except to their own page size) – may be slow if server's page size is larger; RAID stripe issues •
  18. 07/10/12 NFS Client Behaviour • Parallelism on the wire, for

    performance – Can result in IO mis-ordering at server – Can defeat filesystem & VFS streaming optimisations – e.g. VFS readahead on server • Transfer size is limited by client support • Linux clients do not align IOs (except to their own page size) – may be slow if server's page size is larger; RAID stripe issues • Reads: client does its own readahead – may interact with server & RAID readahead • Writes: client decides COMMIT interval – without adequate knowledge of what is an efficient IO size on server
  19. 07/10/12 Application Behaviour • IO sizes • Buffered or O_DIRECT

    or O_SYNC or async IO • fsync • write/read or pwrite/pread or mmap or sendfile/splice • Threads/parallelism • Application buffering/caching/readahead • Time behaviour: burstiness, correlation
  20. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  21. 07/10/12 #1 First tune your network • NFS cannot be

    any faster than the network • To test the network, use – ping – ping -f – ttcp or nttcp • ethtool is your friend • Think about network bottlenecks, other users of network
  22. 07/10/12 #2 Use NFSv3 • Not NFSv2 under any circumstances

    • NFSv4...meh • On modern Linux clients, this is the default
  23. 07/10/12 #3 Use TCP • Not UDP under any circumstances

    • On modern Linux clients, this is the default
  24. 07/10/12 #4 Use the maximum transfer size • Use the

    largest rsize / wsize options that both the client and the server support • On modern Linux clients, this is the default •
  25. 07/10/12 #4 Use the maximum transfer size • Use the

    largest rsize / wsize options that both the client and the server support • On modern Linux clients, this is the default • Larger is always better – no more resources are being wasted – larger always faster, up to the server's limit – modern servers & clients can do 1MiB •
  26. 07/10/12 #4 Use the maximum transfer size • Use the

    largest rsize / wsize options that both the client and the server support • On modern Linux clients, this is the default • Larger is always better – no more resources are being wasted – larger always faster, up to the server's limit – modern servers & clients can do 1MiB • Various sizes below which performance is worse – [rw]size < client page size (e.g. 4K) is bad – [rw]size < server's page size is bad – [rw]size < server's RAID stripe width is bad
  27. 07/10/12 #5 Don't use soft • Use hard, not soft

    • On modern Linux clients, this is the default • Only bend this rule if you know what you're doing – If you had to ask, you don't know what you're doing
  28. 07/10/12 #6 Use intr • Because you want to be

    able to interrupt your program when the server isn't there • By the time you discover you wanted this, it's too late – time to reboot the client machine • This is NOT the default!
  29. 07/10/12 #7 Use the maximum MTU • Use the largest

    MTU that that all the network machinery between client and server properly support • This usually means 9KB • Consult your switch documentation; experiment •
  30. 07/10/12 #7 Use the maximum MTU • Use the largest

    MTU that that all the network machinery between client and server properly support • This usually means 9KB • Consult your switch documentation; experiment • Why? – Each Ethernet frame received takes CPU work – Each TCP segment received takes CPU work – Larger MTU => more data per CPU – Very visible on fast networks (10gige, IB)
  31. 07/10/12 #8 No More Mount Options • Options you really

    really don't want: – sync • Client emits small WRITEs serially (instead of large IOs in parallel) • Each WRITE RPC waits until the server says the data is on disk • Slooooooooooow – noac on Linux • Implies sync
  32. 07/10/12 #9 Parallelism • Client OS will have a tunable

    to control the parallelism on the wire, e.g. – Number of slots (Linux) – Number of nfs threads – Number of biods • If you can figure out where this is, ensure it's about 16 • On modern Linux clients, this is the default
  33. 07/10/12 #10 Umm... • If I could think of ten

    rules, I'd be writing for Letterman
  34. 07/10/12 #10 Client Readahead • Seriously... • On Linux clients,

    tune max readahead to be a multiple of 4 – Ensures READ rpcs will be aligned to rsize thanks to VFS readahead code – Server can be sensitive to that alignment • Default is 15 (bad); 16 is good; more than #slots (16) is bad • SLES – echo 16 > /proc/sys/fs/nfs/nfs_max_readahead • Others – Change NFS_MAX_READAHEAD in fs/nfs/super.c
  35. 07/10/12 A Word About Defaults • You may have noticed...

    • Modern Linux clients mostly have good defaults • Most mount options you specify will make things worse • Don't just copy mount options from your old /etc/fstab • Start with rw,intr
  36. 07/10/12 Mount Options: Trust...But Verify • What you specify in

    /etc/fstab isn't necessarily what applies • Especially: [rw]size is negotiated with server • If unspecified, vers & proto may be negotiated with server • Look in /proc/mounts after mounting – Modern nfs-utils have nfsstat -m
  37. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  38. 07/10/12 #1 Tune the Storage Hardware • Choose hardware RAID

    stripe unit • Choose number of disks in hw RAID set – You want to encourage NFS to do Full Stripe Writes • Avoiding a Read-Modify-Write cycle in the RAID controller – So choose a stripe width == MIN(max_sectors_kb,NFS max transfer size)/N – Some RAID hw prefers 2N+1 RAID sets e.g. 4+1, 8+1 •
  39. 07/10/12 #1 Tune the Storage Hardware • Choose hardware RAID

    stripe unit • Choose number of disks in hw RAID set – You want to encourage NFS to do Full Stripe Writes • Avoiding a Read-Modify-Write cycle in the RAID controller – So choose a stripe width == MIN(max_sectors_kb,NFS max transfer size)/N – Some RAID hw prefers 2N+1 RAID sets e.g. 4+1, 8+1 • Choose RAID caching mode – Nobrainers: want read caching, write caching – Write cache mirroring = slow but safe...choose carefully
  40. 07/10/12 #2 Tune the Block Layer • Choose the right

    IO scheduler for your workload – CFQ (Complete Fair Queuing) seems to work OK – Even though it's dumb about iocontexts & NFS – echo cfq > /sys/block/$sdx/queue/scheduler – But...YMMV. Experiment! •
  41. 07/10/12 #2 Tune the Block Layer • Choose the right

    IO scheduler for your workload – CFQ (Complete Fair Queuing) seems to work OK – Even though it's dumb about iocontexts & NFS – echo cfq > /sys/block/$sdx/queue/scheduler – But...YMMV. Experiment! • Increase CTQ (Command Tag Queue) depth – Improve SCSI parallelism => better disk performance – Sometimes unobvious upper limits; per-HBA, per-RAID controller – Default might be 1 (worst case), try increasing it – echo 4 > /sys/block/$sdx/device/queue_depth
  42. 07/10/12 #2 Son of Tune the Block Layer • Bump

    up max_sectors_kb to get large IOs – You want the largest IOs possible going to disk • That are multiples of RAID stripe width – Linux limit varies with server page size • Altix: 16 KiB pages => 2 MiB max_sectors_kb • x86_64: 4 KiB pages => 512 KiB max_sectors_kb – echo 512 > /sys/block/$sdx/queue/max_sectors_kb •
  43. 07/10/12 #2 Son of Tune the Block Layer • Bump

    up max_sectors_kb to get large IOs – You want the largest IOs possible going to disk • That are multiples of RAID stripe width – Linux limit varies with server page size • Altix: 16 KiB pages => 2 MiB max_sectors_kb • x86_64: 4 KiB pages => 512 KiB max_sectors_kb – echo 512 > /sys/block/$sdx/queue/max_sectors_kb • Check your block device max readahead is adequate – cat /sys/block/$sdx/queue/read_ahead_kb • Don't forget to make your changes persistent – sgisetqdepth on SGI ProPack – Add an /etc/init.d/script or a udev rule
  44. 07/10/12 #3 Tune the Filesystem • Know your filesystem –

    RTFM, experiment! • Tune for NFS workload – Not local workloads – Not your application run locally • Some tunings must be done at mkfs time – So don't wait until your data is already on the fs •
  45. 07/10/12 #3 Tune the Filesystem • Know your filesystem –

    RTFM, experiment! • Tune for NFS workload – Not local workloads – Not your application run locally • Some tunings must be done at mkfs time – So don't wait until your data is already on the fs • Choose partitioning+Volume Manager arrangement to align filesystem structures to underlying RAID stripes • Use noatime or relatime (modern kernels) – If you can; noatime confuses some apps like mail readers
  46. 07/10/12 #3 Revenge of Tune the Filesystem • XFS needs

    care & attention to get the best performance – Default options historically compatible not performance optimal – Be careful with units in mkfs.xfs, xfs_info •
  47. 07/10/12 #3 Revenge of Tune the Filesystem • XFS needs

    care & attention to get the best performance – Default options historically compatible not performance optimal – Be careful with units in mkfs.xfs, xfs_info • Optimise log IO – Maximise log size (128 MiB historically), but aligned to RAID stripe width – mkfs -l size=... -l sunit=... -l version=2 – Log buffer size needs to be >= log stripe unit, multiple of RAID stripe width – mount -o logbufs=8 logbsize=... •
  48. 07/10/12 #3 Revenge of Tune the Filesystem • XFS needs

    care & attention to get the best performance – Default options historically compatible not performance optimal – Be careful with units in mkfs.xfs, xfs_info • Optimise log IO – Maximise log size (128 MiB historically), but aligned to RAID stripe width – mkfs -l size=... -l sunit=... -l version=2 – Log buffer size needs to be >= log stripe unit, multiple of RAID stripe width – mount -o logbufs=8 logbsize=... • Consider a separate log device...NVRAM/SSD if you can • Lazy superblock counters to reduce SB IO – mkfs -l lazy-count=1
  49. 07/10/12 #3 Bride of Tune the Filesystem • Make XFS

    align disk structures to underlying RAID stripes – mkfs -d sunit=... -d swidth=... • Increase directory block size – Larger block size improves workloads which modify large directories – mkfs -n size=16384 •
  50. 07/10/12 #3 Bride of Tune the Filesystem • Make XFS

    align disk structures to underlying RAID stripes – mkfs -d sunit=... -d swidth=... • Increase directory block size – Larger block size improves workloads which modify large directories – mkfs -n size=16384 • Choose number of Allocation Groups – Default is...arbitrary – Fewer AGs can improve some workloads – mkfs -d agcount=... OR mkfs -d agsize=... • Increase inode hash size on inode-heavy workloads – mount -o ihashsize=... # Not needed on latest XFS
  51. 07/10/12 #4 Tune the VM • Make the VM push

    unstable pages to disk faster – Hopefully, some before the client does a COMMIT – Especially useful if you enable the async export option – sysctl vm.dirty_writeback_centisecs=50 – sysctl vm.dirty_background_ratio=10 • But...YMMV • Do not reduce vm.dirty_ratio – Can cause nfsds to block unnecessarily – You really don't want that
  52. 07/10/12 #5 Tune PCI Cards • Some PCI performance effects...

    – Cards sharing a bus share bus bandwidth – Two cards in a bus can slow down the bus rate – Putting a card in some slots can slow down slots on other buses – Throughput limits in PCI bridges – DMA resource limits in PCI bridges – Network cards use lots of small DMAs, very unfriendly to PCI – Some PCI devices don't play well with others, need to be on their own bus – Some cards require MaxReadSize tuned upwards – BIOS or driver setting – Some cards benefit from Write Combining on PCI mappings – driver setting – Some cards benefit from using MSI – driver, BIOS settings
  53. 07/10/12 #5 Ghost of Tune PCI Cards • NUMA effects

    – Minimise NUMA hops: ensure interrupts are bound to nearby CPUs – Fast networks, filesystems can saturate NUMA interconnects – So try to arrange for page cache to be near the NIC & HBA that DMA to it – Turn off cpuset page spreading, it pessimises DMA patterns •
  54. 07/10/12 #5 Ghost of Tune PCI Cards • NUMA effects

    – Minimise NUMA hops: ensure interrupts are bound to nearby CPUs – Fast networks, filesystems can saturate NUMA interconnects – So try to arrange for page cache to be near the NIC & HBA that DMA to it – Turn off cpuset page spreading, it pessimises DMA patterns • FC multipathing: choose default paths carefully – Want to balance IOs down multiple paths to the storage • Know your network, SCSI, and platform hardware! – RTFM. Experiment
  55. 07/10/12 #6 Tune the Network • Check basic parameters –

    Speed, duplex, errors – ethtool •
  56. 07/10/12 #6 Tune the Network • Check basic parameters –

    Speed, duplex, errors – ethtool • Tune interrupt coalescing for bulk traffic – Reduce the NIC interrupt rate for a given workload • Adds latency, but it increases NFS server throughput • If you have latency-sensitive traffic as well, consider a separate NIC for that – ethtool -C ethN rx-usecs 80 rx-frames 20 rx-usecs-irq 80 rx-frames-irq 20 – Some cards only support a subset of those parameters – Current generation 1gige, 10gige, next gen IB cards
  57. 07/10/12 #6 The Evil of Tune the Network • Bind

    NIC interrupts to CPUs – Keep device cachelines hot in one CPU – ifconfig tells you the IRQ – printf %08x $[1<<$cpu] > /proc/irq/$irq/smp_affinity – Careful manual binding may be better than irqbalanced •
  58. 07/10/12 #6 The Evil of Tune the Network • Bind

    NIC interrupts to CPUs – Keep device cachelines hot in one CPU – ifconfig tells you the IRQ – printf %08x $[1<<$cpu] > /proc/irq/$irq/smp_affinity – Careful manual binding may be better than irqbalanced • Increase socket buffer sizes – Need to allow effective TCP buffering, window scaling on faster networks – Modern kernels do this automatically; older ones have bad defaults – sysctl net.ipv4.tcp_{,r,w}mem='8192 262144 524288' – More for 10gige – Actual sysctls needed vary with kernel version; consult kernel docs
  59. 07/10/12 #6 The Curse of Tune The Network • Enable

    TSO (TCP Segment Offload) – Moves some of the simpler TCP grunt work from software to hardware – Larger DMAs to card, less CPU work per byte sent – May be off by default – ethtool -K ethN tso on – Current generation 1gige, 10gige, IB cards •
  60. 07/10/12 #6 The Curse of Tune The Network • Enable

    TSO (TCP Segment Offload) – Moves some of the simpler TCP grunt work from software to hardware – Larger DMAs to card, less CPU work per byte sent – May be off by default – ethtool -K ethN tso on – Current generation 1gige, 10gige, IB cards • Possibly enable LRO (Large Receive Offload) – Receive-side equivalent of TSO – May help for streaming workloads. Experiment! – Current generation 10gige, next generation IB
  61. 07/10/12 #6 Flesh for Tune the Network • Enable RSS

    (Receive Side Scaling) – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs – Fiddly to tune, consult driver docs – Current generation 10gige, next generation IB cards •
  62. 07/10/12 #6 Flesh for Tune the Network • Enable RSS

    (Receive Side Scaling) – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs – Fiddly to tune, consult driver docs – Current generation 10gige, next generation IB cards • Check hardware checksum offload is enabled – Some drivers let you turn it off with ethtool (why?) – Current generation 1gige, 10gige, in next generation IB cards •
  63. 07/10/12 #6 Flesh for Tune the Network • Enable RSS

    (Receive Side Scaling) – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs – Fiddly to tune, consult driver docs – Current generation 10gige, next generation IB cards • Check hardware checksum offload is enabled – Some drivers let you turn it off with ethtool (why?) – Current generation 1gige, 10gige, in next generation IB cards • Enable IPoIB Connected Mode – Allows larger MTU (~64KiB) => less CPU spent in interrupt – See OFED 1.2 release notes
  64. 07/10/12 #6 Tune the Network meets Dracula • Fix default

    ARP settings – Default Linux settings do weird things to incoming packet paths if you use multiple NICs in the same broadcast domain – echo 1 > /proc/sys/net/ipv4/conf/ethN/arp_ignore – echo 2 > /proc/sys/net/ipv4/conf/ethN/arp_announce •
  65. 07/10/12 #6 Tune the Network meets Dracula • Fix default

    ARP settings – Default Linux settings do weird things to incoming packet paths if you use multiple NICs in the same broadcast domain – echo 1 > /proc/sys/net/ipv4/conf/ethN/arp_ignore – echo 2 > /proc/sys/net/ipv4/conf/ethN/arp_announce • Bonding: match the incoming and outgoing transmit hashes – Keep driver cachelines hot in one CPU – Especially important on NUMA platforms – Consult your switch docs...and don't trust them. Experiment!
  66. 07/10/12 #7 Think about async • The async export option

    can put newly written data at risk • But it's faster for some workloads • Sometimes the speed-up is worth the risk – scratch/temporary data – server on UPS
  67. 07/10/12 #8 Use no_subtree_check • The subtree_check export option is

    for “added security” – No beneficial effect if you only export mountpoints – Significant CPU cost on metadata-heavy workloads (specSFS) – Arguably, breaks NFS filehandle semantics • Default has changed, so always explicitly specify no_subtree_check
  68. 07/10/12 #9 Use More Server Threads • Default in most

    Linux distros is 4 or 8: way too low • Figuring out how many you really need is hard – MIN(disk queue depth/fudge factor, max number of parallel requests from clients in your expected peak workload) – You are not expected to understand this •
  69. 07/10/12 #9 Use More Server Threads • Default in most

    Linux distros is 4 or 8: way too low • Figuring out how many you really need is hard – MIN(disk queue depth/fudge factor, max number of parallel requests from clients in your expected peak workload) – You are not expected to understand this • Since 2.6.19 no CPU performance penalty with lots of nfsds – memory use: ~ 1.1 MiB/nfsd • Some server data structures sized by initial #nfsds •
  70. 07/10/12 #9 Use More Server Threads • Default in most

    Linux distros is 4 or 8: way too low • Figuring out how many you really need is hard – MIN(disk queue depth/fudge factor, max number of parallel requests from clients in your expected peak workload) – You are not expected to understand this • Since 2.6.19 no CPU performance penalty with lots of nfsds – memory use: ~ 1.1 MiB/nfsd • Some server data structures sized by initial #nfsds • So use too many....like 128 • USE_KERNEL_NFSD_NUMBER in /etc/sysconfig/nfs
  71. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  72. 07/10/12 Measurement Tips • Measure subsystems separately first – Local

    throughput to disk – Network throughput to client – Only then measure NFS performance •
  73. 07/10/12 Measurement Tips • Measure subsystems separately first – Local

    throughput to disk – Network throughput to client – Only then measure NFS performance • Choose a good measurement tool. A good tool... – Does accurate measurements – Is convenient to use – Is efficient – Shows time variation – e.g. Performance Co-Pilot (plug!)
  74. 07/10/12 Measurement Tips Strikes Back • Be aware of your

    benchmark's behaviour – Does it report in MiB/s or MB/s or something weird? – Does it count close or fsync time? – Does it accidentally trigger cache effects? – Is it emitting the right IO sizes? – Is it doing buffered, direct or sync IO? – When writing, is it truncating or over-writing? – Streaming or random IOs? •
  75. 07/10/12 Measurement Tips Strikes Back • Be aware of your

    benchmark's behaviour – Does it report in MiB/s or MB/s or something weird? – Does it count close or fsync time? – Does it accidentally trigger cache effects? – Is it emitting the right IO sizes? – Is it doing buffered, direct or sync IO? – When writing, is it truncating or over-writing? – Streaming or random IOs? • Measure the right thing – Performance depends on workload – So measure something as close as possible to the workload you care about • If possible, measure the workload itself
  76. 07/10/12 Measurement Tips Begins • Be aware: there are two

    page caches: client, server • Either can skew performance results – Server cache: reads at network speed not disk speed – Client cache: reads at local memory copy speed • Need to ensure their states are appropriately empty (or full!) •
  77. 07/10/12 Measurement Tips Begins • Be aware: there are two

    page caches: client, server • Either can skew performance results – Server cache: reads at network speed not disk speed – Client cache: reads at local memory copy speed • Need to ensure their states are appropriately empty (or full!) • Standard trick: use files > either RAM size • Other tricks: flush page cache by – Using bcfree – echo 1 > /proc/sys/vm/drop_caches – umount, mount the filesystem on both client, server – Some benchmarks have a warmup or ageing phase
  78. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?
  79. 07/10/12 Why NFSv2 Sucks • Limited file size (4 GiB)

    • Limited transfer size (8 KiB) • Writes are sync = slow • ls -l requires many more roundtrips – v2: READDIR + N * (LOOKUP + GETATTR) – v3: READDIRPLUS
  80. 07/10/12 Why NFSv4 Sucks • Much more complex than v3

    – Yet doesn't address important design flaws of v3 • Linux server idmapping is single-threaded • Linux server and pseudo-root – Very painful to have more than 1 v4 export point • Great, it's secure! – But have you ever tried setting up Kerberos on Linux? • Linux client & server still can't do extended attributes • What's with all the W*****s crud in the protocol? – Those guys have their own perfectly adequate protocol
  81. 07/10/12 Why NFS on UDP Sucks • Client: poor congestion

    control – Implemented in the sunrpc layer, not the transport protocol – Linux: based on RTT estimators plus a little fuzz – Results in hair trigger soft option – No, timeo= will not save you – Hence spurious EIO errors to userspace – Good applications die...bad applications corrupt your data when writing •
  82. 07/10/12 Why NFS on UDP Sucks • Client: poor congestion

    control – Implemented in the sunrpc layer, not the transport protocol – Linux: based on RTT estimators plus a little fuzz – Results in hair trigger soft option – No, timeo= will not save you – Hence spurious EIO errors to userspace – Good applications die...bad applications corrupt your data when writing • IPID aliasing – Fundamental design issue in IP reassembly – Causes invisible data corruption at high data rates – IP layer passes corrupt data up to NFS
  83. 07/10/12 Why NFS on UDP Sucks vs Mothra • Routers

    drop UDP packets first when congested – UDP == “expendable” – Causes retries => slowdown – ...or IPID aliasing => corruption • Limited transfer size (Linux: 32 KiB) • Linux server: single socket performance limit
  84. 07/10/12 Testing Your Network • Use ethtool to test NIC

    settings – Speed negotiated with switch properly – Full duplex – Interrupt coalescing enabled – TSO enabled – Hardware checksum enabled – Scatter/gather enabled • Use ping for basic connectivity and stability – Summary tells you dropped packets – Watch netstat -i for errors
  85. 07/10/12 Testing Your Network vs Predator • Use ping -f

    for stress testing connectivity – Also finds some interrupt problems – Dots display tells you dropped packets – Watch netstat -i, netstat -s for errors • Use ttcp or nttcp for testing TCP throughput – Make sure you can fill the pipe – Watch netstat -st for retransmits
  86. 07/10/12 “Sync” vs “Sync” • Client mount option vs server

    export option • They do different things • Do not have to be consistent • Basic rule: sync + sync = very slow • But what do they do?
  87. 07/10/12 Sync: the Client Mount Option • Linux generic mount

    option • NFS client serialises IOs • Each WRITE waits until data is on stable storage • Sloooow, but for NFS no safer • Use async, this is the default
  88. 07/10/12 Sync: the Server Export Option • sync specifies the

    RFC-compliant behaviour • async makes the server lie to the client about data stability • Significantly faster for some workloads •
  89. 07/10/12 Sync: the Server Export Option • sync specifies the

    RFC-compliant behaviour • async makes the server lie to the client about data stability • Significantly faster for some workloads • But a real danger of losing written data – Server says “sure, the data's on disk” – Client forgets the data is dirty...may even delete it's copy – If server crashes before the data hits disk, client won't resend => data loss – Use async only if the performance is worth the risk • Historically the default changed – whiny message from exportfs unless you explicitly choose sync or async
  90. 07/10/12 NUMA Effects • NFS thread pool mode – Want

    pernode at 2 CPUs/node – Want percpu at > 2 CPUs/node – echo percpu > /proc/fs/nfsd/pool_mode # while nfsd stopped • NFS doesn't play nice with cpusets – NFS wants to do it's own CPU management – Cpuset memory allocation policies can slow NFS on high bandwidth setups – Please don't mix cpusets & NFS
  91. 07/10/12 Using iozone Effectively • Ensure close time is measured:

    -c • Take care with choosing IO block sizes • Default workloads are sensitive to cache effects • Cluster mode -+m rocks
  92. 07/10/12 Contents • Introduction • How NFS Works • Performance

    Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions? •