$30 off During Our Annual Pro Sale. View Details »

NFS Tuning Secrets: or Why Does "Sync" Do Two Different Things

Greg Banks
February 01, 2008

NFS Tuning Secrets: or Why Does "Sync" Do Two Different Things

I'm often asked to rescue situations where NFS is working too slowly for people's liking. Sometimes the answer is pointing out that NFS is already running as fast as the hardware will allow it. Sometimes the answer involves undoing some clever "tuning improvements" that have been performed in the name of "increasing performance". Every now and again, there's an actual NFS bug that needs fixing.

This talk will distill that experience, and give guidance on the right way to tune Linux NFS servers and clients. I talked a little about this last year, in the context of tuning performed by SGI's NAS Server product. This year I'll expand on the subject and give direct practical advice.

I'll cover the fundamental hardware and software limits of NFS performance, so you can tell if there's any room for improvement. I'll mention some of the more dangerous or slow "improvements" you could make by tuning unwisely. I'll explain how some of the more obscure tuning options work, so you can see why and how they need to be tuned. I'll even cover some bugs which can cause performance problems.

After hearing this talk, you'll leave the room feeling rightfully confident in your ability to tune NFS in the field.

Plus, you'll have a warm fuzzy feeling about NFS and how simple and obvious it is. Sorry, just kidding about that last bit.

Greg Banks

February 01, 2008
Tweet

More Decks by Greg Banks

Other Decks in Programming

Transcript

  1. Headline in Arial Bold 30pt
    Greg Banks
    Senior Software Engineer
    File Serving Technologies
    NFS Tuning Secrets

    View Slide

  2. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  3. 07/10/12
    About SGI
    • We used to make all sorts of cool stuff
    • Now we make different cool stuff:
    – Altix: honking great NUMA / Itanium computers
    – Altix XE: high-end Xeon rackmount servers
    – ICE: integrated Xeon / Infiniband blade clusters
    – Nexis: mid- to high-end NAS & SAN servers

    View Slide

  4. 07/10/12
    File Serving Technologies
    • 1 of 2 storage software groups
    • We make the bits that make Nexis run
    • Based right here in Melbourne
    • Teams
    – XFS
    – NFS, Samba, iSCSI
    – Nexis web GUI
    – Performance Co-Pilot

    View Slide

  5. 07/10/12
    About This Talk
    • Part of my job is answering questions on NFS performance
    • “NFS is too slow! What can I do to speed it up?”
    • This talk distils several years of answers
    • Mostly relevant to Linux NFS clients & servers
    – some more generic advice
    • Mostly considering bulk read/write workloads
    – Metadata workloads have less room to optimise with the NFS protocol

    View Slide

  6. 07/10/12
    Mythbusters
    • Customer says, “Performance Doesn't Matter”.
    • This is never true
    • Translation: “I have a performance requirement...but I'm
    going to make you guess it”
    • Performance tuning will always matter

    View Slide

  7. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  8. 07/10/12
    How NFS Works: Your
    View
    • fff

    View Slide

  9. 07/10/12
    How NFS Works: My View
    • fff

    View Slide

  10. 07/10/12
    NFS Read Path

    View Slide

  11. 07/10/12
    NFS Write Path

    View Slide

  12. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  13. 07/10/12
    Server Workload
    • Mixing NFS serving and other demand on the server will
    slow down both
    • NFS serving isn't “cheap”
    – costs CPU time
    – costs memory
    – costs PCI bandwidth
    – costs network bandwidth
    – costs disk bandwidth
    • The NFS server is an application not a kernel feature
    – Works best when given all the machine's resources

    View Slide

  14. 07/10/12
    Network Performance
    • Network hardware
    – NICs
    – switches
    • Network software
    – NIC drivers
    – Bonding
    – TCP/IP stack

    View Slide

  15. 07/10/12
    Server FS Performance
    • Hardware
    – Disks
    – RAID controllers
    – FC switches
    – HBAs
    • Software
    – Filesystem
    – Volume Manager
    – Snapshots
    – Block layer
    – SCSI layer

    View Slide

  16. 07/10/12
    NFS vs. Server FS
    Interactions
    • NFS server does things no sane local workload does
    – Calls f_op->open / f_op->release around each RPC
    – Looks up files by inode number each call
    – Uses disconnected dentries; reconnects disconnected dentries
    – Synchronises data more often
    – Synchronises metadata more often
    – Wants to write into page cache while VM is pushing pages out
    – Perform IO from a kernel thread
    – Calls f_op->sendfile / f_op->splice_read support for zero-copy reads

    View Slide

  17. 07/10/12
    NFS vs. Server FS
    Interactions
    • NFS server does things no sane local workload does
    – Calls f_op->open / f_op->release around each RPC
    – Looks up files by inode number each call
    – Uses disconnected dentries; reconnects disconnected dentries
    – Synchronises data more often
    – Synchronises metadata more often
    – Wants to write into page cache while VM is pushing pages out
    – Perform IO from a kernel thread
    – Calls f_op->sendfile / f_op->splice_read support for zero-copy reads
    • The filesystem may have difficulty with these
    – Because developers often aren't aware of the above
    – Filesystem bugs might appear only when serving NFS

    View Slide

  18. 07/10/12
    NFS Server Behaviour
    • Other clients may be using the same files
    • Other clients are sharing the server's resources
    • Server may obey client's order to synchronise data...or not
    – even though that breaks the NFS protocol and is unsafe

    View Slide

  19. 07/10/12
    NFS Server Behaviour
    • Other clients may be using the same files
    • Other clients are sharing the server's resources
    • Server may obey client's order to synchronise data...or not
    – even though that breaks the NFS protocol and is unsafe
    • NFS server application efficiency issues
    – Thread to CPU mapping
    – Data structures
    – Lock contention, cacheline bouncing

    View Slide

  20. 07/10/12
    NFS Client Behaviour
    • Parallelism on the wire, for performance
    – Can result in IO mis-ordering at server
    – Can defeat filesystem & VFS streaming optimisations
    – e.g. VFS readahead on server

    View Slide

  21. 07/10/12
    NFS Client Behaviour
    • Parallelism on the wire, for performance
    – Can result in IO mis-ordering at server
    – Can defeat filesystem & VFS streaming optimisations
    – e.g. VFS readahead on server
    • Transfer size is limited by client support
    • Linux clients do not align IOs (except to their own page size)
    – may be slow if server's page size is larger; RAID stripe issues

    View Slide

  22. 07/10/12
    NFS Client Behaviour
    • Parallelism on the wire, for performance
    – Can result in IO mis-ordering at server
    – Can defeat filesystem & VFS streaming optimisations
    – e.g. VFS readahead on server
    • Transfer size is limited by client support
    • Linux clients do not align IOs (except to their own page size)
    – may be slow if server's page size is larger; RAID stripe issues
    • Reads: client does its own readahead
    – may interact with server & RAID readahead
    • Writes: client decides COMMIT interval
    – without adequate knowledge of what is an efficient IO size on server

    View Slide

  23. 07/10/12
    Application Behaviour
    • IO sizes
    • Buffered or O_DIRECT or O_SYNC or async IO
    • fsync
    • write/read or pwrite/pread or mmap or sendfile/splice
    • Threads/parallelism
    • Application buffering/caching/readahead
    • Time behaviour: burstiness, correlation

    View Slide

  24. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  25. 07/10/12
    #1 First tune your network
    • NFS cannot be any faster than the network
    • To test the network, use
    – ping
    – ping -f
    – ttcp or nttcp
    • ethtool is your friend
    • Think about network bottlenecks, other users of network

    View Slide

  26. 07/10/12
    #2 Use NFSv3
    • Not NFSv2 under any circumstances
    • NFSv4...meh
    • On modern Linux clients, this is the default

    View Slide

  27. 07/10/12
    #3 Use TCP
    • Not UDP under any circumstances
    • On modern Linux clients, this is the default

    View Slide

  28. 07/10/12
    #4 Use the maximum
    transfer size
    • Use the largest rsize / wsize options that both the client and
    the server support
    • On modern Linux clients, this is the default

    View Slide

  29. 07/10/12
    #4 Use the maximum
    transfer size
    • Use the largest rsize / wsize options that both the client and
    the server support
    • On modern Linux clients, this is the default
    • Larger is always better
    – no more resources are being wasted
    – larger always faster, up to the server's limit
    – modern servers & clients can do 1MiB

    View Slide

  30. 07/10/12
    #4 Use the maximum
    transfer size
    • Use the largest rsize / wsize options that both the client and
    the server support
    • On modern Linux clients, this is the default
    • Larger is always better
    – no more resources are being wasted
    – larger always faster, up to the server's limit
    – modern servers & clients can do 1MiB
    • Various sizes below which performance is worse
    – [rw]size < client page size (e.g. 4K) is bad
    – [rw]size < server's page size is bad
    – [rw]size < server's RAID stripe width is bad

    View Slide

  31. 07/10/12
    #5 Don't use soft
    • Use hard, not soft
    • On modern Linux clients, this is the default
    • Only bend this rule if you know what you're doing
    – If you had to ask, you don't know what you're doing

    View Slide

  32. 07/10/12
    #6 Use intr
    • Because you want to be able to interrupt your program when
    the server isn't there
    • By the time you discover you wanted this, it's too late
    – time to reboot the client machine
    • This is NOT the default!

    View Slide

  33. 07/10/12
    #7 Use the maximum MTU
    • Use the largest MTU that that all the network machinery
    between client and server properly support
    • This usually means 9KB
    • Consult your switch documentation; experiment

    View Slide

  34. 07/10/12
    #7 Use the maximum MTU
    • Use the largest MTU that that all the network machinery
    between client and server properly support
    • This usually means 9KB
    • Consult your switch documentation; experiment
    • Why?
    – Each Ethernet frame received takes CPU work
    – Each TCP segment received takes CPU work
    – Larger MTU => more data per CPU
    – Very visible on fast networks (10gige, IB)

    View Slide

  35. 07/10/12
    #8 No More Mount Options
    • Options you really really don't want:
    – sync
    • Client emits small WRITEs serially (instead of large IOs in parallel)
    • Each WRITE RPC waits until the server says the data is on disk
    • Slooooooooooow
    – noac on Linux
    • Implies sync

    View Slide

  36. 07/10/12
    #9 Parallelism
    • Client OS will have a tunable to control the parallelism on the
    wire, e.g.
    – Number of slots (Linux)
    – Number of nfs threads
    – Number of biods
    • If you can figure out where this is, ensure it's about 16
    • On modern Linux clients, this is the default

    View Slide

  37. 07/10/12
    #10 Umm...
    • If I could think of ten rules, I'd be writing for Letterman

    View Slide

  38. 07/10/12
    #10 Client Readahead
    • Seriously...
    • On Linux clients, tune max readahead to be a multiple of 4
    – Ensures READ rpcs will be aligned to rsize thanks to VFS readahead code
    – Server can be sensitive to that alignment
    • Default is 15 (bad); 16 is good; more than #slots (16) is bad
    • SLES
    – echo 16 > /proc/sys/fs/nfs/nfs_max_readahead
    • Others
    – Change NFS_MAX_READAHEAD in fs/nfs/super.c

    View Slide

  39. 07/10/12
    A Word About Defaults
    • You may have noticed...
    • Modern Linux clients mostly have good defaults
    • Most mount options you specify will make things worse
    • Don't just copy mount options from your old /etc/fstab
    • Start with rw,intr

    View Slide

  40. 07/10/12
    Mount Options: Trust...But
    Verify
    • What you specify in /etc/fstab isn't necessarily what applies
    • Especially: [rw]size is negotiated with server
    • If unspecified, vers & proto may be negotiated with server
    • Look in /proc/mounts after mounting
    – Modern nfs-utils have nfsstat -m

    View Slide

  41. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  42. 07/10/12
    #1 Tune the Storage
    Hardware
    • Choose hardware RAID stripe unit
    • Choose number of disks in hw RAID set
    – You want to encourage NFS to do Full Stripe Writes
    • Avoiding a Read-Modify-Write cycle in the RAID controller
    – So choose a stripe width == MIN(max_sectors_kb,NFS max transfer size)/N
    – Some RAID hw prefers 2N+1 RAID sets e.g. 4+1, 8+1

    View Slide

  43. 07/10/12
    #1 Tune the Storage
    Hardware
    • Choose hardware RAID stripe unit
    • Choose number of disks in hw RAID set
    – You want to encourage NFS to do Full Stripe Writes
    • Avoiding a Read-Modify-Write cycle in the RAID controller
    – So choose a stripe width == MIN(max_sectors_kb,NFS max transfer size)/N
    – Some RAID hw prefers 2N+1 RAID sets e.g. 4+1, 8+1
    • Choose RAID caching mode
    – Nobrainers: want read caching, write caching
    – Write cache mirroring = slow but safe...choose carefully

    View Slide

  44. 07/10/12
    #2 Tune the Block Layer
    • Choose the right IO scheduler for your workload
    – CFQ (Complete Fair Queuing) seems to work OK
    – Even though it's dumb about iocontexts & NFS
    – echo cfq > /sys/block/$sdx/queue/scheduler
    – But...YMMV. Experiment!

    View Slide

  45. 07/10/12
    #2 Tune the Block Layer
    • Choose the right IO scheduler for your workload
    – CFQ (Complete Fair Queuing) seems to work OK
    – Even though it's dumb about iocontexts & NFS
    – echo cfq > /sys/block/$sdx/queue/scheduler
    – But...YMMV. Experiment!
    • Increase CTQ (Command Tag Queue) depth
    – Improve SCSI parallelism => better disk performance
    – Sometimes unobvious upper limits; per-HBA, per-RAID controller
    – Default might be 1 (worst case), try increasing it
    – echo 4 > /sys/block/$sdx/device/queue_depth

    View Slide

  46. 07/10/12
    #2 Son of Tune the Block
    Layer
    • Bump up max_sectors_kb to get large IOs
    – You want the largest IOs possible going to disk
    • That are multiples of RAID stripe width
    – Linux limit varies with server page size
    • Altix: 16 KiB pages => 2 MiB max_sectors_kb
    • x86_64: 4 KiB pages => 512 KiB max_sectors_kb
    – echo 512 > /sys/block/$sdx/queue/max_sectors_kb

    View Slide

  47. 07/10/12
    #2 Son of Tune the Block
    Layer
    • Bump up max_sectors_kb to get large IOs
    – You want the largest IOs possible going to disk
    • That are multiples of RAID stripe width
    – Linux limit varies with server page size
    • Altix: 16 KiB pages => 2 MiB max_sectors_kb
    • x86_64: 4 KiB pages => 512 KiB max_sectors_kb
    – echo 512 > /sys/block/$sdx/queue/max_sectors_kb
    • Check your block device max readahead is adequate
    – cat /sys/block/$sdx/queue/read_ahead_kb
    • Don't forget to make your changes persistent
    – sgisetqdepth on SGI ProPack
    – Add an /etc/init.d/script or a udev rule

    View Slide

  48. 07/10/12
    #3 Tune the Filesystem
    • Know your filesystem – RTFM, experiment!
    • Tune for NFS workload
    – Not local workloads
    – Not your application run locally
    • Some tunings must be done at mkfs time
    – So don't wait until your data is already on the fs

    View Slide

  49. 07/10/12
    #3 Tune the Filesystem
    • Know your filesystem – RTFM, experiment!
    • Tune for NFS workload
    – Not local workloads
    – Not your application run locally
    • Some tunings must be done at mkfs time
    – So don't wait until your data is already on the fs
    • Choose partitioning+Volume Manager arrangement to align
    filesystem structures to underlying RAID stripes
    • Use noatime or relatime (modern kernels)
    – If you can; noatime confuses some apps like mail readers

    View Slide

  50. 07/10/12
    #3 Revenge of Tune the
    Filesystem
    • XFS needs care & attention to get the best performance
    – Default options historically compatible not performance optimal
    – Be careful with units in mkfs.xfs, xfs_info

    View Slide

  51. 07/10/12
    #3 Revenge of Tune the
    Filesystem
    • XFS needs care & attention to get the best performance
    – Default options historically compatible not performance optimal
    – Be careful with units in mkfs.xfs, xfs_info
    • Optimise log IO
    – Maximise log size (128 MiB historically), but aligned to RAID stripe width
    – mkfs -l size=... -l sunit=... -l version=2
    – Log buffer size needs to be >= log stripe unit, multiple of RAID stripe width
    – mount -o logbufs=8 logbsize=...

    View Slide

  52. 07/10/12
    #3 Revenge of Tune the
    Filesystem
    • XFS needs care & attention to get the best performance
    – Default options historically compatible not performance optimal
    – Be careful with units in mkfs.xfs, xfs_info
    • Optimise log IO
    – Maximise log size (128 MiB historically), but aligned to RAID stripe width
    – mkfs -l size=... -l sunit=... -l version=2
    – Log buffer size needs to be >= log stripe unit, multiple of RAID stripe width
    – mount -o logbufs=8 logbsize=...
    • Consider a separate log device...NVRAM/SSD if you can
    • Lazy superblock counters to reduce SB IO
    – mkfs -l lazy-count=1

    View Slide

  53. 07/10/12
    #3 Bride of Tune the
    Filesystem
    • Make XFS align disk structures to underlying RAID stripes
    – mkfs -d sunit=... -d swidth=...
    • Increase directory block size
    – Larger block size improves workloads which modify large directories
    – mkfs -n size=16384

    View Slide

  54. 07/10/12
    #3 Bride of Tune the
    Filesystem
    • Make XFS align disk structures to underlying RAID stripes
    – mkfs -d sunit=... -d swidth=...
    • Increase directory block size
    – Larger block size improves workloads which modify large directories
    – mkfs -n size=16384
    • Choose number of Allocation Groups
    – Default is...arbitrary
    – Fewer AGs can improve some workloads
    – mkfs -d agcount=... OR mkfs -d agsize=...
    • Increase inode hash size on inode-heavy workloads
    – mount -o ihashsize=... # Not needed on latest XFS

    View Slide

  55. 07/10/12
    #4 Tune the VM
    • Make the VM push unstable pages to disk faster
    – Hopefully, some before the client does a COMMIT
    – Especially useful if you enable the async export option
    – sysctl vm.dirty_writeback_centisecs=50
    – sysctl vm.dirty_background_ratio=10
    • But...YMMV
    • Do not reduce vm.dirty_ratio
    – Can cause nfsds to block unnecessarily
    – You really don't want that

    View Slide

  56. 07/10/12
    #5 Tune PCI Cards
    • Some PCI performance effects...
    – Cards sharing a bus share bus bandwidth
    – Two cards in a bus can slow down the bus rate
    – Putting a card in some slots can slow down slots on other buses
    – Throughput limits in PCI bridges
    – DMA resource limits in PCI bridges
    – Network cards use lots of small DMAs, very unfriendly to PCI
    – Some PCI devices don't play well with others, need to be on their own bus
    – Some cards require MaxReadSize tuned upwards – BIOS or driver setting
    – Some cards benefit from Write Combining on PCI mappings – driver setting
    – Some cards benefit from using MSI – driver, BIOS settings

    View Slide

  57. 07/10/12
    #5 Ghost of Tune PCI Cards
    • NUMA effects
    – Minimise NUMA hops: ensure interrupts are bound to nearby CPUs
    – Fast networks, filesystems can saturate NUMA interconnects
    – So try to arrange for page cache to be near the NIC & HBA that DMA to it
    – Turn off cpuset page spreading, it pessimises DMA patterns

    View Slide

  58. 07/10/12
    #5 Ghost of Tune PCI Cards
    • NUMA effects
    – Minimise NUMA hops: ensure interrupts are bound to nearby CPUs
    – Fast networks, filesystems can saturate NUMA interconnects
    – So try to arrange for page cache to be near the NIC & HBA that DMA to it
    – Turn off cpuset page spreading, it pessimises DMA patterns
    • FC multipathing: choose default paths carefully
    – Want to balance IOs down multiple paths to the storage
    • Know your network, SCSI, and platform hardware!
    – RTFM. Experiment

    View Slide

  59. 07/10/12
    #6 Tune the Network
    • Check basic parameters
    – Speed, duplex, errors
    – ethtool

    View Slide

  60. 07/10/12
    #6 Tune the Network
    • Check basic parameters
    – Speed, duplex, errors
    – ethtool
    • Tune interrupt coalescing for bulk traffic
    – Reduce the NIC interrupt rate for a given workload
    • Adds latency, but it increases NFS server throughput
    • If you have latency-sensitive traffic as well, consider a separate NIC for that
    – ethtool -C ethN rx-usecs 80 rx-frames 20 rx-usecs-irq 80 rx-frames-irq 20
    – Some cards only support a subset of those parameters
    – Current generation 1gige, 10gige, next gen IB cards

    View Slide

  61. 07/10/12
    #6 The Evil of Tune the
    Network
    • Bind NIC interrupts to CPUs
    – Keep device cachelines hot in one CPU
    – ifconfig tells you the IRQ
    – printf %08x $[1<<$cpu] > /proc/irq/$irq/smp_affinity
    – Careful manual binding may be better than irqbalanced

    View Slide

  62. 07/10/12
    #6 The Evil of Tune the
    Network
    • Bind NIC interrupts to CPUs
    – Keep device cachelines hot in one CPU
    – ifconfig tells you the IRQ
    – printf %08x $[1<<$cpu] > /proc/irq/$irq/smp_affinity
    – Careful manual binding may be better than irqbalanced
    • Increase socket buffer sizes
    – Need to allow effective TCP buffering, window scaling on faster networks
    – Modern kernels do this automatically; older ones have bad defaults
    – sysctl net.ipv4.tcp_{,r,w}mem='8192 262144 524288'
    – More for 10gige
    – Actual sysctls needed vary with kernel version; consult kernel docs

    View Slide

  63. 07/10/12
    #6 The Curse of Tune The
    Network
    • Enable TSO (TCP Segment Offload)
    – Moves some of the simpler TCP grunt work from software to hardware
    – Larger DMAs to card, less CPU work per byte sent
    – May be off by default
    – ethtool -K ethN tso on
    – Current generation 1gige, 10gige, IB cards

    View Slide

  64. 07/10/12
    #6 The Curse of Tune The
    Network
    • Enable TSO (TCP Segment Offload)
    – Moves some of the simpler TCP grunt work from software to hardware
    – Larger DMAs to card, less CPU work per byte sent
    – May be off by default
    – ethtool -K ethN tso on
    – Current generation 1gige, 10gige, IB cards
    • Possibly enable LRO (Large Receive Offload)
    – Receive-side equivalent of TSO
    – May help for streaming workloads. Experiment!
    – Current generation 10gige, next generation IB

    View Slide

  65. 07/10/12
    #6 Flesh for Tune the
    Network
    • Enable RSS (Receive Side Scaling)
    – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs
    – Fiddly to tune, consult driver docs
    – Current generation 10gige, next generation IB cards

    View Slide

  66. 07/10/12
    #6 Flesh for Tune the
    Network
    • Enable RSS (Receive Side Scaling)
    – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs
    – Fiddly to tune, consult driver docs
    – Current generation 10gige, next generation IB cards
    • Check hardware checksum offload is enabled
    – Some drivers let you turn it off with ethtool (why?)
    – Current generation 1gige, 10gige, in next generation IB cards

    View Slide

  67. 07/10/12
    #6 Flesh for Tune the
    Network
    • Enable RSS (Receive Side Scaling)
    – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs
    – Fiddly to tune, consult driver docs
    – Current generation 10gige, next generation IB cards
    • Check hardware checksum offload is enabled
    – Some drivers let you turn it off with ethtool (why?)
    – Current generation 1gige, 10gige, in next generation IB cards
    • Enable IPoIB Connected Mode
    – Allows larger MTU (~64KiB) => less CPU spent in interrupt
    – See OFED 1.2 release notes

    View Slide

  68. 07/10/12
    #6 Tune the Network meets
    Dracula
    • Fix default ARP settings
    – Default Linux settings do weird things to incoming packet paths if you use
    multiple NICs in the same broadcast domain
    – echo 1 > /proc/sys/net/ipv4/conf/ethN/arp_ignore
    – echo 2 > /proc/sys/net/ipv4/conf/ethN/arp_announce

    View Slide

  69. 07/10/12
    #6 Tune the Network meets
    Dracula
    • Fix default ARP settings
    – Default Linux settings do weird things to incoming packet paths if you use
    multiple NICs in the same broadcast domain
    – echo 1 > /proc/sys/net/ipv4/conf/ethN/arp_ignore
    – echo 2 > /proc/sys/net/ipv4/conf/ethN/arp_announce
    • Bonding: match the incoming and outgoing transmit hashes
    – Keep driver cachelines hot in one CPU
    – Especially important on NUMA platforms
    – Consult your switch docs...and don't trust them. Experiment!

    View Slide

  70. 07/10/12
    #7 Think about async
    • The async export option can put newly written data at risk
    • But it's faster for some workloads
    • Sometimes the speed-up is worth the risk
    – scratch/temporary data
    – server on UPS

    View Slide

  71. 07/10/12
    #8 Use no_subtree_check
    • The subtree_check export option is for “added security”
    – No beneficial effect if you only export mountpoints
    – Significant CPU cost on metadata-heavy workloads (specSFS)
    – Arguably, breaks NFS filehandle semantics
    • Default has changed, so always explicitly specify
    no_subtree_check

    View Slide

  72. 07/10/12
    #9 Use More Server
    Threads
    • Default in most Linux distros is 4 or 8: way too low
    • Figuring out how many you really need is hard
    – MIN(disk queue depth/fudge factor, max number of parallel requests from
    clients in your expected peak workload)
    – You are not expected to understand this

    View Slide

  73. 07/10/12
    #9 Use More Server
    Threads
    • Default in most Linux distros is 4 or 8: way too low
    • Figuring out how many you really need is hard
    – MIN(disk queue depth/fudge factor, max number of parallel requests from
    clients in your expected peak workload)
    – You are not expected to understand this
    • Since 2.6.19 no CPU performance penalty with lots of nfsds
    – memory use: ~ 1.1 MiB/nfsd
    • Some server data structures sized by initial #nfsds

    View Slide

  74. 07/10/12
    #9 Use More Server
    Threads
    • Default in most Linux distros is 4 or 8: way too low
    • Figuring out how many you really need is hard
    – MIN(disk queue depth/fudge factor, max number of parallel requests from
    clients in your expected peak workload)
    – You are not expected to understand this
    • Since 2.6.19 no CPU performance penalty with lots of nfsds
    – memory use: ~ 1.1 MiB/nfsd
    • Some server data structures sized by initial #nfsds
    • So use too many....like 128
    • USE_KERNEL_NFSD_NUMBER in /etc/sysconfig/nfs

    View Slide

  75. 07/10/12
    #10 Whoops, no #10
    • Letterman still won't return my calls

    View Slide

  76. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  77. 07/10/12
    Measurement Tips
    • Measure subsystems separately first
    – Local throughput to disk
    – Network throughput to client
    – Only then measure NFS performance

    View Slide

  78. 07/10/12
    Measurement Tips
    • Measure subsystems separately first
    – Local throughput to disk
    – Network throughput to client
    – Only then measure NFS performance
    • Choose a good measurement tool. A good tool...
    – Does accurate measurements
    – Is convenient to use
    – Is efficient
    – Shows time variation
    – e.g. Performance Co-Pilot (plug!)

    View Slide

  79. 07/10/12
    Measurement Tips Strikes
    Back
    • Be aware of your benchmark's behaviour
    – Does it report in MiB/s or MB/s or something weird?
    – Does it count close or fsync time?
    – Does it accidentally trigger cache effects?
    – Is it emitting the right IO sizes?
    – Is it doing buffered, direct or sync IO?
    – When writing, is it truncating or over-writing?
    – Streaming or random IOs?

    View Slide

  80. 07/10/12
    Measurement Tips Strikes
    Back
    • Be aware of your benchmark's behaviour
    – Does it report in MiB/s or MB/s or something weird?
    – Does it count close or fsync time?
    – Does it accidentally trigger cache effects?
    – Is it emitting the right IO sizes?
    – Is it doing buffered, direct or sync IO?
    – When writing, is it truncating or over-writing?
    – Streaming or random IOs?
    • Measure the right thing
    – Performance depends on workload
    – So measure something as close as possible to the workload you care about
    • If possible, measure the workload itself

    View Slide

  81. 07/10/12
    Measurement Tips Begins
    • Be aware: there are two page caches: client, server
    • Either can skew performance results
    – Server cache: reads at network speed not disk speed
    – Client cache: reads at local memory copy speed
    • Need to ensure their states are appropriately empty (or full!)

    View Slide

  82. 07/10/12
    Measurement Tips Begins
    • Be aware: there are two page caches: client, server
    • Either can skew performance results
    – Server cache: reads at network speed not disk speed
    – Client cache: reads at local memory copy speed
    • Need to ensure their states are appropriately empty (or full!)
    • Standard trick: use files > either RAM size
    • Other tricks: flush page cache by
    – Using bcfree
    – echo 1 > /proc/sys/vm/drop_caches
    – umount, mount the filesystem on both client, server
    – Some benchmarks have a warmup or ageing phase

    View Slide

  83. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide

  84. 07/10/12
    Why NFSv2 Sucks
    • Limited file size (4 GiB)
    • Limited transfer size (8 KiB)
    • Writes are sync = slow
    • ls -l requires many more roundtrips
    – v2: READDIR + N * (LOOKUP + GETATTR)
    – v3: READDIRPLUS

    View Slide

  85. 07/10/12
    Why NFSv4 Sucks
    • Much more complex than v3
    – Yet doesn't address important design flaws of v3
    • Linux server idmapping is single-threaded
    • Linux server and pseudo-root
    – Very painful to have more than 1 v4 export point
    • Great, it's secure!
    – But have you ever tried setting up Kerberos on Linux?
    • Linux client & server still can't do extended attributes
    • What's with all the W*****s crud in the protocol?
    – Those guys have their own perfectly adequate protocol

    View Slide

  86. 07/10/12
    Why NFS on UDP Sucks
    • Client: poor congestion control
    – Implemented in the sunrpc layer, not the transport protocol
    – Linux: based on RTT estimators plus a little fuzz
    – Results in hair trigger soft option
    – No, timeo= will not save you
    – Hence spurious EIO errors to userspace
    – Good applications die...bad applications corrupt your data when writing

    View Slide

  87. 07/10/12
    Why NFS on UDP Sucks
    • Client: poor congestion control
    – Implemented in the sunrpc layer, not the transport protocol
    – Linux: based on RTT estimators plus a little fuzz
    – Results in hair trigger soft option
    – No, timeo= will not save you
    – Hence spurious EIO errors to userspace
    – Good applications die...bad applications corrupt your data when writing
    • IPID aliasing
    – Fundamental design issue in IP reassembly
    – Causes invisible data corruption at high data rates
    – IP layer passes corrupt data up to NFS

    View Slide

  88. 07/10/12
    Why NFS on UDP Sucks vs
    Mothra
    • Routers drop UDP packets first when congested
    – UDP == “expendable”
    – Causes retries => slowdown
    – ...or IPID aliasing => corruption
    • Limited transfer size (Linux: 32 KiB)
    • Linux server: single socket performance limit

    View Slide

  89. 07/10/12
    Testing Your Network
    • Use ethtool to test NIC settings
    – Speed negotiated with switch properly
    – Full duplex
    – Interrupt coalescing enabled
    – TSO enabled
    – Hardware checksum enabled
    – Scatter/gather enabled
    • Use ping for basic connectivity and stability
    – Summary tells you dropped packets
    – Watch netstat -i for errors

    View Slide

  90. 07/10/12
    Testing Your Network vs
    Predator
    • Use ping -f for stress testing connectivity
    – Also finds some interrupt problems
    – Dots display tells you dropped packets
    – Watch netstat -i, netstat -s for errors
    • Use ttcp or nttcp for testing TCP throughput
    – Make sure you can fill the pipe
    – Watch netstat -st for retransmits

    View Slide

  91. 07/10/12
    “Sync” vs “Sync”
    • Client mount option vs server export option
    • They do different things
    • Do not have to be consistent
    • Basic rule: sync + sync = very slow
    • But what do they do?

    View Slide

  92. 07/10/12
    Sync: the Client Mount
    Option
    • Linux generic mount option
    • NFS client serialises IOs
    • Each WRITE waits until data is on stable storage
    • Sloooow, but for NFS no safer
    • Use async, this is the default

    View Slide

  93. 07/10/12
    Sync: the Server Export
    Option
    • sync specifies the RFC-compliant behaviour
    • async makes the server lie to the client about data stability
    • Significantly faster for some workloads

    View Slide

  94. 07/10/12
    Sync: the Server Export
    Option
    • sync specifies the RFC-compliant behaviour
    • async makes the server lie to the client about data stability
    • Significantly faster for some workloads
    • But a real danger of losing written data
    – Server says “sure, the data's on disk”
    – Client forgets the data is dirty...may even delete it's copy
    – If server crashes before the data hits disk, client won't resend => data loss
    – Use async only if the performance is worth the risk
    • Historically the default changed
    – whiny message from exportfs unless you explicitly choose sync or async

    View Slide

  95. 07/10/12
    NUMA Effects
    • NFS thread pool mode
    – Want pernode at 2 CPUs/node
    – Want percpu at > 2 CPUs/node
    – echo percpu > /proc/fs/nfsd/pool_mode # while nfsd stopped
    • NFS doesn't play nice with cpusets
    – NFS wants to do it's own CPU management
    – Cpuset memory allocation policies can slow NFS on high bandwidth setups
    – Please don't mix cpusets & NFS

    View Slide

  96. 07/10/12
    Using iozone Effectively
    • Ensure close time is measured: -c
    • Take care with choosing IO block sizes
    • Default workloads are sensitive to cache effects
    • Cluster mode -+m rocks

    View Slide

  97. 07/10/12
    Contents
    • Introduction
    • How NFS Works
    • Performance Limiting Factors
    • Top 10 Rules for Tuning Clients
    • Top 10 Rules for Tuning Servers
    • Measurement Tips & Tricks
    • Bonus Topics
    • Questions?

    View Slide