NFS Tuning Secrets: or Why Does "Sync" Do Two Different Things

Headline in Arial Bold 30pt Greg Banks <[email protected]> Senior Software
Engineer File Serving Technologies NFS Tuning Secrets

07/10/12 Contents • Introduction • How NFS Works • Performance
Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions?

07/10/12 About SGI • We used to make all sorts
of cool stuff • Now we make different cool stuff: – Altix: honking great NUMA / Itanium computers – Altix XE: high-end Xeon rackmount servers – ICE: integrated Xeon / Infiniband blade clusters – Nexis: mid- to high-end NAS & SAN servers

07/10/12 File Serving Technologies • 1 of 2 storage software
groups • We make the bits that make Nexis run • Based right here in Melbourne • Teams – XFS – NFS, Samba, iSCSI – Nexis web GUI – Performance Co-Pilot

07/10/12 About This Talk • Part of my job is
answering questions on NFS performance • “NFS is too slow! What can I do to speed it up?” • This talk distils several years of answers • Mostly relevant to Linux NFS clients & servers – some more generic advice • Mostly considering bulk read/write workloads – Metadata workloads have less room to optimise with the NFS protocol

07/10/12 Mythbusters • Customer says, “Performance Doesn't Matter”. • This
is never true • Translation: “I have a performance requirement...but I'm going to make you guess it” • Performance tuning will always matter

07/10/12 How NFS Works: Your View • fff

07/10/12 How NFS Works: My View • fff

07/10/12 NFS Read Path •

07/10/12 NFS Write Path •

07/10/12 Server Workload • Mixing NFS serving and other demand
on the server will slow down both • NFS serving isn't “cheap” – costs CPU time – costs memory – costs PCI bandwidth – costs network bandwidth – costs disk bandwidth • The NFS server is an application not a kernel feature – Works best when given all the machine's resources

07/10/12 Network Performance • Network hardware – NICs – switches
• Network software – NIC drivers – Bonding – TCP/IP stack

07/10/12 Server FS Performance • Hardware – Disks – RAID
controllers – FC switches – HBAs • Software – Filesystem – Volume Manager – Snapshots – Block layer – SCSI layer

07/10/12 NFS vs. Server FS Interactions • NFS server does
things no sane local workload does – Calls f_op->open / f_op->release around each RPC – Looks up files by inode number each call – Uses disconnected dentries; reconnects disconnected dentries – Synchronises data more often – Synchronises metadata more often – Wants to write into page cache while VM is pushing pages out – Perform IO from a kernel thread – Calls f_op->sendfile / f_op->splice_read support for zero-copy reads

07/10/12 NFS vs. Server FS Interactions • NFS server does
things no sane local workload does – Calls f_op->open / f_op->release around each RPC – Looks up files by inode number each call – Uses disconnected dentries; reconnects disconnected dentries – Synchronises data more often – Synchronises metadata more often – Wants to write into page cache while VM is pushing pages out – Perform IO from a kernel thread – Calls f_op->sendfile / f_op->splice_read support for zero-copy reads • The filesystem may have difficulty with these – Because developers often aren't aware of the above – Filesystem bugs might appear only when serving NFS

07/10/12 NFS Server Behaviour • Other clients may be using
the same files • Other clients are sharing the server's resources • Server may obey client's order to synchronise data...or not – even though that breaks the NFS protocol and is unsafe

07/10/12 NFS Server Behaviour • Other clients may be using
the same files • Other clients are sharing the server's resources • Server may obey client's order to synchronise data...or not – even though that breaks the NFS protocol and is unsafe • NFS server application efficiency issues – Thread to CPU mapping – Data structures – Lock contention, cacheline bouncing

07/10/12 NFS Client Behaviour • Parallelism on the wire, for
performance – Can result in IO mis-ordering at server – Can defeat filesystem & VFS streaming optimisations – e.g. VFS readahead on server •

performance – Can result in IO mis-ordering at server – Can defeat filesystem & VFS streaming optimisations – e.g. VFS readahead on server • Transfer size is limited by client support • Linux clients do not align IOs (except to their own page size) – may be slow if server's page size is larger; RAID stripe issues •

performance – Can result in IO mis-ordering at server – Can defeat filesystem & VFS streaming optimisations – e.g. VFS readahead on server • Transfer size is limited by client support • Linux clients do not align IOs (except to their own page size) – may be slow if server's page size is larger; RAID stripe issues • Reads: client does its own readahead – may interact with server & RAID readahead • Writes: client decides COMMIT interval – without adequate knowledge of what is an efficient IO size on server

07/10/12 Application Behaviour • IO sizes • Buffered or O_DIRECT
or O_SYNC or async IO • fsync • write/read or pwrite/pread or mmap or sendfile/splice • Threads/parallelism • Application buffering/caching/readahead • Time behaviour: burstiness, correlation

07/10/12 #1 First tune your network • NFS cannot be
any faster than the network • To test the network, use – ping – ping -f – ttcp or nttcp • ethtool is your friend • Think about network bottlenecks, other users of network

07/10/12 #2 Use NFSv3 • Not NFSv2 under any circumstances
• NFSv4...meh • On modern Linux clients, this is the default

07/10/12 #3 Use TCP • Not UDP under any circumstances
• On modern Linux clients, this is the default

07/10/12 #4 Use the maximum transfer size • Use the
largest rsize / wsize options that both the client and the server support • On modern Linux clients, this is the default •

largest rsize / wsize options that both the client and the server support • On modern Linux clients, this is the default • Larger is always better – no more resources are being wasted – larger always faster, up to the server's limit – modern servers & clients can do 1MiB •

largest rsize / wsize options that both the client and the server support • On modern Linux clients, this is the default • Larger is always better – no more resources are being wasted – larger always faster, up to the server's limit – modern servers & clients can do 1MiB • Various sizes below which performance is worse – [rw]size < client page size (e.g. 4K) is bad – [rw]size < server's page size is bad – [rw]size < server's RAID stripe width is bad

07/10/12 #5 Don't use soft • Use hard, not soft
• On modern Linux clients, this is the default • Only bend this rule if you know what you're doing – If you had to ask, you don't know what you're doing

07/10/12 #6 Use intr • Because you want to be
able to interrupt your program when the server isn't there • By the time you discover you wanted this, it's too late – time to reboot the client machine • This is NOT the default!

07/10/12 #7 Use the maximum MTU • Use the largest
MTU that that all the network machinery between client and server properly support • This usually means 9KB • Consult your switch documentation; experiment •

07/10/12 #7 Use the maximum MTU • Use the largest
MTU that that all the network machinery between client and server properly support • This usually means 9KB • Consult your switch documentation; experiment • Why? – Each Ethernet frame received takes CPU work – Each TCP segment received takes CPU work – Larger MTU => more data per CPU – Very visible on fast networks (10gige, IB)

07/10/12 #8 No More Mount Options • Options you really
really don't want: – sync • Client emits small WRITEs serially (instead of large IOs in parallel) • Each WRITE RPC waits until the server says the data is on disk • Slooooooooooow – noac on Linux • Implies sync

07/10/12 #9 Parallelism • Client OS will have a tunable
to control the parallelism on the wire, e.g. – Number of slots (Linux) – Number of nfs threads – Number of biods • If you can figure out where this is, ensure it's about 16 • On modern Linux clients, this is the default

07/10/12 #10 Umm... • If I could think of ten
rules, I'd be writing for Letterman

07/10/12 #10 Client Readahead • Seriously... • On Linux clients,
tune max readahead to be a multiple of 4 – Ensures READ rpcs will be aligned to rsize thanks to VFS readahead code – Server can be sensitive to that alignment • Default is 15 (bad); 16 is good; more than #slots (16) is bad • SLES – echo 16 > /proc/sys/fs/nfs/nfs_max_readahead • Others – Change NFS_MAX_READAHEAD in fs/nfs/super.c

07/10/12 A Word About Defaults • You may have noticed...
• Modern Linux clients mostly have good defaults • Most mount options you specify will make things worse • Don't just copy mount options from your old /etc/fstab • Start with rw,intr

07/10/12 Mount Options: Trust...But Verify • What you specify in
/etc/fstab isn't necessarily what applies • Especially: [rw]size is negotiated with server • If unspecified, vers & proto may be negotiated with server • Look in /proc/mounts after mounting – Modern nfs-utils have nfsstat -m

07/10/12 #1 Tune the Storage Hardware • Choose hardware RAID
stripe unit • Choose number of disks in hw RAID set – You want to encourage NFS to do Full Stripe Writes • Avoiding a Read-Modify-Write cycle in the RAID controller – So choose a stripe width == MIN(max_sectors_kb,NFS max transfer size)/N – Some RAID hw prefers 2N+1 RAID sets e.g. 4+1, 8+1 •

07/10/12 #1 Tune the Storage Hardware • Choose hardware RAID
stripe unit • Choose number of disks in hw RAID set – You want to encourage NFS to do Full Stripe Writes • Avoiding a Read-Modify-Write cycle in the RAID controller – So choose a stripe width == MIN(max_sectors_kb,NFS max transfer size)/N – Some RAID hw prefers 2N+1 RAID sets e.g. 4+1, 8+1 • Choose RAID caching mode – Nobrainers: want read caching, write caching – Write cache mirroring = slow but safe...choose carefully

07/10/12 #2 Tune the Block Layer • Choose the right
IO scheduler for your workload – CFQ (Complete Fair Queuing) seems to work OK – Even though it's dumb about iocontexts & NFS – echo cfq > /sys/block/$sdx/queue/scheduler – But...YMMV. Experiment! •

07/10/12 #2 Tune the Block Layer • Choose the right
IO scheduler for your workload – CFQ (Complete Fair Queuing) seems to work OK – Even though it's dumb about iocontexts & NFS – echo cfq > /sys/block/$sdx/queue/scheduler – But...YMMV. Experiment! • Increase CTQ (Command Tag Queue) depth – Improve SCSI parallelism => better disk performance – Sometimes unobvious upper limits; per-HBA, per-RAID controller – Default might be 1 (worst case), try increasing it – echo 4 > /sys/block/$sdx/device/queue_depth

07/10/12 #2 Son of Tune the Block Layer • Bump
up max_sectors_kb to get large IOs – You want the largest IOs possible going to disk • That are multiples of RAID stripe width – Linux limit varies with server page size • Altix: 16 KiB pages => 2 MiB max_sectors_kb • x86_64: 4 KiB pages => 512 KiB max_sectors_kb – echo 512 > /sys/block/$sdx/queue/max_sectors_kb •

07/10/12 #2 Son of Tune the Block Layer • Bump
up max_sectors_kb to get large IOs – You want the largest IOs possible going to disk • That are multiples of RAID stripe width – Linux limit varies with server page size • Altix: 16 KiB pages => 2 MiB max_sectors_kb • x86_64: 4 KiB pages => 512 KiB max_sectors_kb – echo 512 > /sys/block/$sdx/queue/max_sectors_kb • Check your block device max readahead is adequate – cat /sys/block/$sdx/queue/read_ahead_kb • Don't forget to make your changes persistent – sgisetqdepth on SGI ProPack – Add an /etc/init.d/script or a udev rule

07/10/12 #3 Tune the Filesystem • Know your filesystem –
RTFM, experiment! • Tune for NFS workload – Not local workloads – Not your application run locally • Some tunings must be done at mkfs time – So don't wait until your data is already on the fs •

07/10/12 #3 Tune the Filesystem • Know your filesystem –
RTFM, experiment! • Tune for NFS workload – Not local workloads – Not your application run locally • Some tunings must be done at mkfs time – So don't wait until your data is already on the fs • Choose partitioning+Volume Manager arrangement to align filesystem structures to underlying RAID stripes • Use noatime or relatime (modern kernels) – If you can; noatime confuses some apps like mail readers

07/10/12 #3 Revenge of Tune the Filesystem • XFS needs
care & attention to get the best performance – Default options historically compatible not performance optimal – Be careful with units in mkfs.xfs, xfs_info •

care & attention to get the best performance – Default options historically compatible not performance optimal – Be careful with units in mkfs.xfs, xfs_info • Optimise log IO – Maximise log size (128 MiB historically), but aligned to RAID stripe width – mkfs -l size=... -l sunit=... -l version=2 – Log buffer size needs to be >= log stripe unit, multiple of RAID stripe width – mount -o logbufs=8 logbsize=... •

care & attention to get the best performance – Default options historically compatible not performance optimal – Be careful with units in mkfs.xfs, xfs_info • Optimise log IO – Maximise log size (128 MiB historically), but aligned to RAID stripe width – mkfs -l size=... -l sunit=... -l version=2 – Log buffer size needs to be >= log stripe unit, multiple of RAID stripe width – mount -o logbufs=8 logbsize=... • Consider a separate log device...NVRAM/SSD if you can • Lazy superblock counters to reduce SB IO – mkfs -l lazy-count=1

07/10/12 #3 Bride of Tune the Filesystem • Make XFS
align disk structures to underlying RAID stripes – mkfs -d sunit=... -d swidth=... • Increase directory block size – Larger block size improves workloads which modify large directories – mkfs -n size=16384 •

07/10/12 #3 Bride of Tune the Filesystem • Make XFS
align disk structures to underlying RAID stripes – mkfs -d sunit=... -d swidth=... • Increase directory block size – Larger block size improves workloads which modify large directories – mkfs -n size=16384 • Choose number of Allocation Groups – Default is...arbitrary – Fewer AGs can improve some workloads – mkfs -d agcount=... OR mkfs -d agsize=... • Increase inode hash size on inode-heavy workloads – mount -o ihashsize=... # Not needed on latest XFS

07/10/12 #4 Tune the VM • Make the VM push
unstable pages to disk faster – Hopefully, some before the client does a COMMIT – Especially useful if you enable the async export option – sysctl vm.dirty_writeback_centisecs=50 – sysctl vm.dirty_background_ratio=10 • But...YMMV • Do not reduce vm.dirty_ratio – Can cause nfsds to block unnecessarily – You really don't want that

07/10/12 #5 Tune PCI Cards • Some PCI performance effects...
– Cards sharing a bus share bus bandwidth – Two cards in a bus can slow down the bus rate – Putting a card in some slots can slow down slots on other buses – Throughput limits in PCI bridges – DMA resource limits in PCI bridges – Network cards use lots of small DMAs, very unfriendly to PCI – Some PCI devices don't play well with others, need to be on their own bus – Some cards require MaxReadSize tuned upwards – BIOS or driver setting – Some cards benefit from Write Combining on PCI mappings – driver setting – Some cards benefit from using MSI – driver, BIOS settings

07/10/12 #5 Ghost of Tune PCI Cards • NUMA effects
– Minimise NUMA hops: ensure interrupts are bound to nearby CPUs – Fast networks, filesystems can saturate NUMA interconnects – So try to arrange for page cache to be near the NIC & HBA that DMA to it – Turn off cpuset page spreading, it pessimises DMA patterns •

07/10/12 #5 Ghost of Tune PCI Cards • NUMA effects
– Minimise NUMA hops: ensure interrupts are bound to nearby CPUs – Fast networks, filesystems can saturate NUMA interconnects – So try to arrange for page cache to be near the NIC & HBA that DMA to it – Turn off cpuset page spreading, it pessimises DMA patterns • FC multipathing: choose default paths carefully – Want to balance IOs down multiple paths to the storage • Know your network, SCSI, and platform hardware! – RTFM. Experiment

07/10/12 #6 Tune the Network • Check basic parameters –
Speed, duplex, errors – ethtool •

07/10/12 #6 Tune the Network • Check basic parameters –
Speed, duplex, errors – ethtool • Tune interrupt coalescing for bulk traffic – Reduce the NIC interrupt rate for a given workload • Adds latency, but it increases NFS server throughput • If you have latency-sensitive traffic as well, consider a separate NIC for that – ethtool -C ethN rx-usecs 80 rx-frames 20 rx-usecs-irq 80 rx-frames-irq 20 – Some cards only support a subset of those parameters – Current generation 1gige, 10gige, next gen IB cards

07/10/12 #6 The Evil of Tune the Network • Bind
NIC interrupts to CPUs – Keep device cachelines hot in one CPU – ifconfig tells you the IRQ – printf %08x $[1<<$cpu] > /proc/irq/$irq/smp_affinity – Careful manual binding may be better than irqbalanced •

07/10/12 #6 The Evil of Tune the Network • Bind
NIC interrupts to CPUs – Keep device cachelines hot in one CPU – ifconfig tells you the IRQ – printf %08x $[1<<$cpu] > /proc/irq/$irq/smp_affinity – Careful manual binding may be better than irqbalanced • Increase socket buffer sizes – Need to allow effective TCP buffering, window scaling on faster networks – Modern kernels do this automatically; older ones have bad defaults – sysctl net.ipv4.tcp_{,r,w}mem='8192 262144 524288' – More for 10gige – Actual sysctls needed vary with kernel version; consult kernel docs

07/10/12 #6 The Curse of Tune The Network • Enable
TSO (TCP Segment Offload) – Moves some of the simpler TCP grunt work from software to hardware – Larger DMAs to card, less CPU work per byte sent – May be off by default – ethtool -K ethN tso on – Current generation 1gige, 10gige, IB cards •

07/10/12 #6 The Curse of Tune The Network • Enable
TSO (TCP Segment Offload) – Moves some of the simpler TCP grunt work from software to hardware – Larger DMAs to card, less CPU work per byte sent – May be off by default – ethtool -K ethN tso on – Current generation 1gige, 10gige, IB cards • Possibly enable LRO (Large Receive Offload) – Receive-side equivalent of TSO – May help for streaming workloads. Experiment! – Current generation 10gige, next generation IB

07/10/12 #6 Flesh for Tune the Network • Enable RSS
(Receive Side Scaling) – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs – Fiddly to tune, consult driver docs – Current generation 10gige, next generation IB cards •

(Receive Side Scaling) – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs – Fiddly to tune, consult driver docs – Current generation 10gige, next generation IB cards • Check hardware checksum offload is enabled – Some drivers let you turn it off with ethtool (why?) – Current generation 1gige, 10gige, in next generation IB cards •

(Receive Side Scaling) – Splits interrupt load into multiple MSI-X -> spread across multiple CPUs – Fiddly to tune, consult driver docs – Current generation 10gige, next generation IB cards • Check hardware checksum offload is enabled – Some drivers let you turn it off with ethtool (why?) – Current generation 1gige, 10gige, in next generation IB cards • Enable IPoIB Connected Mode – Allows larger MTU (~64KiB) => less CPU spent in interrupt – See OFED 1.2 release notes

07/10/12 #6 Tune the Network meets Dracula • Fix default
ARP settings – Default Linux settings do weird things to incoming packet paths if you use multiple NICs in the same broadcast domain – echo 1 > /proc/sys/net/ipv4/conf/ethN/arp_ignore – echo 2 > /proc/sys/net/ipv4/conf/ethN/arp_announce •

07/10/12 #6 Tune the Network meets Dracula • Fix default
ARP settings – Default Linux settings do weird things to incoming packet paths if you use multiple NICs in the same broadcast domain – echo 1 > /proc/sys/net/ipv4/conf/ethN/arp_ignore – echo 2 > /proc/sys/net/ipv4/conf/ethN/arp_announce • Bonding: match the incoming and outgoing transmit hashes – Keep driver cachelines hot in one CPU – Especially important on NUMA platforms – Consult your switch docs...and don't trust them. Experiment!

07/10/12 #7 Think about async • The async export option
can put newly written data at risk • But it's faster for some workloads • Sometimes the speed-up is worth the risk – scratch/temporary data – server on UPS

07/10/12 #8 Use no_subtree_check • The subtree_check export option is
for “added security” – No beneficial effect if you only export mountpoints – Significant CPU cost on metadata-heavy workloads (specSFS) – Arguably, breaks NFS filehandle semantics • Default has changed, so always explicitly specify no_subtree_check

07/10/12 #9 Use More Server Threads • Default in most
Linux distros is 4 or 8: way too low • Figuring out how many you really need is hard – MIN(disk queue depth/fudge factor, max number of parallel requests from clients in your expected peak workload) – You are not expected to understand this •

Linux distros is 4 or 8: way too low • Figuring out how many you really need is hard – MIN(disk queue depth/fudge factor, max number of parallel requests from clients in your expected peak workload) – You are not expected to understand this • Since 2.6.19 no CPU performance penalty with lots of nfsds – memory use: ~ 1.1 MiB/nfsd • Some server data structures sized by initial #nfsds •

Linux distros is 4 or 8: way too low • Figuring out how many you really need is hard – MIN(disk queue depth/fudge factor, max number of parallel requests from clients in your expected peak workload) – You are not expected to understand this • Since 2.6.19 no CPU performance penalty with lots of nfsds – memory use: ~ 1.1 MiB/nfsd • Some server data structures sized by initial #nfsds • So use too many....like 128 • USE_KERNEL_NFSD_NUMBER in /etc/sysconfig/nfs

07/10/12 #10 Whoops, no #10 • Letterman still won't return
my calls

07/10/12 Measurement Tips • Measure subsystems separately first – Local
throughput to disk – Network throughput to client – Only then measure NFS performance •

07/10/12 Measurement Tips • Measure subsystems separately first – Local
throughput to disk – Network throughput to client – Only then measure NFS performance • Choose a good measurement tool. A good tool... – Does accurate measurements – Is convenient to use – Is efficient – Shows time variation – e.g. Performance Co-Pilot (plug!)

07/10/12 Measurement Tips Strikes Back • Be aware of your
benchmark's behaviour – Does it report in MiB/s or MB/s or something weird? – Does it count close or fsync time? – Does it accidentally trigger cache effects? – Is it emitting the right IO sizes? – Is it doing buffered, direct or sync IO? – When writing, is it truncating or over-writing? – Streaming or random IOs? •

07/10/12 Measurement Tips Strikes Back • Be aware of your
benchmark's behaviour – Does it report in MiB/s or MB/s or something weird? – Does it count close or fsync time? – Does it accidentally trigger cache effects? – Is it emitting the right IO sizes? – Is it doing buffered, direct or sync IO? – When writing, is it truncating or over-writing? – Streaming or random IOs? • Measure the right thing – Performance depends on workload – So measure something as close as possible to the workload you care about • If possible, measure the workload itself

07/10/12 Measurement Tips Begins • Be aware: there are two
page caches: client, server • Either can skew performance results – Server cache: reads at network speed not disk speed – Client cache: reads at local memory copy speed • Need to ensure their states are appropriately empty (or full!) •

07/10/12 Measurement Tips Begins • Be aware: there are two
page caches: client, server • Either can skew performance results – Server cache: reads at network speed not disk speed – Client cache: reads at local memory copy speed • Need to ensure their states are appropriately empty (or full!) • Standard trick: use files > either RAM size • Other tricks: flush page cache by – Using bcfree – echo 1 > /proc/sys/vm/drop_caches – umount, mount the filesystem on both client, server – Some benchmarks have a warmup or ageing phase

07/10/12 Why NFSv2 Sucks • Limited file size (4 GiB)
• Limited transfer size (8 KiB) • Writes are sync = slow • ls -l requires many more roundtrips – v2: READDIR + N * (LOOKUP + GETATTR) – v3: READDIRPLUS

07/10/12 Why NFSv4 Sucks • Much more complex than v3
– Yet doesn't address important design flaws of v3 • Linux server idmapping is single-threaded • Linux server and pseudo-root – Very painful to have more than 1 v4 export point • Great, it's secure! – But have you ever tried setting up Kerberos on Linux? • Linux client & server still can't do extended attributes • What's with all the W*****s crud in the protocol? – Those guys have their own perfectly adequate protocol

07/10/12 Why NFS on UDP Sucks • Client: poor congestion
control – Implemented in the sunrpc layer, not the transport protocol – Linux: based on RTT estimators plus a little fuzz – Results in hair trigger soft option – No, timeo= will not save you – Hence spurious EIO errors to userspace – Good applications die...bad applications corrupt your data when writing •

07/10/12 Why NFS on UDP Sucks • Client: poor congestion
control – Implemented in the sunrpc layer, not the transport protocol – Linux: based on RTT estimators plus a little fuzz – Results in hair trigger soft option – No, timeo= will not save you – Hence spurious EIO errors to userspace – Good applications die...bad applications corrupt your data when writing • IPID aliasing – Fundamental design issue in IP reassembly – Causes invisible data corruption at high data rates – IP layer passes corrupt data up to NFS

07/10/12 Why NFS on UDP Sucks vs Mothra • Routers
drop UDP packets first when congested – UDP == “expendable” – Causes retries => slowdown – ...or IPID aliasing => corruption • Limited transfer size (Linux: 32 KiB) • Linux server: single socket performance limit

07/10/12 Testing Your Network • Use ethtool to test NIC
settings – Speed negotiated with switch properly – Full duplex – Interrupt coalescing enabled – TSO enabled – Hardware checksum enabled – Scatter/gather enabled • Use ping for basic connectivity and stability – Summary tells you dropped packets – Watch netstat -i for errors

07/10/12 Testing Your Network vs Predator • Use ping -f
for stress testing connectivity – Also finds some interrupt problems – Dots display tells you dropped packets – Watch netstat -i, netstat -s for errors • Use ttcp or nttcp for testing TCP throughput – Make sure you can fill the pipe – Watch netstat -st for retransmits

07/10/12 “Sync” vs “Sync” • Client mount option vs server
export option • They do different things • Do not have to be consistent • Basic rule: sync + sync = very slow • But what do they do?

07/10/12 Sync: the Client Mount Option • Linux generic mount
option • NFS client serialises IOs • Each WRITE waits until data is on stable storage • Sloooow, but for NFS no safer • Use async, this is the default

07/10/12 Sync: the Server Export Option • sync specifies the
RFC-compliant behaviour • async makes the server lie to the client about data stability • Significantly faster for some workloads •

07/10/12 Sync: the Server Export Option • sync specifies the
RFC-compliant behaviour • async makes the server lie to the client about data stability • Significantly faster for some workloads • But a real danger of losing written data – Server says “sure, the data's on disk” – Client forgets the data is dirty...may even delete it's copy – If server crashes before the data hits disk, client won't resend => data loss – Use async only if the performance is worth the risk • Historically the default changed – whiny message from exportfs unless you explicitly choose sync or async

07/10/12 NUMA Effects • NFS thread pool mode – Want
pernode at 2 CPUs/node – Want percpu at > 2 CPUs/node – echo percpu > /proc/fs/nfsd/pool_mode # while nfsd stopped • NFS doesn't play nice with cpusets – NFS wants to do it's own CPU management – Cpuset memory allocation policies can slow NFS on high bandwidth setups – Please don't mix cpusets & NFS

07/10/12 Using iozone Effectively • Ensure close time is measured:
-c • Take care with choosing IO block sizes • Default workloads are sensitive to cache effects • Cluster mode -+m rocks

Limiting Factors • Top 10 Rules for Tuning Clients • Top 10 Rules for Tuning Servers • Measurement Tips & Tricks • Bonus Topics • Questions? •

NFS Tuning Secrets: or Why Does "Sync" Do Two D...

NFS Tuning Secrets: or Why Does "Sync" Do Two Different Things

More Decks by Greg Banks

Other Decks in Programming

Featured

Transcript