Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Dive into a Shallow Write Pool

Deep Dive into a Shallow Write Pool

Presented at Apache CouchDB Conf 2013 in Vancouver.

Jason Johnson

November 13, 2013
Tweet

Other Decks in Programming

Transcript

  1. I’m a programmer. • Not an electrical engineer. • Not

    a systems guy. I research and prototype stuff. Most days I write code. Things I’ve learned? • Drive write caches are evil. • Drives are phenomenal. • File systems aren’t simple and guarantee interesting things. • The storage demands of today are going to prompt a return to first principles.
  2. ROC DDR MLC NAND DDR Backplane Disk Array Controller Solid

    State Hard Disk Solid State Solid State PCIe x16 to North Bridge What is /dev/sdc, anyway?
  3. Floating-Gate Metal-Oxide-Semiconductor Field-Effect Transistor (MOSFET) (SLC) Single-Level Cell 2 states

    0 or 1 Multi-Level Cell (MLC) N states, delimited by thresholds 00, 01, 10, etc. Insulation Source Drain Control Gate Substrate Insulator 1V GND Floating Gate (no current) Read 1 (one) Operation Insulation Source Drain Floating Gate Control Gate Substrate Insulator 20V GND Write Operation (Fowler-Nordheim tunneling) Insulation Source Drain Control Gate Substrate Insulator 1V GND Read 0 (zero) Operation Insulation Source Drain Control Gate Substrate Insulator GND 20V Erase Operation
  4. Assertion HDD SSD Stores data. Yes Yes Relies on components

    found in your average stereo. Yes Yes Relies on physical phenomenon. Yes No Relies on quantum mechanical phenomenon. No Yes Executes millions of lines of proprietary code when saving data. Yes Yes Storage Media Survey
  5. • Start Simple • Disable Caching • Tools: sysbench, iozone,

    iostat, vmstat • Apply Increasing Parallel I/O • ext2, ext3, ext4, xfs, btrfs, zfs? • Graph Everything Tuning the File System
  6. Volume Creation & Cache Disablement arcconf \ create 1 logicaldrive

    \ stripesize 256 \ wcache wt \ rcache roff \ max \ 0 \ 0 7 \ 0 8 \ 0 9 \ 0 10 arcconf \ setcache 1 \ device 0 [7-10] \ wt
  7. EXT4 vs. Optimized EXT4 for a 256kb Stripe w/ 4

    Drives mkfs.ext4 /dev/sdc1 mkfs.ext4 \ -b 4096 \ -E \ stride=4, \ stripe_width=16 \ -J \ device=/dev/sdb1 \ /dev/sdc1 Mount with: noatime stripe=16 nobarrier
  8. XFS vs. Optimized XFS for a 256kb Stripe w/ 4

    Drives mkfs.xfs /dev/sdc1 mkfs.xfs \ -d sw=4,su=64k \ -l \ logdev=/dev/sdb1, \ size=128m \ /dev/sdc1 Mount with: noatime logdev=/dev/sdb1 nobarrier inode64
  9. sysbench Parameters sysbench \ --num-threads=[8-128] \ --test=fileio \ --file-total-size=10G \

    --file-test-mode=rndwr \ --file-fsync-all=on \ --file-num=64 \ --file-block-size=16384 \ [prepare|run|cleanup]
  10. • Erlang R16B02 ◦ base ◦ base-hipe ◦ Erlang Solutions

    • CouchDB master ◦ +A [4-16] ◦ +S 24:24 • Ubuntu 12.04 ◦ ulimit & sysctl tuned • Tsung ◦ ~270,000 potential ◦ 6 nodes, single target ◦ simulate massive ingestion or massively broken app servers CouchDB Tsung CouchDB 20,000 20,000 20,000 20,000 20,000 20,000 24 core, 128GB RAM, 12 SAS/SSD 7k-series Adaptec, 10Gbit Ubuntu 12.04
  11. Simple CouchDB Configuration Setting Value couchdb.database_dir /mnt/data couchdb.view_index_dir /mnt/views couchdb.delayed_commits

    false couchdb.file_compression none httpd.socket_options [{nodelay, true}, {keepalive, true}] uuids.algorithm sequential
  12. 1, 10, 100, 500, 1000 & 1200 Concurrency, 16KB *

    10 bulk payload 12,630 9,070 39.25% 21.9 hours 30.6 hours
  13. • Unreliable storage media is the reality ◦ Phenomenal characteristics

    ◦ Proprietary code • Revisit operational practices ◦ Hardware selection ◦ Disk array controller configuration ◦ File system initialization ◦ Database configuration • Trust but verify ◦ Demand transparency “Cute graphs. What about me?”