Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tuning Linux for PostgreSQL

Tuning Linux for PostgreSQL

Slides from my talk at Secon 2015 Penza, Russia

Alexey Lesovsky

April 15, 2015
Tweet

More Decks by Alexey Lesovsky

Other Decks in Education

Transcript

  1. Пенза, 24 апреля 2015 2015.secon.ru About: • Алексей Лесовский •

    PostgreSQL DBA • Linux system administrator PostgreSQL-Consulting.com • Поддержка 24/7 • Аудит и оптимизация производительности • Консультации и тренинги • Мониторинг и аварийное реагирование • Планирование ресурсов
  2. Пенза, 24 апреля 2015 2015.secon.ru Agenda • Linux on PostgreSQL

    ? • Resources and Databases • OS subsystems • CPU & Process scheduling, Power saving • Memory & VM, NUMA, Huge Pages • Storage & Filesystems, Input/Output • Other misc
  3. Пенза, 24 апреля 2015 2015.secon.ru Intro. • Linux as a

    platform for PostgreSQL database • Active development & Community support • Rich features & Fast implementation • Stable & Mature & Durable • Default settings in Linux
  4. Пенза, 24 апреля 2015 2015.secon.ru Tuning targets CPU Disk Memory

    Concurrency Query Speed Sort, Group, Hash,... OS cache (page cache) DB cache (shared_buffers) Operations (work_mem) DB storage Transactional Log (WAL) Cold start
  5. Пенза, 24 апреля 2015 2015.secon.ru Tuning targets CPU Disk Memory

    CPU Scheduler NUMA Power saving Virtual Memory NUMA Huge Pages FileSystems Storage I/O
  6. Пенза, 24 апреля 2015 2015.secon.ru Tuning targets: • CPU scheduler

    • Virtual memory and NUMA • Huge pages • File systems • Storage IO • Power saving policy • Others
  7. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: CPU scheduler •

    CPU scheduler is ... • sysctl: kernel.sched_migration_cost_ns • sysctl: kernel.sched_autogroup_enabled • http://www.postgresql.org/message-id/[email protected] • http://kernelnewbies.org/Linux_2_6_38#head-59575a6aeafa38490226a560ee02de89829a5b20
  8. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: CPU scheduler •

    WARNING: should use update kernel because • Ubuntu 12.04 Bug #1055222, • Ubuntu 14.04 Bug #1422016 A kernel panic occurs while disabling the automatic task group creation within /etc/sysctl.conf using "kernel.sched_autogroup_enabled = 0". After adding the "noautogroup" kernel parameter to GRUB, the system will boot without issue.
  9. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: CPU scheduler •

    pgbench tests: $ pgbench -S -c 8 -T 30 -U postgres pgbench transaction type: SELECT only scaling factor: 30 duration: 30 s number of clients: 8 number of threads: 1 sched_migration_cost_ns = 50000, sched_autogroup_enabled = 1 - tps: 22621, 22692, 22502 sched_migration_cost_ns = 500000, sched_autogroup_enabled = 0 - tps: 23689, 23930, 23657
  10. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Virtual Memory •

    Virtual Memory • Allocator, Caching, Dirty pages and Writeback • vm.dirty_background_ratio & vm.dirty_ratio • vm.dirty_background_bytes & vm.dirty_bytes
  11. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Virtual Memory 0%

    of pagecache RAM dirty do nothing dirty_backgrouond_ratio (10% of RAM dirty) – wakeup flushd dirty_ratio (20% of RAM dirty) – process start sync writes 100% of pagecache RAM dirty flushd writes dirty buffers in background flushd and write()'ng process write dirty buffer
  12. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: NUMA c1 c2

    c3 c4 c1 c2 c3 c4 c1 c2 c3 c4 c1 c2 c3 c4 M1 M3 M4 M2 S1 S2 S4 S3 S – Socket C – CPU Core M – Memory Bank
  13. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: NUMA • BIOS:

    enable memory node interleaving • kernel boot: numa=of • numactl utility • sysctl: vm.zone_reclaim_mode = 0 • sysctl: kernel.numa_balancing = 0
  14. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: NUMA $ numactl

    --hardware available: 1 nodes (0-0) -- NUMA disabled node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 196578 MB node 0 free: 1122 MB node distances: node 0 0: 10 $ numactl --hardware available: 2 nodes (0-1) -- NUMA enabled node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17 node 0 size: 65501 MB node 0 free: 4186 MB node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23 node 1 size: 65536 MB node 1 free: 10803 MB node distances: node 0 1 0: 10 21 1: 21 10
  15. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Huge Pages •

    Huge pages vs. Transparent huge pages • Transparent huge pages • echo never > /sys/kernel/mm/transparent_hugepage/enabled • echo never > /sys/kernel/mm/transparent_hugepage/defrag
  16. Пенза, 24 апреля 2015 2015.secon.ru PostgreSQL with Huge Pages •

    libhugetlbfs • sysctl: vm.hugetlb_shm_group • sysctl: vm.nr_hugepages • sysctl: vm.nr_overcommit_hugepages • environment var: HUGETLB_SHM & LD_PRELOAD • postgresql.conf: huge_pages = try | on
  17. Пенза, 24 апреля 2015 2015.secon.ru PostgreSQL with Huge Pages •

    $ grep ^Huge /proc/meminfo HugePages_Total: 75532 HugePages_Free: 54 HugePages_Rsvd: 13 HugePages_Surp: 0 Hugepagesize: 2048 kB • $ pmap -x $(pgrep postgres|head -n 1) Address Kbytes RSS Dirty Mode Mapping 00007f24a1e00000 25802752 0 0 rw-s- anon_hugepage (deleted)
  18. Пенза, 24 апреля 2015 2015.secon.ru Huge Pages: Resume • Disable

    Transparent Huge Pages • PostgreSQL <= 9.2: libhugetlbfs • PostgreSQL 9.3: huge pages not supported • PostgreSQL >= 9.4: huge pages native support
  19. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Filesystems • Ext3

    vs Ext4 vs XFS • Filesystem Barriers • Disable Write Cache • hdparm -W0 /device/ • MegaCli64 -LDSetProp -DisDskCache -Lall -aALL • Hardware RAID + BBU = barrier=0 • Software RAID = barrier=1 • Enterprise SSD with Power Loss Protection = barrier=0 • Consumer SSD w/o Power Loss Protection = barrier=1 • http://www.thessdreview.com/
  20. Пенза, 24 апреля 2015 2015.secon.ru Filesystems barriers • $ pgbench

    -s 30 -c 8 -T 30 -U postgres pgbench transaction type: TPC-B (sort of) scaling factor: 30 query mode: simple number of clients: 8 number of threads: 1 duration: 30 s barrier=1 barrier=0 tps = 1005 tps = 1142 tps = 988 tps = 1164 tps = 988 tps = 1129
  21. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Storage IO •

    SATA/SAS vs SSD • IO elevators • noop: SSD, hi-end storage • deadline: RAID, SATA/SAS • cfq: good default • # echo 'elevator_name' > /sys/block/<device>/queue/scheduler • kernel boot: elevator=<name> • /sys/block/*/queue/: rotational, rq_affinity, read_ahead_kb
  22. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Power saving policy

    • acpi_cpufreq vs intel_pstate • scaling_governor • /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors • /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor • acpi_cpufreq + performance • intel_pstate + powersave
  23. Пенза, 24 апреля 2015 2015.secon.ru Power saving policy: acpi_cpufreq +

    ondemand $ select o.payment_method_id,pc.name,count(o.payment_method_id) as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit') AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=238.506..238.512 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.068..232.756 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.058..206.598 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.78 rows=1 width=8) (actual time=0.042..109.343 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.001..0.004 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id) Total runtime: 238.630 ms
  24. Пенза, 24 апреля 2015 2015.secon.ru Power saving policy: acpi_cpufreq +

    performance $ select o.payment_method_id,pc.name,count(o.payment_method_id) as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit') AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=193.589..193.593 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.034..188.948 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.030..166.176 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.78 rows=1 width=8) (actual time=0.022..87.541 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.001..0.003 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id) Total runtime: 193.659 ms
  25. Пенза, 24 апреля 2015 2015.secon.ru Tuning target: Misc • Clocksource

    • acpi_pm vs. hpet vs. tsc • /sys/devices/system/clocksource/clocksource0/available_clocksource • /sys/devices/system/clocksource/clocksource0/current_clocksource
  26. Пенза, 24 апреля 2015 2015.secon.ru Clocksource: TSC $ select o.payment_method_id,pc.name,count(o.payment_method_id)

    as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit') AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=244.300..244.308 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.075..238.265 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.064..211.236 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.79 rows=1 width=8) (actual time=0.046..111.409 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.001..0.004 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id) Total runtime: 244.433 ms
  27. Пенза, 24 апреля 2015 2015.secon.ru Clocksource: ACPI_PM $ select o.payment_method_id,pc.name,count(o.payment_method_id)

    as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit') AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=1062.165..1062.188 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.104..1047.211 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.091..962.252 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.79 rows=1 width=8) (actual time=0.046..104.546 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.002..0.039 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.003..0.004 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id) Total runtime: 1062.316 ms
  28. Пенза, 24 апреля 2015 2015.secon.ru Bonus. Tuning target: Pgbouncer •

    open files limit • nofile (/etc/security/limits.conf) • prlimit utility (util-linux) • sysctl: net.ipv4.ip_local_port_range