Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBaseCon 2015 - HBase at Scale

HBaseCon 2015 - HBase at Scale

HBase at Scale in an Online and High-Demand Environment

B36d267cfc5245650585eb82b8b29bdb?s=128

Jeremy Carroll

May 08, 2015
Tweet

More Decks by Jeremy Carroll

Other Decks in Technology

Transcript

  1. None
  2. HBase 2 Jeremy Carroll & Tian-Ying Chang Online, at at

    Low Latency
  3. 50+ Billion Pins categorized by people into more than
 1

    Billion Boards 3
  4. Use Cases Products running on HBase / Zen 4 SmartFeed


    The SmartFeed service renders the Pinterest landing page for feeds. Pinterest Engineering Typeahead
 Provides personal suggestions based on a prefix for a user. Pinterest Engineering Messages
 Send messages to your friends. Conversations around Pins. Pinterest Engineering Interests Follow interests and be inspired about the things you love. Pinterest Engineering 1 2 3 4
  5. Online Performance

  6. Moving Instance Types Validating performance • Previously tuned hi1.4xlarge instance

    workloads were running smooth at scale • AWS deprecated hi1.4xlarge instances, making it harder to get capacity in availability zones • Undertook validation of the new i2 platform as a replacement instance type • Uncovered some best practices while configuring for EC2 hi1.4xlarge deprecated 6
  7. Problem Validating the new i2 platform 7 p99.9 zen get

    nodes in ms
  8. Garbage Collection Pause time in milliseconds 8 CMS Initial-Mark Promotion

    Failures ParNew CMS Remark
  9. Problem #1: Promotion Failures Issues seen in production Heap fragmentation

    causing promotion failures BlockCache at high QPS was causing fragmentation Keeping hundreds of billions of rows BloomFilters from being evicted, leading to latency issues Solutions Went off-heap for BlockCache using BucketCache Ensuring memory space for Memstore + Blooms / Indexes on heap with CombinedCache / MaxDirectMemorySize off-heap Monitoring % of LRU Heap for blooms & indexes Tuning & BucketCache 9
  10. Problem #2: Pause Time Calculating Deltas Started noticing ‘user +

    sys’ compared against ‘real’ was very different Random spikes of ‘real vs user+sys’ delta time, sometimes concentrated on hourly boundaries Found resources online, but none of the fixes seemed to work http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html http://www.evanjones.ca/jvm-mmap-pause.html Ended up tracing all Disk IO on the system to find latency outliers Low user, low sys, high real 10
  11. Cloud Problems Noisy neighbors 11 /dev/sda average io in milliseconds

  12. Formatting with EXT4 Logging to instance-store volume 12 100% 1251.60

    99.99% 1223.52 99.9% 241.33 90% 151.45 Pause time in ms
  13. Instance Layout Implementation • Ephemeral disks formatted as JBOD •

    All logging happens to the first ephemeral disk with aggressive log rotation • GC / H-Daemon logs was written to /var/log, which is now linked to first ephemeral volume What we did not know • GC Statistics logged in /tmp (PerfDisableSharedMem) Filesystem hierarchy 13 ephemeral0 instance /var/log /data0 / sda /data1 ephemeral0 ephemeral1 /dataX ephemeralX
  14. Formatting with XFS PerfDisableSharedMem & logging to epemeral 14 100%

    115.64 99.99% 107.88 99.9% 94.01 90% 66.38 Pause time in ms
  15. Resolution Comparison before / after tuning 15 p99.9 zen get

    nodes in ms
  16. Online Performance JVM Options -server -XX:+PerfDisableSharedMem -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - XX:+UseCompressedOops

    -XX:+CMSParallelRemarkEnabled - XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly - Dnetworkaddress.cache.ttl=300 -Djava.net.preferIPv4Stack=true Instance Configuration • Treat instance-store (sda) as read only • Irqbalance >= 1.0.6 (Ubuntu 12.04) • Kernel 3.8+ for disk performance
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 • XFS Filesystem Formatting Changes for success on EC2 16
  17. Monitoring

  18. Monitoring HBase • Collect statistics via local collection daemons •

    Daemons query the local JMX port, and use filters to grab statistics we care about
 
 http://regionserver:60030/jmx? qry=hadoop:service=HBase,name=RPCStatistics-60020 • Collect per-table statistics • Visualize with in-house dashboards, or OpenTSDBr • Garbage collection analysis performed by parsing GC logs fed into R as TSV Lots of data points 18 table_1 collector regionserver:60030/jmx table_2 OpenTSDB
  19. Dashboards Metrics • Usually low value, except when you need

    them • Dashboards for all H-Stack daemons (DataNode, NameNode, etc..) • Use raw metrics to drive insights regarding increased load to tables / regions • System like OpenTSDB which can handle high cardinality metrics • Compactions can elevate CPU greatly Deep dives 19
  20. Nagios based system • Usually unexpected traffic increases / capacity

    • Clustered Commands very useful - % Daemons Down. Queues (RPC / Replication) • ZooKeeper quorum membership • Stale configuration files (on disk, not in memory) • Critical daemons (NameNode, Master, etc..) • Long running master tasks (Ex: splitting) • HDFS Free Space - Room for compactions, accounting for replication HBase Alerts Analytics for HBase On-Call 20
  21. HotSpots Spammers • Few users request the same row over

    and over • Rate limiting / caching Real time analysis • TCPDump is very helpful tcpdump -i eth0 -w - -s 0 tcp port 60020 | strings • Looking at per-region request stats Code Issues • Hard-coded key in product. Ex: Messages launch Debugging imbalanced requests 21 CPU Utilization
  22. Capacity Planning supporting fast growth

  23. Capacity Planning Start small. Split for growth 23 table table

    thrift zookeeper Master Slave Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  24. Capacity Planning Start small. Split for growth 24 table table

    thrift zookeeper Master Slave Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  25. Capacity Planning Start small. Split for growth 25 table table

    thrift zookeeper Master Slave Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters Slave Master
  26. Capacity Planning Start small. Split for growth 26 table table

    thrift zookeeper Slave Master Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  27. Capacity Planning Start small. Split for growth 27 table table

    thrift zookeeper Slave Master Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  28. Capacity Planning Start small. Split for growth 28 table table

    thrift zookeeper Slave Master Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  29. Capacity Planning Balancing regions to servers 29 table table 1

    2 4 5 3 1 3 4 2 Salted keys for uniform distribution Per-table region assignment • One or Two region difference could cause big difference on load • Average 2.5 regions per RS cause region assignment of 2 or 3 • Load is 30% different
  30. Launching • Started on a NameSpace table on a shared

    cluster • Rolling slowly out to production with by x% users • Split table to get additional region servers serving traffic • Migrated to dedicated cluster • As experiment ramped up, added / removed capacity as feature was adopted From development to production 30 2:50 PM 100%
  31. Availability and disaster recovery

  32. Availability Conditions for Failure Termination notices from the underlying host

    Default RF of 3 in one zone dangerous Placement Groups may make this worse Unexpected events - AWS Global Reboot Stability Patterns Highest numbered instance type in a family Multi Availability Zone + block placement Make changes to only one cluster at a time Strategies for mitigating failure 32 Master Slave US-East-1A US-East-1E Replication Physical Host
  33. Disaster Recovery Bootstrapping new clusters 33 Master Slave US-East-1A US-East-1E

    Replication DR Slave Replication US-East-1D HDFS S3 Backup and recover tables ExportSnapshot to HDFS cluster with throttle DistCP WAL logs hourly to HDFS clusters Clone snapshot + WALPlayer to recover table Data can restore from S3, HDFS, or another Cluster w/rate limited copying
 hbasehbackup -t unique_index -d new_cluster -b 10
 hbaserecover -t unique_index -ts 2015-04-08-16-01-15 --source-cluster zennotifications
  34. Backup Monitoring Snapshot backup routine • ZooKeeper based configuration for

    each cluster • Backup metadata is sent to ElasticSearch for integration with dashboards • Monitoring and alerting around WAL copy & snapshot status Reliable cloud backups 34
  35. Operations running on amazon web services

  36. Rolling Compaction Only one region per server is selected •

    Avoid blocked region in queue Controlled concurrency • Control the space spike • Reduce increased network and disk traffic Controlled time to stop • Stop before day time traffic ramp up • Stop if compaction causing perf issue Resume the next night • Filter out the regions that has run compaction Important for online facing clusters 36
  37. 4 37 Maintenance Check Health Get Lock Get Server Locality

    Check Server Status Region Movement w/Threads Start RegionServer + Cooldown Release Lock Update Status + Cooldown Rolling Restart Change management while retaining availability 1 2 3 Region Movement w/Threads Stop RegionServer + Cooldown Verify Locality Check Server Status
  38. Next Challenges Upgrade to latest stable (1.x) from 0.94.x w/no

    downtime Increasing performance • Lower latency • Better compaction throughput / less write amplification Regional Failover • Cross datacenter is in production now • Failing over between AWS regions Looking forward 38
  39. © Copyright, All Rights Reserved Pinterest Inc. 2015