Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBaseCon 2015 - HBase at Scale

HBaseCon 2015 - HBase at Scale

HBase at Scale in an Online and High-Demand Environment

Jeremy Carroll

May 08, 2015
Tweet

More Decks by Jeremy Carroll

Other Decks in Technology

Transcript

  1. Use Cases Products running on HBase / Zen 4 SmartFeed


    The SmartFeed service renders the Pinterest landing page for feeds. Pinterest Engineering Typeahead
 Provides personal suggestions based on a prefix for a user. Pinterest Engineering Messages
 Send messages to your friends. Conversations around Pins. Pinterest Engineering Interests Follow interests and be inspired about the things you love. Pinterest Engineering 1 2 3 4
  2. Moving Instance Types Validating performance • Previously tuned hi1.4xlarge instance

    workloads were running smooth at scale • AWS deprecated hi1.4xlarge instances, making it harder to get capacity in availability zones • Undertook validation of the new i2 platform as a replacement instance type • Uncovered some best practices while configuring for EC2 hi1.4xlarge deprecated 6
  3. Problem #1: Promotion Failures Issues seen in production Heap fragmentation

    causing promotion failures BlockCache at high QPS was causing fragmentation Keeping hundreds of billions of rows BloomFilters from being evicted, leading to latency issues Solutions Went off-heap for BlockCache using BucketCache Ensuring memory space for Memstore + Blooms / Indexes on heap with CombinedCache / MaxDirectMemorySize off-heap Monitoring % of LRU Heap for blooms & indexes Tuning & BucketCache 9
  4. Problem #2: Pause Time Calculating Deltas Started noticing ‘user +

    sys’ compared against ‘real’ was very different Random spikes of ‘real vs user+sys’ delta time, sometimes concentrated on hourly boundaries Found resources online, but none of the fixes seemed to work http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html http://www.evanjones.ca/jvm-mmap-pause.html Ended up tracing all Disk IO on the system to find latency outliers Low user, low sys, high real 10
  5. Formatting with EXT4 Logging to instance-store volume 12 100% 1251.60

    99.99% 1223.52 99.9% 241.33 90% 151.45 Pause time in ms
  6. Instance Layout Implementation • Ephemeral disks formatted as JBOD •

    All logging happens to the first ephemeral disk with aggressive log rotation • GC / H-Daemon logs was written to /var/log, which is now linked to first ephemeral volume What we did not know • GC Statistics logged in /tmp (PerfDisableSharedMem) Filesystem hierarchy 13 ephemeral0 instance /var/log /data0 / sda /data1 ephemeral0 ephemeral1 /dataX ephemeralX
  7. Formatting with XFS PerfDisableSharedMem & logging to epemeral 14 100%

    115.64 99.99% 107.88 99.9% 94.01 90% 66.38 Pause time in ms
  8. Online Performance JVM Options -server -XX:+PerfDisableSharedMem -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - XX:+UseCompressedOops

    -XX:+CMSParallelRemarkEnabled - XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly - Dnetworkaddress.cache.ttl=300 -Djava.net.preferIPv4Stack=true Instance Configuration • Treat instance-store (sda) as read only • Irqbalance >= 1.0.6 (Ubuntu 12.04) • Kernel 3.8+ for disk performance
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 • XFS Filesystem Formatting Changes for success on EC2 16
  9. Monitoring HBase • Collect statistics via local collection daemons •

    Daemons query the local JMX port, and use filters to grab statistics we care about
 
 http://regionserver:60030/jmx? qry=hadoop:service=HBase,name=RPCStatistics-60020 • Collect per-table statistics • Visualize with in-house dashboards, or OpenTSDBr • Garbage collection analysis performed by parsing GC logs fed into R as TSV Lots of data points 18 table_1 collector regionserver:60030/jmx table_2 OpenTSDB
  10. Dashboards Metrics • Usually low value, except when you need

    them • Dashboards for all H-Stack daemons (DataNode, NameNode, etc..) • Use raw metrics to drive insights regarding increased load to tables / regions • System like OpenTSDB which can handle high cardinality metrics • Compactions can elevate CPU greatly Deep dives 19
  11. Nagios based system • Usually unexpected traffic increases / capacity

    • Clustered Commands very useful - % Daemons Down. Queues (RPC / Replication) • ZooKeeper quorum membership • Stale configuration files (on disk, not in memory) • Critical daemons (NameNode, Master, etc..) • Long running master tasks (Ex: splitting) • HDFS Free Space - Room for compactions, accounting for replication HBase Alerts Analytics for HBase On-Call 20
  12. HotSpots Spammers • Few users request the same row over

    and over • Rate limiting / caching Real time analysis • TCPDump is very helpful tcpdump -i eth0 -w - -s 0 tcp port 60020 | strings • Looking at per-region request stats Code Issues • Hard-coded key in product. Ex: Messages launch Debugging imbalanced requests 21 CPU Utilization
  13. Capacity Planning Start small. Split for growth 23 table table

    thrift zookeeper Master Slave Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  14. Capacity Planning Start small. Split for growth 24 table table

    thrift zookeeper Master Slave Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  15. Capacity Planning Start small. Split for growth 25 table table

    thrift zookeeper Master Slave Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters Slave Master
  16. Capacity Planning Start small. Split for growth 26 table table

    thrift zookeeper Slave Master Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  17. Capacity Planning Start small. Split for growth 27 table table

    thrift zookeeper Slave Master Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  18. Capacity Planning Start small. Split for growth 28 table table

    thrift zookeeper Slave Master Config Replication Feature Cost HDFS disk space CPU - Compression (Prefix, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters
  19. Capacity Planning Balancing regions to servers 29 table table 1

    2 4 5 3 1 3 4 2 Salted keys for uniform distribution Per-table region assignment • One or Two region difference could cause big difference on load • Average 2.5 regions per RS cause region assignment of 2 or 3 • Load is 30% different
  20. Launching • Started on a NameSpace table on a shared

    cluster • Rolling slowly out to production with by x% users • Split table to get additional region servers serving traffic • Migrated to dedicated cluster • As experiment ramped up, added / removed capacity as feature was adopted From development to production 30 2:50 PM 100%
  21. Availability Conditions for Failure Termination notices from the underlying host

    Default RF of 3 in one zone dangerous Placement Groups may make this worse Unexpected events - AWS Global Reboot Stability Patterns Highest numbered instance type in a family Multi Availability Zone + block placement Make changes to only one cluster at a time Strategies for mitigating failure 32 Master Slave US-East-1A US-East-1E Replication Physical Host
  22. Disaster Recovery Bootstrapping new clusters 33 Master Slave US-East-1A US-East-1E

    Replication DR Slave Replication US-East-1D HDFS S3 Backup and recover tables ExportSnapshot to HDFS cluster with throttle DistCP WAL logs hourly to HDFS clusters Clone snapshot + WALPlayer to recover table Data can restore from S3, HDFS, or another Cluster w/rate limited copying
 hbasehbackup -t unique_index -d new_cluster -b 10
 hbaserecover -t unique_index -ts 2015-04-08-16-01-15 --source-cluster zennotifications
  23. Backup Monitoring Snapshot backup routine • ZooKeeper based configuration for

    each cluster • Backup metadata is sent to ElasticSearch for integration with dashboards • Monitoring and alerting around WAL copy & snapshot status Reliable cloud backups 34
  24. Rolling Compaction Only one region per server is selected •

    Avoid blocked region in queue Controlled concurrency • Control the space spike • Reduce increased network and disk traffic Controlled time to stop • Stop before day time traffic ramp up • Stop if compaction causing perf issue Resume the next night • Filter out the regions that has run compaction Important for online facing clusters 36
  25. 4 37 Maintenance Check Health Get Lock Get Server Locality

    Check Server Status Region Movement w/Threads Start RegionServer + Cooldown Release Lock Update Status + Cooldown Rolling Restart Change management while retaining availability 1 2 3 Region Movement w/Threads Stop RegionServer + Cooldown Verify Locality Check Server Status
  26. Next Challenges Upgrade to latest stable (1.x) from 0.94.x w/no

    downtime Increasing performance • Lower latency • Better compaction throughput / less write amplification Regional Failover • Cross datacenter is in production now • Failing over between AWS regions Looking forward 38