HBaseCon 2015 - HBase at Scale

HBase 2 Jeremy Carroll & Tian-Ying Chang Online, at at
Low Latency

50+ Billion Pins categorized by people into more than  1
Billion Boards 3

Use Cases Products running on HBase / Zen 4 SmartFeed 
The SmartFeed service renders the Pinterest landing page for feeds. Pinterest Engineering Typeahead  Provides personal suggestions based on a preﬁx for a user. Pinterest Engineering Messages  Send messages to your friends. Conversations around Pins. Pinterest Engineering Interests Follow interests and be inspired about the things you love. Pinterest Engineering 1 2 3 4

Online Performance

Moving Instance Types Validating performance • Previously tuned hi1.4xlarge instance
workloads were running smooth at scale • AWS deprecated hi1.4xlarge instances, making it harder to get capacity in availability zones • Undertook validation of the new i2 platform as a replacement instance type • Uncovered some best practices while conﬁguring for EC2 hi1.4xlarge deprecated 6

Problem Validating the new i2 platform 7 p99.9 zen get
nodes in ms

Garbage Collection Pause time in milliseconds 8 CMS Initial-Mark Promotion
Failures ParNew CMS Remark

Problem #1: Promotion Failures Issues seen in production Heap fragmentation
causing promotion failures BlockCache at high QPS was causing fragmentation Keeping hundreds of billions of rows BloomFilters from being evicted, leading to latency issues Solutions Went off-heap for BlockCache using BucketCache Ensuring memory space for Memstore + Blooms / Indexes on heap with CombinedCache / MaxDirectMemorySize off-heap Monitoring % of LRU Heap for blooms & indexes Tuning & BucketCache 9

Problem #2: Pause Time Calculating Deltas Started noticing ‘user +
sys’ compared against ‘real’ was very different Random spikes of ‘real vs user+sys’ delta time, sometimes concentrated on hourly boundaries Found resources online, but none of the ﬁxes seemed to work http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html http://www.evanjones.ca/jvm-mmap-pause.html Ended up tracing all Disk IO on the system to ﬁnd latency outliers Low user, low sys, high real 10

Cloud Problems Noisy neighbors 11 /dev/sda average io in milliseconds

Formatting with EXT4 Logging to instance-store volume 12 100% 1251.60
99.99% 1223.52 99.9% 241.33 90% 151.45 Pause time in ms

Instance Layout Implementation • Ephemeral disks formatted as JBOD •
All logging happens to the ﬁrst ephemeral disk with aggressive log rotation • GC / H-Daemon logs was written to /var/log, which is now linked to ﬁrst ephemeral volume What we did not know • GC Statistics logged in /tmp (PerfDisableSharedMem) Filesystem hierarchy 13 ephemeral0 instance /var/log /data0 / sda /data1 ephemeral0 ephemeral1 /dataX ephemeralX

Formatting with XFS PerfDisableSharedMem & logging to epemeral 14 100%
115.64 99.99% 107.88 99.9% 94.01 90% 66.38 Pause time in ms

Resolution Comparison before / after tuning 15 p99.9 zen get
nodes in ms

Online Performance JVM Options -server -XX:+PerfDisableSharedMem -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - XX:+UseCompressedOops
-XX:+CMSParallelRemarkEnabled - XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly - Dnetworkaddress.cache.ttl=300 -Djava.net.preferIPv4Stack=true Instance Conﬁguration • Treat instance-store (sda) as read only • Irqbalance >= 1.0.6 (Ubuntu 12.04) • Kernel 3.8+ for disk performance  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 • XFS Filesystem Formatting Changes for success on EC2 16

Monitoring

Monitoring HBase • Collect statistics via local collection daemons •
Daemons query the local JMX port, and use ﬁlters to grab statistics we care about    http://regionserver:60030/jmx? qry=hadoop:service=HBase,name=RPCStatistics-60020 • Collect per-table statistics • Visualize with in-house dashboards, or OpenTSDBr • Garbage collection analysis performed by parsing GC logs fed into R as TSV Lots of data points 18 table_1 collector regionserver:60030/jmx table_2 OpenTSDB

Dashboards Metrics • Usually low value, except when you need
them • Dashboards for all H-Stack daemons (DataNode, NameNode, etc..) • Use raw metrics to drive insights regarding increased load to tables / regions • System like OpenTSDB which can handle high cardinality metrics • Compactions can elevate CPU greatly Deep dives 19

Nagios based system • Usually unexpected traffic increases / capacity
• Clustered Commands very useful - % Daemons Down. Queues (RPC / Replication) • ZooKeeper quorum membership • Stale configuration files (on disk, not in memory) • Critical daemons (NameNode, Master, etc..) • Long running master tasks (Ex: splitting) • HDFS Free Space - Room for compactions, accounting for replication HBase Alerts Analytics for HBase On-Call 20

HotSpots Spammers • Few users request the same row over
and over • Rate limiting / caching Real time analysis • TCPDump is very helpful tcpdump -i eth0 -w - -s 0 tcp port 60020 | strings • Looking at per-region request stats Code Issues • Hard-coded key in product. Ex: Messages launch Debugging imbalanced requests 21 CPU Utilization

Capacity Planning supporting fast growth

Capacity Planning Start small. Split for growth 23 table table
thrift zookeeper Master Slave Conﬁg Replication Feature Cost HDFS disk space CPU - Compression (Preﬁx, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters

thrift zookeeper Master Slave Conﬁg Replication Feature Cost HDFS disk space CPU - Compression (Preﬁx, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters

thrift zookeeper Master Slave Conﬁg Replication Feature Cost HDFS disk space CPU - Compression (Preﬁx, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters Slave Master

thrift zookeeper Slave Master Conﬁg Replication Feature Cost HDFS disk space CPU - Compression (Preﬁx, FastDiff, Snappy) Memory - Index, Bloom Filters Managed Splitting Disable auto-splitting and monitor region size Manually split slave cluster • No impact to online facing cluster • Switch between master and slave clusters

Capacity Planning Balancing regions to servers 29 table table 1
2 4 5 3 1 3 4 2 Salted keys for uniform distribution Per-table region assignment • One or Two region difference could cause big difference on load • Average 2.5 regions per RS cause region assignment of 2 or 3 • Load is 30% different

Launching • Started on a NameSpace table on a shared
cluster • Rolling slowly out to production with by x% users • Split table to get additional region servers serving trafﬁc • Migrated to dedicated cluster • As experiment ramped up, added / removed capacity as feature was adopted From development to production 30 2:50 PM 100%

Availability and disaster recovery

Availability Conditions for Failure Termination notices from the underlying host
Default RF of 3 in one zone dangerous Placement Groups may make this worse Unexpected events - AWS Global Reboot Stability Patterns Highest numbered instance type in a family Multi Availability Zone + block placement Make changes to only one cluster at a time Strategies for mitigating failure 32 Master Slave US-East-1A US-East-1E Replication Physical Host

Disaster Recovery Bootstrapping new clusters 33 Master Slave US-East-1A US-East-1E
Replication DR Slave Replication US-East-1D HDFS S3 Backup and recover tables ExportSnapshot to HDFS cluster with throttle DistCP WAL logs hourly to HDFS clusters Clone snapshot + WALPlayer to recover table Data can restore from S3, HDFS, or another Cluster w/rate limited copying  hbasehbackup -t unique_index -d new_cluster -b 10  hbaserecover -t unique_index -ts 2015-04-08-16-01-15 --source-cluster zennotifications

Backup Monitoring Snapshot backup routine • ZooKeeper based conﬁguration for
each cluster • Backup metadata is sent to ElasticSearch for integration with dashboards • Monitoring and alerting around WAL copy & snapshot status Reliable cloud backups 34

Operations running on amazon web services

Rolling Compaction Only one region per server is selected •
Avoid blocked region in queue Controlled concurrency • Control the space spike • Reduce increased network and disk trafﬁc Controlled time to stop • Stop before day time trafﬁc ramp up • Stop if compaction causing perf issue Resume the next night • Filter out the regions that has run compaction Important for online facing clusters 36

4 37 Maintenance Check Health Get Lock Get Server Locality
Check Server Status Region Movement w/Threads Start RegionServer + Cooldown Release Lock Update Status + Cooldown Rolling Restart Change management while retaining availability 1 2 3 Region Movement w/Threads Stop RegionServer + Cooldown Verify Locality Check Server Status

Next Challenges Upgrade to latest stable (1.x) from 0.94.x w/no
downtime Increasing performance • Lower latency • Better compaction throughput / less write ampliﬁcation Regional Failover • Cross datacenter is in production now • Failing over between AWS regions Looking forward 38

HBaseCon 2015 - HBase at Scale

HBaseCon 2015 - HBase at Scale

More Decks by Jeremy Carroll

Other Decks in Technology

Featured

Transcript