Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBaseCon 2015 - HBase at Scale

HBaseCon 2015 - HBase at Scale

HBase at Scale in an Online and High-Demand Environment

Jeremy Carroll

May 08, 2015
Tweet

More Decks by Jeremy Carroll

Other Decks in Technology

Transcript

  1. HBase
    2
    Jeremy Carroll & Tian-Ying Chang
    Online, at at Low Latency

    View full-size slide

  2. 50+ Billion Pins
    categorized by people into more than

    1 Billion Boards
    3

    View full-size slide

  3. Use Cases
    Products running on HBase / Zen
    4
    SmartFeed

    The SmartFeed service renders
    the Pinterest landing page for
    feeds.
    Pinterest
    Engineering
    Typeahead

    Provides personal suggestions
    based on a prefix for a user.
    Pinterest
    Engineering
    Messages

    Send messages to your friends.
    Conversations around Pins.
    Pinterest
    Engineering
    Interests
    Follow interests and be inspired
    about the things you love.
    Pinterest
    Engineering
    1 2 3 4

    View full-size slide

  4. Online Performance

    View full-size slide

  5. Moving Instance Types
    Validating performance
    • Previously tuned hi1.4xlarge instance workloads
    were running smooth at scale
    • AWS deprecated hi1.4xlarge instances, making
    it harder to get capacity in availability zones
    • Undertook validation of the new i2 platform as a
    replacement instance type
    • Uncovered some best practices while
    configuring for EC2
    hi1.4xlarge deprecated
    6

    View full-size slide

  6. Problem
    Validating the new i2 platform
    7
    p99.9 zen get nodes in ms

    View full-size slide

  7. Garbage Collection
    Pause time in milliseconds
    8
    CMS Initial-Mark
    Promotion Failures
    ParNew
    CMS Remark

    View full-size slide

  8. Problem #1: Promotion Failures
    Issues seen in production
    Heap fragmentation causing promotion failures
    BlockCache at high QPS was causing fragmentation
    Keeping hundreds of billions of rows BloomFilters from being evicted,
    leading to latency issues
    Solutions
    Went off-heap for BlockCache using BucketCache
    Ensuring memory space for Memstore + Blooms / Indexes on heap
    with CombinedCache / MaxDirectMemorySize off-heap
    Monitoring % of LRU Heap for blooms & indexes
    Tuning & BucketCache
    9

    View full-size slide

  9. Problem #2: Pause Time
    Calculating Deltas
    Started noticing ‘user + sys’ compared against ‘real’ was very
    different
    Random spikes of ‘real vs user+sys’ delta time, sometimes
    concentrated on hourly boundaries
    Found resources online, but none of the fixes seemed to work
    http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux
    http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html
    http://www.evanjones.ca/jvm-mmap-pause.html
    Ended up tracing all Disk IO on the system to find latency outliers
    Low user, low sys, high real
    10

    View full-size slide

  10. Cloud Problems
    Noisy neighbors
    11
    /dev/sda average io in milliseconds

    View full-size slide

  11. Formatting with EXT4
    Logging to instance-store volume
    12
    100%
    1251.60
    99.99%
    1223.52
    99.9%
    241.33
    90%
    151.45
    Pause time in ms

    View full-size slide

  12. Instance Layout
    Implementation
    • Ephemeral disks formatted as JBOD
    • All logging happens to the first ephemeral disk
    with aggressive log rotation
    • GC / H-Daemon logs was written to /var/log,
    which is now linked to first ephemeral volume
    What we did not know
    • GC Statistics logged in /tmp
    (PerfDisableSharedMem)
    Filesystem hierarchy
    13
    ephemeral0
    instance
    /var/log
    /data0
    / sda
    /data1
    ephemeral0
    ephemeral1
    /dataX ephemeralX

    View full-size slide

  13. Formatting with XFS
    PerfDisableSharedMem & logging to epemeral
    14
    100%
    115.64
    99.99%
    107.88
    99.9%
    94.01
    90%
    66.38
    Pause time in ms

    View full-size slide

  14. Resolution
    Comparison before / after tuning
    15
    p99.9 zen get nodes in ms

    View full-size slide

  15. Online Performance
    JVM Options
    -server -XX:+PerfDisableSharedMem -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -
    XX:+UseCompressedOops -XX:+CMSParallelRemarkEnabled -
    XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -
    Dnetworkaddress.cache.ttl=300 -Djava.net.preferIPv4Stack=true
    Instance Configuration
    • Treat instance-store (sda) as read only
    • Irqbalance >= 1.0.6 (Ubuntu 12.04)
    • Kernel 3.8+ for disk performance

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811
    • XFS Filesystem Formatting
    Changes for success on EC2
    16

    View full-size slide

  16. Monitoring HBase
    • Collect statistics via local collection daemons
    • Daemons query the local JMX port, and use
    filters to grab statistics we care about


    http://regionserver:60030/jmx?
    qry=hadoop:service=HBase,name=RPCStatistics-60020
    • Collect per-table statistics
    • Visualize with in-house dashboards, or
    OpenTSDBr
    • Garbage collection analysis performed by
    parsing GC logs fed into R as TSV
    Lots of data points
    18
    table_1
    collector
    regionserver:60030/jmx
    table_2
    OpenTSDB

    View full-size slide

  17. Dashboards
    Metrics
    • Usually low value, except when you need them
    • Dashboards for all H-Stack daemons
    (DataNode, NameNode, etc..)
    • Use raw metrics to drive insights regarding
    increased load to tables / regions
    • System like OpenTSDB which can handle high
    cardinality metrics
    • Compactions can elevate CPU greatly
    Deep dives
    19

    View full-size slide

  18. Nagios based system
    • Usually unexpected traffic increases / capacity
    • Clustered Commands very useful
    - % Daemons Down. Queues (RPC / Replication)
    • ZooKeeper quorum membership
    • Stale configuration files (on disk, not in memory)
    • Critical daemons (NameNode, Master, etc..)
    • Long running master tasks (Ex: splitting)
    • HDFS Free Space
    - Room for compactions, accounting for replication
    HBase Alerts
    Analytics for HBase On-Call
    20

    View full-size slide

  19. HotSpots
    Spammers
    • Few users request the same row over and over
    • Rate limiting / caching
    Real time analysis
    • TCPDump is very helpful
    tcpdump -i eth0 -w - -s 0 tcp port 60020 | strings
    • Looking at per-region request stats
    Code Issues
    • Hard-coded key in product. Ex: Messages launch
    Debugging imbalanced requests
    21
    CPU Utilization

    View full-size slide

  20. Capacity Planning
    supporting fast growth

    View full-size slide

  21. Capacity Planning
    Start small. Split for growth
    23
    table table
    thrift zookeeper
    Master Slave
    Config
    Replication
    Feature Cost
    HDFS disk space
    CPU - Compression (Prefix, FastDiff, Snappy)
    Memory - Index, Bloom Filters
    Managed Splitting
    Disable auto-splitting and monitor region size
    Manually split slave cluster
    • No impact to online facing cluster
    • Switch between master and slave clusters

    View full-size slide

  22. Capacity Planning
    Start small. Split for growth
    24
    table table
    thrift zookeeper
    Master Slave
    Config
    Replication
    Feature Cost
    HDFS disk space
    CPU - Compression (Prefix, FastDiff, Snappy)
    Memory - Index, Bloom Filters
    Managed Splitting
    Disable auto-splitting and monitor region size
    Manually split slave cluster
    • No impact to online facing cluster
    • Switch between master and slave clusters

    View full-size slide

  23. Capacity Planning
    Start small. Split for growth
    25
    table table
    thrift zookeeper
    Master Slave
    Config
    Replication
    Feature Cost
    HDFS disk space
    CPU - Compression (Prefix, FastDiff, Snappy)
    Memory - Index, Bloom Filters
    Managed Splitting
    Disable auto-splitting and monitor region size
    Manually split slave cluster
    • No impact to online facing cluster
    • Switch between master and slave clusters
    Slave Master

    View full-size slide

  24. Capacity Planning
    Start small. Split for growth
    26
    table table
    thrift zookeeper
    Slave Master
    Config
    Replication
    Feature Cost
    HDFS disk space
    CPU - Compression (Prefix, FastDiff, Snappy)
    Memory - Index, Bloom Filters
    Managed Splitting
    Disable auto-splitting and monitor region size
    Manually split slave cluster
    • No impact to online facing cluster
    • Switch between master and slave clusters

    View full-size slide

  25. Capacity Planning
    Start small. Split for growth
    27
    table table
    thrift zookeeper
    Slave Master
    Config
    Replication
    Feature Cost
    HDFS disk space
    CPU - Compression (Prefix, FastDiff, Snappy)
    Memory - Index, Bloom Filters
    Managed Splitting
    Disable auto-splitting and monitor region size
    Manually split slave cluster
    • No impact to online facing cluster
    • Switch between master and slave clusters

    View full-size slide

  26. Capacity Planning
    Start small. Split for growth
    28
    table table
    thrift zookeeper
    Slave Master
    Config
    Replication
    Feature Cost
    HDFS disk space
    CPU - Compression (Prefix, FastDiff, Snappy)
    Memory - Index, Bloom Filters
    Managed Splitting
    Disable auto-splitting and monitor region size
    Manually split slave cluster
    • No impact to online facing cluster
    • Switch between master and slave clusters

    View full-size slide

  27. Capacity Planning
    Balancing regions to servers
    29
    table
    table
    1
    2
    4 5
    3
    1
    3 4
    2
    Salted keys for uniform distribution
    Per-table region assignment
    • One or Two region difference could cause big
    difference on load
    • Average 2.5 regions per RS cause region
    assignment of 2 or 3
    • Load is 30% different

    View full-size slide

  28. Launching
    • Started on a NameSpace table on a shared
    cluster
    • Rolling slowly out to production with by x%
    users
    • Split table to get additional region servers
    serving traffic
    • Migrated to dedicated cluster
    • As experiment ramped up, added / removed
    capacity as feature was adopted
    From development to production
    30
    2:50 PM 100%

    View full-size slide

  29. Availability
    and disaster recovery

    View full-size slide

  30. Availability
    Conditions for Failure
    Termination notices from the underlying host
    Default RF of 3 in one zone dangerous
    Placement Groups may make this worse
    Unexpected events - AWS Global Reboot
    Stability Patterns
    Highest numbered instance type in a family
    Multi Availability Zone + block placement
    Make changes to only one cluster at a time
    Strategies for mitigating failure
    32
    Master Slave
    US-East-1A US-East-1E
    Replication
    Physical Host

    View full-size slide

  31. Disaster Recovery
    Bootstrapping new clusters
    33
    Master Slave
    US-East-1A US-East-1E
    Replication DR Slave
    Replication
    US-East-1D
    HDFS
    S3
    Backup and recover tables
    ExportSnapshot to HDFS cluster with throttle
    DistCP WAL logs hourly to HDFS clusters
    Clone snapshot + WALPlayer to recover table
    Data can restore from S3, HDFS, or another
    Cluster w/rate limited copying

    hbasehbackup -t unique_index -d new_cluster -b 10

    hbaserecover -t unique_index -ts
    2015-04-08-16-01-15 --source-cluster
    zennotifications

    View full-size slide

  32. Backup Monitoring
    Snapshot backup routine
    • ZooKeeper based configuration for each cluster
    • Backup metadata is sent to ElasticSearch for
    integration with dashboards
    • Monitoring and alerting around WAL copy &
    snapshot status
    Reliable cloud backups
    34

    View full-size slide

  33. Operations
    running on amazon web services

    View full-size slide

  34. Rolling Compaction
    Only one region per server is selected
    • Avoid blocked region in queue
    Controlled concurrency
    • Control the space spike
    • Reduce increased network and disk traffic
    Controlled time to stop
    • Stop before day time traffic ramp up
    • Stop if compaction causing perf issue
    Resume the next night
    • Filter out the regions that has run compaction
    Important for online facing clusters
    36

    View full-size slide

  35. 4
    37
    Maintenance
    Check Health
    Get Lock
    Get Server Locality
    Check Server Status
    Region Movement w/Threads
    Start RegionServer
    + Cooldown
    Release Lock
    Update Status + Cooldown
    Rolling
    Restart
    Change management
    while retaining
    availability
    1
    2
    3
    Region Movement w/Threads
    Stop RegionServer
    + Cooldown
    Verify Locality
    Check Server Status

    View full-size slide

  36. Next Challenges
    Upgrade to latest stable (1.x) from 0.94.x w/no downtime
    Increasing performance
    • Lower latency
    • Better compaction throughput / less write amplification
    Regional Failover
    • Cross datacenter is in production now
    • Failing over between AWS regions
    Looking forward
    38

    View full-size slide

  37. © Copyright, All Rights Reserved Pinterest Inc. 2015

    View full-size slide