$30 off During Our Annual Pro Sale. View Details »

Operate HBase clusters at Scale

Operate HBase clusters at Scale

Presented by Florentin Dubois & Kevin Georges at SysadminDays #8 (https://sysadmindays.fr)

Renaud Chaput

October 20, 2018
Tweet

More Decks by Renaud Chaput

Other Decks in Programming

Transcript

  1. Operate HBase clusters at scale
    with a monitoring goal
    1
    > by Metrics with

    View Slide

  2. @sysadmindays @ovh 2
    Kevin Georges
    Engineering manager
    @0xd33d33

    View Slide

  3. @sysadmindays @ovh 3
    Florentin Dubois
    Software engineer
    @FlorentinDUBOIS

    View Slide

  4. @sysadmindays @ovh 4
    OVHcloud

    View Slide

  5. @sysadmindays @ovh 5
    OVH

    View Slide

  6. @sysadmindays @ovh 6
    OVH
    Presience

    View Slide

  7. What are we doing?
    7

    View Slide

  8. @sysadmindays @ovh 8
    Metrics Data Platform

    View Slide

  9. @sysadmindays @ovh 9
    Metrics Data Platform
    432.000.000.000
    data points / jour

    View Slide

  10. @sysadmindays @ovh 10
    Metrics Data Platform
    10 To / j

    View Slide

  11. @sysadmindays @ovh 11
    Metrics Data Platform
    5.000.000 dp/s

    View Slide

  12. @sysadmindays @ovh 12
    Metrics Data Platform
    500.000.000
    series

    View Slide

  13. Our Infrastructure
    13

    View Slide

  14. @sysadmindays @ovh 14
    2 regions

    View Slide

  15. @sysadmindays @ovh 15
    Our clusters size
    BHS:
    ● 30 nodes
    ● 400 TB
    ● 120 Mbps
    GRA:
    ● 150 nodes
    ● 2 PB
    ● 1.1 Gbps

    View Slide

  16. @sysadmindays @ovh 16
    Warp 10

    View Slide

  17. @sysadmindays @ovh 17
    Warp 10 on top of HBase

    View Slide

  18. Warp10
    Egress
    Warp10
    Directory
    Warp10
    Store
    @sysadmindays @ovh 18
    Our cluster architecture
    Region server
    +
    Datanode
    Region server
    +
    Datanode
    Region server
    +
    Datanode
    Region server
    +
    Datanode
    Warp10
    Ingress
    Warp10
    Store
    Kafka
    Warp10
    Directory
    Warp10
    Egress

    View Slide

  19. @sysadmindays @ovh 19
    Our real cluster architecture

    View Slide

  20. @sysadmindays @ovh 20
    Manage multiple hardware configurations

    View Slide

  21. @sysadmindays @ovh 21
    Hardware pitfalls
    Be sure how much controlers matches the
    number of disk & sata ports
    Be sure that your network link can handle your
    disk IO capacity
    Be sure of threads distributions, (IRQ, NUMA
    surprises,ingest+processing+gc+...)

    View Slide

  22. What’s Apache HBase?
    22

    View Slide

  23. @sysadmindays @ovh 23
    What’s Apache HBase? #KeyValue

    View Slide

  24. @sysadmindays @ovh 24
    What’s Apache HBase? #SortedColumnStor
    e

    View Slide

  25. @sysadmindays @ovh 25
    What’s Apache HBase? #ColumnStore

    View Slide

  26. @sysadmindays @ovh 26
    What’s Apache HBase? #ColumnStore

    View Slide

  27. @sysadmindays @ovh 27
    What’s Apache HBase? #Columnar?

    View Slide

  28. @sysadmindays @ovh 28
    What’s Apache HBase? #ColumnStore

    View Slide

  29. @sysadmindays @ovh 29
    What’s Apache HBase? #ColumnStore

    View Slide

  30. @sysadmindays @ovh 30
    What’s Apache HBase? #ColumnStore

    View Slide

  31. Use cases
    31

    View Slide

  32. @sysadmindays @ovh 32
    Use cases families
    • Billing ……………………………………...………....
    (e.g. bill on maximum consumption in a month)
    • Monitoring …………………………………………….…………………...
    (APM, infrastructure,appliances,...)
    • IoT ………………………………………………………….………………....
    (Manage devices, operator integration,
    ...)
    • Geo Location …………………………………………………………………...………………...
    (manage localized fleets)

    View Slide

  33. @sysadmindays @ovh 33
    Use cases
    • DC Temperature/Elec/Cooling map
    • Pay as you go billing (PCI/IPLB)
    • GSCAN
    • Monitoring
    • ML Model scoring (Anti-Fraude)
    • Pattern Detection for medical applications

    View Slide

  34. Detect errors
    34

    View Slide

  35. @sysadmindays @ovh 35
    Extract errors from logs

    View Slide

  36. @sysadmindays @ovh 36
    Tailor
    Forward logs and extract metrics!

    View Slide

  37. @sysadmindays @ovh 37
    Monitoring JVM

    View Slide

  38. @sysadmindays @ovh 38
    Documentation

    View Slide

  39. JVM GC
    The good, the bad and the ugly
    39

    View Slide

  40. @sysadmindays @ovh 40
    The good

    View Slide

  41. @sysadmindays @ovh 41
    The bad

    View Slide

  42. @sysadmindays @ovh 42
    … and the ugly
    #java #jdk11 #zgc

    View Slide

  43. @sysadmindays @ovh 43
    Monitoring HBase

    View Slide

  44. @sysadmindays @ovh 44
    Number of open regions

    View Slide

  45. @sysadmindays @ovh 45
    Queues length

    View Slide

  46. @sysadmindays @ovh 46
    Number of read and write requests

    View Slide

  47. @sysadmindays @ovh 47
    Preserve data locality

    View Slide

  48. @sysadmindays @ovh 48
    Host health

    View Slide

  49. Pokédex
    49
    Inventory all animals.

    View Slide

  50. @sysadmindays @ovh 50
    Merging all data sources

    View Slide

  51. @sysadmindays @ovh 51
    Global visualization

    View Slide

  52. @sysadmindays @ovh 52
    Correlate information

    View Slide

  53. Sacha
    53
    The best tamer!

    View Slide

  54. @sysadmindays @ovh 54
    An awesome command line tool

    View Slide

  55. @sysadmindays @ovh 55
    Retrieving bare informations

    View Slide

  56. @sysadmindays @ovh 56
    Create region map

    View Slide

  57. @sysadmindays @ovh 57
    Move region to another region server

    View Slide

  58. @sysadmindays @ovh 58
    Drain regions of the region server

    View Slide

  59. @sysadmindays @ovh 59
    Managing multiple hardware profiles

    View Slide

  60. @sysadmindays @ovh 60
    Balance the cluster

    View Slide

  61. Tips & tricks
    61

    View Slide

  62. @sysadmindays @ovh 62
    Xreceiver
    ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(...):
    DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256
    HDFS

    View Slide

  63. @sysadmindays @ovh 63
    Xreceiver
    ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(...):
    DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256
    HDFS
    INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream
    java.io.IOException: Could not read from stream
    INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-546...
    WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.io.IOException:
    Unable to create new block.
    WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block blk_-546.. bad
    dn[0]
    FATAL org.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required.
    Forcing server shutdown
    HBASE

    View Slide

  64. @sysadmindays @ovh 64
    Xreceiver
    if (curXceiverCount > dataXceiverServer.maxXceiverCount) {
    throw new IOException(“xceiverCount ” + curXceiverCount
    + ” exceeds the limit of concurrent xcievers “
    + dataXceiverServer.maxXceiverCount);
    }

    View Slide

  65. @sysadmindays @ovh 65
    Xreceiver
    if (curXceiverCount > dataXceiverServer.maxXceiverCount) {
    throw new IOException(“xceiverCount ” + curXceiverCount
    + ” exceeds the limit of concurrent xcievers “
    + dataXceiverServer.maxXceiverCount);
    }

    View Slide

  66. @sysadmindays @ovh 66
    Ipc queue
    HBASE

    View Slide

  67. @sysadmindays @ovh 67
    Hardware pitfalls
    Be sure how much controlers matches the
    number of disk & sata ports
    Be sure that your network link can handle your
    disk IO capacity
    Be sure of threads distributions, (IRQ, NUMA
    surprises,ingest+processing+gc+...)

    View Slide

  68. @sysadmindays @ovh 68
    Hardware pitfalls
    Be sure how much controlers matches the
    number of disk & sata ports
    Be sure that your network link can handle your
    disk IO capacity
    Be sure of threads distributions, (IRQ, NUMA
    surprises,ingest+processing+gc+...)

    View Slide

  69. What we achieved!
    69

    View Slide

  70. @sysadmindays @ovh 70
    5 million puts/s

    View Slide

  71. @sysadmindays @ovh 71
    ...

    View Slide

  72. Thanks!
    72

    View Slide