Operate HBase clusters at Scale

Operate HBase clusters at Scale

Presented by Florentin Dubois & Kevin Georges at SysadminDays #8 (https://sysadmindays.fr)

415efaa445ed983307231341eaa4be55?s=128

Renaud Chaput

October 20, 2018
Tweet

Transcript

  1. Operate HBase clusters at scale with a monitoring goal 1

    </> by Metrics with
  2. @sysadmindays @ovh 2 Kevin Georges Engineering manager @0xd33d33

  3. @sysadmindays @ovh 3 Florentin Dubois Software engineer @FlorentinDUBOIS

  4. @sysadmindays @ovh 4 OVHcloud

  5. @sysadmindays @ovh 5 OVH

  6. @sysadmindays @ovh 6 OVH Presience

  7. What are we doing? 7

  8. @sysadmindays @ovh 8 Metrics Data Platform

  9. @sysadmindays @ovh 9 Metrics Data Platform 432.000.000.000 data points /

    jour
  10. @sysadmindays @ovh 10 Metrics Data Platform 10 To / j

  11. @sysadmindays @ovh 11 Metrics Data Platform 5.000.000 dp/s

  12. @sysadmindays @ovh 12 Metrics Data Platform 500.000.000 series

  13. Our Infrastructure 13

  14. @sysadmindays @ovh 14 2 regions

  15. @sysadmindays @ovh 15 Our clusters size BHS: • 30 nodes

    • 400 TB • 120 Mbps GRA: • 150 nodes • 2 PB • 1.1 Gbps
  16. @sysadmindays @ovh 16 Warp 10

  17. @sysadmindays @ovh 17 Warp 10 on top of HBase

  18. Warp10 Egress Warp10 Directory Warp10 Store @sysadmindays @ovh 18 Our

    cluster architecture Region server + Datanode Region server + Datanode Region server + Datanode Region server + Datanode Warp10 Ingress Warp10 Store Kafka Warp10 Directory Warp10 Egress
  19. @sysadmindays @ovh 19 Our real cluster architecture

  20. @sysadmindays @ovh 20 Manage multiple hardware configurations

  21. @sysadmindays @ovh 21 Hardware pitfalls Be sure how much controlers

    matches the number of disk & sata ports Be sure that your network link can handle your disk IO capacity Be sure of threads distributions, (IRQ, NUMA surprises,ingest+processing+gc+...)
  22. What’s Apache HBase? 22

  23. @sysadmindays @ovh 23 What’s Apache HBase? #KeyValue

  24. @sysadmindays @ovh 24 What’s Apache HBase? #SortedColumnStor e

  25. @sysadmindays @ovh 25 What’s Apache HBase? #ColumnStore

  26. @sysadmindays @ovh 26 What’s Apache HBase? #ColumnStore

  27. @sysadmindays @ovh 27 What’s Apache HBase? #Columnar?

  28. @sysadmindays @ovh 28 What’s Apache HBase? #ColumnStore

  29. @sysadmindays @ovh 29 What’s Apache HBase? #ColumnStore

  30. @sysadmindays @ovh 30 What’s Apache HBase? #ColumnStore

  31. Use cases 31

  32. @sysadmindays @ovh 32 Use cases families • Billing ……………………………………...……….... (e.g.

    bill on maximum consumption in a month) • Monitoring …………………………………………….…………………... (APM, infrastructure,appliances,...) • IoT ………………………………………………………….……………….... (Manage devices, operator integration, ...) • Geo Location …………………………………………………………………...………………... (manage localized fleets)
  33. @sysadmindays @ovh 33 Use cases • DC Temperature/Elec/Cooling map •

    Pay as you go billing (PCI/IPLB) • GSCAN • Monitoring • ML Model scoring (Anti-Fraude) • Pattern Detection for medical applications
  34. Detect errors 34

  35. @sysadmindays @ovh 35 Extract errors from logs

  36. @sysadmindays @ovh 36 Tailor Forward logs and extract metrics!

  37. @sysadmindays @ovh 37 Monitoring JVM

  38. @sysadmindays @ovh 38 Documentation

  39. JVM GC The good, the bad and the ugly 39

  40. @sysadmindays @ovh 40 The good

  41. @sysadmindays @ovh 41 The bad

  42. @sysadmindays @ovh 42 … and the ugly #java #jdk11 #zgc

  43. @sysadmindays @ovh 43 Monitoring HBase

  44. @sysadmindays @ovh 44 Number of open regions

  45. @sysadmindays @ovh 45 Queues length

  46. @sysadmindays @ovh 46 Number of read and write requests

  47. @sysadmindays @ovh 47 Preserve data locality

  48. @sysadmindays @ovh 48 Host health

  49. Pokédex 49 Inventory all animals.

  50. @sysadmindays @ovh 50 Merging all data sources

  51. @sysadmindays @ovh 51 Global visualization

  52. @sysadmindays @ovh 52 Correlate information

  53. Sacha 53 The best tamer!

  54. @sysadmindays @ovh 54 An awesome command line tool

  55. @sysadmindays @ovh 55 Retrieving bare informations

  56. @sysadmindays @ovh 56 Create region map

  57. @sysadmindays @ovh 57 Move region to another region server

  58. @sysadmindays @ovh 58 Drain regions of the region server

  59. @sysadmindays @ovh 59 Managing multiple hardware profiles

  60. @sysadmindays @ovh 60 Balance the cluster

  61. Tips & tricks 61

  62. @sysadmindays @ovh 62 Xreceiver ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(...): DataXceiver: java.io.IOException: xceiverCount

    258 exceeds the limit of concurrent xcievers 256 HDFS
  63. @sysadmindays @ovh 63 Xreceiver ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(...): DataXceiver: java.io.IOException: xceiverCount

    258 exceeds the limit of concurrent xcievers 256 HDFS INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-546... WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block blk_-546.. bad dn[0] FATAL org.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required. Forcing server shutdown HBASE
  64. @sysadmindays @ovh 64 Xreceiver if (curXceiverCount > dataXceiverServer.maxXceiverCount) { throw

    new IOException(“xceiverCount ” + curXceiverCount + ” exceeds the limit of concurrent xcievers “ + dataXceiverServer.maxXceiverCount); }
  65. @sysadmindays @ovh 65 Xreceiver if (curXceiverCount > dataXceiverServer.maxXceiverCount) { throw

    new IOException(“xceiverCount ” + curXceiverCount + ” exceeds the limit of concurrent xcievers “ + dataXceiverServer.maxXceiverCount); }
  66. @sysadmindays @ovh 66 Ipc queue HBASE

  67. @sysadmindays @ovh 67 Hardware pitfalls Be sure how much controlers

    matches the number of disk & sata ports Be sure that your network link can handle your disk IO capacity Be sure of threads distributions, (IRQ, NUMA surprises,ingest+processing+gc+...)
  68. @sysadmindays @ovh 68 Hardware pitfalls Be sure how much controlers

    matches the number of disk & sata ports Be sure that your network link can handle your disk IO capacity Be sure of threads distributions, (IRQ, NUMA surprises,ingest+processing+gc+...)
  69. What we achieved! 69

  70. @sysadmindays @ovh 70 5 million puts/s

  71. @sysadmindays @ovh 71 ...

  72. Thanks! 72