Slide 1

Slide 1 text

Operate HBase clusters at scale with a monitoring goal 1 by Metrics with

Slide 2

Slide 2 text

@sysadmindays @ovh 2 Kevin Georges Engineering manager @0xd33d33

Slide 3

Slide 3 text

@sysadmindays @ovh 3 Florentin Dubois Software engineer @FlorentinDUBOIS

Slide 4

Slide 4 text

@sysadmindays @ovh 4 OVHcloud

Slide 5

Slide 5 text

@sysadmindays @ovh 5 OVH

Slide 6

Slide 6 text

@sysadmindays @ovh 6 OVH Presience

Slide 7

Slide 7 text

What are we doing? 7

Slide 8

Slide 8 text

@sysadmindays @ovh 8 Metrics Data Platform

Slide 9

Slide 9 text

@sysadmindays @ovh 9 Metrics Data Platform 432.000.000.000 data points / jour

Slide 10

Slide 10 text

@sysadmindays @ovh 10 Metrics Data Platform 10 To / j

Slide 11

Slide 11 text

@sysadmindays @ovh 11 Metrics Data Platform 5.000.000 dp/s

Slide 12

Slide 12 text

@sysadmindays @ovh 12 Metrics Data Platform 500.000.000 series

Slide 13

Slide 13 text

Our Infrastructure 13

Slide 14

Slide 14 text

@sysadmindays @ovh 14 2 regions

Slide 15

Slide 15 text

@sysadmindays @ovh 15 Our clusters size BHS: ● 30 nodes ● 400 TB ● 120 Mbps GRA: ● 150 nodes ● 2 PB ● 1.1 Gbps

Slide 16

Slide 16 text

@sysadmindays @ovh 16 Warp 10

Slide 17

Slide 17 text

@sysadmindays @ovh 17 Warp 10 on top of HBase

Slide 18

Slide 18 text

Warp10 Egress Warp10 Directory Warp10 Store @sysadmindays @ovh 18 Our cluster architecture Region server + Datanode Region server + Datanode Region server + Datanode Region server + Datanode Warp10 Ingress Warp10 Store Kafka Warp10 Directory Warp10 Egress

Slide 19

Slide 19 text

@sysadmindays @ovh 19 Our real cluster architecture

Slide 20

Slide 20 text

@sysadmindays @ovh 20 Manage multiple hardware configurations

Slide 21

Slide 21 text

@sysadmindays @ovh 21 Hardware pitfalls Be sure how much controlers matches the number of disk & sata ports Be sure that your network link can handle your disk IO capacity Be sure of threads distributions, (IRQ, NUMA surprises,ingest+processing+gc+...)

Slide 22

Slide 22 text

What’s Apache HBase? 22

Slide 23

Slide 23 text

@sysadmindays @ovh 23 What’s Apache HBase? #KeyValue

Slide 24

Slide 24 text

@sysadmindays @ovh 24 What’s Apache HBase? #SortedColumnStor e

Slide 25

Slide 25 text

@sysadmindays @ovh 25 What’s Apache HBase? #ColumnStore

Slide 26

Slide 26 text

@sysadmindays @ovh 26 What’s Apache HBase? #ColumnStore

Slide 27

Slide 27 text

@sysadmindays @ovh 27 What’s Apache HBase? #Columnar?

Slide 28

Slide 28 text

@sysadmindays @ovh 28 What’s Apache HBase? #ColumnStore

Slide 29

Slide 29 text

@sysadmindays @ovh 29 What’s Apache HBase? #ColumnStore

Slide 30

Slide 30 text

@sysadmindays @ovh 30 What’s Apache HBase? #ColumnStore

Slide 31

Slide 31 text

Use cases 31

Slide 32

Slide 32 text

@sysadmindays @ovh 32 Use cases families • Billing ……………………………………...……….... (e.g. bill on maximum consumption in a month) • Monitoring …………………………………………….…………………... (APM, infrastructure,appliances,...) • IoT ………………………………………………………….……………….... (Manage devices, operator integration, ...) • Geo Location …………………………………………………………………...………………... (manage localized fleets)

Slide 33

Slide 33 text

@sysadmindays @ovh 33 Use cases • DC Temperature/Elec/Cooling map • Pay as you go billing (PCI/IPLB) • GSCAN • Monitoring • ML Model scoring (Anti-Fraude) • Pattern Detection for medical applications

Slide 34

Slide 34 text

Detect errors 34

Slide 35

Slide 35 text

@sysadmindays @ovh 35 Extract errors from logs

Slide 36

Slide 36 text

@sysadmindays @ovh 36 Tailor Forward logs and extract metrics!

Slide 37

Slide 37 text

@sysadmindays @ovh 37 Monitoring JVM

Slide 38

Slide 38 text

@sysadmindays @ovh 38 Documentation

Slide 39

Slide 39 text

JVM GC The good, the bad and the ugly 39

Slide 40

Slide 40 text

@sysadmindays @ovh 40 The good

Slide 41

Slide 41 text

@sysadmindays @ovh 41 The bad

Slide 42

Slide 42 text

@sysadmindays @ovh 42 … and the ugly #java #jdk11 #zgc

Slide 43

Slide 43 text

@sysadmindays @ovh 43 Monitoring HBase

Slide 44

Slide 44 text

@sysadmindays @ovh 44 Number of open regions

Slide 45

Slide 45 text

@sysadmindays @ovh 45 Queues length

Slide 46

Slide 46 text

@sysadmindays @ovh 46 Number of read and write requests

Slide 47

Slide 47 text

@sysadmindays @ovh 47 Preserve data locality

Slide 48

Slide 48 text

@sysadmindays @ovh 48 Host health

Slide 49

Slide 49 text

Pokédex 49 Inventory all animals.

Slide 50

Slide 50 text

@sysadmindays @ovh 50 Merging all data sources

Slide 51

Slide 51 text

@sysadmindays @ovh 51 Global visualization

Slide 52

Slide 52 text

@sysadmindays @ovh 52 Correlate information

Slide 53

Slide 53 text

Sacha 53 The best tamer!

Slide 54

Slide 54 text

@sysadmindays @ovh 54 An awesome command line tool

Slide 55

Slide 55 text

@sysadmindays @ovh 55 Retrieving bare informations

Slide 56

Slide 56 text

@sysadmindays @ovh 56 Create region map

Slide 57

Slide 57 text

@sysadmindays @ovh 57 Move region to another region server

Slide 58

Slide 58 text

@sysadmindays @ovh 58 Drain regions of the region server

Slide 59

Slide 59 text

@sysadmindays @ovh 59 Managing multiple hardware profiles

Slide 60

Slide 60 text

@sysadmindays @ovh 60 Balance the cluster

Slide 61

Slide 61 text

Tips & tricks 61

Slide 62

Slide 62 text

@sysadmindays @ovh 62 Xreceiver ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(...): DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256 HDFS

Slide 63

Slide 63 text

@sysadmindays @ovh 63 Xreceiver ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(...): DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256 HDFS INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-546... WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block blk_-546.. bad dn[0] FATAL org.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required. Forcing server shutdown HBASE

Slide 64

Slide 64 text

@sysadmindays @ovh 64 Xreceiver if (curXceiverCount > dataXceiverServer.maxXceiverCount) { throw new IOException(“xceiverCount ” + curXceiverCount + ” exceeds the limit of concurrent xcievers “ + dataXceiverServer.maxXceiverCount); }

Slide 65

Slide 65 text

@sysadmindays @ovh 65 Xreceiver if (curXceiverCount > dataXceiverServer.maxXceiverCount) { throw new IOException(“xceiverCount ” + curXceiverCount + ” exceeds the limit of concurrent xcievers “ + dataXceiverServer.maxXceiverCount); }

Slide 66

Slide 66 text

@sysadmindays @ovh 66 Ipc queue HBASE

Slide 67

Slide 67 text

@sysadmindays @ovh 67 Hardware pitfalls Be sure how much controlers matches the number of disk & sata ports Be sure that your network link can handle your disk IO capacity Be sure of threads distributions, (IRQ, NUMA surprises,ingest+processing+gc+...)

Slide 68

Slide 68 text

@sysadmindays @ovh 68 Hardware pitfalls Be sure how much controlers matches the number of disk & sata ports Be sure that your network link can handle your disk IO capacity Be sure of threads distributions, (IRQ, NUMA surprises,ingest+processing+gc+...)

Slide 69

Slide 69 text

What we achieved! 69

Slide 70

Slide 70 text

@sysadmindays @ovh 70 5 million puts/s

Slide 71

Slide 71 text

@sysadmindays @ovh 71 ...

Slide 72

Slide 72 text

Thanks! 72