HBase and Kafka data pipeline and applications for LINE Messaging Platform

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Agenda - HBase at LINE Messaging Platform - HBase and Kafka data pipeline - HBase and Kafka data pipeline applications

Slide 3

Slide 3 text

HBase at LINE Messaging Platform

Slide 4

Slide 4 text

About me and HBase Unit - I joined in 2018 as new grad - Member of HBase Unit for LINE Messaging Platform Server app 1 Server app 2 Server

Slide 5

Slide 5 text

Which data we store in our HBase Friend message Chat meta RECEIVED_MESSAGE SEND_MESSAGE

Slide 6

Slide 6 text

HBase Architecture HMaster NameNode JournalNode ZKQuorum Client Controller nodes 3,5,7,... Worker nodes 4~ HRegionServer HRegionServer HRegionServer DataNode DataNode DataNode

Slide 7

Slide 7 text

HBase Architecture ※3 replica HMaster NameNode JournalNode ZKQuorum Client Controller nodes 3,5,7,... Worker nodes 4~ HRegionServer HRegionServer HRegionServer DataNode DataNode DataNode Block c Block a Block b Block b Block c Block a Block a Block c Block b

Slide 8

Slide 8 text

HBase Architecture ※3 replica HMaster NameNode JournalNode ZKQuorum Client Controller nodes 3,5,7,... Worker nodes 4~ DataNode DataNode DataNode Block c Block a Block b Block b Block c Block a Block a Block c Block b HRegionServer HRegionServer HRegionServer Region 1 Region 2 Region 3 Region 4 Region 3 Region 4

Slide 9

Slide 9 text

Slide 10

Slide 10 text

HBase internal write flow RegionServer A Region 1 Client HDFS WALs memstore HFiles Client send mutation RegionServer B

Slide 11

Slide 11 text

HBase internal write flow RegionServer A Region 1 Client HDFS WALs memstore HFiles Client send mutation Append to Write Ahead Log Update memstore RegionServer B

Slide 12

Slide 12 text

HBase internal write flow RegionServer A Region 1 Client HDFS WALs memstore HFiles Client send mutation Append to Write Ahead Log Update memstore Flush memstore to HFile RegionServer B

Slide 13

Slide 13 text

Restore memstore from WAL on regionserver failure RegionServer A Region 1 HDFS WALs memstore HFiles RegionServer B Region 1 memstore

Slide 14

Slide 14 text

Restore memstore from WAL on regionserver failure RegionServer A Region 1 HDFS WALs memstore HFiles RegionServer B Region 1 memstore Restore memstore

Slide 15

Slide 15 text

HBase replication and reliability RegionServer A Source cluster HDFS ZooKeeper Destination cluster Replication Source WALEntry Replication Endpoint RegionServers WALs of A

Slide 16

Slide 16 text

HBase replication and reliability RegionServer A Source cluster HDFS ZooKeeper Destination cluster Replication Source WALEntry Replication Endpoint RegionServers retries retries WALs of A

Slide 17

Slide 17 text

HBase replication and reliability RegionServer A Source cluster HDFS ZooKeeper Destination cluster Replication Source WALEntry Replication Endpoint Replication offset of A RegionServers retries retries WALs of A

Slide 18

Slide 18 text

Slide 19

Slide 19 text

HBase replication and reliability RegionServer B Source cluster HDFS ZooKeeper Destination cluster Replication Source WALs of A WALEntry Replication Endpoint Replication offset of A RegionServers retries retries

Slide 20

Slide 20 text

Setup replication and usecase $ hbase shell > add_peer ‘1’, CLUSTER_KEY => “backup001.linecorp.com,...:2181:/hbase” User cluster Backup cluster DR cluster Tokyo region Osaka region

Slide 21

Slide 21 text

Pluggable ReplicationEndpoint* Example: Logging WALs $ hbase shell > add_peer ‘1’, ENDPOINT_CLASSNAME => “com.linecorp.hbase.LoggingReplicationEndpoint” * https://issues.apache.org/jira/browse/HBASE-11367

Slide 22

Slide 22 text

HBase and Kafka pipeline The first application

Slide 23

Slide 23 text

In 2017 - We were using HBase 0.90.6-cdh3u5 released in 2012, no longer supported by community - Replicated to HBase 0.94 cluster for statistical analysis Replication Server 0.90.6-cdh3u5 stats 0.94

Slide 24

Slide 24 text

In 2017 - We were migrating from HBase 0.90.6-cdh3u5 to HBase 1.2.5 Replication Dual write Copy data Server stats 0.94 1.2.5 0.90.6-cdh3u5

Slide 25

Slide 25 text

In 2017 - Needed to replicate to Stats cluster so that we keep the statistical analysis Replication Replication Dual write Copy data Server stats 0.94 1.2.5 0.90.6-cdh3u5

Slide 26

Slide 26 text

In 2017 - HBase 1.2.5 official replication doesn’t support replication to HBase 0.94 Replication Replication Dual write Copy data Incompatible Server stats 0.94 1.2.5 0.90.6-cdh3u5

Slide 27

Slide 27 text

Why cannot replicate to 0.94 from 1.2.5 From “HBASE AT LINE 2017” by Tomu Tsuruhara at LINE DEVELOPER DAY 2017 Release Date Version 2011 2012 2013 2014 2015 2016 2017 ★0.90 ★0.92 ★0.94 ★0.90.6-cdh3u5 ★0.96 ★0.98 ★1.0 ★1.1 ★1.2 ★1.3 Wire Protocol Change API Clean Up Singularity

Slide 28

Slide 28 text

1.2.5 The pipeline and the first application - It was difficult to migrate stats cluster side for various reason - Replicate from HBase 1.2.5 to HBase 0.94 through Kafka changing the protocol Replication Custom Replication Endpoint Dual write Replayer Custom Protocol Use HBase 0.94 client and protocol Copy data Server 0.90.6-cdh3u5 stats 0.94

Slide 29

Slide 29 text

Kafka Kafka brokers Topic Partition 1 Partition 2 Partition 3 Producer Producer Producer Consumer Consumer key:value key:value key:partition a:3 b:1 c:2 d:3 ...

Slide 30

Slide 30 text

Protocol for the pipeline - To avoid contamination by HBase 1.2.5 client at replayer for HBase 0.94 - Defined by Protocol Buffers contains - WAL meta data - Cell - Almost the same with HBase 1.2.5’s protocol

Slide 31

Slide 31 text

ReplicationEndpoint producing to Kafka - Use Pluggable ReplicationEndpoint - Topic per table - -- - Kafka key - Encoded region name (Region identifier) - Rowkey Replication Source Kafka Replication Endpoint

Slide 32

Slide 32 text

Setup KafkaReplicationEndpoint $ hbase shell > add_peer '1’, ENDPOINT_CLASSNAME =>’com.linecorp.hbase.KafkaReplicationEndpoint’, CONFIG => { ”kafka.config.bootstrap.servers" => ”kafka001.linecorp.com,...", ”kafka.config.linger.ms" => "1000", ”kafka.config.acks" => "all", ”kafka.config.retries" => "100" , ”kafka.config.client.id" => "linehbase-wal-replicator", "topic.name.prefix" => "linehbase-wal", "topic.name.suffix" => "v1” }

Slide 33

Slide 33 text

The replayer for HBase 0.94 - Consume WAL compatible protobuf data - Convert it to HBase 0.94‘s mutations (Put, Delete and so on) - Write them using HBase 0.94’s library

Slide 34

Slide 34 text

HBase and Kafka data pipeline - Such kind of pipeline is called as “Change data capture” - Strength! - Easy to interact the database mutations - High reliability thanks to HBase Replication implementation and Kafka - Weakness☹ - Asynchronous, so there might be delay - 100ms~ - Cannot get other rowkeys or columns at the time on the mutation - Need aggregation or interact with database at consumer side

Slide 35

Slide 35 text

Without HBase and Kafka data pipeline Server Tables • Added Kafka path for every HBase write path? • Retry for Kafka failure? • Won’t it affect to service? • Durability when server failure while sending to Kafka?

Slide 36

Slide 36 text

HBase and Kafka data pipeline: Reliability Server Tables • Added Kafka path for every HBase write path? →Yes, adding peer • Retry for Kafka failure? →Yes, Kafka client retry + retry in replication source • Won’t it affect to service? →No issue for short failure • Durability when server failure while sending to Kafka? →No issue thanks to replication failover RegionServer ZooKeeper Replication Source WALEntry Replication Endpoint Replication offset retries retries

Slide 37

Slide 37 text

HBase and Kafka pipeline Applications

Slide 38

Slide 38 text

Applications - We use this pipeline for several years and develop applications - 20+ target tables - 1.2M WAL messages / sec at peak - Introduce 4 kinds of our usecase and applications so far - Replication or data migration that the built-in HBase replication cannot handle - Applications running business logic considering WAL as an event stream - Near-realtime statistics analysis - Abuser detection at storage side

Slide 39

Slide 39 text

Replication or data migration 1.2.5 non-secure Kerberos-secured 0.94

Slide 40

Slide 40 text

Replication or data migration Replayer HBase 0.94 client 1.2.5 non-secure Kerberos-secured 0.94 Kerberos authenticated

Slide 41

Slide 41 text

Replication or data migration Replayer HBase 0.94 client Kerberos authenticated Other middleware 1.2.5 non-secure Kerberos-secured 0.94

Slide 42

Slide 42 text

Applications with WALs UserSettings - User settings service manages settings for each user as key-value format - Use it not only in Messaging Platform, but also in other services - Other service want to know settings changes Family app service user-settings service user-settings Get latest settings

Slide 43

Slide 43 text

Applications with WALs UserSettings user-settings service WAL Consumer Event Producer Service A consumer Service B consumer WAL WAL settings event settings event user-settings

Slide 44

Slide 44 text

Near-realtime statistic analysis - Traffic bursts 3x~4x of daily peak at 00:00 New Year - For New Year Greeting: Akeome LINE - Monitoring various metrics on new year bursting - Message count - Important metrics because the load is proportional to message count (and it’s fun) - High resolution: every 1 sec, 100 ms - Near-realtime: <= 10 seconds delay

Slide 45

Slide 45 text

Near-realtime statistic analysis WAL Consumer Count in 100ms bucket WAL WAL SEND_MESSAGE Server operation

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

400K msgs at 00:00:03

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Abuser detection - Various Abusers in LINE Messaging Platform - Detecting them by various aspect - For persistent storage, HBase - Long term and massive data storing abusing pattern is critical - Not only disk usage, but also HBase performance - Might affect to many other users

Slide 50

Slide 50 text

Abuser detection WAL Consumer Count aggregation PenaltyGateway WAL WAL 1m count 1d count 2w count Count changelog Ban abuser Store penalty Read penalty Block request Penalty rules Server Tables user penalties

Slide 51

Slide 51 text

Future works - Expand usage of HBase and Kafka data pipeline - Secondary index (Materialized view) - Incremental backup

Slide 52

Slide 52 text

Secondary index - HBase only support index by row, column: Key → Value - For example, Alice become a friend of Bob - Store Alice → Bob in HBase - Lookup from Bob is not supported - Need secondary index for reverse lookup: Value → Keys - Apache Phoenix provides an option for HBase with SQL - Overhead - Overkill for just secondary index - Using Redis, Cassandra for such purpose - Want secondary index in HBase for some reasons - Reliability - Performance - Consistency model - ...

Slide 53

Slide 53 text

Secondary index server WAL Consumer Build secondary index Value → Keys Key → Value Value → Keys Tables Key → Value

Slide 54

Slide 54 text

HBase’s incremental backup HBase F Take a full backup Time WALs HFiles Cron job t2 t1 t3 MR Job Storage (HDFS, Amazon S3, ...) Take incremental backups

Slide 55

Slide 55 text

HBase’s incremental backup: pain point HBase Time WALs HFiles Cron job t3 MR Job Bug released Remains all WALs until cron job runs Extra load on the cluster Restore from backup Lost sound data F t1 t2

Slide 56

Slide 56 text

Incremental backup using pipeline )#BTF 8"- $POTVNFS F Take a snapshot Time 8"-T No impact to HBase Storage (HDFS, Amazon S3, ...)

Slide 57

Slide 57 text

Incremental backup using pipeline )#BTF 8"- $POTVNFS F Time 8"-T t1 t2 t3 Make HFiles for fast restore Storage (HDFS, Amazon S3, ...)

Slide 58

Slide 58 text

8"-T Incremental backup using pipeline: restore )#BTF F Time Bug released Restore from backup t3 t1 t2

Slide 59

Slide 59 text

Conclusion - HBase and Kafka data pipeline for LINE Messaging Platform - Using HBase WAL and replication - Powerful and reliable way to interact with DB mutation - Our actual use case of the pipeline - Replication or data migration that the built-in HBase replication cannot handle - Applications running business logic considering WAL as an event stream - Near-realtime statistics analysis - Abuser detection at storage side - Possible use cases - Secondary index - Incremental backup - What’s your idea?