task processor Apache Kafka Redis Redis Redis In-memory data structure store Used as a database and cache Distributed key/value store that provides random, realtime read/write access on top of HDFS (Hadoop Distributed File System) persistence layer Messages Events User data Social Graph … Apache HBase Cluster
.BTUFS )#BTF $MJFOU RegionServer RegionServer RegionServer meta table tableA region 2 tableA region 1 Lookup Master, meta table HDFS )BEPPQ%JTUSJCVUFE'JMF4ZTUFN )%'4 Overview of LINE Messaging Storages
Storage Maintainable and scalable Avoid unnecessary cost Performant Reliable Highly available Data consistency 1.6 Petabytes of data for service purposes 2.7 trillion requests / day aprox. x3 traffic at peak on New Year (100 million requests / sec) …
in-house sharded cluster: › No dynamic resizing. › Lack of possibility for version upgrades. › High operational cost. › Redis official cluster (from Redis 3.x): › Dynamic resizing. › Consumes more memory. › Performance degradation when cluster becomes big. › Gossip traffic eats lots of network in big cluster. › In general cost of memory > cost of disk. › Apache HBase › Horizontal scalability. › Rolling upgrade version. › Easy to operate (+ developed in-house automation tools). HBase node Performance HBase node HBase node Performance Redis in-house cluster Redis official cluster Apache HBase cluster old cluster new cluster "QQMJDBUJPO background migration
high traffic and high volume of data Storage Storage Storage Backend Application Backend Application busy! waiting waiting Threads waiting waiting Highly Available and fault tolerant
› Client short timeout. But sometimes not enough. › Keep recovery time at its minimum possible. › Redis: short circuit breaker for fast failure: › Redis is single threaded instance can make application wait for thousands of Redis responses. › When time responses increases, temporary mark the shard as failed and stop sending requests to it. › HBase Dual clusters: › Store same data. › Requests will select fastest/higher priority/merge. › Mostly for immutable data or where punctual eventual consistency is tolerated. Threads waiting busy! Redis Redis Apache HBase Cluster-B RegionServer RegionServer RegionServer Redis Cache Critical places with high traffic and high volume of data Highly Available and fault tolerant
async sync data Backend Application Redis Storage Cluster › No transactions between Redis and HBase. › It suffers from race conditions. Need to keep both storages consistent over time. › Still all in memory (expensive). › Big technical debt. Primary storage Asynchronous task processor Apache Kafka retry
- HBase as primary storage Apache HBase Cluster sync data Backend Application Redis Storage Cache Cluster Primary storage Asynchronous task processor Apache Kafka retry › Single source of truth. › Reduced memory usage. › No data loss thanks to HBase persistence and HDFS data replication.
HBase › HBase becomes primary storage for our core features. › New features are built on top of HBase. › Started to be used not only for messaging, but for new services and modules. › Redis as cache still serves most of the reads. Bigger impact on user experience HBase Requirements and standards 200M AU (aprox.)
performance: ? Data consistency and reduce risk of data loss. › Make our cluster more reliable and performant: Evaluate every version, settings and features we use. › Performance depends on how we use it: How we model and access data play an important role. High availability by dualizing clusters in critical areas. LINE Messaging service storage requirements
Replay RPC (Put, Get, Scan….) 3FQMBZFS DPOTVNFS Shadow HBase RPCs Application Backend HBase Cluster HBase Cluster (test env) Evaluate new versions and features › Replay RPCs › Deploy new version or feature. › Detect and fix bugs, backport, etc… › Contribute to the open source project. › Safely apply on production!
Replay RPC (Put, Get, Scan….) 3FQMBZFS DPOTVNFS Shadow HBase RPCs Application Backend HBase RegionServer HBase RegionServer (test env) Evaluate new versions and features Recent work: Improve cluster reliability by reducing performance spikes: › Access to disk is more unpredictable than memory access. › Networks could be occasionally unstable.
(*real_read_t)(int, void *, size_t); // http://man7.org/linux/man-pages/man2/read.2.html ssize_t read(int fd, void *data, size_t size) { // Our malicious code sleep(3); // Behave just like the regular syscall would return ((real_read_t)dlsym(RTLD_NEXT, “read”))(fd, data, size); } Testing Hedged Reads › Hedged Reads may not trigger with usual traffic. › Need to simulate slow network or flaky disk. › We can use LD_PRELOAD. This tells Unix dynamic linker to load your code before any other library. › In our case, we want to sleep when read(2) is called. gcc –shared –fPIC –o inject_read.so inject_read.c -ldl › Add in your hadoop-env.sh export LD_PRELOAD=${path_to_file}/inject_read.so
parking to wait for <0x00007f5395078198> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at org.apache.hadoop.hbase.regionserver.HStore.add(HStore.java:724) … "RpcServer.default.RWQ.Fifo.read.handler=309,queue=26,port=11471" … java.lang.Thread.State: WAITING (parking) - parking to wait for <0x00007f5395078198> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) … "RpcServer.default.RWQ.Fifo.read.handler=330,queue=34,port=11471" #378 daemon prio=5 os_prio=0 tid=0x00007f63afa57000 nid=0xce06 waiting on condition [0x00007f52bbf01000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f55a244c520> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1435) at org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1400) at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1538) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1507) …
hedgedService.submit(readTask); try { future = hedgedService.poll(timeout); } catch(ExecutionException e) { // Ignore } result = hedgedService.take(); // BlockingQueue.take() !! // hang forever because there is no completed task in the BlockingQueue Testing Hedged Reads: deadlock HBase RegionServer )%'4$MJFOU Datanode Read 1- HBase acquires read lock 2- Submit task to read from Datanode 3- No actual result from Datanode 4- Results blocking queue is empty Read lock is never released! We were affected by HDFS-11303 Hedged read might hang infinitely if read data from all DN failed. Backported HDFS-11303 to our Hadoop internal LINE branch. lock.readLock().lock();
› Let the community know about this: HBASE-24469 Hedged read might hang infinitely if read data from all DN failed. › Fixed metrics: metrics are mentioned in the official book, so they must be there! Spent lots of time figuring out if the testing process or our settings were wrong. › Added metrics back to HBase 1.x: HBASE-24435 Bring back hedged reads metrics to branch-1 › Exposed new metric for main branch: HBASE-24994 Add hedgedReadOpsInCurThread metric
HBASE-23205 Correctly update the position of WALs currently being replicated › HBASE-22715 All scan requests should be handled by scan handler threads in RWQueueRpcExecutor › HBASE-21418 Reduce a number of reseek operations in MemstoreScanner when seek point is close to the current row › HBASE-24994 Add hedgedReadOpsInCurThread metric › HBASE-24435 Bring back hedged reads metrics to branch-1 › HBASE-24402 Moving the meta region causes MetricsException when using above 2.6.0 hadoop version › … › Reported: › HBASE-24903 'scandetail' log message is missing when responseTooSlow happens in the rpc that closes the scanner › HBASE-21738 Remove all the CSLM#size operation in our memstore because it's an quite time consuming › HBASE-24469 Hedged read might hang infinitely if read data from all DN failed › … › Backports to our branches: › HBASE-24742 Improve performance of SKIP vs SEEK logic › HBASE-24282 'scandetail' log message is missing when responseTooSlow happens on the first scan rpc call › HBASE-21748 Remove all the CLSM#size operation in our memstore because it's an quite time consuming › HDFS-11303 Hedged read might hang infinitely if read data from all DN failed › … Made our clusters stronger by: Evaluate new versions and features
healthy but we could still we see bad performance. › Table schema design plays an important role in performance. › It is important to understand the technology internals.
list table Message Id List table: A list of message Ids per chat userId1 : userId2 … messageId: 7 messageId: 11 … User A reads up to messageId: 11 We want to scan by messageId range event: Mark as read 2 WOW Read 17:23 PM A B Awesome! Read 17:23 PM
Message Id list table MessageIdList table: A list of message Ids per chat $PMVNO'BNJMZ DPM NFTTBHF*E ʜ DPM NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E IBTI VTFS*E" VTFS*E# ʜ Row Key RegionServer1 regionB hash(111):222 hash(333):555 userId A < UserId B
Study case: Message Id list table MessageIdList table: A list of message Ids per chat $PMVNO'BNJMZ DPM NFTTBHF*E ʜ DPM NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E ʜ WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E Using messageId for the version will allow us to do range scans: scan from version 123 to version 245 Row Key hash(111):222 hash(333):555 userId A < UserId B IBTI VTFS*E" VTFS*E# ʜ
list table ∞ › Read 8 › Insert / Delete / Read: time complexity O(Log N) › Scales better than the simple Linked List! Sorted MemStore: In-memory sorted store implemented as a SkipList.
list table 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI ʜ IBTI IBTI IBTI IBTI IBTI ʜ ʜ IBTI ʜ ʜ IBTI ʜ ʜ IBTI IBTI ʜ ʜ IBTI IBTI ʜ ʜ ʜ IBTI ʜ Memstore: SkipList seek next column 1000 <= version <= 1010 Hope you have more luck in the next column Scan rowkey “hash(333):555” from version 1000 to version 1010
list table Memstore: SkipList O(Log N) O(Log N) O(Log N) O(Log N) M O(M * Log N) Scan rowkey “hash(333):555” from version 1000 to version 1010 › Tried to fix: HBASE-21418 Reduce a number of reseek operations in MemstoreScanner when seek point is close to the current row 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI ʜ IBTI IBTI IBTI IBTI IBTI ʜ ʜ IBTI ʜ ʜ IBTI ʜ ʜ IBTI IBTI ʜ ʜ IBTI IBTI ʜ ʜ ʜ IBTI ʜ … … include include … … Java implementation: ConcurrentSkipListMap.tailMap // needs to traverse the SkipList O(Log N)
IBTI IBTI IBTI IBTI ʜ ʜ IBTI ʜ ʜ IBTI ʜ ʜ IBTI IBTI ʜ ʜ IBTI IBTI ʜ ʜ ʜ IBTI ʜ The importance of a good schema Study case: Message Id list table Load data block from disk Scan rowkey “hash(333):555” from version 1000 to version 1010 Data block ByteBuffer // position 0 HFile (Disk)
IBTI IBTI IBTI IBTI ʜ ʜ IBTI ʜ ʜ IBTI ʜ ʜ IBTI IBTI ʜ ʜ IBTI IBTI ʜ ʜ ʜ IBTI ʜ The importance of a good schema Study case: Message Id list table Load data block from disk Scan rowkey “hash(333):555” from version 1000 to version 1010 Data block HFile (Disk) ByteBuffer - position 0 blockSeek position: 0
IBTI IBTI IBTI IBTI ʜ ʜ IBTI ʜ ʜ IBTI ʜ ʜ IBTI IBTI ʜ ʜ IBTI IBTI ʜ ʜ ʜ IBTI ʜ The importance of a good schema Study case: Message Id list table Load data block from disk Scan rowkey “hash(333):555” from version 1000 to version 1010 Data block HFile (Disk) ByteBuffer - position 780 blockSeek advance advance advance … advance O(N) position: 780 Include position: 790 Include … advance advance
best to our users while avoiding unnecessary cost. › Need to be performant, reliable, highly available and scalable. › Need to protect our data against data inconsistencies. › Make our clusters reliable: › Test and evaluate every version or feature carefully. › Build a safe testing environment as similar as production. › Good data schema design is key for performance: › Must understand technology internals to make good design.
Value storages limitations: › Disaster Recovery for now. › Machines are underutilized. › Nature of Messaging service makes active-active multi DC very challenging. Multi Data Center architecture: Apache HBase Apache HBase latency replication JP1 DC JP2 DC async Eventual consistency › No such high performance requirements. › But need better consistency and transactional features. › And still requires scalability. › Need to consider and explore new storages. Adapt better to projects with different needs: