Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Traffic intensive storages at LINE's Messaging ...

Traffic intensive storages at LINE's Messaging Application

Javier Akira Luca de Tena
LINE Z Part Team Senior Software Engineer
https://linedevday.linecorp.com/2020/ja/sessions/6595
https://linedevday.linecorp.com/2020/en/sessions/6595

LINE DevDay 2020

November 25, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Agenda › Introduction › LINE Messaging service storage requirements ›

    Achieving storage requirements with Apache HBase › Make cluster stronger by evaluation › The importance of a good schema
  2. Agenda › Introduction › LINE Messaging service storage requirements ›

    Achieving storage requirements with Apache HBase › Make cluster stronger by evaluation › The importance of a good schema
  3. LINE Messaging Application › 200 million Active Users aprox. ›

    Around 4 billion messages sent everyday. › More than a messaging application: › Many family services: News, Music, LIVE, Pay, etc…
  4. LINE Messaging App simplified logic send message Messaging Application Backend

    event: message received read message event: message sent event: message checked update profile event: profile updated event: notified update profile event: notified read message Basically a big synchronization service. A B
  5. Overview of LINE Messaging backend architecture Backend Application LEGY Asynchronous

    task processor Apache Kafka Redis Redis Redis Storages play a very important role in LINE Messaging service. Apache HBase Cluster
  6. Overview of LINE Messaging backend architecture Backend Application LEGY Asynchronous

    task processor Apache Kafka Redis Redis Redis In-memory data structure store Used as a database and cache Distributed key/value store that provides random, realtime read/write access on top of HDFS (Hadoop Distributed File System) persistence layer Messages Events User data Social Graph … Apache HBase Cluster
  7. +BWB "QQMJDBUJPO $MVTUFS Redis in-house sharded cluster Overview of LINE

    Messaging Storages 3FEJT$MVTUFS.POJUPS master slave shard-1 $MVTUFS.BOBHFS4FSWFS ;PPLFFQFS TIBSETJOGP master slave shard-2 master slave shard-2 -*/& 3FEJT $MJFOU Sync Update Monitoring Health Check Redis official cluster gossip protocol 3FEJT $MJFOU Masters Slaves +BWB "QQMJDBUJPO
  8. +BWB "QQMJDBUJPO $MVTUFS Redis in-house sharded cluster 3FEJT$MVTUFS.POJUPS master slave

    shard-1 $MVTUFS.BOBHFS4FSWFS master slave shard-2 master slave shard-2 -*/& 3FEJT $MJFOU Update Monitoring Health Check Redis official cluster gossip protocol 3FEJT $MJFOU Masters Slaves +BWB "QQMJDBUJPO Sync ;PPLFFQFS TIBSETJOGP Overview of LINE Messaging Storages
  9. +BWB "QQMJDBUJPO Apache HBase ;PPLFFQFS DPOpHVSBUJPO  TZODISPOJ[BUJPO  FUDʜ

    .BTUFS )#BTF $MJFOU RegionServer RegionServer RegionServer meta table tableA region 2 tableA region 1 Lookup Master, meta table HDFS )BEPPQ%JTUSJCVUFE'JMF4ZTUFN )%'4 Overview of LINE Messaging Storages
  10. Datanode Datanode Datanode blockA replica = 3 HDFS /BNFOPEF blockA

    blockA blockA Datanode blockB blockB blockB Apache HBase +BWB "QQMJDBUJPO .BTUFS )#BTF $MJFOU RegionServer RegionServer RegionServer Lookup Master, meta table ;PPLFFQFS DPOpHVSBUJPO  TZODISPOJ[BUJPO  FUDʜ Overview of LINE Messaging Storages
  11. Agenda › Introduction › LINE Messaging service storage requirements ›

    Achieving storage requirements with Apache HBase › Make cluster stronger by evaluation › The importance of a good schema
  12. LINE Messaging service storage requirements 200M AU (aprox.) Storage Storage

    Storage Maintainable and scalable Avoid unnecessary cost Performant Reliable Highly available Data consistency 1.6 Petabytes of data for service purposes 2.7 trillion requests / day aprox. x3 traffic at peak on New Year (100 million requests / sec) …
  13. Maintainable and scalable LINE Messaging service storage requirements › Redis

    in-house sharded cluster: › No dynamic resizing. › Lack of possibility for version upgrades. › High operational cost. › Redis official cluster (from Redis 3.x): › Dynamic resizing. › Consumes more memory. › Performance degradation when cluster becomes big. › Gossip traffic eats lots of network in big cluster. › In general cost of memory > cost of disk. › Apache HBase › Horizontal scalability. › Rolling upgrade version. › Easy to operate (+ developed in-house automation tools). HBase node Performance HBase node HBase node Performance Redis in-house cluster Redis official cluster Apache HBase cluster old cluster new cluster "QQMJDBUJPO background migration
  14. Highly Available and fault tolerant LINE Messaging service storage requirements

    Backend Application Critical places with high traffic and high volume of data Storage Storage Storage Backend Application Backend Application
  15. LINE Messaging service storage requirements Backend Application Critical places with

    high traffic and high volume of data Storage Storage Storage Backend Application Backend Application busy! waiting waiting Threads waiting waiting Highly Available and fault tolerant
  16. LINE Messaging service storage requirements Apache HBase Cluster-A Backend Application

    › Client short timeout. But sometimes not enough. › Keep recovery time at its minimum possible. › Redis: short circuit breaker for fast failure: › Redis is single threaded instance can make application wait for thousands of Redis responses. › When time responses increases, temporary mark the shard as failed and stop sending requests to it. › HBase Dual clusters: › Store same data. › Requests will select fastest/higher priority/merge. › Mostly for immutable data or where punctual eventual consistency is tolerated. Threads waiting busy! Redis Redis Apache HBase Cluster-B RegionServer RegionServer RegionServer Redis Cache Critical places with high traffic and high volume of data Highly Available and fault tolerant
  17. LINE Messaging service storage requirements Data Consistency Backend Application Redis

    Storage Cluster Active Users Redis usage Early LINE Messaging Application: most of our data in memory. Primary storage
  18. LINE Messaging service storage requirements Data Consistency Apache HBase Cluster

    async sync data Backend Application Redis Storage Cluster › No transactions between Redis and HBase. › It suffers from race conditions. Need to keep both storages consistent over time. › Still all in memory (expensive). › Big technical debt. Primary storage Asynchronous task processor Apache Kafka retry
  19. LINE Messaging service storage requirements Data Consistency: Redis as cache

    - HBase as primary storage Apache HBase Cluster sync data Backend Application Redis Storage Cache Cluster Primary storage Asynchronous task processor Apache Kafka retry › Single source of truth. › Reduced memory usage. › No data loss thanks to HBase persistence and HDFS data replication.
  20. LINE Messaging service storage requirements Increasing responsibility and usage of

    HBase › HBase becomes primary storage for our core features. › New features are built on top of HBase. › Started to be used not only for messaging, but for new services and modules. › Redis as cache still serves most of the reads. Bigger impact on user experience HBase Requirements and standards 200M AU (aprox.)
  21. Horizontal scalability. Reduced cost by reducing memory usage. Reliability and

    performance: ? Data consistency and reduce risk of data loss. › Make our cluster more reliable and performant: Evaluate every version, settings and features we use. › Performance depends on how we use it: How we model and access data play an important role. High availability by dualizing clusters in critical areas. LINE Messaging service storage requirements
  22. Agenda › Introduction › LINE Messaging service storage requirements ›

    Achieving storage requirements with Apache HBase › Make cluster stronger by evaluation › The importance of a good schema
  23. Evaluate new versions and features Apache Kafka topic Intercept Serialize

    Put, Get, Scan…. with Protobuf Replay RPC (Put, Get, Scan….) 3FQMBZFS DPOTVNFS Shadow HBase RPCs Application Backend HBase Cluster (prod env) HBase Cluster (test env)
  24. Apache Kafka topic Intercept Serialize Put, Get, Scan…. with Protobuff

    Replay RPC (Put, Get, Scan….) 3FQMBZFS DPOTVNFS Shadow HBase RPCs Application Backend HBase Cluster HBase Cluster (test env) Evaluate new versions and features › Replay RPCs › Deploy new version or feature. › Detect and fix bugs, backport, etc… › Contribute to the open source project. › Safely apply on production!
  25. Apache Kafka topic Intercept Serialize Put, Get, Scan…. with Protobuff

    Replay RPC (Put, Get, Scan….) 3FQMBZFS DPOTVNFS Shadow HBase RPCs Application Backend HBase RegionServer HBase RegionServer (test env) Evaluate new versions and features Recent work: Improve cluster reliability by reducing performance spikes: › Access to disk is more unpredictable than memory access. › Networks could be occasionally unstable.
  26. HBase RegionServer Hedged Reads slow network Apache HBase HDFS Datanode

    Datanode Datanode B … B … … … B … … slow or flaky disk Disk )%'4$MJFOU
  27. How to test? #define _GNU_SOURCE #include <dlfcn.sh> … typedef ssize_t

    (*real_read_t)(int, void *, size_t); // http://man7.org/linux/man-pages/man2/read.2.html ssize_t read(int fd, void *data, size_t size) { // Our malicious code sleep(3); // Behave just like the regular syscall would return ((real_read_t)dlsym(RTLD_NEXT, “read”))(fd, data, size); } Testing Hedged Reads › Hedged Reads may not trigger with usual traffic. › Need to simulate slow network or flaky disk. › We can use LD_PRELOAD. This tells Unix dynamic linker to load your code before any other library. › In our case, we want to sleep when read(2) is called. gcc –shared –fPIC –o inject_read.so inject_read.c -ldl › Add in your hadoop-env.sh export LD_PRELOAD=${path_to_file}/inject_read.so
  28. Testing Hedged Reads: deadlock "RpcServer.default.RWQ.Fifo.write.handler=27,queue=2,port=11471" …
 java.lang.Thread.State: WAITING (parking)
 -

    parking to wait for <0x00007f5395078198> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at org.apache.hadoop.hbase.regionserver.HStore.add(HStore.java:724)
 … "RpcServer.default.RWQ.Fifo.read.handler=309,queue=26,port=11471" …
 java.lang.Thread.State: WAITING (parking)
 - parking to wait for <0x00007f5395078198> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
 … "RpcServer.default.RWQ.Fifo.read.handler=330,queue=34,port=11471" #378 daemon prio=5 os_prio=0 tid=0x00007f63afa57000 
 nid=0xce06 waiting on condition [0x00007f52bbf01000]
 java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) 
 - parking to wait for <0x00007f55a244c520> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) 
 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
 at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) 
 at org.apache.hadoop.hdfs.DFSInputStream.getFirstToComplete(DFSInputStream.java:1435) 
 at org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1400) 
 at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1538) 
 at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1507) 
 …
  29. CompletionService<..> hedgedService = new CompletionService(..); // submit the first read


    hedgedService.submit(readTask); try { future = hedgedService.poll(timeout); } catch(ExecutionException e) { // Ignore } result = hedgedService.take(); // BlockingQueue.take() !! // hang forever because there is no completed task in the BlockingQueue Testing Hedged Reads: deadlock HBase RegionServer )%'4$MJFOU Datanode Read 1- HBase acquires read lock 2- Submit task to read from Datanode 3- No actual result from Datanode 4- Results blocking queue is empty Read lock is never released! We were affected by HDFS-11303 Hedged read might hang infinitely if read data from all DN failed. Backported HDFS-11303 to our Hadoop internal LINE branch. lock.readLock().lock();    
  30. Testing Hedged Reads Contributing back to the Open Source community:

    › Let the community know about this: HBASE-24469 Hedged read might hang infinitely if read data from all DN failed. › Fixed metrics: metrics are mentioned in the official book, so they must be there! Spent lots of time figuring out if the testing process or our settings were wrong. › Added metrics back to HBase 1.x: HBASE-24435 Bring back hedged reads metrics to branch-1 › Exposed new metric for main branch: HBASE-24994 Add hedgedReadOpsInCurThread metric
  31. › Testing new versions and evaluating them: › Contributions: ›

    HBASE-23205 Correctly update the position of WALs currently being replicated › HBASE-22715 All scan requests should be handled by scan handler threads in RWQueueRpcExecutor › HBASE-21418 Reduce a number of reseek operations in MemstoreScanner when seek point is close to the current row › HBASE-24994 Add hedgedReadOpsInCurThread metric › HBASE-24435 Bring back hedged reads metrics to branch-1 › HBASE-24402 Moving the meta region causes MetricsException when using above 2.6.0 hadoop version › … › Reported: › HBASE-24903 'scandetail' log message is missing when responseTooSlow happens in the rpc that closes the scanner › HBASE-21738 Remove all the CSLM#size operation in our memstore because it's an quite time consuming › HBASE-24469 Hedged read might hang infinitely if read data from all DN failed › … › Backports to our branches: › HBASE-24742 Improve performance of SKIP vs SEEK logic › HBASE-24282 'scandetail' log message is missing when responseTooSlow happens on the first scan rpc call › HBASE-21748 Remove all the CLSM#size operation in our memstore because it's an quite time consuming › HDFS-11303 Hedged read might hang infinitely if read data from all DN failed › … Made our clusters stronger by: Evaluate new versions and features
  32. Are we ready yet? › Cluster might be fast, reliable,

    healthy but we could still we see bad performance. › Table schema design plays an important role in performance. › It is important to understand the technology internals. 
  33. Agenda › Introduction › LINE Messaging service storage requirements ›

    Achieving storage requirements with Apache HBase › Make cluster stronger by evaluation › The importance of a good schema
  34. The importance of a good schema How data is organized

    in HBase: $PMVNO'BNJMZ" $PMVNO $PMVNO $PMVNO  BCD YZ[ $PMVNO'BNJMZ# $PMVNO $PMVNO WBMVF" WBMVF# SPX SPX SPX ʜ    version: 10 (usually timestamp) version: 9 version: 8 Row Key Understanding HBase internals CELL: immutable and versioned
  35. The importance of a good schema HBase write path: Write

    123 flush() Disk (HDFS Files) Immutable )'JMF )'JMF { compact! )'JMF HBase RegionServer Understanding HBase internals .FN4UPSF Memory Data: Immutable CELLs sorted by - row key - column - version
  36. HBase RegionServer The importance of a good schema HBase read

    path: Read 123 )'JMF )'JMF )'JMF merge! Understanding HBase internals .FN4UPSF Memory Disk (HDFS Files) Immutable Data: Immutable CELLs sorted by - row key - column - version
  37. The importance of a good schema Study case: Message Id

    list table Message Id List table: A list of message Ids per chat userId1 : userId2 … messageId: 7 messageId: 11 … User A reads up to messageId: 11 We want to scan by messageId range event: Mark as read 2 WOW Read 17:23 PM A B Awesome! Read 17:23 PM
  38. RegionServer1 regionA The importance of a good schema Study case:

    Message Id list table MessageIdList table: A list of message Ids per chat $PMVNO'BNJMZ DPM NFTTBHF*E ʜ DPM NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E IBTI VTFS*E" VTFS*E# ʜ Row Key RegionServer1 regionB hash(111):222 hash(333):555 userId A < UserId B
  39. RegionServer1 regionA RegionServer1 regionB The importance of a good schema

    Study case: Message Id list table MessageIdList table: A list of message Ids per chat $PMVNO'BNJMZ DPM NFTTBHF*E ʜ DPM NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E ʜ WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E Using messageId for the version will allow us to do range scans: scan from version 123 to version 245 Row Key hash(111):222 hash(333):555 userId A < UserId B IBTI VTFS*E" VTFS*E# ʜ
  40. The importance of a good schema Study case: Message Id

    list table Hint: the spikes disappear after we flush() the MemStore to disk. Slow mark as read will impact users experience and our backends.
  41. The importance of a good schema Study case: Message Id

    list table HBase read/write path: Read 123 Disk (HDFS File) )'JMF )'JMF )'JMF merge! RegionServer .FN4UPSF Memory
  42. The importance of a good schema Study case: Message Id

    list table MemStore: In-memory sorted store implemented as a SkipList.          ∞ Sorted
  43. The importance of a good schema Study case: Message Id

    list table MemStore: In-memory sorted store implemented as a SkipList.          ∞ Sorted › Insert / Delete / Read: time complexity O(N) › Insert 7
  44. The importance of a good schema Study case: Message Id

    list table          ∞         Sorted MemStore: In-memory sorted store implemented as a SkipList.
  45. The importance of a good schema Study case: Message Id

    list table          ∞         › Read 8 › Insert / Delete / Read: time complexity O(Log N) › Scales better than the simple Linked List! Sorted MemStore: In-memory sorted store implemented as a SkipList.
  46. The importance of a good schema Study case: Message Id

    list table ∞ › Read Cell8 $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM $FMM › Insert / Delete / Read: time complexity O(Log N) › Scales better than the simple Linked List! Sorted by: RowKey, Column, Version MemStore: In-memory sorted store implemented as a SkipList.
  47. The importance of a good schema Study case: Message Id

    list table Scan rowkey “hash(333):555” from version 1000 to version 1010
 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ Memstore: SkipList sorted cells
  48. The importance of a good schema Study case: Message Id

    list table seek to row Scan rowkey “hash(333):555” from version 1000 to version 1010
 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ O(Log N) Memstore: SkipList
  49. The importance of a good schema Study case: Message Id

    list table 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ Memstore: SkipList seek next column 1000 <= version <= 1010 Hope you have more luck in the next column Scan rowkey “hash(333):555” from version 1000 to version 1010

  50. The importance of a good schema Study case: Message Id

    list table Scan rowkey “hash(333):555” from version 1000 to version 1010
 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ Memstore: SkipList seek next column
  51. The importance of a good schema Study case: Message Id

    list table Scan rowkey “hash(333):555” from version 1000 to version 1010
 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ Memstore: SkipList seek next column O(Log N)
  52. The importance of a good schema Study case: Message Id

    list table 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ Memstore: SkipList seek next column Java implementation: ConcurrentSkipListMap.tailMap // internally makes immutable SubSkipList // needs to traverse the SkipList Scan rowkey “hash(333):555” from version 1000 to version 1010
 O(Log N)
  53. The importance of a good schema Study case: Message Id

    list table 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ Memstore: SkipList seek next column seek next column seek next column … seek next column … include include Java implementation: ConcurrentSkipListMap.tailMap // needs to traverse the SkipList O(Log N) Scan rowkey “hash(333):555” from version 1000 to version 1010
 … …
  54. The importance of a good schema Study case: Message Id

    list table Memstore: SkipList O(Log N) O(Log N) O(Log N) O(Log N) Scan rowkey “hash(333):555” from version 1000 to version 1010
 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ … … include include … … Java implementation: ConcurrentSkipListMap.tailMap // needs to traverse the SkipList O(Log N)
  55. The importance of a good schema Study case: Message Id

    list table Memstore: SkipList O(Log N) O(Log N) O(Log N) O(Log N) M O(M * Log N) Scan rowkey “hash(333):555” from version 1000 to version 1010
 › Tried to fix: HBASE-21418 Reduce a number of reseek operations in MemstoreScanner when seek point is close to the current row 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI     IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ … … include include … … Java implementation: ConcurrentSkipListMap.tailMap // needs to traverse the SkipList O(Log N)
  56. The importance of a good schema Study case: Message Id

    list table HBase read/write path: Read/Write 123 flush() Disk (HDFS File) )'JMF )'JMF )'JMF RegionServer Spikes disappear when flush() But why? .FN4UPSF Memory
  57. 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI 

       IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ The importance of a good schema Study case: Message Id list table Load data block from disk Scan rowkey “hash(333):555” from version 1000 to version 1010
 Data block ByteBuffer // position 0 HFile (Disk)
  58. 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI 

       IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ The importance of a good schema Study case: Message Id list table Load data block from disk Scan rowkey “hash(333):555” from version 1000 to version 1010
 Data block HFile (Disk) ByteBuffer - position 0 blockSeek position: 0
  59. 3PX,FZ $PMVNO 7FSTJPO NFTTBHF*E IBTI   ʜ IBTI 

       IBTI     IBTI     IBTI     IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI   ʜ ʜ IBTI     IBTI   ʜ ʜ IBTI     IBTI     ʜ ʜ ʜ IBTI   ʜ The importance of a good schema Study case: Message Id list table Load data block from disk Scan rowkey “hash(333):555” from version 1000 to version 1010
 Data block HFile (Disk) ByteBuffer - position 780 blockSeek advance advance advance … advance O(N) position: 780 Include position: 790 Include … advance advance
  60. The importance of a good schema Study case: Message Id

    list table MessageIdList table: A list of message Ids per chat $PMVNO'BNJMZ DPM NFTTBHF*E ʜ DPM NFTTBHF*E WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E ʜ WBMVFTFOEFS6TFS*E WFSTJPO NFTTBHF*E IBTI MPXFS6TFS*E IJHIFS6TFS*E ʜ $PMVNO'BNJMZ DPM&.15: WBMVFTFOEFS6TFS*E WFSTJPONFTTBHF*E IBTI MPXFS6TFS*EIJHIFS6TFS*E  NFTTBHF*E ʜ IBTI MPXFS6TFS*EIJHIFS6TFS*E  NFTTBHF*E ʜ Row Key Row Key Scan from rowkey “hash(lowerUserId:higherUserId):1000” to rowkey “hash(lowerUserId:higherUserId):1011”
 Scan rowkey “hash(lowerUserId):higherUserId” from version 1000 to version 1010

  61. The importance of a good schema Study case: Message Id

    list table Include Scan from rowkey “hash(lowerUserId:higherUserId):1000” to rowkey “hash(lowerUserId:higherUserId);1011”
 … Include 3PX,FZ $PMVNO IBTI   ʜ IBTI   TFOEFS*E IBTI   TFOEFS*E IBTI   TFOEFS*E IBTI   TFOEFS*E IBTI  ʜ ʜ IBTI  ʜ ʜ IBTI  ʜ ʜ IBTI   TFOEFS*E IBTI  ʜ ʜ IBTI   TFOEFS*E IBTI   TFOEFS*E ʜ ʜ IBTI  ʜ Memstore: SkipList seek to row Problem solved! O(Log N)
  62. Wrap up › High demanding storage requirements: › Offer the

    best to our users while avoiding unnecessary cost. › Need to be performant, reliable, highly available and scalable. › Need to protect our data against data inconsistencies. › Make our clusters reliable: › Test and evaluate every version or feature carefully. › Build a safe testing environment as similar as production. › Good data schema design is key for performance: › Must understand technology internals to make good design.
  63. Future challenges › Better transactions. › Secondary Index. Overcome Key

    Value storages limitations: › Disaster Recovery for now. › Machines are underutilized. › Nature of Messaging service makes active-active multi DC very challenging. Multi Data Center architecture: Apache HBase Apache HBase latency replication JP1 DC JP2 DC async Eventual consistency › No such high performance requirements. › But need better consistency and transactional features. › And still requires scalability. › Need to consider and explore new storages. Adapt better to projects with different needs: