Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBase in Between

VCNC
November 15, 2013

HBase in Between

This is presented on "Apache HBase Meetup in Seoul"

Why use HBase?
- If you don’t need complex SQL query
- Scalability
- Auto-sharding
- Good write throughput
- More structured data for Analysis

Haeinsa
- HBaseTransaction library for OLTP
- Inspired by Google percolator
- Cross tables, cross rows
- Low overhead
– No consistency issues over 3 months
– Open in github (http://github.com/vcnc/haeinsa)

http://engineering.vcnc.co.kr/2013/11/hbase-meetup-presentation/

VCNC

November 15, 2013
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. Agenda • HBase experience – Between Service (OLTP) – Between

    Log Analysis (OLAP) • Haeinsa – Open-source HBase transaction library – Made by VCNC • Summary
  2. Between Architecture HBase (Cluster) ELB (HTTP) API #1 API #2

    HTTP ELB #1 (TCP) ELB #2 (TCP) ZooKeep er TCP API #3 ELB #3 (TCP)
  3. HBase in OLTP • Between uses HBase as main DB

    from beginning of service • HBase in AWS – Use CDH 4.4.0 (HBase 0.94.6) – EC2 instances manually – HDFS with replication x3 • HA namenode – M2.4xlarge instances • 68.4GB RAM
  4. Why choose HBase? • Messaging is the key feature •

    Do not need complex schema, query • Prepare for scale (+ AWS) • High write throughput
  5. Mistake list • Hot Region / Cold Region • Major

    compaction storm • Long log splitting in RS crash • TCP no delay • Long latency in region balancing • AWS storage issues • …
  6. Mistake – (1) • Hot Region / Cold Region –

    Region is split by file size – Table grows by different speed – Manual split is recommended T1 T1 T1 T2 T1 T1 T1 T1 T1 T2 T1 T1 T2 RS 1 RS 2 T1 RS 1 RS 2
  7. Mistake – (2) • Major compaction storm – Run major

    compaction manually in off-peak Old files Compacted file Peak Off-peak
  8. Mistake – (3) • Long latency in region balancing T1

    T1 T1 T1 T1 RS 1 RS 2 T1 T1 T1 T1 T1 T1 T1 T1 RS 1 RS 2 T1 T1 T1 T1 T1
  9. What we learned • Have to understand HBase to operate

    it correctly!! • HBase is not yet optimized for Latency in many cases
  10. HBase in OLAP • Between analyze user action logs –

    300M+ per day • HDFS + HBase + MapReduce + MySQL
  11. How we analyze • Cluster in office ( NOT AWS

    ) – We don’t have a lot of money – Cheap PCs API S3 Log Aggregator HBase (Cluster) MySQL MapReduce SQL Import Download Upload
  12. What to analyze • Retention of user • Activity across

    country, device, gender • Activity pattern depend on length of relationship • Data-driven decision making !
  13. Haeinsa – Why we made it • HBase only support

    ACID semantics for single row – Only support checkAndPut, checkAndDelete • OLTP w/o transaction was NIGHTMARE • No good alternatives outside • Google made transaction on BigTable
  14. Haeinsa • Haeinsa is open-source transaction library for HBase •

    Made & maintained by VCNC • Use basic HBase library to implement – Do not use coprocessor – Do not change HBase
  15. Haeinsa • Haeinsa is layer between application and HBase client

    library Application Haeinsa HBase Client Library
  16. Haeinsa Mechanism – (1) Col1 Col2 Col3 Lock Col1 Col2

    Col3 Lock row1 row2 CheckAndPut Check
  17. Haeinsa – example BeginTransaction() bobBalance = Read(Bob, balance) Write(Bob, balance,

    bobBalance-$7) joeBalance = Read(Joe, balance) Write(Joe, balance, joeBalance+$7) Commit()
  18. checkAndPut is the atomic operation provided by HBase. So we

    can say that row didn't modified since execution of the get operation. R bob R joe C get write get checkAndPut write checkAndPut ensures that value of the row has not been modified since read. Remember: Every modification via Haeinsa modifies Lock column also.
  19. Haeinsa don't allows any operations to access unstable rows. That

    means, Haeinsa locks participating rows during commit operation. R bob R joe C get write get checkAndPut write Since the row is not in STABLE state, other transaction can't access to the row during this interval. And each checkAndPut operation ensures that the row has not been accessed by other transaction.
  20. Atomicity of the transaction ensured by single checkAndPut operation. R

    bob R joe C get write get checkAndPut write This checkAndPut operation determine whether whole transaction is succeed or not. Success of the transaction is determined by atomic operation. << committed >>
  21. Any of checkAndPut operation fails, all rows can be recovered

    to STABLE state. If state of primary row is COMMITED, the transaction can be treated as succeed, so, apply mutations to each row. If not, delete prewritten values from all rows. R bob R joe C get write get checkAndPut write Any of these operation fails, states of row can be recovered to STABLE.
  22. Haeinsa – Linearly scalable 0 5000 10000 15000 20000 25000

    30000 35000 40000 45000 50000 0 200 400 600 800 1000 1200 Tx/Sec ECU of HBase Cluster Haeinsa HBase
  23. Haeinsa - Latency 0 5 10 15 20 25 30

    35 0 200 400 600 800 1000 1200 ms ECU of HBase Cluster Haeinsa HBase
  24. Haeinsa • Pros – Linearly scalable – Serializability – Low

    overhead – Fault-tolerant – Not intrusive to original HBase cluster – Proven in practice
  25. Summary • Why use HBase? – If you don’t need

    complex SQL query – Scalability • $ is bottle-neck, not storage – Auto-sharding – Good write throughput – More structured data for Analysis
  26. Summary • Haeinsa – Transaction library for OLTP • Inspired

    by Google percolator – Cross tables, cross rows – Low overhead – No consistency issues over 3 months – Open in github • http://github.com/vcnc/haeinsa