Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBase: An Introduction

HBase: An Introduction

A brief introduction to HBase

Simon Kelly

June 20, 2012
Tweet

Other Decks in Programming

Transcript

  1. What  is  Hbase? •  Open source clone of Google’s BigTable

    •  Distributed •  Column-oriented •  Fault tolerant •  Linear scaling by adding more servers •  Runs on commodity (not crappy) hardware •  Not an RDBMS o  No joins o  No secondary indexes o  No SQL
  2. A  bit  more  detail •  Built on Hadoop HDFS • 

    Strongly consistent •  Automatic sharding •  Automatic failover •  MapReduce •  Java Client API + Thrift / REST API •  Block cache and Bloom filters
  3. Data  model •  Sparse multi-dimensional map (table, row key, column

    family:column, timestamp) = cell •  Rows keys are sorted in lexicographical order o  i.e. 1, 12, 15, 2, 23, 3, 4 •  Row key and Cell contents are byte[] o  No data types •  Timestamp specifies different versions •  Tunings and storage specifications are done per column family o  Compression, version retention, bloom filters etc.
  4. Example Row  Key meta: detail: southafrica code=ZA population@t1=40m population@t2=43m history=“…1994…”

    politics=“…” zimbabwe code=ZW capital=Harare currency@t1=ZWD currency@t2=USD history=“…1980…” climate=“…”
  5. Schema  design •  De-normalization •  Nested entities •  Attributes in

    keys •  e.g. OpenTSDB o  metric-base timestamp-tag-tag Ian  Varley:  h;p://www.slideshare.net/cloudera/5-­‐‑h-­‐‑base-­‐‑schemahbasecon2012
  6. HBase  API •  get(row) •  put(row, Map<Column, Value>) •  scan(key

    range, filter) •  increment(row, columns) •  delete(row) •  coprocessors (like a stored procedure) •  MapReduce
  7. When  to  use  HBase •  BIG data o  hundreds of

    millions of smallish rows •  Variable schema •  High write volume •  Key based access •  Sequential reads •  Can you live without some RDBMS features o  typed columns, secondary indexes, cross record transactions, joins, SQL •  Make sure you have enough hardware
  8. In  practice •  3 machine cluster minimum (5 better) • 

    lots of RAM (4-8GB per core) •  Java GC tuning (not so bad thanks to those who have gone before) •  Optimize for your workload
  9. Resources •  HBase Docs o  http://hbase.apache.org/book/book.html •  Cloudera o  http://www.cloudera.com/blog/category/hbase/

    o  http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with- memstore-local-allocation-buffers-part-1/ o  http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ o  http://www.cloudera.com/resource/hadoop-world-2011-presentation-video- advanced-hbase-schema-design/ •  Mailing lists o  http://hbase.apache.org/mail-lists.html •  Lars George o  http://www.larsgeorge.com/ •  HBase road map o  http://www.slideshare.net/cloudera/apache-hbase-road-map-jonathan-gray- facebook
  10. Case  studies •  Facebook o  http://highscalability.com/blog/2011/3/22/facebooks-new-realtime- analytics-system-hbase-to-process-20.html •  Hstack o 

    http://hstack.org/why-were-using-hbase-part-1/ •  Twitter o  http://squarecog.wordpress.com/2010/05/20/pig-hbase-hadoop-and- twitter-hug-talk-slides/