HBase: An Introduction

Simon Kelly [email protected]

What is Hbase? •  Open source clone of Google’s BigTable
•  Distributed •  Column-oriented •  Fault tolerant •  Linear scaling by adding more servers •  Runs on commodity (not crappy) hardware •  Not an RDBMS o  No joins o  No secondary indexes o  No SQL

A bit more detail •  Built on Hadoop HDFS • 
Strongly consistent •  Automatic sharding •  Automatic failover •  MapReduce •  Java Client API + Thrift / REST API •  Block cache and Bloom filters

Architecture (image by Lars George) h;p://www.larsgeorge.com/2009/10/hbase-‐‑architecture-‐‑101-‐‑storage.html

Data model •  Sparse multi-dimensional map (table, row key, column
family:column, timestamp) = cell •  Rows keys are sorted in lexicographical order o  i.e. 1, 12, 15, 2, 23, 3, 4 •  Row key and Cell contents are byte[] o  No data types •  Timestamp specifies different versions •  Tunings and storage specifications are done per column family o  Compression, version retention, bloom filters etc.

Example Row Key meta: detail: southafrica code=ZA population@t1=40m population@t2=43m history=“…1994…”
politics=“…” zimbabwe code=ZW capital=Harare currency@t1=ZWD currency@t2=USD history=“…1980…” climate=“…”

Schema design •  De-normalization •  Nested entities •  Attributes in
keys •  e.g. OpenTSDB o  metric-base timestamp-tag-tag Ian Varley: h;p://www.slideshare.net/cloudera/5-‐‑h-‐‑base-‐‑schemahbasecon2012

HBase API •  get(row) •  put(row, Map<Column, Value>) •  scan(key
range, filter) •  increment(row, columns) •  delete(row) •  coprocessors (like a stored procedure) •  MapReduce

When to use HBase •  BIG data o  hundreds of
millions of smallish rows •  Variable schema •  High write volume •  Key based access •  Sequential reads •  Can you live without some RDBMS features o  typed columns, secondary indexes, cross record transactions, joins, SQL •  Make sure you have enough hardware

In practice •  3 machine cluster minimum (5 better) • 
lots of RAM (4-8GB per core) •  Java GC tuning (not so bad thanks to those who have gone before) •  Optimize for your workload

Resources •  HBase Docs o  http://hbase.apache.org/book/book.html •  Cloudera o  http://www.cloudera.com/blog/category/hbase/
o  http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with- memstore-local-allocation-buffers-part-1/ o  http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ o  http://www.cloudera.com/resource/hadoop-world-2011-presentation-video- advanced-hbase-schema-design/ •  Mailing lists o  http://hbase.apache.org/mail-lists.html •  Lars George o  http://www.larsgeorge.com/ •  HBase road map o  http://www.slideshare.net/cloudera/apache-hbase-road-map-jonathan-gray- facebook

Case studies •  Facebook o  http://highscalability.com/blog/2011/3/22/facebooks-new-realtime- analytics-system-hbase-to-process-20.html •  Hstack o 
http://hstack.org/why-were-using-hbase-part-1/ •  Twitter o  http://squarecog.wordpress.com/2010/05/20/pig-hbase-hadoop-and- twitter-hug-talk-slides/

HBase: An Introduction

HBase: An Introduction

Simon Kelly

Other Decks in Programming

Featured

Transcript

Simon Kelly [email protected]

What is Hbase? •  Open source clone of Google’s BigTable

A bit more detail •  Built on Hadoop HDFS •

Architecture (image by Lars George) h;p://www.larsgeorge.com/2009/10/hbase-‐‑architecture-‐‑101-‐‑storage.html

Data model •  Sparse multi-dimensional map (table, row key, column

Example Row Key meta: detail: southafrica code=ZA population@t1=40m population@t2=43m history=“…1994…”

Schema design •  De-normalization •  Nested entities •  Attributes in

HBase API •  get(row) •  put(row, Map<Column, Value>) •  scan(key

When to use HBase •  BIG data o  hundreds of

In practice •  3 machine cluster minimum (5 better) •

Resources •  HBase Docs o  http://hbase.apache.org/book/book.html •  Cloudera o  http://www.cloudera.com/blog/category/hbase/

Case studies •  Facebook o  http://highscalability.com/blog/2011/3/22/facebooks-new-realtime- analytics-system-hbase-to-process-20.html •  Hstack o