Introduction to HBase

Introduction to HBase Karthik Kumar

Overview • Background • Data Model • APIs • Case
Study • Architecture

What is HBase? • A “sparse, distributed, consistent, multi-dimensional, sorted
map” • Column-oriented • Based on Google’s BigTable (2006) • Apache project

Map • At its core, HBase is just a mapping
of keys to values key -> value

Sorted • Each cell is sorted lexicographically by key ◦
Allows for range queries ▪ (return all values with keys between k1...k5)

Multi-dimensional • Key is actually made up of several parts
(rowkey, column family, column, version) -> value

Sparse • For a given row, we don’t store anything
for null/empty values • No data wasted for empty values

Distributed • Data stored in HBase can be spread over
many machines and can store billions of cells, reliably • Uses HDFS (Hadoop Distributed File System) ◦ Provides protection against node failures

Consistent • HBase is strongly consistent ◦ All changes within
the same row are atomic ◦ Reads always return the last written & committed value

Data Model • Table ◦ Row ▪ Column Family •
Column ◦ Cell (value, timestamp) • 1+ columns form a row • 1+ rows form a table • Each column can hold 1+ versions of a value

Data Model - Table • Clients store data in HBase
tables • Made up of several rows

Data Model - Row • Rows provide a logical grouping
of cells • Rows are sorted by key

Data Model - Columns • Columns are arbitrary labels for
attributes of a row • Does not need to be specified up front

Data Model - Column Families • Columns are grouped into
column families • Used to define storage attributes for columns (compression, # of versions etc).

Data Model - Cells • Each cell contains a value
and a version (usually a timestamp

Putting it all together (Table, RowKey, Family, Column, Timestamp) ->
Value

Canonical Example: Google WebTable

HBase API • HBase Shell: JRuby IRB-based shell ◦ JRuby
= Ruby over JVM • REST queries (Stargate): Use curl requests • Thrift: communication protocol used for RPC • MapReduce: Create Mapper and Reducer that queries HBase • Java, JRuby, HBql, Jython, Groovy, Scala: provide libraries that can talk to HBase

CRUD Operations • PUT hbase> put ‘table’, ‘rowkey’, ‘fam:col’, ‘value’,
‘ts’ • GET hbase> get ‘table’, ‘rowkey’, {COLUMN => ‘col’, TIMESTAMP => ‘ts’, VERSIONS => 4} • DELETE hbase> delete ‘table’, ‘rowkey’, ‘fam:col’ • SCAN hbase> scan ‘table’, {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}

HBase Case Study: NFL Play-by-play data • We want to
store every play for every game since 2002 • Play by play is time-series data • Use play number instead of timestamp • Identify each game by gameid • One column family with several columns for each attribute of play

Sample data gameid,qtr,min,sec,off,def,down,togo,ydline,description,offscore,defscore,season

Schema Design

HBase Architecture Overview

Introduction to HBase

Introduction to HBase

Karthik Kumar

More Decks by Karthik Kumar

Other Decks in Programming

Featured

Transcript