Slide 1

Slide 1 text

Introduction to HBase Karthik Kumar

Slide 2

Slide 2 text

Overview ● Background ● Data Model ● APIs ● Case Study ● Architecture

Slide 3

Slide 3 text

What is HBase? ● A “sparse, distributed, consistent, multi-dimensional, sorted map” ● Column-oriented ● Based on Google’s BigTable (2006) ● Apache project

Slide 4

Slide 4 text

Map ● At its core, HBase is just a mapping of keys to values key -> value

Slide 5

Slide 5 text

Sorted ● Each cell is sorted lexicographically by key ○ Allows for range queries ■ (return all values with keys between k1...k5)

Slide 6

Slide 6 text

Multi-dimensional ● Key is actually made up of several parts (rowkey, column family, column, version) -> value

Slide 7

Slide 7 text

Sparse ● For a given row, we don’t store anything for null/empty values ● No data wasted for empty values

Slide 8

Slide 8 text

Distributed ● Data stored in HBase can be spread over many machines and can store billions of cells, reliably ● Uses HDFS (Hadoop Distributed File System) ○ Provides protection against node failures

Slide 9

Slide 9 text

Consistent ● HBase is strongly consistent ○ All changes within the same row are atomic ○ Reads always return the last written & committed value

Slide 10

Slide 10 text

Data Model ● Table ○ Row ■ Column Family ● Column ○ Cell (value, timestamp) ● 1+ columns form a row ● 1+ rows form a table ● Each column can hold 1+ versions of a value

Slide 11

Slide 11 text

Data Model - Table ● Clients store data in HBase tables ● Made up of several rows

Slide 12

Slide 12 text

Data Model - Row ● Rows provide a logical grouping of cells ● Rows are sorted by key

Slide 13

Slide 13 text

Data Model - Columns ● Columns are arbitrary labels for attributes of a row ● Does not need to be specified up front

Slide 14

Slide 14 text

Data Model - Column Families ● Columns are grouped into column families ● Used to define storage attributes for columns (compression, # of versions etc).

Slide 15

Slide 15 text

Data Model - Cells ● Each cell contains a value and a version (usually a timestamp

Slide 16

Slide 16 text

Putting it all together (Table, RowKey, Family, Column, Timestamp) -> Value

Slide 17

Slide 17 text

Canonical Example: Google WebTable

Slide 18

Slide 18 text

HBase API ● HBase Shell: JRuby IRB-based shell ○ JRuby = Ruby over JVM ● REST queries (Stargate): Use curl requests ● Thrift: communication protocol used for RPC ● MapReduce: Create Mapper and Reducer that queries HBase ● Java, JRuby, HBql, Jython, Groovy, Scala: provide libraries that can talk to HBase

Slide 19

Slide 19 text

CRUD Operations ● PUT hbase> put ‘table’, ‘rowkey’, ‘fam:col’, ‘value’, ‘ts’ ● GET hbase> get ‘table’, ‘rowkey’, {COLUMN => ‘col’, TIMESTAMP => ‘ts’, VERSIONS => 4} ● DELETE hbase> delete ‘table’, ‘rowkey’, ‘fam:col’ ● SCAN hbase> scan ‘table’, {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}

Slide 20

Slide 20 text

HBase Case Study: NFL Play-by-play data ● We want to store every play for every game since 2002 ● Play by play is time-series data ● Use play number instead of timestamp ● Identify each game by gameid ● One column family with several columns for each attribute of play

Slide 21

Slide 21 text

Sample data gameid,qtr,min,sec,off,def,down,togo,ydline,description,offscore,defscore,season

Slide 22

Slide 22 text

Schema Design

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

HBase Architecture Overview

Slide 25

Slide 25 text

?