Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to HBase

Introduction to HBase

A talk I gave on HBase at my alma mater. The audience was undergraduate and graduate Computer Science students and I gave a high-level overview of HBase.

Karthik Kumar

March 05, 2014

More Decks by Karthik Kumar

Other Decks in Programming


  1. What is HBase? • A “sparse, distributed, consistent, multi-dimensional, sorted

    map” • Column-oriented • Based on Google’s BigTable (2006) • Apache project
  2. Map • At its core, HBase is just a mapping

    of keys to values key -> value
  3. Sorted • Each cell is sorted lexicographically by key ◦

    Allows for range queries ▪ (return all values with keys between k1...k5)
  4. Multi-dimensional • Key is actually made up of several parts

    (rowkey, column family, column, version) -> value
  5. Sparse • For a given row, we don’t store anything

    for null/empty values • No data wasted for empty values
  6. Distributed • Data stored in HBase can be spread over

    many machines and can store billions of cells, reliably • Uses HDFS (Hadoop Distributed File System) ◦ Provides protection against node failures
  7. Consistent • HBase is strongly consistent ◦ All changes within

    the same row are atomic ◦ Reads always return the last written & committed value
  8. Data Model • Table ◦ Row ▪ Column Family •

    Column ◦ Cell (value, timestamp) • 1+ columns form a row • 1+ rows form a table • Each column can hold 1+ versions of a value
  9. Data Model - Table • Clients store data in HBase

    tables • Made up of several rows
  10. Data Model - Row • Rows provide a logical grouping

    of cells • Rows are sorted by key
  11. Data Model - Columns • Columns are arbitrary labels for

    attributes of a row • Does not need to be specified up front
  12. Data Model - Column Families • Columns are grouped into

    column families • Used to define storage attributes for columns (compression, # of versions etc).
  13. Data Model - Cells • Each cell contains a value

    and a version (usually a timestamp
  14. HBase API • HBase Shell: JRuby IRB-based shell ◦ JRuby

    = Ruby over JVM • REST queries (Stargate): Use curl requests • Thrift: communication protocol used for RPC • MapReduce: Create Mapper and Reducer that queries HBase • Java, JRuby, HBql, Jython, Groovy, Scala: provide libraries that can talk to HBase
  15. CRUD Operations • PUT hbase> put ‘table’, ‘rowkey’, ‘fam:col’, ‘value’,

    ‘ts’ • GET hbase> get ‘table’, ‘rowkey’, {COLUMN => ‘col’, TIMESTAMP => ‘ts’, VERSIONS => 4} • DELETE hbase> delete ‘table’, ‘rowkey’, ‘fam:col’ • SCAN hbase> scan ‘table’, {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  16. HBase Case Study: NFL Play-by-play data • We want to

    store every play for every game since 2002 • Play by play is time-series data • Use play number instead of timestamp • Identify each game by gameid • One column family with several columns for each attribute of play
  17. ?