Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to HBase

Introduction to HBase

A talk I gave on HBase at my alma mater. The audience was undergraduate and graduate Computer Science students and I gave a high-level overview of HBase.

Decf9a676b62d08fd6659e25951fc385?s=128

Karthik Kumar

March 05, 2014
Tweet

Transcript

  1. Introduction to HBase Karthik Kumar

  2. Overview • Background • Data Model • APIs • Case

    Study • Architecture
  3. What is HBase? • A “sparse, distributed, consistent, multi-dimensional, sorted

    map” • Column-oriented • Based on Google’s BigTable (2006) • Apache project
  4. Map • At its core, HBase is just a mapping

    of keys to values key -> value
  5. Sorted • Each cell is sorted lexicographically by key ◦

    Allows for range queries ▪ (return all values with keys between k1...k5)
  6. Multi-dimensional • Key is actually made up of several parts

    (rowkey, column family, column, version) -> value
  7. Sparse • For a given row, we don’t store anything

    for null/empty values • No data wasted for empty values
  8. Distributed • Data stored in HBase can be spread over

    many machines and can store billions of cells, reliably • Uses HDFS (Hadoop Distributed File System) ◦ Provides protection against node failures
  9. Consistent • HBase is strongly consistent ◦ All changes within

    the same row are atomic ◦ Reads always return the last written & committed value
  10. Data Model • Table ◦ Row ▪ Column Family •

    Column ◦ Cell (value, timestamp) • 1+ columns form a row • 1+ rows form a table • Each column can hold 1+ versions of a value
  11. Data Model - Table • Clients store data in HBase

    tables • Made up of several rows
  12. Data Model - Row • Rows provide a logical grouping

    of cells • Rows are sorted by key
  13. Data Model - Columns • Columns are arbitrary labels for

    attributes of a row • Does not need to be specified up front
  14. Data Model - Column Families • Columns are grouped into

    column families • Used to define storage attributes for columns (compression, # of versions etc).
  15. Data Model - Cells • Each cell contains a value

    and a version (usually a timestamp
  16. Putting it all together (Table, RowKey, Family, Column, Timestamp) ->

    Value
  17. Canonical Example: Google WebTable

  18. HBase API • HBase Shell: JRuby IRB-based shell ◦ JRuby

    = Ruby over JVM • REST queries (Stargate): Use curl requests • Thrift: communication protocol used for RPC • MapReduce: Create Mapper and Reducer that queries HBase • Java, JRuby, HBql, Jython, Groovy, Scala: provide libraries that can talk to HBase
  19. CRUD Operations • PUT hbase> put ‘table’, ‘rowkey’, ‘fam:col’, ‘value’,

    ‘ts’ • GET hbase> get ‘table’, ‘rowkey’, {COLUMN => ‘col’, TIMESTAMP => ‘ts’, VERSIONS => 4} • DELETE hbase> delete ‘table’, ‘rowkey’, ‘fam:col’ • SCAN hbase> scan ‘table’, {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  20. HBase Case Study: NFL Play-by-play data • We want to

    store every play for every game since 2002 • Play by play is time-series data • Use play number instead of timestamp • Identify each game by gameid • One column family with several columns for each attribute of play
  21. Sample data gameid,qtr,min,sec,off,def,down,togo,ydline,description,offscore,defscore,season

  22. Schema Design

  23. None
  24. HBase Architecture Overview

  25. ?