HBase Storage Internals

Slide 1

Slide 1 text

1 Speaker Name or Subhead Goes Here Matteo Bertozzi | @Cloudera March 2013 - Hadoop Summit Europe HBase Storage Internals, present and future!

Slide 2

Slide 2 text

What is HBase? • Open source Storage Manager that provides random read/write on top of HDFS • Provides Tables with a “Key:Column/Value” interface • Dynamic columns (qualifiers), no schema needed • “Fixed” column groups (families) • table[row:family:column] = value 2

Slide 3

Slide 3 text

HBase ecosystem 3 • Apache Hadoop HDFS for data durability and reliability (Write-Ahead Log) • Apache ZooKeeper for distributed coordination • Apache Hadoop MapReduce built-in support for running MapReduce jobs ZK HDFS App MR

Slide 4

Slide 4 text

4 How HBase Works “View from 10000ft”

Slide 5

Slide 5 text

Region Server Master, Region Servers and Regions 5 HDFS Region Region Region Region Server Region Region Region Region Server Region Region Region Client ZooKeeper Master • Region Server • Server that contains a set of Regions • Responsible to handle reads and writes • Region • The basic unit of scalability in HBase • Subset of the table’s data • Contiguous, sorted range of rows stored together. • Master • Coordinates the HBase Cluster • Assignment/Balancing of the Regions • Handles admin operations • create/delete/modify table, …

Slide 6

Slide 6 text

Autosharding and .META. table • A Region is a Subset of the table’s data • When there is too much data in a Region… • a split is triggered, creating 2 regions • The association “Region -> Server” is stored in a System Table • The Location of .META. Is stored in ZooKeeper 6 Table Start Key Region ID Region Server testTable Key-00 1 machine01.host testTable Key-31 2 machine03.host testTable Key-65 3 machine02.host testTable Key-83 4 machine01.host … … … … users Key-AB 1 machine03.host users Key-KG 2 machine02.host machine01 Region 1 - testTable Region 4 - testTable machine02 Region 3 - testTable Region 1 - users machine03 Region 2 - testTable Region 2 - users

Slide 7

Slide 7 text

The Write Path – Create a New Table 7 • The client asks to the master to create a new Table • hbase> create ‘myTable’, ‘cf’ • The Master • Store the Table information (“schema”) • Create Regions based on the key-splits provided • no splits provided, one single region by default • Assign the Regions to the Region Servers • The assignment Region -> Server is written to a system table called “.META.” Client Master Region Server Region Server createTable() Store Table “Metadata” Assign the Regions “enable” Region Region Region Region Region Server Region Region

Slide 8

Slide 8 text

The Write Path – “Inserting” data 8 • table.put(row-key:family:column, value) • The client asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to insert/update/delete the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • The operation is written to a Write-Ahead Log (WAL) • …and the KeyValues added to the Store: “MemStore” Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Insert KeyValue

Slide 9

Slide 9 text

The Write Path – Append Only to Random R/W 9 • Files in HDFS are • Append-Only • Immutable once closed • HBase provides Random Writes? • …not really from a storage point of view • KeyValues are stored in memory and written to disk on pressure • Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash) • But this allow to sort data by Key before writing on disk • Deletes are like Inserts but with a “remove me flag” RS Region Region Region WAL MemStore + Store Files (HFiles) Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5 Store Files

Slide 10

Slide 10 text

The Read Path – “reading” data 10 • The client asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to get the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • MemStore and Store Files are scanned to find the key Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Get Key

Slide 11

Slide 11 text

The Read Path – Append Only to Random R/W 11 • Each flush a new file is created • Each file have KeyValues sorted by key • Two or more files can contains the same key (updates/deletes) • To find a Key you need to scan all the files • …with some optimizations • Filter Files Start/End Key • Having a bloom filter on each file Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

Slide 12

Slide 12 text

12 HBase Store File Format HFile

Slide 13

Slide 13 text

HFile Format 13 • Only Sequential Writes, just append(key, value) • Large Sequential Reads are better • Why grouping records in blocks? • Easy to split • Easy to read • Easy to cache • Easy to index (if records are sorted) • Block Compression (snappy, lz4, gz, …) Record 0 Record 1 Header … Record N Record 0 Record 1 Header … Record N Index 0 … Index N Blocks Key/Value (record) Key Length : int Value Length : int Key : byte[] Value : byte[] Trailer

Slide 14

Slide 14 text

Data Block Encoding 14 Row Length : short Row : byte[] Family Length : byte Family : byte[] Qualifier : byte[] Timestamp : long Type : byte “on-disk” KeyValue • “Be aware of the data” • Block Encoding allows to compress the Key based on what we know • Keys are sorted… prefix may be similar in most cases • One file contains keys from one Family only • Timestamps are “similar”, we can store the diff • Type is “put” most of the time…

Slide 15

Slide 15 text

15 Optimize the read-path Compactions

Slide 16

Slide 16 text

Compactions 16 • Reduce the number of files to look into during a scan • Removing duplicated keys (updated values) • Removing deleted keys • Creates a new file by merging the content of two or more files • Remove the old files Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

Slide 17

Slide 17 text

Pluggable Compactions 17 • Try different algorithm • Be aware of the data • Time Series? I guess no updates from the 80s • Be aware of the requests • Compact based on statistics • which files are hot and which are not • which keys are hot and which are not Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

Slide 18

Slide 18 text

18 Zero-Copy Snapshots and Table Clones Snapshots

Slide 19

Slide 19 text

What Is a Snapshot? 19 • “a Snapshot is not a copy of the table” • a Snapshot is a set of metadata information • The table “schema” (column families and attributes) • The Regions information (start key, end key, …) • The list of Store Files • The list of WALs active RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK

Slide 20

Slide 20 text

How Taking a Snapshot Works? 20 • The master orchestrate the RSs • the communication is done via ZooKeeper • using a “2-phase commit like” transaction (prepare/commit) • Each RS is responsible to take its “piece” of snapshot • For each Region store the metadata information needed • (list of Store Files, WALs, region start/end keys, …) RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK

Slide 21

Slide 21 text

Cloning a Table from a Snapshots 21 • hbase> clone_snapshot ‘snapshotName’, ‘tableName’ … • Creates a new table with the data “contained” in the snapshot • No data copies involved • HFiles are immutable, and shared between tables and snapshots • You can insert/update/remove data from the new table • No repercussions on the snapshot, original tables or other cloned tables

Slide 22

Slide 22 text

Compactions & Archiving 22 • HFiles are immutable, and shared between tables and snapshots • On compaction or table deletion, files are removed from disk • If one of these files are referenced by a snapshot or a cloned table • The file is moved to an “archive” directory • And deleted later, when there’re no references to it

Slide 23

Slide 23 text

23 What can be improved? Future

Slide 24

Slide 24 text

0.96 is coming up • Moving RPC to Protobuf • Allows rolling upgrades with no surprises • HBase Snapshots • Pluggable Compactions • Remove -ROOT- • Table Locks 24

Slide 25

Slide 25 text

0.98 and Beyond 25 • Transparent Table/Column-Family Encryption • Cell-level security • Multiple WALs per Region Server (MTTR) • Data Placement Awareness (MTTR) • Data Type Awareness • Compaction policies, based on the data needs • Managing blocks directly (instead of files)

Slide 26

Slide 26 text

26 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Questions? Matteo Bertozzi | @Cloudera