Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBase Storage Internals

HBase Storage Internals

Matteo Bertozzi

March 24, 2013
Tweet

More Decks by Matteo Bertozzi

Other Decks in Technology

Transcript

  1. 1 Speaker Name or Subhead Goes Here Matteo Bertozzi |

    @Cloudera March 2013 - Hadoop Summit Europe HBase Storage Internals, present and future!
  2. What is HBase? • Open source Storage Manager that provides

    random read/write on top of HDFS • Provides Tables with a “Key:Column/Value” interface • Dynamic columns (qualifiers), no schema needed • “Fixed” column groups (families) • table[row:family:column] = value 2
  3. HBase ecosystem 3 • Apache Hadoop HDFS for data durability

    and reliability (Write-Ahead Log) • Apache ZooKeeper for distributed coordination • Apache Hadoop MapReduce built-in support for running MapReduce jobs ZK HDFS App MR
  4. Region Server Master, Region Servers and Regions 5 HDFS Region

    Region Region Region Server Region Region Region Region Server Region Region Region Client ZooKeeper Master • Region Server • Server that contains a set of Regions • Responsible to handle reads and writes • Region • The basic unit of scalability in HBase • Subset of the table’s data • Contiguous, sorted range of rows stored together. • Master • Coordinates the HBase Cluster • Assignment/Balancing of the Regions • Handles admin operations • create/delete/modify table, …
  5. Autosharding and .META. table • A Region is a Subset

    of the table’s data • When there is too much data in a Region… • a split is triggered, creating 2 regions • The association “Region -> Server” is stored in a System Table • The Location of .META. Is stored in ZooKeeper 6 Table Start Key Region ID Region Server testTable Key-00 1 machine01.host testTable Key-31 2 machine03.host testTable Key-65 3 machine02.host testTable Key-83 4 machine01.host … … … … users Key-AB 1 machine03.host users Key-KG 2 machine02.host machine01 Region 1 - testTable Region 4 - testTable machine02 Region 3 - testTable Region 1 - users machine03 Region 2 - testTable Region 2 - users
  6. The Write Path – Create a New Table 7 •

    The client asks to the master to create a new Table • hbase> create ‘myTable’, ‘cf’ • The Master • Store the Table information (“schema”) • Create Regions based on the key-splits provided • no splits provided, one single region by default • Assign the Regions to the Region Servers • The assignment Region -> Server is written to a system table called “.META.” Client Master Region Server Region Server createTable() Store Table “Metadata” Assign the Regions “enable” Region Region Region Region Region Server Region Region
  7. The Write Path – “Inserting” data 8 • table.put(row-key:family:column, value)

    • The client asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to insert/update/delete the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • The operation is written to a Write-Ahead Log (WAL) • …and the KeyValues added to the Store: “MemStore” Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Insert KeyValue
  8. The Write Path – Append Only to Random R/W 9

    • Files in HDFS are • Append-Only • Immutable once closed • HBase provides Random Writes? • …not really from a storage point of view • KeyValues are stored in memory and written to disk on pressure • Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash) • But this allow to sort data by Key before writing on disk • Deletes are like Inserts but with a “remove me flag” RS Region Region Region WAL MemStore + Store Files (HFiles) Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5 Store Files
  9. The Read Path – “reading” data 10 • The client

    asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to get the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • MemStore and Store Files are scanned to find the key Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Get Key
  10. The Read Path – Append Only to Random R/W 11

    • Each flush a new file is created • Each file have KeyValues sorted by key • Two or more files can contains the same key (updates/deletes) • To find a Key you need to scan all the files • …with some optimizations • Filter Files Start/End Key • Having a bloom filter on each file Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0
  11. HFile Format 13 • Only Sequential Writes, just append(key, value)

    • Large Sequential Reads are better • Why grouping records in blocks? • Easy to split • Easy to read • Easy to cache • Easy to index (if records are sorted) • Block Compression (snappy, lz4, gz, …) Record 0 Record 1 Header … Record N Record 0 Record 1 Header … Record N Index 0 … Index N Blocks Key/Value (record) Key Length : int Value Length : int Key : byte[] Value : byte[] Trailer
  12. Data Block Encoding 14 Row Length : short Row :

    byte[] Family Length : byte Family : byte[] Qualifier : byte[] Timestamp : long Type : byte “on-disk” KeyValue • “Be aware of the data” • Block Encoding allows to compress the Key based on what we know • Keys are sorted… prefix may be similar in most cases • One file contains keys from one Family only • Timestamps are “similar”, we can store the diff • Type is “put” most of the time…
  13. Compactions 16 • Reduce the number of files to look

    into during a scan • Removing duplicated keys (updated values) • Removing deleted keys • Creates a new file by merging the content of two or more files • Remove the old files Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0
  14. Pluggable Compactions 17 • Try different algorithm • Be aware

    of the data • Time Series? I guess no updates from the 80s • Be aware of the requests • Compact based on statistics • which files are hot and which are not • which keys are hot and which are not Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0
  15. What Is a Snapshot? 19 • “a Snapshot is not

    a copy of the table” • a Snapshot is a set of metadata information • The table “schema” (column families and attributes) • The Regions information (start key, end key, …) • The list of Store Files • The list of WALs active RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK
  16. How Taking a Snapshot Works? 20 • The master orchestrate

    the RSs • the communication is done via ZooKeeper • using a “2-phase commit like” transaction (prepare/commit) • Each RS is responsible to take its “piece” of snapshot • For each Region store the metadata information needed • (list of Store Files, WALs, region start/end keys, …) RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK
  17. Cloning a Table from a Snapshots 21 • hbase> clone_snapshot

    ‘snapshotName’, ‘tableName’ … • Creates a new table with the data “contained” in the snapshot • No data copies involved • HFiles are immutable, and shared between tables and snapshots • You can insert/update/remove data from the new table • No repercussions on the snapshot, original tables or other cloned tables
  18. Compactions & Archiving 22 • HFiles are immutable, and shared

    between tables and snapshots • On compaction or table deletion, files are removed from disk • If one of these files are referenced by a snapshot or a cloned table • The file is moved to an “archive” directory • And deleted later, when there’re no references to it
  19. 0.96 is coming up • Moving RPC to Protobuf •

    Allows rolling upgrades with no surprises • HBase Snapshots • Pluggable Compactions • Remove -ROOT- • Table Locks 24
  20. 0.98 and Beyond 25 • Transparent Table/Column-Family Encryption • Cell-level

    security • Multiple WALs per Region Server (MTTR) • Data Placement Awareness (MTTR) • Data Type Awareness • Compaction policies, based on the data needs • Managing blocks directly (instead of files)
  21. 26 Headline Goes Here Speaker Name or Subhead Goes Here

    DO NOT USE PUBLICLY PRIOR TO 10/23/12 Questions? Matteo Bertozzi | @Cloudera
  22. 27