Technical Challenges of HDFS Erasure Coding on a Large Scale

Speaker Data Platform Dept., Data Engineering Center - Toshihiko Uchida
- Senior Software Engineer - Tech lead team member at Data Platform Dept. - Interests - Distributed systems - Formal methods - Apache Hadoop/Hive contributor

Agenda - Apache HDFS at LINE - Introduction to HDFS
Erasure Coding - Troubleshooting of HDFS Erasure Coding

Apache HDFS at LINE

Apache HDFS Distributed file system - Highly scalable distributed filesystem
- Designed to run thousands of servers and store petabyte scale data - High fault-tolerance - High throughput

Apache HDFS Component - Master - NameNode - Slave -
DataNode

Scale Unified storage for data lake PB Data Size 290
DataNodes # of DataNodes 2000 PB Monthly Increase 10

HDFS Erasure Coding New feature in Hadoop 3 Erasure Coding
(RS-3-2) - A file is divided into blocks, which in turn are divided into cells - Parity cells are calculated from data cells - Cells are stored in a striped way across DataNodes Replication (3x) - A file is divided into blocks - Blocks are replicated across DataNodes

Data Recovery How to recover lost data Erasure Coding (RS-3-2)
- A DataNode reconstructs missing blocks with live blocks Replication (3x) - Copy alive blocks into different DataNodes

Erasure Coding and Replication Summary Replication (3x) Erasure Coding (RS-6-3)
Fault-tolerance 2 3 Storage efficiency 67% 33% Recovery Replication Reconstruction Locality optimization Possible Impossible Write performance Disk-bound Network-bound Read performance Disk-bound Network-bound Small file problem Severe Very severe

Usage Archiving - Only for cold data archive - Why
not for hot data? - It is often hard to avoid small files for hot data - HDFS Erasure Coding is still immature - Merge small files at archiving to avoid the small file problem - 12 PB stored with RS-6-3 => 32 servers with 12 TB * 24 disks saved

Troubleshooting of HDFS Erasure Coding

Data Corruption Erasure-coded files cannot be read

Key Observation Something wrong with erasure coding reconstruction - Only
certain storage blocks were corrupted - Compared bit-wise between corrupted files and their original ones - Those corrupted blocks were reconstructed in the past - NullPointerException sometimes happened at StripedBlockReconstructor - It made reconstruction fail - Corruption rate: 0.005 % - Ran a Spark application to scan all erasure-coded files

Study Dirty buffer pool on StripedBlockReconstructor - HDFS-15240 Erasure Coding:
dirty buffer causes reconstruction block error - It was an open issue at the time of our troubleshooting - We helped with a unit test :) - Summary - StripedBlockReconstructor uses a buffer pool, defined as a static field, for reconstruction - The buffer pool will be polluted under a certain scenario - The next reconstruction with the dirty buffer pool yields corrupted blocks

How Reconstruction works Normal scenario - A StripedBlockReconstructor reconstructs missing
blocks iteratively by a certain buffer size

Failure Scenario Step 1 - When timeout happens at a
StripedBlockReader, a new one is created and reads from a different source DataNode

Failure Scenario Step 2 - The next iteration fails due
to NPE when the StripedBlockReconstructor gets results from StripedBlockReaders

Failure Scenario Step 3 - The StripedBlockReconstructor asks for the
StripedBlockReaders to release (put) the buffers to the buffer pool, before closing the readers

Failure Scenario Step 4 - These StripedBlockReaders still hold the
buffers and write to them until they are closed

Failure Scenario Step 5 - The next reconstruction uses the
dirty buffer pool and yields corrupted blocks

Challenge Not enough just to apply the patch - Is
it enough just to apply the patch? - No - Challenge - How to detect - How can we ensure that no data corruption occurs? - How to prevent - How can we prevent data corruption for the future?

How to Detect Record file checksums - Problem - HDFS
cannot recognize this kind of data corruption natively - Solution - Record file checksums of erasure-coded files daily with Spark - Contribution - HDFS-15709 EC: Socket file descriptor leak in StripedBlockChecksumReconstructor - HDFS-15795 EC: Wrong checksum when reconstruction was failed by exception

How getFileChecksum works Normal scenario - Get checksums of data
blocks - Calculate a file checksum from them

File Descriptor Leak HDFS-15709 - If a block is missing,
reconstruct the block from live blocks - Calculate a file checksum StripedBlockChecksumReconstructor does not close the connections to DataNodes

Wrong Checksum HDFS-15795 - If a block is missing, reconstruct
the block from live blocks - Calculate a file checksum When StripedBlockChecksumReconstructor failed, getFileChecksum ignores the missing block

How to Prevent Verify correctness of erasure coding reconstruction -
Problem - A new bug may cause data corruption again in the future - Solution - Verify correctness of reconstruction inside HDFS - Contribution - HDFS-15759 EC: Verify EC reconstruction correctness on DataNode - Set dfs.datanode.ec.reconstruction.validation=true - Monitor EcInvalidReconstructionTasks metric

How It Works Idea: Yet another reconstruction for validation Assume
d2 is missing 1. DN reconstructs d2’ 2. DN reconstructs d1’ with d2’ 3. Compare d1’ with d1 If d1’ != d1, make the reconstruction fail

Future Work Streaming block checksums - After the patch was
applied, no data corruption has been observed by our monitoring system - However, - Recording checksums of all erasure-coded files periodically is costly and unfeasible - Our reconstruction validation does not provide 100% guarantee - To take one step further, - We need to record block checksums at streaming

Conclusion And recommendation - HDFS Erasure Coding works well most
of the time - It may still cause troubles on a large scale or under certain conditions - Recommendation - Keep upgrading to the latest maintenance version - Check HDFS JIRA issues with ec/erasure-coding Components regularly - Backup some of original data for troubleshooting - We will keep contributing to OSS communities

We are hiring! - Data Platform Engineer - Site Reliability
Engineer - Distributed System Administrator - And more…

Thank you

Technical Challenges of HDFS Erasure Coding on ...

Technical Challenges of HDFS Erasure Coding on a Large Scale

LINE DEVDAY 2021

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript

Speaker Data Platform Dept., Data Engineering Center - Toshihiko Uchida

Agenda - Apache HDFS at LINE - Introduction to HDFS

Apache HDFS at LINE

Apache HDFS Distributed file system - Highly scalable distributed filesystem

Apache HDFS Component - Master - NameNode - Slave -

Scale Unified storage for data lake PB Data Size 290

HDFS Erasure Coding New feature in Hadoop 3 Erasure Coding

Data Recovery How to recover lost data Erasure Coding (RS-3-2)

Erasure Coding and Replication Summary Replication (3x) Erasure Coding (RS-6-3)

Usage Archiving - Only for cold data archive - Why

Troubleshooting of HDFS Erasure Coding

Data Corruption Erasure-coded files cannot be read

Key Observation Something wrong with erasure coding reconstruction - Only

Study Dirty buffer pool on StripedBlockReconstructor - HDFS-15240 Erasure Coding:

How Reconstruction works Normal scenario - A StripedBlockReconstructor reconstructs missing

Failure Scenario Step 1 - When timeout happens at a

Failure Scenario Step 2 - The next iteration fails due

Failure Scenario Step 3 - The StripedBlockReconstructor asks for the

Failure Scenario Step 4 - These StripedBlockReaders still hold the

Failure Scenario Step 5 - The next reconstruction uses the

Challenge Not enough just to apply the patch - Is

How to Detect Record file checksums - Problem - HDFS

How getFileChecksum works Normal scenario - Get checksums of data

File Descriptor Leak HDFS-15709 - If a block is missing,

Wrong Checksum HDFS-15795 - If a block is missing, reconstruct

How to Prevent Verify correctness of erasure coding reconstruction -

How It Works Idea: Yet another reconstruction for validation Assume

Future Work Streaming block checksums - After the patch was

Conclusion And recommendation - HDFS Erasure Coding works well most

We are hiring! - Data Platform Engineer - Site Reliability

Thank you