Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Technical Challenges of HDFS Erasure Coding on a Large Scale

Technical Challenges of HDFS Erasure Coding on a Large Scale

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker Data Platform Dept., Data Engineering Center - Toshihiko Uchida

    - Senior Software Engineer - Tech lead team member at Data Platform Dept. - Interests - Distributed systems - Formal methods - Apache Hadoop/Hive contributor
  2. Agenda - Apache HDFS at LINE - Introduction to HDFS

    Erasure Coding - Troubleshooting of HDFS Erasure Coding
  3. Apache HDFS Distributed file system - Highly scalable distributed filesystem

    - Designed to run thousands of servers and store petabyte scale data - High fault-tolerance - High throughput
  4. Scale Unified storage for data lake PB Data Size 290

    DataNodes # of DataNodes 2000 PB Monthly Increase 10
  5. HDFS Erasure Coding New feature in Hadoop 3 Erasure Coding

    (RS-3-2) - A file is divided into blocks, which in turn are divided into cells - Parity cells are calculated from data cells - Cells are stored in a striped way across DataNodes Replication (3x) - A file is divided into blocks - Blocks are replicated across DataNodes
  6. Data Recovery How to recover lost data Erasure Coding (RS-3-2)

    - A DataNode reconstructs missing blocks with live blocks Replication (3x) - Copy alive blocks into different DataNodes
  7. Erasure Coding and Replication Summary Replication (3x) Erasure Coding (RS-6-3)

    Fault-tolerance 2 3 Storage efficiency 67% 33% Recovery Replication Reconstruction Locality optimization Possible Impossible Write performance Disk-bound Network-bound Read performance Disk-bound Network-bound Small file problem Severe Very severe
  8. Usage Archiving - Only for cold data archive - Why

    not for hot data? - It is often hard to avoid small files for hot data - HDFS Erasure Coding is still immature - Merge small files at archiving to avoid the small file problem - 12 PB stored with RS-6-3 => 32 servers with 12 TB * 24 disks saved
  9. Key Observation Something wrong with erasure coding reconstruction - Only

    certain storage blocks were corrupted - Compared bit-wise between corrupted files and their original ones - Those corrupted blocks were reconstructed in the past - NullPointerException sometimes happened at StripedBlockReconstructor - It made reconstruction fail - Corruption rate: 0.005 % - Ran a Spark application to scan all erasure-coded files
  10. Study Dirty buffer pool on StripedBlockReconstructor - HDFS-15240 Erasure Coding:

    dirty buffer causes reconstruction block error - It was an open issue at the time of our troubleshooting - We helped with a unit test :) - Summary - StripedBlockReconstructor uses a buffer pool, defined as a static field, for reconstruction - The buffer pool will be polluted under a certain scenario - The next reconstruction with the dirty buffer pool yields corrupted blocks
  11. Failure Scenario Step 1 - When timeout happens at a

    StripedBlockReader, a new one is created and reads from a different source DataNode
  12. Failure Scenario Step 2 - The next iteration fails due

    to NPE when the StripedBlockReconstructor gets results from StripedBlockReaders
  13. Failure Scenario Step 3 - The StripedBlockReconstructor asks for the

    StripedBlockReaders to release (put) the buffers to the buffer pool, before closing the readers
  14. Failure Scenario Step 4 - These StripedBlockReaders still hold the

    buffers and write to them until they are closed
  15. Failure Scenario Step 5 - The next reconstruction uses the

    dirty buffer pool and yields corrupted blocks
  16. Challenge Not enough just to apply the patch - Is

    it enough just to apply the patch? - No - Challenge - How to detect - How can we ensure that no data corruption occurs? - How to prevent - How can we prevent data corruption for the future?
  17. How to Detect Record file checksums - Problem - HDFS

    cannot recognize this kind of data corruption natively - Solution - Record file checksums of erasure-coded files daily with Spark - Contribution - HDFS-15709 EC: Socket file descriptor leak in StripedBlockChecksumReconstructor - HDFS-15795 EC: Wrong checksum when reconstruction was failed by exception
  18. How getFileChecksum works Normal scenario - Get checksums of data

    blocks - Calculate a file checksum from them
  19. File Descriptor Leak HDFS-15709 - If a block is missing,

    reconstruct the block from live blocks - Calculate a file checksum StripedBlockChecksumReconstructor does not close the connections to DataNodes
  20. Wrong Checksum HDFS-15795 - If a block is missing, reconstruct

    the block from live blocks - Calculate a file checksum When StripedBlockChecksumReconstructor failed, getFileChecksum ignores the missing block
  21. How to Prevent Verify correctness of erasure coding reconstruction -

    Problem - A new bug may cause data corruption again in the future - Solution - Verify correctness of reconstruction inside HDFS - Contribution - HDFS-15759 EC: Verify EC reconstruction correctness on DataNode - Set dfs.datanode.ec.reconstruction.validation=true - Monitor EcInvalidReconstructionTasks metric
  22. How It Works Idea: Yet another reconstruction for validation Assume

    d2 is missing 1. DN reconstructs d2’ 2. DN reconstructs d1’ with d2’ 3. Compare d1’ with d1 If d1’ != d1, make the reconstruction fail
  23. Future Work Streaming block checksums - After the patch was

    applied, no data corruption has been observed by our monitoring system - However, - Recording checksums of all erasure-coded files periodically is costly and unfeasible - Our reconstruction validation does not provide 100% guarantee - To take one step further, - We need to record block checksums at streaming
  24. Conclusion And recommendation - HDFS Erasure Coding works well most

    of the time - It may still cause troubles on a large scale or under certain conditions - Recommendation - Keep upgrading to the latest maintenance version - Check HDFS JIRA issues with ec/erasure-coding Components regularly - Backup some of original data for troubleshooting - We will keep contributing to OSS communities
  25. We are hiring! - Data Platform Engineer - Site Reliability

    Engineer - Distributed System Administrator - And more…