Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Technical Challenges of HDFS Erasure Coding on ...

Technical Challenges of HDFS Erasure Coding on a Large Scale

Avatar for LINE DEVDAY 2021

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker Data Platform Dept., Data Engineering Center - Toshihiko Uchida

    - Senior Software Engineer - Tech lead team member at Data Platform Dept. - Interests - Distributed systems - Formal methods - Apache Hadoop/Hive contributor
  2. Agenda - Apache HDFS at LINE - Introduction to HDFS

    Erasure Coding - Troubleshooting of HDFS Erasure Coding
  3. Apache HDFS Distributed file system - Highly scalable distributed filesystem

    - Designed to run thousands of servers and store petabyte scale data - High fault-tolerance - High throughput
  4. Scale Unified storage for data lake PB Data Size 290

    DataNodes # of DataNodes 2000 PB Monthly Increase 10
  5. HDFS Erasure Coding New feature in Hadoop 3 Erasure Coding

    (RS-3-2) - A file is divided into blocks, which in turn are divided into cells - Parity cells are calculated from data cells - Cells are stored in a striped way across DataNodes Replication (3x) - A file is divided into blocks - Blocks are replicated across DataNodes
  6. Data Recovery How to recover lost data Erasure Coding (RS-3-2)

    - A DataNode reconstructs missing blocks with live blocks Replication (3x) - Copy alive blocks into different DataNodes
  7. Erasure Coding and Replication Summary Replication (3x) Erasure Coding (RS-6-3)

    Fault-tolerance 2 3 Storage efficiency 67% 33% Recovery Replication Reconstruction Locality optimization Possible Impossible Write performance Disk-bound Network-bound Read performance Disk-bound Network-bound Small file problem Severe Very severe
  8. Usage Archiving - Only for cold data archive - Why

    not for hot data? - It is often hard to avoid small files for hot data - HDFS Erasure Coding is still immature - Merge small files at archiving to avoid the small file problem - 12 PB stored with RS-6-3 => 32 servers with 12 TB * 24 disks saved
  9. Key Observation Something wrong with erasure coding reconstruction - Only

    certain storage blocks were corrupted - Compared bit-wise between corrupted files and their original ones - Those corrupted blocks were reconstructed in the past - NullPointerException sometimes happened at StripedBlockReconstructor - It made reconstruction fail - Corruption rate: 0.005 % - Ran a Spark application to scan all erasure-coded files
  10. Study Dirty buffer pool on StripedBlockReconstructor - HDFS-15240 Erasure Coding:

    dirty buffer causes reconstruction block error - It was an open issue at the time of our troubleshooting - We helped with a unit test :) - Summary - StripedBlockReconstructor uses a buffer pool, defined as a static field, for reconstruction - The buffer pool will be polluted under a certain scenario - The next reconstruction with the dirty buffer pool yields corrupted blocks
  11. Failure Scenario Step 1 - When timeout happens at a

    StripedBlockReader, a new one is created and reads from a different source DataNode
  12. Failure Scenario Step 2 - The next iteration fails due

    to NPE when the StripedBlockReconstructor gets results from StripedBlockReaders
  13. Failure Scenario Step 3 - The StripedBlockReconstructor asks for the

    StripedBlockReaders to release (put) the buffers to the buffer pool, before closing the readers
  14. Failure Scenario Step 4 - These StripedBlockReaders still hold the

    buffers and write to them until they are closed
  15. Failure Scenario Step 5 - The next reconstruction uses the

    dirty buffer pool and yields corrupted blocks
  16. Challenge Not enough just to apply the patch - Is

    it enough just to apply the patch? - No - Challenge - How to detect - How can we ensure that no data corruption occurs? - How to prevent - How can we prevent data corruption for the future?
  17. How to Detect Record file checksums - Problem - HDFS

    cannot recognize this kind of data corruption natively - Solution - Record file checksums of erasure-coded files daily with Spark - Contribution - HDFS-15709 EC: Socket file descriptor leak in StripedBlockChecksumReconstructor - HDFS-15795 EC: Wrong checksum when reconstruction was failed by exception
  18. How getFileChecksum works Normal scenario - Get checksums of data

    blocks - Calculate a file checksum from them
  19. File Descriptor Leak HDFS-15709 - If a block is missing,

    reconstruct the block from live blocks - Calculate a file checksum StripedBlockChecksumReconstructor does not close the connections to DataNodes
  20. Wrong Checksum HDFS-15795 - If a block is missing, reconstruct

    the block from live blocks - Calculate a file checksum When StripedBlockChecksumReconstructor failed, getFileChecksum ignores the missing block
  21. How to Prevent Verify correctness of erasure coding reconstruction -

    Problem - A new bug may cause data corruption again in the future - Solution - Verify correctness of reconstruction inside HDFS - Contribution - HDFS-15759 EC: Verify EC reconstruction correctness on DataNode - Set dfs.datanode.ec.reconstruction.validation=true - Monitor EcInvalidReconstructionTasks metric
  22. How It Works Idea: Yet another reconstruction for validation Assume

    d2 is missing 1. DN reconstructs d2’ 2. DN reconstructs d1’ with d2’ 3. Compare d1’ with d1 If d1’ != d1, make the reconstruction fail
  23. Future Work Streaming block checksums - After the patch was

    applied, no data corruption has been observed by our monitoring system - However, - Recording checksums of all erasure-coded files periodically is costly and unfeasible - Our reconstruction validation does not provide 100% guarantee - To take one step further, - We need to record block checksums at streaming
  24. Conclusion And recommendation - HDFS Erasure Coding works well most

    of the time - It may still cause troubles on a large scale or under certain conditions - Recommendation - Keep upgrading to the latest maintenance version - Check HDFS JIRA issues with ec/erasure-coding Components regularly - Backup some of original data for troubleshooting - We will keep contributing to OSS communities
  25. We are hiring! - Data Platform Engineer - Site Reliability

    Engineer - Distributed System Administrator - And more…