$30 off During Our Annual Pro Sale. View Details »

Technical Challenges of HDFS Erasure Coding on a Large Scale

Technical Challenges of HDFS Erasure Coding on a Large Scale

LINE DEVDAY 2021
PRO

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. None
  2. Speaker Data Platform Dept., Data Engineering Center - Toshihiko Uchida

    - Senior Software Engineer - Tech lead team member at Data Platform Dept. - Interests - Distributed systems - Formal methods - Apache Hadoop/Hive contributor
  3. Agenda - Apache HDFS at LINE - Introduction to HDFS

    Erasure Coding - Troubleshooting of HDFS Erasure Coding
  4. Apache HDFS at LINE

  5. Apache HDFS Distributed file system - Highly scalable distributed filesystem

    - Designed to run thousands of servers and store petabyte scale data - High fault-tolerance - High throughput
  6. Apache HDFS Component - Master - NameNode - Slave -

    DataNode
  7. Scale Unified storage for data lake PB Data Size 290

    DataNodes # of DataNodes 2000 PB Monthly Increase 10
  8. HDFS Erasure Coding New feature in Hadoop 3 Erasure Coding

    (RS-3-2) - A file is divided into blocks, which in turn are divided into cells - Parity cells are calculated from data cells - Cells are stored in a striped way across DataNodes Replication (3x) - A file is divided into blocks - Blocks are replicated across DataNodes
  9. Data Recovery How to recover lost data Erasure Coding (RS-3-2)

    - A DataNode reconstructs missing blocks with live blocks Replication (3x) - Copy alive blocks into different DataNodes
  10. Erasure Coding and Replication Summary Replication (3x) Erasure Coding (RS-6-3)

    Fault-tolerance 2 3 Storage efficiency 67% 33% Recovery Replication Reconstruction Locality optimization Possible Impossible Write performance Disk-bound Network-bound Read performance Disk-bound Network-bound Small file problem Severe Very severe
  11. Usage Archiving - Only for cold data archive - Why

    not for hot data? - It is often hard to avoid small files for hot data - HDFS Erasure Coding is still immature - Merge small files at archiving to avoid the small file problem - 12 PB stored with RS-6-3 => 32 servers with 12 TB * 24 disks saved
  12. Troubleshooting of HDFS Erasure Coding

  13. Data Corruption Erasure-coded files cannot be read

  14. Key Observation Something wrong with erasure coding reconstruction - Only

    certain storage blocks were corrupted - Compared bit-wise between corrupted files and their original ones - Those corrupted blocks were reconstructed in the past - NullPointerException sometimes happened at StripedBlockReconstructor - It made reconstruction fail - Corruption rate: 0.005 % - Ran a Spark application to scan all erasure-coded files
  15. Study Dirty buffer pool on StripedBlockReconstructor - HDFS-15240 Erasure Coding:

    dirty buffer causes reconstruction block error - It was an open issue at the time of our troubleshooting - We helped with a unit test :) - Summary - StripedBlockReconstructor uses a buffer pool, defined as a static field, for reconstruction - The buffer pool will be polluted under a certain scenario - The next reconstruction with the dirty buffer pool yields corrupted blocks
  16. How Reconstruction works Normal scenario - A StripedBlockReconstructor reconstructs missing

    blocks iteratively by a certain buffer size
  17. Failure Scenario Step 1 - When timeout happens at a

    StripedBlockReader, a new one is created and reads from a different source DataNode
  18. Failure Scenario Step 2 - The next iteration fails due

    to NPE when the StripedBlockReconstructor gets results from StripedBlockReaders
  19. Failure Scenario Step 3 - The StripedBlockReconstructor asks for the

    StripedBlockReaders to release (put) the buffers to the buffer pool, before closing the readers
  20. Failure Scenario Step 4 - These StripedBlockReaders still hold the

    buffers and write to them until they are closed
  21. Failure Scenario Step 5 - The next reconstruction uses the

    dirty buffer pool and yields corrupted blocks
  22. Challenge Not enough just to apply the patch - Is

    it enough just to apply the patch? - No - Challenge - How to detect - How can we ensure that no data corruption occurs? - How to prevent - How can we prevent data corruption for the future?
  23. How to Detect Record file checksums - Problem - HDFS

    cannot recognize this kind of data corruption natively - Solution - Record file checksums of erasure-coded files daily with Spark - Contribution - HDFS-15709 EC: Socket file descriptor leak in StripedBlockChecksumReconstructor - HDFS-15795 EC: Wrong checksum when reconstruction was failed by exception
  24. How getFileChecksum works Normal scenario - Get checksums of data

    blocks - Calculate a file checksum from them
  25. File Descriptor Leak HDFS-15709 - If a block is missing,

    reconstruct the block from live blocks - Calculate a file checksum StripedBlockChecksumReconstructor does not close the connections to DataNodes
  26. Wrong Checksum HDFS-15795 - If a block is missing, reconstruct

    the block from live blocks - Calculate a file checksum When StripedBlockChecksumReconstructor failed, getFileChecksum ignores the missing block
  27. How to Prevent Verify correctness of erasure coding reconstruction -

    Problem - A new bug may cause data corruption again in the future - Solution - Verify correctness of reconstruction inside HDFS - Contribution - HDFS-15759 EC: Verify EC reconstruction correctness on DataNode - Set dfs.datanode.ec.reconstruction.validation=true - Monitor EcInvalidReconstructionTasks metric
  28. How It Works Idea: Yet another reconstruction for validation Assume

    d2 is missing 1. DN reconstructs d2’ 2. DN reconstructs d1’ with d2’ 3. Compare d1’ with d1 If d1’ != d1, make the reconstruction fail
  29. Future Work Streaming block checksums - After the patch was

    applied, no data corruption has been observed by our monitoring system - However, - Recording checksums of all erasure-coded files periodically is costly and unfeasible - Our reconstruction validation does not provide 100% guarantee - To take one step further, - We need to record block checksums at streaming
  30. Conclusion And recommendation - HDFS Erasure Coding works well most

    of the time - It may still cause troubles on a large scale or under certain conditions - Recommendation - Keep upgrading to the latest maintenance version - Check HDFS JIRA issues with ec/erasure-coding Components regularly - Backup some of original data for troubleshooting - We will keep contributing to OSS communities
  31. We are hiring! - Data Platform Engineer - Site Reliability

    Engineer - Distributed System Administrator - And more…
  32. Thank you