Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Technical Challenges of HDFS Erasure Coding on a Large Scale

Technical Challenges of HDFS Erasure Coding on a Large Scale

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker
    Data Platform Dept., Data Engineering Center
    - Toshihiko Uchida
    - Senior Software Engineer
    - Tech lead team member at Data Platform Dept.
    - Interests
    - Distributed systems
    - Formal methods
    - Apache Hadoop/Hive contributor

    View full-size slide

  2. Agenda - Apache HDFS at LINE
    - Introduction to HDFS Erasure Coding
    - Troubleshooting of HDFS Erasure Coding

    View full-size slide

  3. Apache HDFS at LINE

    View full-size slide

  4. Apache HDFS
    Distributed file system
    - Highly scalable distributed filesystem
    - Designed to run thousands of servers and store petabyte scale data
    - High fault-tolerance
    - High throughput

    View full-size slide

  5. Apache HDFS
    Component
    - Master
    - NameNode
    - Slave
    - DataNode

    View full-size slide

  6. Scale
    Unified storage for data lake
    PB
    Data Size
    290
    DataNodes
    # of DataNodes
    2000
    PB
    Monthly Increase
    10

    View full-size slide

  7. HDFS Erasure Coding
    New feature in Hadoop 3
    Erasure Coding (RS-3-2)
    - A file is divided into blocks, which in turn are divided into cells
    - Parity cells are calculated from data cells
    - Cells are stored in a striped way across DataNodes
    Replication (3x)
    - A file is divided into blocks
    - Blocks are replicated across DataNodes

    View full-size slide

  8. Data Recovery
    How to recover lost data
    Erasure Coding (RS-3-2)
    - A DataNode reconstructs missing blocks with live blocks
    Replication (3x)
    - Copy alive blocks into different DataNodes

    View full-size slide

  9. Erasure Coding and Replication
    Summary
    Replication (3x) Erasure Coding (RS-6-3)
    Fault-tolerance 2 3
    Storage efficiency 67% 33%
    Recovery Replication Reconstruction
    Locality optimization Possible Impossible
    Write performance Disk-bound Network-bound
    Read performance Disk-bound Network-bound
    Small file problem Severe Very severe

    View full-size slide

  10. Usage
    Archiving
    - Only for cold data archive
    - Why not for hot data?
    - It is often hard to avoid small files for hot data
    - HDFS Erasure Coding is still immature
    - Merge small files at archiving to avoid the small file problem
    - 12 PB stored with RS-6-3 => 32 servers with 12 TB * 24 disks saved

    View full-size slide

  11. Troubleshooting of
    HDFS Erasure Coding

    View full-size slide

  12. Data Corruption
    Erasure-coded files cannot be read

    View full-size slide

  13. Key Observation
    Something wrong with erasure coding reconstruction
    - Only certain storage blocks were corrupted
    - Compared bit-wise between corrupted files and their original ones
    - Those corrupted blocks were reconstructed in the past
    - NullPointerException sometimes happened at StripedBlockReconstructor
    - It made reconstruction fail
    - Corruption rate: 0.005 %
    - Ran a Spark application to scan all erasure-coded files

    View full-size slide

  14. Study
    Dirty buffer pool on StripedBlockReconstructor
    - HDFS-15240 Erasure Coding: dirty buffer causes reconstruction block error
    - It was an open issue at the time of our troubleshooting
    - We helped with a unit test :)
    - Summary
    - StripedBlockReconstructor uses a buffer pool, defined as a static field, for
    reconstruction
    - The buffer pool will be polluted under a certain scenario
    - The next reconstruction with the dirty buffer pool yields corrupted blocks

    View full-size slide

  15. How Reconstruction works
    Normal scenario
    - A StripedBlockReconstructor reconstructs
    missing blocks iteratively by a certain buffer size

    View full-size slide

  16. Failure Scenario
    Step 1
    - When timeout happens at a StripedBlockReader,
    a new one is created and reads from a different
    source DataNode

    View full-size slide

  17. Failure Scenario
    Step 2
    - The next iteration fails due to NPE when the
    StripedBlockReconstructor gets results from
    StripedBlockReaders

    View full-size slide

  18. Failure Scenario
    Step 3
    - The StripedBlockReconstructor asks for the
    StripedBlockReaders to release (put) the buffers
    to the buffer pool, before closing the readers

    View full-size slide

  19. Failure Scenario
    Step 4
    - These StripedBlockReaders still hold the buffers
    and write to them until they are closed

    View full-size slide

  20. Failure Scenario
    Step 5
    - The next reconstruction uses the dirty buffer pool
    and yields corrupted blocks

    View full-size slide

  21. Challenge
    Not enough just to apply the patch
    - Is it enough just to apply the patch?
    - No
    - Challenge
    - How to detect
    - How can we ensure that no data corruption occurs?
    - How to prevent
    - How can we prevent data corruption for the future?

    View full-size slide

  22. How to Detect
    Record file checksums
    - Problem
    - HDFS cannot recognize this kind of data corruption natively
    - Solution
    - Record file checksums of erasure-coded files daily with Spark
    - Contribution
    - HDFS-15709 EC: Socket file descriptor leak in StripedBlockChecksumReconstructor
    - HDFS-15795 EC: Wrong checksum when reconstruction was failed by exception

    View full-size slide

  23. How getFileChecksum works
    Normal scenario
    - Get checksums of data blocks
    - Calculate a file checksum from them

    View full-size slide

  24. File Descriptor Leak
    HDFS-15709
    - If a block is missing, reconstruct the block from
    live blocks
    - Calculate a file checksum
    StripedBlockChecksumReconstructor does not close
    the connections to DataNodes

    View full-size slide

  25. Wrong Checksum
    HDFS-15795
    - If a block is missing, reconstruct the block from
    live blocks
    - Calculate a file checksum
    When StripedBlockChecksumReconstructor failed,
    getFileChecksum ignores the missing block

    View full-size slide

  26. How to Prevent
    Verify correctness of erasure coding reconstruction
    - Problem
    - A new bug may cause data corruption again in the future
    - Solution
    - Verify correctness of reconstruction inside HDFS
    - Contribution
    - HDFS-15759 EC: Verify EC reconstruction correctness on DataNode
    - Set dfs.datanode.ec.reconstruction.validation=true
    - Monitor EcInvalidReconstructionTasks metric

    View full-size slide

  27. How It Works
    Idea: Yet another reconstruction for validation
    Assume d2 is missing
    1. DN reconstructs d2’
    2. DN reconstructs d1’ with d2’
    3. Compare d1’ with d1
    If d1’ != d1, make the reconstruction fail

    View full-size slide

  28. Future Work
    Streaming block checksums
    - After the patch was applied, no data corruption has been observed by our monitoring
    system
    - However,
    - Recording checksums of all erasure-coded files periodically is costly and
    unfeasible
    - Our reconstruction validation does not provide 100% guarantee
    - To take one step further,
    - We need to record block checksums at streaming

    View full-size slide

  29. Conclusion
    And recommendation
    - HDFS Erasure Coding works well most of the time
    - It may still cause troubles on a large scale or under certain conditions
    - Recommendation
    - Keep upgrading to the latest maintenance version
    - Check HDFS JIRA issues with ec/erasure-coding Components regularly
    - Backup some of original data for troubleshooting
    - We will keep contributing to OSS communities

    View full-size slide

  30. We are hiring!
    - Data Platform Engineer
    - Site Reliability Engineer
    - Distributed System Administrator
    - And more…

    View full-size slide