When we have large data sets on the cluster, there will be corruptions of blocks. This could be due to disk or any other.

One way to check for disk health is the use the fsck command.

Basic Command.

command

hdfs fsck /<path_to_file>

This will give the blocks, number of files on that file/directory and any blocks which are under replicated.

Check blocks on a specific file.

hdfs fsck /<path/to/corrupt/file -locations -blocks -files

Output would be similar to below. (formatted for better verbose)

BP-123123123-172.16.16.1-1231231231231:blk_1231231231_123123 len=123123 repl=3
  [
    DatanodeInfoWithStorage[172.16.16.1:1000,DS-12312312-1231-1231-1231-123123123123,DISK],
    DatanodeInfoWithStorage[172.16.16.2:1000,DS-12312312-1231-1231-1231-123123123124,DISK],
    DatanodeInfoWithStorage[172.16.16.3:1000,DS-12312312-1231-1231-1231-123123123125,DISK]
  ]
  • Block Pool: BP-123123123-172.16.16.2-1231231231231
  • Block Identifier: blk_1231231231_123123
  • Number of bytes in the block: len=123123
  • Replication Count: repl=3
  • Block information on Each node: DatanodeInfoWithStorage[172.16.16.9:1000,DS-12312312-1231-1231-1231-123123123123,DISK]
    • 172.16.16.9 node ip.
    • 1000 stream port.
    • DS-12312312-1231-1231-1231-123123123123 storage id.
    • DISK storage type.

While writing this post, I found a excellent answer about block details, really worth a read. I have post a extract below.

courtesy https://stackoverflow.com/a/34497704. (Unfortunately I was unable to embed the answer here)

BP-929597290-192.0.0.2-1439573305237 : This is Block Pool ID. Block pool is a set of blocks that belong to single name space. For simplicity, you can say that all the blocks managed by a Name Node are under the same Block Pool.

The Block Pool is formed as:  String bpid = "BP-" + rand + "-"+ ip + "-" + Time.now();        
rand = Some random number
ip = IP address of the Name Node
Time.now() - Current system time

Read about Block Pools here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html blk_1074084574_344316: Block number of the block. Each block in HDFS is given a unique identifier. The block ID is formed as: blk_blockid_genstamp

blockid = ID of the block
genstamp = an incrementing integer that records the version of a particular block
DISK = storageType. It is DISK here. Storage type can be: RAM_DISK, SSD, DISK and ARCHIVE