生产HDFS Block损坏恢复最佳实践

1.文件ruozedata.md

上传:
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -mkdir /blockrecover
19/07/08 00:14:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ echo "www.ruozedata.com" > ruozedata.md
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -put ruozedata.md /blockrecover
19/07/08 00:16:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /blockrecover
19/07/08 00:16:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup         18 2019-07-08 00:16 /blockrecover/ruozedata.md

校验: 健康状态
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs fsck /
19/07/08 00:16:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /10.9.15.140 for path / at Mon Jul 08 00:16:49 CST 2019
...........Status: HEALTHY
 Total size:	150512 B
 Total dirs:	14
 Total files:	11
 Total symlinks:		0
 Total blocks (validated):	7 (avg. block size 21501 B)
 Minimally replicated blocks:	7 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	1.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		1
 Number of racks:		1
FSCK ended at Mon Jul 08 00:16:49 CST 2019 in 9 milliseconds

The filesystem under path '/' is HEALTHY

2.直接DN节点上删除文件一个block的一个副本(3副本)

删除块和meta文件:
[root@10-9-15-140 ~]# cd /tmp/hadoop-hadoop/dfs/data/current/BP-2139914965-127.0.0.1-1562410380097/current/finalized/subdir0/subdir0
[root@10-9-15-140 subdir0]# rm -rf blk_1073741825 blk_1073741825_1001.meta

直接重启HDFS,直接模拟损坏效果,然后fsck检查: 有错误
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs fsck /
19/07/08 00:45:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /10.9.15.140 for path / at Mon Jul 08 00:45:04 CST 2019
..
/examples/input/1.log: CORRUPT blockpool BP-2139914965-127.0.0.1-1562410380097 block blk_1073741825

/examples/input/1.log: MISSING 1 blocks of total size 14 B..........Status: CORRUPT
 Total size:	150512 B
 Total dirs:	14
 Total files:	11
 Total symlinks:		0
 Total blocks (validated):	7 (avg. block size 21501 B)
  ********************************
  CORRUPT FILES:	1
  MISSING BLOCKS:	1
  MISSING SIZE:		14 B
  CORRUPT BLOCKS: 	1
  ********************************
 Minimally replicated blocks:	6 (85.71429 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	0.85714287
 Corrupt blocks:		1
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		1
 Number of racks:		1
FSCK ended at Mon Jul 08 00:45:04 CST 2019 in 26 milliseconds


The filesystem under path '/' is CORRUPT


3.定位损坏的block块的位置
想要知道文件的哪些块分布在哪些机器上面,然后手工删除在Linux上损坏的块
-files 文件分块信息,
-blocks 在带-files参数后才显示block信息
-locations 在带-blocks参数后才显示block块所在datanode的具体IP位置,
-racks 在带-files参数后显示机架位置

检测缺失块
①.hdfs fsck -list-corruptfileblocks
②.查看上面某一个文件的情况
hdfs fsck /examples/input/1.log -files -blocks -locations

[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs fsck /examples/input/1.log -files -blocks -locations
19/07/09 08:11:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /10.9.15.140 for path /examples/input/1.log at Tue Jul 09 08:11:42 CST 2019
/examples/input/1.log 14 bytes, 1 block(s): 
/examples/input/1.log: CORRUPT blockpool BP-2139914965-127.0.0.1-1562410380097 block blk_1073741825
 MISSING 1 blocks of total size 14 B
0. BP-2139914965-127.0.0.1-1562410380097:blk_1073741825_1001 len=14 MISSING!

Status: CORRUPT
 Total size:	14 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 14 B)
  ********************************
  CORRUPT FILES:	1
  MISSING BLOCKS:	1
  MISSING SIZE:		14 B
  CORRUPT BLOCKS: 	1
  ********************************
 Minimally replicated blocks:	0 (0.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	0.0
 Corrupt blocks:		1
 Missing replicas:		0
 Number of data-nodes:		1
 Number of racks:		1
FSCK ended at Tue Jul 09 08:11:42 CST 2019 in 3 milliseconds


The filesystem under path '/examples/input/1.log' is CORRUPT

可以看到/examples/input/1.log: CORRUPT blockpool BP-2139914965-127.0.0.1-1562410380097 block blk_1073741825这个,根据这个就可以定位到相关的坏块的位置了。

4.手动修复
看一下hdfs debug这个命令,这个命令很重要

[hadoop@10-9-15-140 ~]$ hdfs debug
Usage: hdfs debug  [arguments]

verify [-meta ] [-block ]
recoverLease [-path ] [-retries ]

手动修复命令:

hdfs fsck -list-corruptfileblocks

修复命令:
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs debug  recoverLease  -path /examples/input/1.log -retries 10
19/07/09 08:57:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
recoverLease SUCCEEDED on /examples/input/1.log

因为我部署的Hadoop是伪分布式,只有一个DataNode节点,所以只有一个副本,没有其它副本,所以一个块也就只有一份。当删除了一个副本上的块之后,就无法模拟了。但是正常情况下,如果有三个副本,删除一个块的一个副本之后,另外还有两个副本,那么就可以修复了。

4.自动修复

当数据块损坏后,DN节点执行directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600

在DN向NN进行blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000

当NN收到blockreport才会进行恢复操作。

具体参考生产上HDFS(CDH5.12.0)对应的版本的文档参数:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.12.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

总结:
生产上本人一般倾向于使用 手动修复方式,但是前提要手动删除损坏的block块。
切记,是删除损坏block文件和meta文件,而不是删除hdfs文件。

当然还可以先把文件get下载,然后hdfs删除,再对应上传。

切记删除不要执行: hdfs fsck / -delete 这是删除损坏的文件, 那么数据不就丢了嘛;除非无所谓丢数据,或者有信心从其他地方可以补数据到hdfs!

你可能感兴趣的:(HDFS)