1.文件ruozedata.md
上传:
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -mkdir /blockrecover
19/07/08 00:14:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ echo "www.ruozedata.com" > ruozedata.md
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -put ruozedata.md /blockrecover
19/07/08 00:16:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /blockrecover
19/07/08 00:16:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 1 hadoop supergroup 18 2019-07-08 00:16 /blockrecover/ruozedata.md
校验: 健康状态
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs fsck /
19/07/08 00:16:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /10.9.15.140 for path / at Mon Jul 08 00:16:49 CST 2019
...........Status: HEALTHY
Total size: 150512 B
Total dirs: 14
Total files: 11
Total symlinks: 0
Total blocks (validated): 7 (avg. block size 21501 B)
Minimally replicated blocks: 7 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Jul 08 00:16:49 CST 2019 in 9 milliseconds
The filesystem under path '/' is HEALTHY
2.直接DN节点上删除文件一个block的一个副本(3副本)
删除块和meta文件:
[root@10-9-15-140 ~]# cd /tmp/hadoop-hadoop/dfs/data/current/BP-2139914965-127.0.0.1-1562410380097/current/finalized/subdir0/subdir0
[root@10-9-15-140 subdir0]# rm -rf blk_1073741825 blk_1073741825_1001.meta
直接重启HDFS,直接模拟损坏效果,然后fsck检查: 有错误
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs fsck /
19/07/08 00:45:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /10.9.15.140 for path / at Mon Jul 08 00:45:04 CST 2019
..
/examples/input/1.log: CORRUPT blockpool BP-2139914965-127.0.0.1-1562410380097 block blk_1073741825
/examples/input/1.log: MISSING 1 blocks of total size 14 B..........Status: CORRUPT
Total size: 150512 B
Total dirs: 14
Total files: 11
Total symlinks: 0
Total blocks (validated): 7 (avg. block size 21501 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 14 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 6 (85.71429 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 0.85714287
Corrupt blocks: 1
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Jul 08 00:45:04 CST 2019 in 26 milliseconds
The filesystem under path '/' is CORRUPT
3.定位损坏的block块的位置
想要知道文件的哪些块分布在哪些机器上面,然后手工删除在Linux上损坏的块
-files 文件分块信息,
-blocks 在带-files参数后才显示block信息
-locations 在带-blocks参数后才显示block块所在datanode的具体IP位置,
-racks 在带-files参数后显示机架位置
检测缺失块
①.hdfs fsck -list-corruptfileblocks
②.查看上面某一个文件的情况
hdfs fsck /examples/input/1.log -files -blocks -locations
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs fsck /examples/input/1.log -files -blocks -locations
19/07/09 08:11:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /10.9.15.140 for path /examples/input/1.log at Tue Jul 09 08:11:42 CST 2019
/examples/input/1.log 14 bytes, 1 block(s):
/examples/input/1.log: CORRUPT blockpool BP-2139914965-127.0.0.1-1562410380097 block blk_1073741825
MISSING 1 blocks of total size 14 B
0. BP-2139914965-127.0.0.1-1562410380097:blk_1073741825_1001 len=14 MISSING!
Status: CORRUPT
Total size: 14 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 14 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 14 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 0.0
Corrupt blocks: 1
Missing replicas: 0
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Tue Jul 09 08:11:42 CST 2019 in 3 milliseconds
The filesystem under path '/examples/input/1.log' is CORRUPT
可以看到/examples/input/1.log: CORRUPT blockpool BP-2139914965-127.0.0.1-1562410380097 block blk_1073741825这个,根据这个就可以定位到相关的坏块的位置了。
4.手动修复
看一下hdfs debug这个命令,这个命令很重要
[hadoop@10-9-15-140 ~]$ hdfs debug
Usage: hdfs debug [arguments]
verify [-meta ] [-block ]
recoverLease [-path ] [-retries ]
手动修复命令:
hdfs fsck -list-corruptfileblocks
修复命令:
[hadoop@10-9-15-140 hadoop-2.6.0-cdh5.7.0]$ hdfs debug recoverLease -path /examples/input/1.log -retries 10
19/07/09 08:57:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
recoverLease SUCCEEDED on /examples/input/1.log
因为我部署的Hadoop是伪分布式,只有一个DataNode节点,所以只有一个副本,没有其它副本,所以一个块也就只有一份。当删除了一个副本上的块之后,就无法模拟了。但是正常情况下,如果有三个副本,删除一个块的一个副本之后,另外还有两个副本,那么就可以修复了。
4.自动修复
当数据块损坏后,DN节点执行directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进行blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行恢复操作。
具体参考生产上HDFS(CDH5.12.0)对应的版本的文档参数:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.12.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
总结:
生产上本人一般倾向于使用 手动修复方式,但是前提要手动删除损坏的block块。
切记,是删除损坏block文件和meta文件,而不是删除hdfs文件。
当然还可以先把文件get下载,然后hdfs删除,再对应上传。
切记删除不要执行: hdfs fsck / -delete 这是删除损坏的文件, 那么数据不就丢了嘛;除非无所谓丢数据,或者有信心从其他地方可以补数据到hdfs!