HDFS Block损坏恢复实践

文章目录

    • 一,介绍:
    • 二,实践
      • ①在hdfs创建文件夹,上传测试文件,并检查文件健康状况
      • ②找出块位置,并且删除一个block副本和block元数据信息
      • ③重启hdfs,直接模拟损坏效果,然后hdfs fsck /path 进行检出
    • 三,修复
      • ①hdfs debug 手动修复(推荐)
      • ②手动修复二
      • ③自动修复
    • 四,总结

一,介绍:

①:hdfs fsck /path
检查path中文件的健康状况
②:hdfs fsck /path -files -blocks -locations
打印文件块的位置信息(-locations) 需要和-files -blocks一起使用
③:hdfs fsck /path -list-corruptfileblocks
查看文件中损坏的块(-list-corruptfileblocks)
④:hdfs fsck /path -delete
这是删除损坏的文件(它在hdfs上)

二,实践

①在hdfs创建文件夹,上传测试文件,并检查文件健康状况

[hadoop@hadoop001 ~]$ hdfs dfs -mkdir /blocktest
[hadoop@hadoop001 ~]$ hdfs dfs -put testHdfsFile.txt /blocktest/
[hadoop@hadoop001 ~]$ hdfs dfs -ls /blocktest/
Found 1 items
-rw-r--r--   3 hadoop hadoop         60 2019-08-21 14:23 /blocktest/testHdfsFile.txt
[hadoop@hadoop001 ~]$ hdfs fsck /blocktest/
Connecting to namenode via http://hadoop001:50070/fsck?ugi=hadoop&path=%2Fblocktest
FSCK started by hadoop (auth:SIMPLE) from /172.19.252.139 for path /blocktest at Wed Aug 21 14:25:05 CST 2019
.Status: HEALTHY
 Total size:    60 B
 Total dirs:    1
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      1 (avg. block size 60 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Wed Aug 21 14:25:05 CST 2019 in 1 milliseconds

The filesystem under path '/blocktest' is HEALTHY
[hadoop@hadoop001 ~]$ 

②找出块位置,并且删除一个block副本和block元数据信息

[root@hadoop001 subdir0]#  hdfs fsck /blocktest/testHdfsFile.txt -files -blocks -locations
-bash: hdfs: command not found
[root@hadoop001 subdir0]# su - hadoop
Last login: Wed Aug 21 14:12:44 CST 2019 on pts/0
[hadoop@hadoop001 ~]$ hdfs fsck /blocktest/testHdfsFile.txt -files -blocks -locations
Connecting to namenode via http://hadoop001:50070/fsck?ugi=hadoop&files=1&blocks=1&locations=1&path=%2Fblocktest%2FtestHdfsFile.txt
FSCK started by hadoop (auth:SIMPLE) from /172.19.252.139 for path /blocktest/testHdfsFile.txt at Wed Aug 21 14:47:17 CST 2019
/blocktest/testHdfsFile.txt 60 bytes, 1 block(s):  OK
0. BP-577895678-172.19.252.139-1566271200217:blk_1073741826_1002 len=60 Live_repl=3 [DatanodeInfoWithStorage[172.19.252.141:50010,DS-ffd3fa19-ddbb-4f5a-b487-d1ecb6a6d95b,DISK], DatanodeInfoWithStorage[172.19.252.140:50010,DS-ce5c4933-ca59-4955-bfcd-b1c6c0276f1f,DISK], DatanodeInfoWithStorage[172.19.252.139:50010,DS-afdf9c32-a7f5-4b9b-b9ff-32bf4ea876e2,DISK]]

Status: HEALTHY
 Total size:    60 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      1 (avg. block size 60 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          3
 Number of racks:               1
FSCK ended at Wed Aug 21 14:47:17 CST 2019 in 1 milliseconds

The filesystem under path '/blocktest/testHdfsFile.txt' is HEALTHY
[hadoop@hadoop001 ~]$ logout
[root@hadoop001 subdir0]# find / -name "*blk_1073741826_1002*"
/home/hadoop/data/dfs/data/current/BP-577895678-172.19.252.139-1566271200217/current/finalized/subdir0/subdir0/blk_1073741826_1002.meta
[root@hadoop001 subdir0]# cd /home/hadoop/data/dfs/data/current/BP-577895678-172.19.252.139-1566271200217/current/finalized/subdir0/subdir0/
[root@hadoop001 subdir0]# ll
total 20
-rw-rw-r-- 1 hadoop hadoop 4233 Aug 20 12:32 blk_1073741825
-rw-rw-r-- 1 hadoop hadoop   43 Aug 20 12:32 blk_1073741825_1001.meta
-rw-rw-r-- 1 hadoop hadoop   60 Aug 21 14:23 blk_1073741826
-rw-rw-r-- 1 hadoop hadoop   11 Aug 21 14:23 blk_1073741826_1002.meta
#删除块和meta⽂件
[root@hadoop001 subdir0]# rm -rf blk_1073741826*
[root@hadoop001 subdir0]# ll
total 12
-rw-rw-r-- 1 hadoop hadoop 4233 Aug 20 12:32 blk_1073741825
-rw-rw-r-- 1 hadoop hadoop   43 Aug 20 12:32 blk_1073741825_1001.meta
[root@hadoop001 subdir0]# 

③重启hdfs,直接模拟损坏效果,然后hdfs fsck /path 进行检出

[hadoop@hadoop001 subdir0]$ hdfs fsck /
Connecting to namenode via http://hadoop001:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path / at Mon Apr 29 18:51:06 CST 2019
..
/blockrecover/testcorruptfiles.txt: CORRUPT blockpool BP-2041209051-127.0.0.1-1556350579057 block blk_1073741890

/blockrecover/testcorruptfiles.txt: MISSING 1 blocks of total size 51 B.............Status: CORRUPT
 Total size:    654116 B
 Total dirs:    12
 Total files:   14
 Total symlinks:                0
 Total blocks (validated):      14 (avg. block size 46722 B)
  ********************************
  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         51 B
  CORRUPT BLOCKS:       1
  ********************************
 Minimally replicated blocks:   13 (92.85714 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    1
 Average block replication:     0.9285714
 Corrupt blocks:                1
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          1
 Number of racks:               1
FSCK ended at Mon Apr 29 18:51:06 CST 2019 in 41 milliseconds
The filesystem under path '/' is CORRUPT
[hadoop@hadoop001 subdir0]$ 

Corrupt blocks: 1
有一个block损坏
(此次模拟是伪分布式,集群模式下可能重启hdfs集群就已经自动修复了,看不到损坏的block)

三,修复

①hdfs debug 手动修复(推荐)

手动删除损坏的block块。切记,是删除损坏block文件和meta文件,而不是删除hdfs文件。然后用命令修复:

[hadoop@hadoop001 subdir0]$ hdfs debug recoverLease -path /blocktest/testHdfsFile.txt -retries 10

-retries 重试次数

②手动修复二

先用命令从hdfs上把文件下载到本地,然后删除hdfs上的对应文件,最后在上传上去,

hdfs dfs -ls /xxxx
hdfs dfs -get /xxxx ./
hdfs dfs -rm /xxx
hdfs dfs -put xxx /        #put到hdfs上之后,它会自动变为3份。

③自动修复

当数据块损坏后,DN节点执⾏directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进⾏blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进⾏恢复操作。

四,总结

①,区分好hdfs文件和block之间的关系(通常一个文件有三个block副本)
②,⽣产环境中本⼈⼀般倾向于使⽤ ⼿动修复⽅式,但是前提要⼿动删除损坏的block块。
切记,是删除损坏block⽂件和meta⽂件,⽽不是删除hdfs⽂件。
当然还可以先把⽂件get下载,然后hdfs删除,再对应上传。
切记删除不要执⾏: hdfs fsck / -delete 这是删除损坏的⽂件, 那么数据就直接丢了;除⾮⽆所谓丢数据,或者有信⼼从其他地⽅可以补数据到hdfs!

你可能感兴趣的:(Hadoop)