剑指数据仓库-Hadoop三

一、上次课程回顾

二、Hadoop第二次课

  • 2.1、block块
  • 2.2、hdfs上的小文件
  • 2.3、hdfs的架构设计
  • 2.4、hdfs上的datanode
  • 2.5、hdfs上关于数据修复
  • 2.6、hdfs上的SecondaryNameNode
  • 2.7、fsimage+editlog详解
  • 2.8、校验文件是否损坏

三、本次课程作业

一、上次课程回顾

  • https://blog.csdn.net/SparkOnYarn/article/details/105085009

回顾上次课程:
1、伪分布式部署可以理解为集群的微缩版本
2、云主机为了防止挖矿,修改端口为38088
3、通过mapreduce的案例了解到,存储输出都是再hdfs,通过yarn来进行资源调度;
4、CentOS6、CentOS7下修改主机名;/etc/hosts下配置的是内网ip,在C盘的目录下配置的是外网IP
5、jps下的真真假假,使用ps -ef|grep监听到的才是真项
6、linux下的oom机制、定期clear机制;大数据项目运行周期会受到影响

二、Hadoop第三次课

2.1、block块

  • https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

存在两个关于block的参数:
dfs.blocksize 128m
dfs.replication 3

通俗的举例:
存在260ml的水,瓶子的规格是128ml,问就是存几个瓶子?
260/128 = 2瓶 … 4ml

p1 128ml 装满
p2 128ml 装满
p3 4m 未装满,但是也占用一个瓶子
最终这些水要装3个瓶子。

–> 大数据中,blocksize -->瓶子的规格,在伪分布式部署中:dfs.replication = 1,因为我们只有一个datanode的节点,(千万不要在一台机器上不指三个datanode,以不同的端口号去启动),这种其实是不建议这么操作的;

  • 解析下图:DN节点上存在P1,dfs.replication=3意味着的是包括本身一共3份,注意:副本数设置3 <= dn节点的数量,p1/p2/p3都是分布在不同的机器节点上的。260m的文件上传至hdfs上会进行切割为3份。

剑指数据仓库-Hadoop三_第1张图片

假设一个文件是260m,上传至hdfs上,切为3个块,dfs.replication=3
128m 128m 128m
128m 128m 128m
4m 4m 4m

存在如下面试题:
一个文件是160m,块大小是128m,副本数是2份
请问实际存储多少块,占据多少空间?
160/128=1…32m
答:实际存储4个块,占据320m空间

谨记两句话:1、数据上传hdfs不可能会凭空增加新的数据内容;2、dfs.blocksize是规格,未满一个规格,也会占用一个block文件

2.2、hdfs上的小文件

hdfs适合小文件存储吗?
假如不适合,为什么呢?
加入上传的文件都是小文件,比如上传的文件3 5 6 10m的四个文件
dfs.blocksize 128m 规格
dfs.replication 3

块的数量就是4个文件 x 3 = 12块
假设在【上传前合并】这四个文件为24m的文件,然后有3个副本,那存储的话就是占据3个块。

块的元数据信息是记录在我们的老大namenode上的,比如namenode的内存就只有4G,相比块越少,它的压力就越小;namenode管控文件分布在哪些机器,块的存储位置记录这些信息。

假如已经在hdfs上真的有小文件,该怎么办?
启动一个服务,单独进行合并。

目标是为了小文件合并为大文件,约定俗称的一点:合并后的大文件不要超过blocksize的大小,比如控制在110M的范围内,不要踩高压线,一个文件就是一个block块。比如129m的文件,进行切割还是会有一个1m的小文件。

所以hdfs的设计初衷是为了大文件的存储;J总公司怎样判断哪些小文件合并在一起?

  • 编写shell脚本,筛选出所有小于10m的文件,设定阈值为10m

flume --> hdfs上如何控制小文件

查看hdfs上文件的字节:hdfs dfs -ls -h /wordcount/input

2.3、hdfs的架构设计

  • https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

NN-DN的主从架构:

rack机架初识:就是一个柜子用来横着存放刀片机,企业存放的服务器是扁的;J总公司一个机架存放的是5个刀片机,每个刀片机的配置:256G内存(由32G x 8 = 256G内存)、56个物理core、4块各500g的ssd、10块各1T的机械硬盘1万转、2颗GPU(用作数据挖掘);价格:8.5W~10W;一个机架放5台刀片机;为啥只放5台,是因为一台机架的电的安培数是固定的,2颗CPU就等于是一台刀片机了;这当中的内存是可以进行升级的,后期J总公司升级到了512G.

  • 下图解析:有两个机架,Rack1上存放了三台机器,Rack2上存放了两台机器;NameNode的角色是主:
    存储: 文件系统的命名空间
    a.文件的名称
    b.文件的目录结构
    c.文件的属性、权限、创建时间、副本数
    d.文件对应被切割为哪些数据块(需要考虑到副本数,就是一个map的映射结构) --> 数据块分布在哪些DN节点上

p1 p1 p1
p2 p2 p2
p3 p3 p3

注意:一个DN节点上不可能存储一个块的多个副本,只能1个;
eg:DN1上已经存储了p1,不可能p1的副本还存储在DN1上;当然这是因为分布式存储的原因。

直观的去查看块的存储路径,如下:

[root@hadoop001 subdir0]# pwd
/tmp/hadoop-hadoop/dfs/data/current/BP-269313764-172.17.0.5-1585035736843/current/finalized/subdir0/subdir0

//落地到linux磁盘上面如下,一对一对的出现
[root@hadoop001 subdir0]# ll
total 212
-rw-rw-r-- 1 hadoop hadoop     35 Mar 24 23:24 blk_1073741826
-rw-rw-r-- 1 hadoop hadoop     11 Mar 24 23:24 blk_1073741826_1002.meta
-rw-rw-r-- 1 hadoop hadoop     23 Mar 24 23:27 blk_1073741834
-rw-rw-r-- 1 hadoop hadoop     11 Mar 24 23:27 blk_1073741834_1010.meta
-rw-rw-r-- 1 hadoop hadoop    349 Mar 24 23:27 blk_1073741835
-rw-rw-r-- 1 hadoop hadoop     11 Mar 24 23:27 blk_1073741835_1011.meta
-rw-rw-r-- 1 hadoop hadoop  33588 Mar 24 23:27 blk_1073741836
-rw-rw-r-- 1 hadoop hadoop    271 Mar 24 23:27 blk_1073741836_1012.meta
-rw-rw-r-- 1 hadoop hadoop 141083 Mar 24 23:27 blk_1073741837
-rw-rw-r-- 1 hadoop hadoop   1111 Mar 24 23:27 blk_1073741837_1013.meta

2、我们上传本地的一个文件进行直观的感受下,上传hadoop-2.6.0-cdh5.16.2.tar.gz,415m的文件会被切割为4个块。
hadoop@hadoop001 software]$ du -sh hadoop-2.6.0-cdh5.16.2.tar.gz 
415M    hadoop-2.6.0-cdh5.16.2.tar.gz
library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 software]$ hdfs dfs -put hadoop-2.6.0-cdh5.16.2.tar.gz /ruozedata
20/03/25 21:26:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

3、还是在那个目录下查看明显多了4个block块
[root@hadoop001 subdir0]# ll -lh
total 419M
-rw-rw-r-- 1 hadoop hadoop   35 Mar 24 23:24 blk_1073741826
-rw-rw-r-- 1 hadoop hadoop   11 Mar 24 23:24 blk_1073741826_1002.meta
-rw-rw-r-- 1 hadoop hadoop   23 Mar 24 23:27 blk_1073741834
-rw-rw-r-- 1 hadoop hadoop   11 Mar 24 23:27 blk_1073741834_1010.meta
-rw-rw-r-- 1 hadoop hadoop  349 Mar 24 23:27 blk_1073741835
-rw-rw-r-- 1 hadoop hadoop   11 Mar 24 23:27 blk_1073741835_1011.meta
-rw-rw-r-- 1 hadoop hadoop  33K Mar 24 23:27 blk_1073741836
-rw-rw-r-- 1 hadoop hadoop  271 Mar 24 23:27 blk_1073741836_1012.meta
-rw-rw-r-- 1 hadoop hadoop 138K Mar 24 23:27 blk_1073741837
-rw-rw-r-- 1 hadoop hadoop 1.1K Mar 24 23:27 blk_1073741837_1013.meta
-rw-rw-r-- 1 hadoop hadoop 128M Mar 25 21:27 blk_1073741838
-rw-rw-r-- 1 hadoop hadoop 1.1M Mar 25 21:27 blk_1073741838_1014.meta
-rw-rw-r-- 1 hadoop hadoop 128M Mar 25 21:27 blk_1073741839
-rw-rw-r-- 1 hadoop hadoop 1.1M Mar 25 21:27 blk_1073741839_1015.meta
-rw-rw-r-- 1 hadoop hadoop 128M Mar 25 21:27 blk_1073741840
-rw-rw-r-- 1 hadoop hadoop 1.1M Mar 25 21:27 blk_1073741840_1016.meta
-rw-rw-r-- 1 hadoop hadoop  31M Mar 25 21:27 blk_1073741841
-rw-rw-r-- 1 hadoop hadoop 242K Mar 25 21:27 blk_1073741841_1017.meta

//如果副本数为3的话,blk_1073741838这个块在另外的机器上会有2份备份

namenode的第四条:blockmap 当然nn节点不会持久化的存储这种映射关系,是通过集群启动和运行,dn会定期发送blockreport给nn,依次nn在内存中动态维护这种映射关系。

NameNode的作用:管理文件系统的命名空间,维护文件系统树的所有文件和文件夹;这些信息以两个文件形式永久的保存在本地磁盘上:镜像文件(fsimage)和编辑日志文件(editlog)

  • fsimage和editlog的存储目录:[root@hadoop001 current]# pwd
    /tmp/hadoop-hadoop/dfs/name/current
    -rw-rw-r-- 1 hadoop hadoop 42 Mar 25 20:49 edits_0000000000000000179-0000000000000000180
    -rw-rw-r-- 1 hadoop hadoop 1048576 Mar 25 21:27 edits_inprogress_0000000000000000181
    -rw-rw-r-- 1 hadoop hadoop 1688 Mar 25 19:49 fsimage_0000000000000000178
    -rw-rw-r-- 1 hadoop hadoop 62 Mar 25 19:49 fsimage_0000000000000000178.md5
    -rw-rw-r-- 1 hadoop hadoop 1688 Mar 25 20:49 fsimage_0000000000000000180
    -rw-rw-r-- 1 hadoop hadoop 62 Mar 25 20:49 fsimage_0000000000000000180.md5

2.4、hdfs上的DataNode

DataNode是从节点,简写是dn,存储:数据块和数据块的校验和;
与NameNode通信:
a.每隔3秒发送心跳包给nn,告诉他我还活着,参数:dfs.heartbeat.interval
b.每隔一定的时间发送一次blockreport
c.

DataNode心跳控制参数:

name value description
dfs.heartbeat.interval 3s Determines datanode heartbeat interval in seconds. Can use the following suffix (case insensitive): ms(millis), s(sec), m(min), h(hour), d(day) to specify the time (such as 2s, 2m, 1h, etc.). Or provide complete number in seconds (such as 30 for 30 seconds).

块报告的参数:

name value description
dfs.blockreport.intervalMsec 21600000 ==>(转换为小时就是6h) Determines block reporting interval in milliseconds.
dfs.datanode.directoryscan.interval 21600s ⇒ (转换后也是6小时) Interval in seconds for Datanode to scan data directories and reconcile the difference between blocks in memory and on the disk. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval.

小结:datanode存储的是数据块和数据块校验和,与nn进行通信,每隔3秒发送心跳包;需要修改的是下一个参数,J总公司dfs.blockreport.intervalMsec、dfs.datanode.directoryscan.interval这两个参数设置的都是3h。

  • 未来公司数据不多,几十T的话可以设置个1h、2h;数据量1Pb、2Pb的话就不要设置的时间太短。

2.5、hdfs上的数据修复

[ruoze@ruozedata001 current]$ hdfs debug
Usage: hdfs debug [arguments]

These commands are for advanced users only.

Incorrect usages may result in data loss. Use at your own risk.

verifyMeta -meta [-block ]
computeMeta -block -out
recoverLease -path [-retries ]
[ruoze@ruozedata001 current]$
[ruoze@ruozedata001 current]$

手动修复: 多副本
[ruoze@ruozedata001 current]$ hdfs debug recoverLease -path xxx -retries 10

自动修复:
https://ruozedata.github.io/2019/06/06/%E7%94%9F%E4%BA%A7HDFS%20Block%E6%8D%9F%E5%9D%8F%E6%81%A2%E5%A4%8D%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5(%E5%90%AB%E6%80%9D%E8%80%83%E9%A2%98)/

存在情况:手动修复和自动修复都是失败的,在数据仓库中,有数据质量和数据重刷机制(fulloutjoin)

2.6、hdfs上的SecondaryNameNode

SecondaryNameNode:第二名称节点,h-1,做的就是checkpoint每隔一小时备份;简称为SNN,存储:fsimage+editlog,作用:定期合并fsimage+edit文件作为新的fsimage,推送给NN,简称为checkpoint检查点动作,目的就是为了做备份。

剑指数据仓库-Hadoop三_第2张图片
控制checkpoint的两个参数,要么时间达到,要么次数达到。

name value description
dfs.namenode.checkpoint.period 3600s The number of seconds between two periodic checkpoints. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval.
dfs.namenode.checkpoint.txns 1000000 The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every ‘dfs.namenode.checkpoint.txns’ transactions, regardless of whether ‘dfs.namenode.checkpoint.period’ has expired.

Checkpoint设置为1分钟呢,这肯定不合理的,hdfs忙着做检查点动作了;J总见过资源不够,一开始没有做HA,就把Secondary的checkpoint设置为半小时,一般数据丢失是因为磁盘损坏的原因。

2.7、详解fsimage和editlog的合并

比如刚开始的数据:
1 2 3 4 全备份 fsimage1,
5 6 editlog1
checkpoint检查点所做的操作就是:fsimage1+editlog1 ==> fsimage2

举例:SNN会把fsimage2推送给NN,NN拿到这个文件之后,此时去写7 8 ==> 会保存在editlog2文件中,
过了一小时,checkpoint2:fsimage2+editlog2 ==> fsimage3
做checkpoint的目的就是不光光做备份,每次都会合并生成一份最新的文件

edit.new的意思:重新生成一份新的editlog文件,我们要做checkpoint,editlog不允许你写了,你只能在edit.new文件中写。

1、如下是NameNode下的fsimage+editlog文件:

[root@hadoop001 current]# ll
total 6340
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 15:54 edits_0000000000000000001-0000000000000000002
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 15:54 edits_0000000000000000003-0000000000000000003
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 16:23 edits_0000000000000000004-0000000000000000005
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:23 edits_0000000000000000006-0000000000000000006
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:25 edits_0000000000000000007-0000000000000000007
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:27 edits_0000000000000000008-0000000000000000008
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 16:31 edits_0000000000000000009-0000000000000000010
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:31 edits_0000000000000000011-0000000000000000011
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 16:49 edits_0000000000000000012-0000000000000000013
-rw-rw-r-- 1 hadoop hadoop     724 Mar 24 17:49 edits_0000000000000000014-0000000000000000023
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 18:49 edits_0000000000000000024-0000000000000000025
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 19:49 edits_0000000000000000026-0000000000000000027
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 20:49 edits_0000000000000000028-0000000000000000029
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 21:49 edits_0000000000000000030-0000000000000000031
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 22:49 edits_0000000000000000032-0000000000000000033
-rw-rw-r-- 1 hadoop hadoop   12981 Mar 24 23:49 edits_0000000000000000034-0000000000000000138
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 00:49 edits_0000000000000000139-0000000000000000140
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 01:49 edits_0000000000000000141-0000000000000000142
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 02:49 edits_0000000000000000143-0000000000000000144
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 03:49 edits_0000000000000000145-0000000000000000146
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 04:49 edits_0000000000000000147-0000000000000000148
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 05:49 edits_0000000000000000149-0000000000000000150
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 06:49 edits_0000000000000000151-0000000000000000152
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 07:49 edits_0000000000000000153-0000000000000000154
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 08:49 edits_0000000000000000155-0000000000000000156
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 09:49 edits_0000000000000000157-0000000000000000158
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 10:49 edits_0000000000000000159-0000000000000000160
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 11:49 edits_0000000000000000161-0000000000000000162
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 12:49 edits_0000000000000000163-0000000000000000164
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 13:49 edits_0000000000000000165-0000000000000000166
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 14:49 edits_0000000000000000167-0000000000000000168
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 15:49 edits_0000000000000000169-0000000000000000170
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 16:49 edits_0000000000000000171-0000000000000000172
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 17:49 edits_0000000000000000173-0000000000000000174
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 18:49 edits_0000000000000000175-0000000000000000176
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 19:49 edits_0000000000000000177-0000000000000000178
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 20:49 edits_0000000000000000179-0000000000000000180
-rw-rw-r-- 1 hadoop hadoop    1217 Mar 25 21:49 edits_0000000000000000181-0000000000000000197
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 22:49 edits_0000000000000000198-0000000000000000199
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 25 22:49 edits_inprogress_0000000000000000200
-rw-rw-r-- 1 hadoop hadoop    1835 Mar 25 21:49 fsimage_0000000000000000197
-rw-rw-r-- 1 hadoop hadoop      62 Mar 25 21:49 fsimage_0000000000000000197.md5
-rw-rw-r-- 1 hadoop hadoop    1835 Mar 25 22:49 fsimage_0000000000000000199
-rw-rw-r-- 1 hadoop hadoop      62 Mar 25 22:49 fsimage_0000000000000000199.md5
-rw-rw-r-- 1 hadoop hadoop       4 Mar 25 22:49 seen_txid
-rw-rw-r-- 1 hadoop hadoop     201 Mar 24 15:42 VERSION
[root@hadoop001 current]# pwd
/tmp/hadoop-hadoop/dfs/name/current

2、如下是SecondaryNameNode下的fsimage+editlog文件:

[hadoop@hadoop001 current]$ ll
total 5308
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 15:54 edits_0000000000000000001-0000000000000000002
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:23 edits_0000000000000000003-0000000000000000003
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 16:23 edits_0000000000000000004-0000000000000000005
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:31 edits_0000000000000000006-0000000000000000006
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:31 edits_0000000000000000007-0000000000000000007
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:31 edits_0000000000000000008-0000000000000000008
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 16:31 edits_0000000000000000009-0000000000000000010
-rw-rw-r-- 1 hadoop hadoop 1048576 Mar 24 16:49 edits_0000000000000000011-0000000000000000011
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 16:49 edits_0000000000000000012-0000000000000000013
-rw-rw-r-- 1 hadoop hadoop     724 Mar 24 17:49 edits_0000000000000000014-0000000000000000023
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 18:49 edits_0000000000000000024-0000000000000000025
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 19:49 edits_0000000000000000026-0000000000000000027
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 20:49 edits_0000000000000000028-0000000000000000029
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 21:49 edits_0000000000000000030-0000000000000000031
-rw-rw-r-- 1 hadoop hadoop      42 Mar 24 22:49 edits_0000000000000000032-0000000000000000033
-rw-rw-r-- 1 hadoop hadoop   12981 Mar 24 23:49 edits_0000000000000000034-0000000000000000138
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 00:49 edits_0000000000000000139-0000000000000000140
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 01:49 edits_0000000000000000141-0000000000000000142
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 02:49 edits_0000000000000000143-0000000000000000144
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 03:49 edits_0000000000000000145-0000000000000000146
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 04:49 edits_0000000000000000147-0000000000000000148
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 05:49 edits_0000000000000000149-0000000000000000150
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 06:49 edits_0000000000000000151-0000000000000000152
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 07:49 edits_0000000000000000153-0000000000000000154
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 08:49 edits_0000000000000000155-0000000000000000156
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 09:49 edits_0000000000000000157-0000000000000000158
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 10:49 edits_0000000000000000159-0000000000000000160
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 11:49 edits_0000000000000000161-0000000000000000162
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 12:49 edits_0000000000000000163-0000000000000000164
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 13:49 edits_0000000000000000165-0000000000000000166
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 14:49 edits_0000000000000000167-0000000000000000168
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 15:49 edits_0000000000000000169-0000000000000000170
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 16:49 edits_0000000000000000171-0000000000000000172
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 17:49 edits_0000000000000000173-0000000000000000174
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 18:49 edits_0000000000000000175-0000000000000000176
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 19:49 edits_0000000000000000177-0000000000000000178
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 20:49 edits_0000000000000000179-0000000000000000180
-rw-rw-r-- 1 hadoop hadoop    1217 Mar 25 21:49 edits_0000000000000000181-0000000000000000197
-rw-rw-r-- 1 hadoop hadoop      42 Mar 25 22:49 edits_0000000000000000198-0000000000000000199
-rw-rw-r-- 1 hadoop hadoop    1835 Mar 25 21:49 fsimage_0000000000000000197
-rw-rw-r-- 1 hadoop hadoop      62 Mar 25 21:49 fsimage_0000000000000000197.md5
-rw-rw-r-- 1 hadoop hadoop    1835 Mar 25 22:49 fsimage_0000000000000000199
-rw-rw-r-- 1 hadoop hadoop      62 Mar 25 22:49 fsimage_0000000000000000199.md5
-rw-rw-r-- 1 hadoop hadoop     201 Mar 25 22:49 VERSION
[hadoop@hadoop001 current]$ pwd
/tmp/hadoop-hadoop/dfs/namesecondary/current

分析:

  • SecondaryNameNode下的这条记录: fsimage_0000000000000000197 + edits_0000000000000000198-0000000000000000199 ==> fsimage_0000000000000000199;
    这条最新的fsimage会推送给NN,我们发现在NN中就有一个正在写的editlog文件:edits_inprogress_0000000000000000200,就比如这份正在写的文件丢失,就只能恢复到fsimage199这个文件。

  • 通过时间来进行发现:在SNN目录节点下,21:49分fsimage_0000000000000000197+22:49分的edits_0000000000000000198-0000000000000000199 ==> 生成的文件:fsimage_0000000000000000199是22:49分;与此同时:NN目录节点下这份文件是同时出现的,可以理解为是从SNN推送过去的:22:49 fsimage_0000000000000000199,又同时22:49分这个inprogress文件正在写入edits_inprogress_0000000000000000200

流程总结:
1、roll edit
2、传输fsimage+edits(可能有多个的)
3、merge合并
4、传输 fsimage.ckpt to nn
5、做一个回滚fsimage.ckpt ⇒ fsimage
edit.new ==> edit

剑指数据仓库-Hadoop三_第3张图片

2.8、校验文件是否损坏

1、流程如下:

[hadoop@hadoop001 current]$ which sha1sum
/usr/bin/sha1sum
[hadoop@hadoop001 current]$ /bin/md5sum fsimage_0000000000000000199
-bash: /bin/md5sum: No such file or directory
[hadoop@hadoop001 current]$ cat fsimage_0000000000000000199.md5
  • md5校验出来的值应该和cat md5出来的值是一致的,校验和一致代表文件未损坏。

三、本次作业

1、HDFS的存储目录在/tmp下,修改至/home/hadoop/tmp目录下?

2、整理对块大小、副本数、小文件的理解

3、整理HDFS架构,整理SNN的流程

你可能感兴趣的:(剑指数据仓库-Hadoop基础)