LeaseManager提供了lease recovery的方法。
lease recovery算法如下:
1. name node查找到lease的信息
2. 对于lease中的每一个文件,获取其最后一个block b进行以下处理
2.1 获取包含block b 的全部data node
2.2 获取一个data node作为其primary data node,
2.3 p 向 nn获取一个新的generation stamp
2.4 p 向每一个dn获取block info
2.5 p 计算最小块的长度
2.6 p 用最小块的长度和新生成的 genetation stamp来更新 dn 。(后面进行详述)
2.7 p ack nn update的结果
2.8 nn 更新 block info
2.9 nn 删除文件 f 的lease
2.10 nn向edit log提交 lease 这个change
针对此策略,我们查看了 https://issues.apache.org/jira/browse/HADOOP-1700,
lease recovey策略是在全部valid的dn中,选择block大小最小的dn,作为标准。并为此向nn申请一个新的generation stamp,分发至全部的dn,最后,nn将此次操作写之edit log。如以下设计文档中所述,选择最小的block是为了减少系统的负载。
Leaserecovery
TheNamenode creates an in-memory lease-record when a file is opened forwrite (or append). The lease record contains the filename that wasbeing written to. If a lease for a file expires (typically after 1hour), the Namenode starts lease recovery of that lease.
TheNamenode contacts the Datanodes where the last block was beingwritten to and fetches the BlockGenerationStamp from each of them.The Datanode(s) that have a DataGenerationStamp that is equal to orgreater than the number stored in the BlocksMap have a good replicaof the data. Any Datanode that has a BlockGenerationStamp that islarger than what is stored in the BlocksMap is guaranteed to containdata from the last successful write to that block. The block size ofeach of these replicas could still be different because the writefrom a client might not have been committed to all replicas. TheLease Recovery process should pick one of these blocks to have theright size of data and then ensure that all the good blocks have thesame size. There are three possible algorithms that it can choose:
Choosethe replica that has the maximum size, copy data from largerreplicas to the shorter replicas
Choosethe replica that has the minimum size, truncate all larger replicasto this size
Choosesize that is the majority of all known good replicas, then truncatelarger replicas and copy data to smaller replicas
Allof the above algorithms are appropriate because the Client-write didnot complete and so it is up-to the system to determine how much datait chooses to persist.
Wechoose the second option because it incurs the least overhead for thesystem. We pick the size of the minimum-size-block as the chosen sizeof this block. The Namenode generates a new BlockGenerationStamp forthis block and sends it and the choose size to this subset ofDatanodes. The Datanodes closes client-connections (if any) to thatblock, truncates the block file to the chosen size (if needed),persist this BlockGenerationStamp in the block meta-file and sendsconfirmation back to the Namenode. The truncation of the block fileand the storage of the new BlockGenerationStamp can be done in anyorder. The Namenode then writes a CloseFile record into thetransaction log and deletes the lease record.