版本号:hadopp2.7.0
DataNode是HDFS中主要扮演者数据存储管理的角色,主要辅助Block快的存储和读写,以及上报Block快信息至NameNode节点,同事对NameNode发出的指令做出响应,比如缓存、删除、迁移、复制等。还接收客户端的请求,提供Block的读写请求,同也与其他DataNode进行通信,比如写数据时的数据管道的建立,已经写成功的汇报工作。
提示:以下是本篇文章正文内容,有错误的地方欢迎指正
类 | 功能 |
---|---|
DataStorage | 管理和组织磁盘存储目录(current、previous、detach、tmp等) |
DataXceiverServer | 处理block快的读和写操作 |
BlockPoolManager | 管理BPOfferService |
BPOfferService | 管理BPServiceActor |
BPServiceActor | 每一个NN对应一个BPServiceActor,DN预注册、DN注册、DN发送心跳、处理NN响应的命令 |
Rpc.Server | HDFS RPC服务端 |
Listener | 创建socket服务,监听Accept事件 |
Handler | 处理请求业务并发送响应至客户端 |
Responder | 当Handler压力过大协助发送响应至客户端 |
void startDataNode(Configuration conf,
List<StorageLocation> dataDirs,
SecureResources resources
) throws IOException {
//初始化DataStorage
storage = new DataStorage();
// global DN settings
//将DataNode注册到MBeanServer中 标准MBean
registerMXBean();
//初始化DataXceiver 用来 read/write block data
initDataXceiver(conf);
//http服务
startInfoServer(conf);
pauseMonitor = new JvmPauseMonitor(conf);
pauseMonitor.start();
// BlockPoolTokenSecretManager is required to create ipc server.
//管理BlockTokenSecretManager的类
this.blockPoolTokenSecretManager = new BlockPoolTokenSecretManager();
// Login is done by now. Set the DN user name.
dnUserName = UserGroupInformation.getCurrentUser().getShortUserName();
LOG.info("dnUserName = " + dnUserName);
LOG.info("supergroup = " + supergroup);
//初始化RPCserver DataNode
initIpcServer(conf);
//初始化指标jmx 动态Mbean
metrics = DataNodeMetrics.create(conf, getDisplayName());
metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);
//管理 blockpool的类
blockPoolManager = new BlockPoolManager(this);
//刷新Dn中的NN,BpOfferService actor
blockPoolManager.refreshNamenodes(conf);
}
我们都知道一个BPServiceActor对应一个NameNode,每个DN都要向所有的NN进行注册,也就是每个BPServiceActor都要向NameNode进行注册,那么BPServiceActor是如何创建出来的呢?其实很简单,在DataNode初始化的时候会刷新本地的NameNode列表,在此过程中通过配置文件中NameNode的HA配置项,每一个nameservice对应一个BPOfferService,每一个nameservice下的address对应一个BPServiceActor。
更新NameNodes的方法
private void doRefreshNamenodes(
Map<String, Map<String, InetSocketAddress>> addrMap) throws IOException {
assert Thread.holdsLock(refreshNamenodesLock);
//需要更新的nameservices
Set<String> toRefresh = Sets.newLinkedHashSet();
//需要新增的nameservices
Set<String> toAdd = Sets.newLinkedHashSet();
//需要移除的nameservices
Set<String> toRemove;
synchronized (this) {
// Step 1. For each of the new nameservices, figure out whether
// it's an update of the set of NNs for an existing NS,
// or an entirely new nameservice.
for (String nameserviceId : addrMap.keySet()) {
//如果已经有了,就加入待刷新的列表,否则加入待新增的列表
if (bpByNameserviceId.containsKey(nameserviceId)) {
toRefresh.add(nameserviceId);
} else {
toAdd.add(nameserviceId);
}
}
// Step 2. Any nameservices we currently have but are no longer present
// need to be removed.
// toRemove = bpByNameserviceId - addrMap
toRemove = Sets.newHashSet(Sets.difference(
bpByNameserviceId.keySet(), addrMap.keySet()));
assert toRefresh.size() + toAdd.size() ==
addrMap.size() :
"toAdd: " + Joiner.on(",").useForNull("" ).join(toAdd) +
" toRemove: " + Joiner.on(",").useForNull("" ).join(toRemove) +
" toRefresh: " + Joiner.on(",").useForNull("" ).join(toRefresh);
// Step 3. Start new nameservices
if (!toAdd.isEmpty()) {
LOG.info("Starting BPOfferServices for nameservices: " +
Joiner.on(",").useForNull("" ).join(toAdd));
//nsToAdd 表示的是待映射的nameservice
for (String nsToAdd : toAdd) {
//每个nameservice下有两个NN
ArrayList<InetSocketAddress> addrs =
Lists.newArrayList(addrMap.get(nsToAdd).values());
//每个nameservice都对应一个BPOfferService
BPOfferService bpos = createBPOS(addrs);
bpByNameserviceId.put(nsToAdd, bpos);
offerServices.add(bpos);
}
}
startAll();
}
// Step 4. Shut down old nameservices. This happens outside
// of the synchronized(this) lock since they need to call
// back to .remove() from another thread
if (!toRemove.isEmpty()) {
LOG.info("Stopping BPOfferServices for nameservices: " +
Joiner.on(",").useForNull("" ).join(toRemove));
for (String nsToRemove : toRemove) {
BPOfferService bpos = bpByNameserviceId.get(nsToRemove);
bpos.stop();
bpos.join();
// they will call remove on their own
}
}
// Step 5. Update nameservices whose NN list has changed
if (!toRefresh.isEmpty()) {
LOG.info("Refreshing list of NNs for nameservices: " +
Joiner.on(",").useForNull("" ).join(toRefresh));
for (String nsToRefresh : toRefresh) {
BPOfferService bpos = bpByNameserviceId.get(nsToRefresh);
ArrayList<InetSocketAddress> addrs =
Lists.newArrayList(addrMap.get(nsToRefresh).values());
bpos.refreshNNList(addrs);
}
}
}
BPOfferService构造方法按照nameservices下的NN配置,每一个NN对应一个BPServiceActor
BPOfferService(List<InetSocketAddress> nnAddrs, DataNode dn) {
Preconditions.checkArgument(!nnAddrs.isEmpty(),
"Must pass at least one NN.");
this.dn = dn;
for (InetSocketAddress addr : nnAddrs) {
this.bpServices.add(new BPServiceActor(addr, this));
}
}
对于refresh操作,目前是不支持的,只能重启DN
/**
* 目前不支持增加新的NN进启动的DN(standby)
* org NN list
* nameservicce1
* 192.168.2.100
* 192.168.2.101
* nameservicce2
* 192.168.2.200
* 192.168.2.201
*current NN List
* nameservicce1
* 192.168.2.100
* 192.168.2.103
* nameservicce2
* 192.168.2.200
* 192.168.2.202
* 上述NN节点的变更是不能进行的,会抛出IOException,那么什么情况下会出现次IOException呢?
* 在ClientDatanodeProtocol#refreshNamenodes方法被调用,目前修改NN nameservicce下的地址后,需要重启DN
* 才能生效
* @param addrs
* @throws IOException
*/
void refreshNNList(ArrayList<InetSocketAddress> addrs) throws IOException {
Set<InetSocketAddress> oldAddrs = Sets.newHashSet();
for (BPServiceActor actor : bpServices) {
oldAddrs.add(actor.getNNSocketAddress());
}
Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs);
if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty()) {
// Keep things simple for now -- we can implement this at a later date.
throw new IOException(
"HA does not currently support adding a new standby to a running DN. " +
"Please do a rolling restart of DNs to reconfigure the list of NNs.");
}
}
connectToNNAndHandshake()方法主要分这么几步:
private void connectToNNAndHandshake() throws IOException {
// get NN proxy(DatanodeProtocol)
bpNamenode = dn.connectToNN(nnAddr);
// First phase of the handshake with NN - get the namespace
// info.
//通过rpc获取NN的版本信息,检查DN的version是否和NN的version一致,nn的version不能小于配置的最小NN version
NamespaceInfo nsInfo = retrieveNamespaceInfo();
// Verify that this matches the other NN in this HA pair.
// This also initializes our block pool in the DN if we are
// the first NN connection for this BP.
//1.校验
//1.1 Block pool id 是否一致
//1.2 NameSpace id 是否一致
//1.3 Cluster id 是否一致
//2.初始化block pool
//blockPoolManager-->Map bpByBlockPoolId-->add block pool
//由此可以证明,一个DN中是可以有多个Block pool ,Block pool 只是一个逻辑概念
//一个NN对应一个Block pool,联邦环境下,不同的NN之间共享所有的DN
bpos.verifyAndSetNamespaceInfo(nsInfo);
// Second phase of the handshake with the NN.
//携带了 ip hostname server post http port version storageInfo 等,向NN注册
register(nsInfo);
}
默认是3s发送一次心跳至NameNode,消息体如下:
指标 | 含义 |
---|---|
DatanodeRegistration | 包含StorageInfo,DatanodeID |
StorageReport[] | 包含每个存储目录的storage信息,存储卷总容量、使用量、可使用的量以及当前blockpool使用的磁盘大小 |
CacheCapacity | 缓存容量 |
CacheUsed | 缓存使用量 |
XmitsInProgress | 当前正在处理数据传输的线程数 |
XceiverCount | block读取操作的并发量 |
numFailedVolumes | 失败的存储卷数量 |
volumeFailureSummary | 失败的存储卷消息信息 |
NameNode响应如下:
字段 | 含义 |
---|---|
commands | 命令集合 |
haStatus | NN的HA状态,包含status和txid(事物id) |
rollingUpdateStatus | 滚动升级的状态 |
上面说了BPServiceActor主要是用来与NN进行交互的,一个NameNode节点就对应一个BPServiceActor,这里需要确认哪一个BPServiceActor是ACTIVE状态的,方便后面与Active状态的NN进行交互,这里每次心跳响应都会携带NN的状态,每次心跳都需要进行HA检查,选择出当前时刻ACTIVE节点。分为两种情况都是要进行HA切换的,第一种是当前bpos的active的nn有active切换至standby;第二种是当前的bpos的active的nn由standy切换成了active。
void updateActorStatesFromHeartbeat(
BPServiceActor actor,
NNHAStatusHeartbeat nnHaState) {
//加写锁
writeLock();
try {
final long txid = nnHaState.getTxId();
//心跳响应,该NN是否是active的节点
final boolean nnClaimsActive =
nnHaState.getState() == HAServiceState.ACTIVE;
//bpos认为的active的NN
final boolean bposThinksActive = bpServiceToActive == actor;
//是否是最新的事物编号
final boolean isMoreRecentClaim = txid > lastActiveClaimTxId;
//如果当前nn是active,bpos任务的active节点不是同一个,这时候说明NN发生了HA切换
if (nnClaimsActive && !bposThinksActive) {
LOG.info("Namenode " + actor + " trying to claim ACTIVE state with " +
"txid=" + txid);
//事物id不是最新的直接忽略
if (!isMoreRecentClaim) {
// Split-brain scenario - an NN is trying to claim active
// state when a different NN has already claimed it with a higher
// txid.
LOG.warn("NN " + actor + " tried to claim ACTIVE state at txid=" +
txid + " but there was already a more recent claim at txid=" +
lastActiveClaimTxId);
return;
} else {
//如果当前bpos的ACtive节点未选出,那么直接将次NN最为当前的active节点
if (bpServiceToActive == null) {
LOG.info("Acknowledging ACTIVE Namenode " + actor);
} else {
LOG.info("Namenode " + actor + " taking over ACTIVE state from " +
bpServiceToActive + " at higher txid=" + txid);
}
bpServiceToActive = actor;
}
} else if (!nnClaimsActive && bposThinksActive) {
//如果次节点不是active,而且当前bpos的active节点和次node是同一个,那么就说明Active节点切换成了standby节点
//此时直接将bpServiceToActive赋值为null,等待下一次的心跳
LOG.info("Namenode " + actor + " relinquishing ACTIVE state with " +
"txid=" + nnHaState.getTxId());
bpServiceToActive = null;
}
//如果确定了active节点,那么就保存最新的txid
if (bpServiceToActive == actor) {
assert txid >= lastActiveClaimTxId;
lastActiveClaimTxId = txid;
}
} finally {
writeUnlock();
}
}
final static int DNA_UNKNOWN = 0; // unknown action
final static int DNA_TRANSFER = 1; // transfer blocks to another datanode
final static int DNA_INVALIDATE = 2; // invalidate blocks
final static int DNA_SHUTDOWN = 3; // shutdown node 不支持
final static int DNA_REGISTER = 4; // re-register
final static int DNA_FINALIZE = 5; // finalize previous upgrade
final static int DNA_RECOVERBLOCK = 6; // request a block recovery
final static int DNA_ACCESSKEYUPDATE = 7; // update access key
final static int DNA_BALANCERBANDWIDTHUPDATE = 8; // update balancer bandwidth
final static int DNA_CACHE = 9; // cache blocks
final static int DNA_UNCACHE = 10; // uncache blocks
DN会周期性的进行快的全量汇报,默认时间是6小时,也会进行删除的块的汇报,默认周期是300s(100*心跳时间)
DN启动初始化了很多服务,主要是服务于数据的read/write。同事DN也会周期性的余NN保持心跳,NN通过心跳来向DN发送命令。NN是无法直接操作DN的,只能通过心跳来进行交互。