HDFS源码-DataNode启动流程

HDFS源码-DataNode启动流程

版本号:hadopp2.7.0

文章目录

  • HDFS源码-DataNode启动流程
    • 前言
    • 一、DataNode启动
      • 1、相关类
        • 2、重点代码
    • 二、DN注册流程
        • 1、BPServiceActor的创建
        • 2、DataNode注册
        • 3、DataNode发送心跳
        • 4、DataNode确认ACTIVE状态的BPServiceActor
        • 5、执行NameNode响应的指令
        • 6、快汇报
    • 三、总结


前言

DataNode是HDFS中主要扮演者数据存储管理的角色,主要辅助Block快的存储和读写,以及上报Block快信息至NameNode节点,同事对NameNode发出的指令做出响应,比如缓存、删除、迁移、复制等。还接收客户端的请求,提供Block的读写请求,同也与其他DataNode进行通信,比如写数据时的数据管道的建立,已经写成功的汇报工作。


提示:以下是本篇文章正文内容,有错误的地方欢迎指正

一、DataNode启动

1、相关类

注:重点类标红
HDFS源码-DataNode启动流程_第1张图片

功能
DataStorage 管理和组织磁盘存储目录(current、previous、detach、tmp等)
DataXceiverServer 处理block快的读和写操作
BlockPoolManager 管理BPOfferService
BPOfferService 管理BPServiceActor
BPServiceActor 每一个NN对应一个BPServiceActor,DN预注册、DN注册、DN发送心跳、处理NN响应的命令
Rpc.Server HDFS RPC服务端
Listener 创建socket服务,监听Accept事件
Handler 处理请求业务并发送响应至客户端
Responder Handler压力过大协助发送响应至客户端

2、重点代码

void startDataNode(Configuration conf, 
                     List<StorageLocation> dataDirs,
                     SecureResources resources
                     ) throws IOException {
	//初始化DataStorage
    storage = new DataStorage();
    // global DN settings
    //将DataNode注册到MBeanServer中 标准MBean
    registerMXBean();
    //初始化DataXceiver 用来 read/write block data
    initDataXceiver(conf);
    //http服务
    startInfoServer(conf);
    pauseMonitor = new JvmPauseMonitor(conf);
    pauseMonitor.start();
  
    // BlockPoolTokenSecretManager is required to create ipc server.
    //管理BlockTokenSecretManager的类
    this.blockPoolTokenSecretManager = new BlockPoolTokenSecretManager();

    // Login is done by now. Set the DN user name.
    dnUserName = UserGroupInformation.getCurrentUser().getShortUserName();
    LOG.info("dnUserName = " + dnUserName);
    LOG.info("supergroup = " + supergroup);
    //初始化RPCserver DataNode
    initIpcServer(conf);
    //初始化指标jmx 动态Mbean
    metrics = DataNodeMetrics.create(conf, getDisplayName());
    metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);
    //管理 blockpool的类
    blockPoolManager = new BlockPoolManager(this);
    //刷新Dn中的NN,BpOfferService actor
    blockPoolManager.refreshNamenodes(conf);                     
}

二、DN注册流程

1、BPServiceActor的创建

我们都知道一个BPServiceActor对应一个NameNode,每个DN都要向所有的NN进行注册,也就是每个BPServiceActor都要向NameNode进行注册,那么BPServiceActor是如何创建出来的呢?其实很简单,在DataNode初始化的时候会刷新本地的NameNode列表,在此过程中通过配置文件中NameNode的HA配置项,每一个nameservice对应一个BPOfferService,每一个nameservice下的address对应一个BPServiceActor。

更新NameNodes的方法

private void doRefreshNamenodes(
      Map<String, Map<String, InetSocketAddress>> addrMap) throws IOException {
    assert Thread.holdsLock(refreshNamenodesLock);
	//需要更新的nameservices
    Set<String> toRefresh = Sets.newLinkedHashSet();
    //需要新增的nameservices
    Set<String> toAdd = Sets.newLinkedHashSet();
    //需要移除的nameservices
    Set<String> toRemove;
    
    synchronized (this) {
      // Step 1. For each of the new nameservices, figure out whether
      // it's an update of the set of NNs for an existing NS,
      // or an entirely new nameservice.
      for (String nameserviceId : addrMap.keySet()) {
        //如果已经有了,就加入待刷新的列表,否则加入待新增的列表
        if (bpByNameserviceId.containsKey(nameserviceId)) {
          toRefresh.add(nameserviceId);
        } else {
          toAdd.add(nameserviceId);
        }
      }
      
      // Step 2. Any nameservices we currently have but are no longer present
      // need to be removed.
      // toRemove = bpByNameserviceId - addrMap
      toRemove = Sets.newHashSet(Sets.difference(
          bpByNameserviceId.keySet(), addrMap.keySet()));
      
      assert toRefresh.size() + toAdd.size() ==
        addrMap.size() :
          "toAdd: " + Joiner.on(",").useForNull("").join(toAdd) +
          "  toRemove: " + Joiner.on(",").useForNull("").join(toRemove) +
          "  toRefresh: " + Joiner.on(",").useForNull("").join(toRefresh);

      
      // Step 3. Start new nameservices
      if (!toAdd.isEmpty()) {
        LOG.info("Starting BPOfferServices for nameservices: " +
            Joiner.on(",").useForNull("").join(toAdd));
        //nsToAdd 表示的是待映射的nameservice
        for (String nsToAdd : toAdd) {
          //每个nameservice下有两个NN
          ArrayList<InetSocketAddress> addrs =
            Lists.newArrayList(addrMap.get(nsToAdd).values());
          //每个nameservice都对应一个BPOfferService
          BPOfferService bpos = createBPOS(addrs);
          bpByNameserviceId.put(nsToAdd, bpos);
          offerServices.add(bpos);
        }
      }
      startAll();
    }

    // Step 4. Shut down old nameservices. This happens outside
    // of the synchronized(this) lock since they need to call
    // back to .remove() from another thread
    if (!toRemove.isEmpty()) {
      LOG.info("Stopping BPOfferServices for nameservices: " +
          Joiner.on(",").useForNull("").join(toRemove));
      
      for (String nsToRemove : toRemove) {
        BPOfferService bpos = bpByNameserviceId.get(nsToRemove);
        bpos.stop();
        bpos.join();
        // they will call remove on their own
      }
    }
    
    // Step 5. Update nameservices whose NN list has changed
    if (!toRefresh.isEmpty()) {
      LOG.info("Refreshing list of NNs for nameservices: " +
          Joiner.on(",").useForNull("").join(toRefresh));
      
      for (String nsToRefresh : toRefresh) {
        BPOfferService bpos = bpByNameserviceId.get(nsToRefresh);
        ArrayList<InetSocketAddress> addrs =
          Lists.newArrayList(addrMap.get(nsToRefresh).values());
        bpos.refreshNNList(addrs);
      }
    }
  }

BPOfferService构造方法按照nameservices下的NN配置,每一个NN对应一个BPServiceActor

 BPOfferService(List<InetSocketAddress> nnAddrs, DataNode dn) {
    Preconditions.checkArgument(!nnAddrs.isEmpty(),
        "Must pass at least one NN.");
    this.dn = dn;

    for (InetSocketAddress addr : nnAddrs) {
      this.bpServices.add(new BPServiceActor(addr, this));
    }
  }

对于refresh操作,目前是不支持的,只能重启DN

 /**
   * 目前不支持增加新的NN进启动的DN(standby)
   * org NN list
   * nameservicce1
   *    192.168.2.100
   *    192.168.2.101
   * nameservicce2
   *    192.168.2.200
   *    192.168.2.201
   *current NN List
   * nameservicce1
   *    192.168.2.100
   *    192.168.2.103
   * nameservicce2
   *    192.168.2.200
   *    192.168.2.202
   * 上述NN节点的变更是不能进行的,会抛出IOException,那么什么情况下会出现次IOException呢?
   * 在ClientDatanodeProtocol#refreshNamenodes方法被调用,目前修改NN nameservicce下的地址后,需要重启DN
   * 才能生效
   * @param addrs
   * @throws IOException
   */
  void refreshNNList(ArrayList<InetSocketAddress> addrs) throws IOException {
    Set<InetSocketAddress> oldAddrs = Sets.newHashSet();
    for (BPServiceActor actor : bpServices) {
      oldAddrs.add(actor.getNNSocketAddress());
    }
    Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs);
    
    if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty()) {
      // Keep things simple for now -- we can implement this at a later date.
      throw new IOException(
          "HA does not currently support adding a new standby to a running DN. " +
          "Please do a rolling restart of DNs to reconfigure the list of NNs.");
    }
  }

2、DataNode注册

connectToNNAndHandshake()方法主要分这么几步:

  1. 初始化DatanodeProtocol远程代理
  2. 获取NN的版本信息,并校验
  3. 初始化BlockPool
  4. 同一个nameservices下的NN(active/standby)信息必须一致,包括(Blockpool ID,Namespace ID,Cluster ID)
  5. 想Namenode注册(这里DN会向所有的NameNode进行注册)
private void connectToNNAndHandshake() throws IOException {
    // get NN proxy(DatanodeProtocol)
    bpNamenode = dn.connectToNN(nnAddr);

    // First phase of the handshake with NN - get the namespace
    // info.
    //通过rpc获取NN的版本信息,检查DN的version是否和NN的version一致,nn的version不能小于配置的最小NN version
    NamespaceInfo nsInfo = retrieveNamespaceInfo();
    
    // Verify that this matches the other NN in this HA pair.
    // This also initializes our block pool in the DN if we are
    // the first NN connection for this BP.
    //1.校验
    //1.1 Block pool id 是否一致
    //1.2 NameSpace id 是否一致
    //1.3 Cluster id 是否一致
    //2.初始化block pool
    //blockPoolManager-->Map bpByBlockPoolId-->add block pool
    //由此可以证明,一个DN中是可以有多个Block pool ,Block pool 只是一个逻辑概念
    //一个NN对应一个Block pool,联邦环境下,不同的NN之间共享所有的DN
    bpos.verifyAndSetNamespaceInfo(nsInfo);
    // Second phase of the handshake with the NN.
    //携带了 ip  hostname  server post http port version storageInfo 等,向NN注册
    register(nsInfo);
  }

3、DataNode发送心跳

默认是3s发送一次心跳至NameNode,消息体如下:

指标 含义
DatanodeRegistration 包含StorageInfo,DatanodeID
StorageReport[] 包含每个存储目录的storage信息,存储卷总容量、使用量、可使用的量以及当前blockpool使用的磁盘大小
CacheCapacity 缓存容量
CacheUsed 缓存使用量
XmitsInProgress 当前正在处理数据传输的线程数
XceiverCount block读取操作的并发量
numFailedVolumes 失败的存储卷数量
volumeFailureSummary 失败的存储卷消息信息

NameNode响应如下:

字段 含义
commands 命令集合
haStatus NN的HA状态,包含status和txid(事物id)
rollingUpdateStatus 滚动升级的状态

4、DataNode确认ACTIVE状态的BPServiceActor

上面说了BPServiceActor主要是用来与NN进行交互的,一个NameNode节点就对应一个BPServiceActor,这里需要确认哪一个BPServiceActor是ACTIVE状态的,方便后面与Active状态的NN进行交互,这里每次心跳响应都会携带NN的状态,每次心跳都需要进行HA检查,选择出当前时刻ACTIVE节点。分为两种情况都是要进行HA切换的,第一种是当前bpos的active的nn有active切换至standby;第二种是当前的bpos的active的nn由standy切换成了active。

void updateActorStatesFromHeartbeat(
      BPServiceActor actor,
      NNHAStatusHeartbeat nnHaState) {
    //加写锁
    writeLock();
    try {
      final long txid = nnHaState.getTxId();
      //心跳响应,该NN是否是active的节点
      final boolean nnClaimsActive =
          nnHaState.getState() == HAServiceState.ACTIVE;
      //bpos认为的active的NN
      final boolean bposThinksActive = bpServiceToActive == actor;
      //是否是最新的事物编号
      final boolean isMoreRecentClaim = txid > lastActiveClaimTxId;

      //如果当前nn是active,bpos任务的active节点不是同一个,这时候说明NN发生了HA切换
      if (nnClaimsActive && !bposThinksActive) {
        LOG.info("Namenode " + actor + " trying to claim ACTIVE state with " +
            "txid=" + txid);
        //事物id不是最新的直接忽略
        if (!isMoreRecentClaim) {
          // Split-brain scenario - an NN is trying to claim active
          // state when a different NN has already claimed it with a higher
          // txid.
          LOG.warn("NN " + actor + " tried to claim ACTIVE state at txid=" +
              txid + " but there was already a more recent claim at txid=" +
              lastActiveClaimTxId);
          return;
        } else {
          //如果当前bpos的ACtive节点未选出,那么直接将次NN最为当前的active节点
          if (bpServiceToActive == null) {
            LOG.info("Acknowledging ACTIVE Namenode " + actor);
          } else {
            LOG.info("Namenode " + actor + " taking over ACTIVE state from " +
                bpServiceToActive + " at higher txid=" + txid);
          }
          bpServiceToActive = actor;
        }
      } else if (!nnClaimsActive && bposThinksActive) {
        //如果次节点不是active,而且当前bpos的active节点和次node是同一个,那么就说明Active节点切换成了standby节点
        //此时直接将bpServiceToActive赋值为null,等待下一次的心跳
        LOG.info("Namenode " + actor + " relinquishing ACTIVE state with " +
            "txid=" + nnHaState.getTxId());
        bpServiceToActive = null;
      }

      //如果确定了active节点,那么就保存最新的txid
      if (bpServiceToActive == actor) {
        assert txid >= lastActiveClaimTxId;
        lastActiveClaimTxId = txid;
      }
    } finally {
      writeUnlock();
    }
  }

5、执行NameNode响应的指令

final static int DNA_UNKNOWN = 0;    // unknown action   
  final static int DNA_TRANSFER = 1;   // transfer blocks to another datanode
  final static int DNA_INVALIDATE = 2; // invalidate blocks
  final static int DNA_SHUTDOWN = 3;   // shutdown node 不支持
  final static int DNA_REGISTER = 4;   // re-register
  final static int DNA_FINALIZE = 5;   // finalize previous upgrade
  final static int DNA_RECOVERBLOCK = 6;  // request a block recovery
  final static int DNA_ACCESSKEYUPDATE = 7;  // update access key
  final static int DNA_BALANCERBANDWIDTHUPDATE = 8; // update balancer bandwidth
  final static int DNA_CACHE = 9;      // cache blocks
  final static int DNA_UNCACHE = 10;   // uncache blocks

6、快汇报

DN会周期性的进行快的全量汇报,默认时间是6小时,也会进行删除的块的汇报,默认周期是300s(100*心跳时间)

三、总结

DN启动初始化了很多服务,主要是服务于数据的read/write。同事DN也会周期性的余NN保持心跳,NN通过心跳来向DN发送命令。NN是无法直接操作DN的,只能通过心跳来进行交互。

你可能感兴趣的:(hadoop,hdfs,hadoop,big,data)