HDFS启动

标签（空格分隔）：大数据 HDFS

[toc]

所有的分析以单机安装的Hadoop版本2.6.4为例分析。步骤依赖于安装文档中的步骤，见Hadoop的单机安装

预制几个重要的脚本文件：

假设hadoop的安装目录在HADOOP_HOME。
重要的脚本文件hadoop-functions.sh。

步骤详解

格式化系统

第一步要：$ bin/hdfs namenode -format

主要执行HADOOP_HOME/bin/hdfs命令。其中设置了3个重要的变量名

 namenode)
      HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true"
      HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.namenode.NameNode'
      hadoop_add_param HADOOP_OPTS hdfs.audit.logger "-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER}"
    ;;

然后最后执行

hadoop_java_exec "${HADOOP_SUBCMD}" "${HADOOP_CLASSNAME}" "${HADOOP_SUBCMD_ARGS[@]}"

其中的hadoop_java_exec是hadoop-functions.sh中声明的一个函数，其作用就是启动java进程执行command。

function hadoop_java_exec
{
  # run a java command.  this is used for
  # non-daemons

  local command=$1
  local class=$2
  shift 2

  hadoop_debug "Final CLASSPATH: ${CLASSPATH}"
  hadoop_debug "Final HADOOP_OPTS: ${HADOOP_OPTS}"
  hadoop_debug "Final JAVA_HOME: ${JAVA_HOME}"
  hadoop_debug "java: ${JAVA}"
  hadoop_debug "Class name: ${class}"
  hadoop_debug "Command line options: $*"

  export CLASSPATH
  #shellcheck disable=SC2086
  exec "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@"
}

所以，整个命令的链路核心目标就是执行org.apache.hadoop.hdfs.server.namenode.NameNode类的main函数，传递的参数为format。

public static void main(String argv[]) throws Exception {
    if (DFSUtil.parseHelpArgument(argv, NameNode.USAGE, System.out, true)) {
      System.exit(0);
    }

    try {
      StringUtils.startupShutdownMessage(NameNode.class, argv, LOG);
      NameNode namenode = createNameNode(argv, null);
      if (namenode != null) {
        namenode.join();
      }
    } catch (Throwable e) {
      LOG.error("Failed to start namenode.", e);
      terminate(1, e);
    }
  }

其中startupShutdownMessage方法会打印一些启动信息到控制台，同时如果是unix系统，会注册logger到signal，在接受 { "TERM", "HUP", "INT" }信号时打印错误日志。这样做的意义在于当有系统信号触发进程结束时，可以根据日志来判断是什么原因退出进程的。

if (SystemUtils.IS_OS_UNIX) {
      try {
        SignalLogger.INSTANCE.register(LOG);
      } catch (Throwable t) {
        LOG.warn("failed to register any UNIX signal loggers: ", t);
      }

接下来就是createNameNode了，首先解析出-format参数为StartOption.FORMAT，然后执行format方法，由于没有指定cluster，所以系统new一个clusterId，比如形如CID-d2425dab-c066-4a67-954f-32228c22abe6。

private static boolean format(Configuration conf, boolean force,
      boolean isInteractive) throws IOException {
    String nsId = DFSUtil.getNamenodeNameServiceId(conf);
    String namenodeId = HAUtil.getNameNodeId(conf, nsId);
    initializeGenericKeys(conf, nsId, namenodeId);
    checkAllowFormat(conf);

    if (UserGroupInformation.isSecurityEnabled()) {
      InetSocketAddress socAddr = DFSUtilClient.getNNAddress(conf);
      SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,
          DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, socAddr.getHostName());
    }
    
    Collection nameDirsToFormat = FSNamesystem.getNamespaceDirs(conf);
    List sharedDirs = FSNamesystem.getSharedEditsDirs(conf);
    List dirsToPrompt = new ArrayList();
    dirsToPrompt.addAll(nameDirsToFormat);
    dirsToPrompt.addAll(sharedDirs);
    List editDirsToFormat = 
                 FSNamesystem.getNamespaceEditsDirs(conf);

    // if clusterID is not provided - see if you can find the current one
    String clusterId = StartupOption.FORMAT.getClusterId();
    if(clusterId == null || clusterId.equals("")) {
      //Generate a new cluster id
      clusterId = NNStorage.newClusterID();
    }
    System.out.println("Formatting using clusterid: " + clusterId);
    
    FSImage fsImage = new FSImage(conf, nameDirsToFormat, editDirsToFormat);
    try {
      FSNamesystem fsn = new FSNamesystem(conf, fsImage);
      fsImage.getEditLog().initJournalsForWrite();

      if (!fsImage.confirmFormat(force, isInteractive)) {
        return true; // aborted
      }

      fsImage.format(fsn, clusterId);
    } catch (IOException ioe) {
      LOG.warn("Encountered exception during format: ", ioe);
      fsImage.close();
      throw ioe;
    }
    return false;
  }

接下来构造一个FSImage，设置默认的checkpoint目录，设置存储以及初始化edit log。其中NNStorage负责管理存储目录，FSEditLog是edit log对象。

protected FSImage(Configuration conf,
                    Collection imageDirs,
                    List editsDirs)
      throws IOException {
    this.conf = conf;

    storage = new NNStorage(conf, imageDirs, editsDirs);
    if(conf.getBoolean(DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_KEY,
                       DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_DEFAULT)) {
      storage.setRestoreFailedStorage(true);
    }

    this.editLog = FSEditLog.newInstance(conf, storage, editsDirs);
    archivalManager = new NNStorageRetentionManager(conf, storage, editLog);
  }

有了文件系统镜像，就可以构造FSNamesystem了，这是一个namespace状态存储的容器，负责承载NameNode的一切记录性质的工作。具体的构造函数代码较长，这里就不贴明细了。具体分析一下步骤：

先创建KeyProvider，我们这个例子没有安全模式，因此no KeyProvider found。
读取dfs.namenode.fslock.fair，构造FSNamesystemLock，默认true，即公平读写锁。
设置用户和权限
check 是否HA
初始化BlockManager及其代理的一堆manager，包括：DatanodeManager(管理DataNode的下线[DecommissionManager]和其他活动)，HeartbeatManager(管理从datanode接收到的心跳)，BlockIdManager(分配和管理GenerationStamp和block id)等。
构造FSDirectory，这是个纯内存的结构，用来和FSNamesystem一起管理NameNode，构造INode。
初始化CacheManager来管理DataNode的cache。
初始化RetryCache。cache了一些非幂等的被RPCserver成功处理的请求，用以处理重试。

至此FSNamesystem初始化完成，最后执行FSImage的format方法，进行格式化。然后shutdown NameNode。

启动NameNode和DataNode的进程

第二步就是启动NameNode和DataNode了，具体脚本如下：

$ sbin/start-dfs.sh

NameNode启动

脚本核心代码：

#---------------------------------------------------------
# namenodes

NAMENODES=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -namenodes 2>/dev/null)

if [[ -z "${NAMENODES}" ]]; then
  NAMENODES=$(hostname)
fi

echo "Starting namenodes on [${NAMENODES}]"
hadoop_uservar_su hdfs namenode "${HADOOP_HDFS_HOME}/bin/hdfs" \
    --workers \
    --config "${HADOOP_CONF_DIR}" \
    --hostnames "${NAMENODES}" \
    --daemon start \
    namenode ${nameStartOpt}

HADOOP_JUMBO_RETCOUNTER=$?

也就是先hdfs getconf -namenodes来查询配置列出所有NameNode。然后执行hdfs namenode来启动NameNode。根据上面的分析，我们知道hdfs脚本就是启动对应命令的java进程，namenode子命令还是对应NameNode类的main方法，具体执行的其他步骤一样，只是在createNameNode时，因为参数不同而导致逻辑不同。因为启动脚本里namenode没有其他参数，因此启动默认逻辑

default: {
        DefaultMetricsSystem.initialize("NameNode");
        return new NameNode(conf);
      }

核心就是NameNode的构造方法。其首先通过setClientNamenodeAddress方法设置NameNode的地址，默认的就是fs.defaultFS配置对应的值hdfs://localhost:9000。

接着初始化NameNode

protected void initialize(Configuration conf) throws IOException {
    if (conf.get(HADOOP_USER_GROUP_METRICS_PERCENTILES_INTERVALS) == null) {
      String intervals = conf.get(DFS_METRICS_PERCENTILES_INTERVALS_KEY);
      if (intervals != null) {
        conf.set(HADOOP_USER_GROUP_METRICS_PERCENTILES_INTERVALS,
          intervals);
      }
    }

    UserGroupInformation.setConfiguration(conf);
    loginAsNameNodeUser(conf);

    NameNode.initMetrics(conf, this.getRole());
    StartupProgressMetrics.register(startupProgress);

    pauseMonitor = new JvmPauseMonitor();
    pauseMonitor.init(conf);
    pauseMonitor.start();
    metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);

    if (NamenodeRole.NAMENODE == role) {
      startHttpServer(conf);
    }

    loadNamesystem(conf);

    rpcServer = createRpcServer(conf);

    initReconfigurableBackoffKey();

    if (clientNamenodeAddress == null) {
      // This is expected for MiniDFSCluster. Set it now using 
      // the RPC server's bind address.
      clientNamenodeAddress = 
          NetUtils.getHostPortString(getNameNodeAddress());
      LOG.info("Clients are to use " + clientNamenodeAddress + " to access"
          + " this namenode/service.");
    }
    if (NamenodeRole.NAMENODE == role) {
      httpServer.setNameNodeAddress(getNameNodeAddress());
      httpServer.setFSImage(getFSImage());
    }

    startCommonServices(conf);
    startMetricsLogger(conf);
  }

几个比较重要的步骤，其中startHttpServer会启动一个httpServer，默认地址是http://0.0.0.0:50070。HDFS的默认httpserver是一个Jetty服务器，启动httpserver后，打开页面可以看到整个hdfs的监控情况。然后加载Namesystem，先check参数，由于本地启动，会收到这样两个警告：

2017-02-11 21:59:28,765 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage 
directories!
2017-02-11 21:59:28,765 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) configured. Beware of data loss due to lack of redunda
nt storage directories!

无视存储和editlog的存储单目录问题，接下来和format逻辑一样，要构造FSNamesystem。接着就是loadFSImage，FSImage加载后需要判断是否保存，其逻辑上是

final boolean needToSave = staleImage && !haEnabled && !isRollingUpgrade();

由于单机模式，这几个值都是false，因此needToSave也是false，所以不会进行fsImage的saveNamespace方法。

结束后会看到一行日志：

2017-02-11 21:59:29,472 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 349 msecs

表示FSImage加载完毕。

后面跟着初始化RPC server。具体对应的类是RPC.Server，基于Protobuf的一个客户端rpc服务器。

方法的最后两行，startCommonServices会启动所有的*manager和httpServer以及rpcServer，还有如果有配置ServicePlugin，每个plugin也会启动。而startMetricsLogger开启日志记录

DataNode启动

启动脚本

#---------------------------------------------------------
# datanodes (using default workers file)

echo "Starting datanodes"
hadoop_uservar_su hdfs datanode "${HADOOP_HDFS_HOME}/bin/hdfs" \
    --workers \
    --config "${HADOOP_CONF_DIR}" \
    --daemon start \
    datanode ${dataStartOpt}
(( HADOOP_JUMBO_RETCOUNTER=HADOOP_JUMBO_RETCOUNTER + $? ))

执行无参数的hdfs datanode。DataNode存储了一系列的block来存放实际的文件数据。DataNode会和NameNode通信，且也会和其他DataNode甚至客户端来通信。DataNode只维护了一个关系block到bytes流的映射关系。

具体DataNode的初始化，首先先初始MetricSystem。接着进入核心的代码段——DataNode的构造函数：

DataNode(final Configuration conf,
           final List dataDirs,
           final StorageLocationChecker storageLocationChecker,
           final SecureResources resources) throws IOException {
    super(conf);
    this.tracer = createTracer(conf);
    this.tracerConfigurationManager =
        new TracerConfigurationManager(DATANODE_HTRACE_PREFIX, conf);
    this.fileIoProvider = new FileIoProvider(conf, this);
    this.blockScanner = new BlockScanner(this);
    this.lastDiskErrorCheck = 0;
    this.maxNumberOfBlocksToLog = conf.getLong(DFS_MAX_NUM_BLOCKS_TO_LOG_KEY,
        DFS_MAX_NUM_BLOCKS_TO_LOG_DEFAULT);

    this.usersWithLocalPathAccess = Arrays.asList(
        conf.getTrimmedStrings(DFSConfigKeys.DFS_BLOCK_LOCAL_PATH_ACCESS_USER_KEY));
    this.connectToDnViaHostname = conf.getBoolean(
        DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME,
        DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME_DEFAULT);
    this.supergroup = conf.get(DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_KEY,
        DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_DEFAULT);
    this.isPermissionEnabled = conf.getBoolean(
        DFSConfigKeys.DFS_PERMISSIONS_ENABLED_KEY,
        DFSConfigKeys.DFS_PERMISSIONS_ENABLED_DEFAULT);
    this.pipelineSupportECN = conf.getBoolean(
        DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED,
        DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED_DEFAULT);

    confVersion = "core-" +
        conf.get("hadoop.common.configuration.version", "UNSPECIFIED") +
        ",hdfs-" +
        conf.get("hadoop.hdfs.configuration.version", "UNSPECIFIED");

    this.volumeChecker = new DatasetVolumeChecker(conf, new Timer());

    // Determine whether we should try to pass file descriptors to clients.
    if (conf.getBoolean(HdfsClientConfigKeys.Read.ShortCircuit.KEY,
              HdfsClientConfigKeys.Read.ShortCircuit.DEFAULT)) {
      String reason = DomainSocket.getLoadingFailureReason();
      if (reason != null) {
        LOG.warn("File descriptor passing is disabled because " + reason);
        this.fileDescriptorPassingDisabledReason = reason;
      } else {
        LOG.info("File descriptor passing is enabled.");
        this.fileDescriptorPassingDisabledReason = null;
      }
    } else {
      this.fileDescriptorPassingDisabledReason =
          "File descriptor passing was not configured.";
      LOG.debug(this.fileDescriptorPassingDisabledReason);
    }

    this.socketFactory = NetUtils.getDefaultSocketFactory(conf);

    try {
      hostName = getHostName(conf);
      LOG.info("Configured hostname is " + hostName);
      startDataNode(dataDirs, resources);
    } catch (IOException ie) {
      shutdown();
      throw ie;
    }
    final int dncCacheMaxSize =
        conf.getInt(DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_KEY,
            DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_DEFAULT) ;
    datanodeNetworkCounts =
        CacheBuilder.newBuilder()
            .maximumSize(dncCacheMaxSize)
            .build(new CacheLoader>() {
              @Override
              public Map load(String key) throws Exception {
                final Map ret = new HashMap();
                ret.put("networkErrors", 0L);
                return ret;
              }
            });

    initOOBTimeout();
    this.storageLocationChecker = storageLocationChecker;
  }

而其中最重要的就是startDataNode方法。其核心步骤摘要如下：

注册MBean
创建一个TcpPeerServer，监听50010端口。该server负责和Client和其他DataNode通信。此server不使用Hadoop的IPC机制
启动JvmPauseManager，用于记录Jvm的暂停，发现则log一条
初始化IpcServer，监听50020端口。
构造一个BPOfferService线程，然后启动线程。BPServiceActor是这样一个线程，它会先和NameNode进行握手做预注册，接下来注册DataNode到NameNode，然后周期性的发送心跳给NameNode，并处理接收到的response命令。
具体描述步骤5，就是如下代码：

public void run() {
    LOG.info(this + " starting to offer service");

    try {
      while (true) {
        // init stuff
        try {
          // setup storage
          connectToNNAndHandshake();
          break;
        } catch (IOException ioe) {
          // Initial handshake, storage recovery or registration failed
          runningState = RunningState.INIT_FAILED;
          if (shouldRetryInit()) {
            // Retry until all namenode's of BPOS failed initialization
            LOG.error("Initialization failed for " + this + " "
                + ioe.getLocalizedMessage());
            sleepAndLogInterrupts(5000, "initializing");
          } else {
            runningState = RunningState.FAILED;
            LOG.error("Initialization failed for " + this + ". Exiting. ", ioe);
            return;
          }
        }
      }

      runningState = RunningState.RUNNING;
      if (initialRegistrationComplete != null) {
        initialRegistrationComplete.countDown();
      }

      while (shouldRun()) {
        try {
          offerService();
        } catch (Exception ex) {
          LOG.error("Exception in BPOfferService for " + this, ex);
          sleepAndLogInterrupts(5000, "offering service");
        }
      }
      runningState = RunningState.EXITED;
    } catch (Throwable ex) {
      LOG.warn("Unexpected exception in block pool " + this, ex);
      runningState = RunningState.FAILED;
    } finally {
      LOG.warn("Ending block pool service for: " + this);
      cleanUp();
    }
  }

下面具体分析一下BPServiceActor线程做的几件事：

发送versionRequest请求给NameNode，来获取NameNode的namespace和版本信息。响应得到一个NamespaceInfo。

利用NamespaceInfo初始化Storage，初始化之前先做格式化format。初始化后生成一个uuid，具体可以看到如下的日志：


2017-02-11 21:59:33,901 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=537369943;bpid=BP-503975772-192.168.0.109-1486821555429;lv=-56;nsInfo=lv=-60;cid=CID-c79cc043-b282-435c-a0f6-d5a55b23e87e;nsid=537369943;c=0;bpid=BP-503975772-192.168.0.109-1486821555429;dnuuid=null
2017-02-11 21:59:33,902 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Generated and persisted new Datanode UUID 43ed99d1-20c6-4d71-919c-e9a70cb75c6e

真实握手，发送registerDatanode请求给NameNode。这时NameNode会处理这个请求，利用DataNodeManager来进行registerDatanode。这时在NameNode日志会看到如下的日志：

2017-02-11 21:59:34,090 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* register
Datanode: from DatanodeRegistration(127.0.0.1, datanodeUuid=43ed99d1-20c6-4d71-9
19c-e9a70cb75c6e, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-c79c
c043-b282-435c-a0f6-d5a55b23e87e;nsid=537369943;c=0) storage 43ed99d1-20c6-4d71-
919c-e9a70cb75c6e
2017-02-11 21:59:34,099 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Number of failed storage changes from 0 to 0
2017-02-11 21:59:34,100 INFO org.apache.hadoop.net.NetworkTopology: Adding a new
 node: /default-rack/127.0.0.1:50010
 2017-02-11 21:59:34,189 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Number of failed storage changes from 0 to 0
2017-02-11 21:59:34,189 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Adding new storage ID DS-7d302778-acd6-4366-be5e-9dbf7ad22c4d for
 DN 127.0.0.1:50010

调用offerService方法，开始周期性发送心跳。每个心跳包都包含几个内容：DataNode名字、数据传输端口、总容量和剩余bytes。然后NameNode接受到心跳后开始handleHeartbeat。

至此，整个NameNode和DataNode都开始正常工作，整个HDFS的启动结束。