Hadoop中的配置说明

$HADOOP_INSTALL/hadoop/conf文件夹包含Hadoop的相关配置文件.它们是:

hadoop-env.sh - 此文件包含运行Hadoop的环境变量.可以使用这些配置来改变Hadoop后台线程的行为.比如:修改日志文件的存储位置,以及Hadoop可以使用的最大堆数量等.此文件中唯一需要修改的变量是JAVA_HOME,以用来指定JDK的安装目录.

slaves - 此文件用于配置运行Hadoop slave daemons(datanodes和tasktrackers)的主机,每行一个主机.缺省情况下,此文件只包含一个localhost条目.

hadoop-default.xml -此文件包含Hadoop后线程和Map/Reduce任务的缺省配置.切忌修改此文件.

mapred-default.xml -此文件包含Hadoop Map/Reduce后台线程和Jobs的站点特有配置.缺省情况下,文件内容是空的.配置此文件将会覆盖hadoop-default.xml中的Map/Reduce配置.
可以使用此文件来定制站点Map/Reduce行为.

hadoop-site.xml -此文件包含Hadoop Map/Reduce后台线程和Jobs的站点特有配置.缺省情况下,文件内容是空的.配置此文件可覆盖那些hadoop-default.xml和mapred-default.xml的行为.此文件必须包含可被Hadoop安装中的所有服务端和客户端所关心的配置,如,namenode和jobtracker的位置.

Basic Configuration

Take a pass at putting together basic configuration settings for your cluster. Some of the settings that follow are required, others are recommended for more straightforward and predictable operation.

Hadoop Environment Settings - Ensure that JAVA_HOME is set in hadoop-env.sh and points to the Java installation you intend to use. You can set other environment variables in hadoop-env.sh to suit your requirments. Some of the default settings refer to the variable HADOOP_HOME. The value of HADOOP_HOME is automatically inferred from the location of the startup scripts. HADOOP_HOME is the parent directory of the bin directory that holds the Hadoop scripts. In this instance it is $HADOOP_INSTALL/hadoop.
Jobtracker and Namenode settings - Figure out where to run your namenode and jobtracker. Set the variable fs.default.name to the Namenode's intended host:port. Set the variable mapred.job.tracker to the jobtrackers intended host:port. These settings should be in hadoop-site.xml. You may also want to set one or more of the following ports (also in hadoop-site.xml):
dfs.datanode.port
dfs.info.port
mapred.job.tracker.info.port
mapred.task.tracker.output.port
mapred.task.tracker.report.port
Data Path Settings - Figure out where your data goes. This includes settings for where the namenode stores the namespace checkpoint and the edits log, where the datanodes store filesystem blocks, storage locations for Map/Reduce intermediate output and temporary storage for the HDFS client. The default values for these paths point to various locations in /tmp. While this might be ok for a single node installation, for larger clusters storing data in /tmp is not an option. These settings must also be in hadoop-site.xml. It is important for these settings to be present in hadoop-site.xml because they can otherwise be overridden by client configuration settings in Map/Reduce jobs. Set the following variables to appropriate values:
dfs.name.dir
dfs.data.dir
dfs.client.buffer.dir
mapred.local.dir
An example of a hadoop-site.xml file:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>hdfs://localhost:54311</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>8</value>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>
</configuration>
Formatting the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem, which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. If you just used the default, then mkdir -p /tmp/hadoop-username/dfs/name will create the directory. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
% $HADOOP_INSTALL/hadoop/bin/hadoop namenode -format

If asked to [re]format, you must reply Y (not just y) if you want to reformat, else Hadoop will abort the format.

Starting a Single node cluster

Run the command:
% $HADOOP_INSTALL/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

Stopping a Single node cluster

Run the command
% $HADOOP_INSTALL/hadoop/bin/stop-all.sh
to stop all the daemons running on your machine.

Separating Configuration from Installation

In the example described above, the configuration files used by the Hadoop cluster all lie in the Hadoop installation. This can become cumbersome when upgrading to a new release since all custom config has to be re-created in the new installation. It is possible to separate the config from the install. To do so, select a directory to house Hadoop configuration (let's say /foo/bar/hadoop-config. Copy all conf files to this directory. You can either set the HADOOP_CONF_DIR environment variable to refer to this directory or pass it directly to the Hadoop scripts with the --config option. In this case, the cluster start and stop commands specified in the above two sub-sections become
% $HADOOP_INSTALL/hadoop/bin/start-all.sh --config /foo/bar/hadoop-config and
% $HADOOP_INSTALL/hadoop/bin/stop-all.sh --config /foo/bar/hadoop-config.
Only the absolute path to the config directory should be passed to the scripts.

Starting up a larger cluster

Ensure that the Hadoop package is accessible from the same path on all nodes that are to be included in the cluster. If you have separated configuration from the install then ensure that the config directory is also accessible the same way.
Populate the slaves file with the nodes to be included in the cluster. One node per line.
Follow the steps in the Basic Configuration section above.
Format the Namenode
Run the command % $HADOOP_INSTALL/hadoop/bin/start-dfs.sh on the node you want the Namenode to run on. This will bring up HDFS with the Namenode running on the machine you ran the command on and Datanodes on the machines listed in the slaves file mentioned above.
Run the command % $HADOOP_INSTALL/hadoop/bin/start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring up the Map/Reduce cluster with Jobtracker running on the machine you ran the command on and Tasktrackers running on machines listed in the slaves file.
The above two commands can also be executed with a --config option.
Stopping the cluster

The cluster can be stopped by running % $HADOOP_INSTALL/hadoop/bin/stop-mapred.sh and then % $HADOOP_INSTALL/hadoop/bin/stop-dfs.sh on your Jobtracker and Namenode respectively. These commands also accept the --config option.

你可能感兴趣的:(hadoop)