hadoop配置文件参数详解

一般来说,hadoop主要有三个默认参数文件,分别为core-default.xml,hdfs-default.xml,mapred-default.xml。

其它需要用户配置的参数文件为core-site.xml,hdfs-site.xml,mapred-site.xml,下面分别介绍下相关参数的含义


1 core-site.xml
[node1 conf]$ cat core-site.xml





        fs.default.name
        hdfs://192.168.0.75:9000/
        URI of NameNode.

--fs.default.name  缺省的文件URI标识设定。

        hadoop.tmp.dir
        /data1/tmp
        Temp dir.

--hadoop.tmp.dir   临时目录设定

        dfs.hosts.exclude
        /home/ocdc/hadoop-ocdc/conf/excludes
        List of excluded DataNodes.

-- dfs.hosts.exclude    Datanode的黑名单

        fs.default.name0
        hdfs://0.0.0.0:9000/

--fs.default.name0 Avatar hadoop的配置,主namenode的URI,目前设置为空

        fs.default.name1
        hdfs://192.168.0.69:9000/


--fs.default.name1 备namenode的URI

2、hdfs-site.xml


[node1 conf]$ cat hdfs-site.xml









    dfs.name.dir
    /home/ocdc/hadoop-ocdc/data/namenode
        File fsimage location.
        If this is a comma-delimited list of directories then the name
        table is replicated in all of the directories, for redundancy.
       

--默认配置是${hadoop.tmp.dir}/dfs/name,namenode相关数据存放的地方,如果此处配置了多个目录,这数据在这些目录中各方一份

    dfs.name.edits.dir
    /home/ocdc/hadoop-ocdc/data/editlog
        File edits location.
        If this is a comma-delimited list of directories then the name
        table is replicated in all of the directories, for redundancy.
       

文件编辑的目录,也可设置多个目录的冗余配置

        dfs.data.dir
        /data01/data,/data02/data,/data03/data,/data04/data,/data05/data
        Determines where on the local filesystem an DFS data node
        should store its blocks.  If this is a comma-delimited
        list of directories, then data will be stored in all named
        directories, typically on different devices.
        Directories that do not exist are ignored.
       


datanode节点上data存储的目录,可以配置多个目录共同存放数据,如果目录不存在,则忽略。此处配置了5个目录存放data
[hadoop@node1 ~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_wtoc3-lv_root
                       25G  3.3G   20G  15% /
tmpfs                  32G   88K   32G   1% /dev/shm
/dev/cciss/c0d0p1     485M   38M  422M   9% /boot
/dev/mapper/vg_wtoc300-lv_data1
                      917G 1023M  870G   1% /data1
/dev/mapper/vg_wtoc301-lv_data2
                      917G  3.1G  868G   1% /data2
/dev/mapper/vg_wtoc3-lv_home
                       80G   13G   63G  17% /home
/dev/mapper/mpathe    474G  381G   69G  85% /data02
/dev/mapper/mpathf    474G  379G   71G  85% /data03
tmpfs                  16G  3.4G   13G  22% /dev/flare
/dev/mapper/mpathd    474G  378G   72G  85% /data01
fuse_dfs               12T  8.9T  2.8T  77% /home/ocdc/fuse-dfs
/dev/mapper/mpathg    474G  373G   77G  83% /data04
/dev/mapper/mpathh    474G  373G   77G  83% /data05


        dfs.datanode.address
        0.0.0.0:50010
       
    The address where the datanode server will listen to.
    If the port is 0 then the server will start on a free port.
       


datanode监听的地址
[hadoop@node1 ~]$ netstat -an|grep 50010|grep LISTEN
tcp        0      0 ::ffff:192.168.0.70:50010     :::*                        LISTEN

        dfs.datanode.http.address
        0.0.0.0:50075
       
    The datanode http server address and port.
    If the port is 0 then the server will start on a free port.
       


datanode的HTTP服务器和端口,默认50075

[hadoop@node1 ~]$ netstat -an|grep 50075
tcp        0      0 ::ffff:192.168.0.70:50075     :::*                        LISTEN


        dfs.datanode.ipc.address
        0.0.0.0:50020
       
    The datanode ipc server address and port.
    If the port is 0 then the server will start on a free port.
       

atanode的RPC服务器地址和端口,默认50020


        dfs.datanode.handler.count
        20
        The number of server threads for the datanode.


--datanode 线程数


        dfs.namenode.handler.count
        20
        The number of server threads for the namenode.

--namenode的线程数

        dfs.web.ugi
        hadoop,hadoop
        The user account used by the web interface.
    Syntax: USERNAME,GROUP1,GROUP2, ...
       

--Web接口访问的用户名和组的帐户设定,此处设置和主机配置明显不符合

        dfs.replication
        2
        Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
       

--块复制数量,如果是裸盘,默认设置为3,此处为2

        dfs.replication.min
        1
        Minimal block replication.
       

--块复制的最小数量

        dfs.support.append
        true
        set if hadoop support append

--是否运行文件追加写,默认为false

        fs.checkpoint.period
        30
        set if hadoop support append

---checkpoint周期,默认3600秒

        dfs.http.address
        0.0.0.0:50070

---namenode的http访问地址和端口

        fs.checkpoint.dir
        ${hadoop.tmp.dir}/dfs/namesecondary
        Determines where on the local filesystem the DFS secondary
        name node should store the temporary images to merge.
        If this is a comma-delimited list of directories then the image is
        replicated in all of the directories for redundancy.
       

--第二(备份)DFS namenode目录

    dfs.balance.bandwidthPerSec
    10485760
   
        Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second.
   

--用来做负载均衡功能时,可利用的的最大带宽

    dfs.datanode.max.xcievers
    4096

--Hadoop HDFS Datanode同时处理文件的上限,有点类似于linux的nfile

        dfs.permissions
        false

--文件操作时,是否检查权限????

    fs.hdfs.impl.disable.cache
    false

----???

    dfs.name.dir.shared0
    /namenode/namenode0


--AvatarNode0上HDFS的镜像日志存储目录

    dfs.name.dir.shared1
    /namenode/namenode1


    dfs.name.edits.dir.shared0
    /namenode/editlog0


    dfs.name.edits.dir.shared1
    /namenode/editlog1


---在此配置文件里,除了Hadoop的配置项外,Avatar的配置项dfs.name.dir.shared0,dfs.name.edits.dir.shared0,dfs.name.dir.shared1,dfs.name.edits.dir.shared1,分别为AvatarNode0上HDFS的镜像日志存储目录,AvatarNode1上HDFS的镜像日志存储目录。可以看到这些目录都在NFS的共享目录中,当AvatarNode0上运行的是PrimaryNameNode时,会向dfs.name.edits.dir.share0中写日志,AvatarNode1上的StandbyNameNode就会去读这些日志,反之,当AvatarNode1上运行的是PrimaryNameNode时,会向dfs.name.edits.dir.share1中写日志,AvatarNode0上的StandbyNameNode就会去读这些日志。

        dfs.http.address0
        0.0.0.0:50070


        dfs.http.address1
        192.168.0.69:50070




---0.0.0.0,表示可通过任一网卡访问http接口

3,mapred-site.xml

第三个配置文件
[node1 conf]$ cat mapred-site.xml








        mapred.job.tracker
        192.168.0.75:9002
        The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
        and reduce task.
       

----JobTracker 的主机(或者IP)和端口。

        mapred.job.tracker.http.address
        192.168.0.75:50030
       
        The job tracker http server address and port the server will listen on.
        If the port is 0 then the server will start on a free port.
       

---job tracker http server的端口和地址

        mapred.job.tracker.handler.count
        2
       
    The number of server threads for the JobTracker. This should be roughly
    4% of the number of tasktracker nodes.
       

--JobTracker的线程数,为tasktracker的4%

        mapred.tasktracker.map.tasks.maximum
    7

--可同时运行的map任务数

        mapred.tasktracker.reduce.tasks.maximum
        1
        The maximum number of reduce tasks that will be run
        simultaneously by a task tracker.
       

-- 最大reduce并发数
--这2个参数分别是用来设置的map和reduce的并发数量。实际作用的就是控制同时运行的task的数量


        mapred.task.tracker.report.address
        127.0.0.1:0
        The interface and port that task tracker server listens on.
        Since it is only connected to by the tasks, it uses the local interface.
        EXPERT ONLY. Should only be changed if your host does not have the loopback
        interface.

--tasktracker监听的地址,除非loopback没有设置,一般设置为127.0.0.1

        mapred.local.dir
        ${hadoop.tmp.dir}/mapred/local
        The local directory where MapReduce stores intermediate
        data files.  May be a comma-separated list of
        directories on different devices in order to spread disk i/o.
        Directories that do not exist are ignored.
       

--map存储临时数据的地方,可以设置多个目录

        mapred.system.dir
        ${hadoop.tmp.dir}/mapred/system
        The shared directory where MapReduce stores control files.
       

--Map/Reduce存放控制文件的目录, Map/Reduce框架存储系统文件的HDFS路径。

        mapred.temp.dir
        ${hadoop.tmp.dir}/mapred/temp
        A shared directory for temporary files.
       

--mapred临时文件存放地

        mapred.map.tasks
        1
        The default number of map tasks per job.
        Ignored when mapred.job.tracker is "local".
       

--每个job默认的task数

        mapred.reduce.tasks
        1
        The default number of reduce tasks per job. Typically set to 99%
        of the cluster's reduce capacity, so that if a node fails the reduces can
        still be executed in a single wave.
        Ignored when mapred.job.tracker is "local".
       

---每个job的默认的reduce任务数

        hadoop.job.history.user.location
        none
        User can specify a location to store the history files of
        a particular job. If nothing is specified, the logs are stored in
        output directory. The files are stored in "_logs/history/" in the directory.
        User can stop logging by giving the value "none".
       

---

        mapred.compress.map.output
        true

--map输出是否压缩

        mapred.tasktracker.expiry.interval
        30000

---tasktracker过期时间,30000秒内不发送心跳,则过期

        mapred.job.reuse.jvm.num.tasks
        -1
        How many tasks to run per jvm. If set to -1, there is
        no limit.
       

---每虚拟机运行的任务数

        mapred.task.timeout
        90000000
        The number of milliseconds before a task will be
        terminated if it neither reads an input, writes an output, nor
        updates its status string.
       


---如果任务无读无写时的时间耗时为90000000/1000/60/60=25小时,将被终止.默认10分钟

        mapred.reduce.parallel.copies
        20

---复制阶段时reduce并行传送的值。默认是5

        mapred.child.java.opts
        -Xmx1024m

---一般来说,都是reduce耗费内存比较大,这个选项正是用来设置JVM堆的最大可用内存,但是也不要设置太大,如果超过2G,应该考虑从程序设计角度去优化。



--- FAIR Scheduler的相关配置
 

---默认调度算法FIFO,计算能力调度算法Capacity Scheduler(Yahoo!开发),公平份额调度算法Fair Scheduler(Facebook开发)



-

参考资料
http://blog.csdn.net/yangjl38/article/details/7583374
http://hadoop.apache.org/docs/r1.0.4/cluster_setup.html#Configuring+the+Hadoop+Daemons

你可能感兴趣的:(hadoop)