HDFS, 配置项清单

1 数据存储

dfs.namenode.name.dir

fsimage和edits存储目录

Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

dfs.datanode.data.dir

datanode物理文件存储目录

Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows.

dfs.replication

块副本数, 默认值3

副本数越多, 数据的容错性越强, 当然使用的磁盘空间也越多

Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

dfs.blocksize

块大小, 默认128M

值过小 >> 单文件切分出的块数量多 >> namenode需要维持的元数据大, 占用内存多,寻址时间长

值过大 >> 块的网络传输时间长, 失败重试成本高

The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).

dfs.datanode.du.reserved

DN预留磁盘空间, 默认值0

Reserved space in bytes per volume. Always leave this much space free for non dfs use. Specific storage type based reservation is also supported. The property can be followed with corresponding storage types ([ssd]/[disk]/[archive]/[ram_disk]) for cluster with heterogeneous storage. For example, reserved space for RAM_DISK storage can be configured using property ‘dfs.datanode.du.reserved.ram_disk’. If specific storage type reservation is not configured then dfs.datanode.du.reserved will be used. Support multiple size unit suffix(case insensitive), as described in dfs.blocksize. Note: In case of using tune2fs to set reserved-blocks-percentage, or other filesystem tools, then you can possibly run into out of disk errors because hadoop will not check those external tool configurations.

dfs.datanode.failed.volumes.tolerated

DN可容忍失效的磁盘数, 默认值0

The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. The value should be greater than or equal to -1 , -1 represents minimum 1 valid volume.

2 HDFS权限

dfs.permissions.enabled

是否校验HDFS权限 (UGO)

If “true”, enable permission checking in HDFS. If “false”, permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.

3 DN扫描

dfs.blockreport.intervalMsec

DN块信息上报周期, 默认值21600000 (6小时)

Determines block reporting interval in milliseconds.

dfs.block.scanner.volume.bytes.per.second

blokc扫描IO阀值, 默认值1048576 (1M)

If this is 0, the DataNode’s block scanner will be disabled. If this is positive, this is the number of bytes per second that the DataNode’s block scanner will try to scan from each volume.

fs.datanode.scan.period.hours

block全量扫描周期, 默认值504 (3周)

If this is positive, the DataNode will not scan any individual block more than once in the specified scan period. If this is negative, the block scanner is disabled. If this is set to zero, then the default value of 504 hours or 3 weeks is used. Prior versions of HDFS incorrectly documented that setting this key to zero will disable the block scanner.

dfs.datanode.directoryscan.interval

DN目录扫描周期, 检查本地文件与元数据的一致性, 默认值21600 (6小时)

Interval in seconds for Datanode to scan data directories and reconcile the difference between blocks in memory and on the disk. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval.If no time unit is specified then seconds is assumed

dfs.datanode.directoryscan.threads

DN目录扫描线程数, 默认值1

How many threads should the threadpool used to compile reports for volumes in parallel have

4 块恢复

dfs.namenode.replication.work.multiplier.per.iteration

块复制操作数量倍数, 默认2

Note: Advanced property. Change with caution. This determines the total amount of block transfers to begin in parallel at a DN, for replication, when such a command list is being sent over a DN heartbeat by the NN. The actual number is obtained by multiplying this multiplier with the total number of live nodes in the cluster. The result number is the number of blocks to begin transfers immediately for, per DN heartbeat. This number can be any positive, non-zero integer.

dfs.namenode.replication.max-streams

块复制任务数, 默认值2

Hard limit for the number of replication streams other than those with highest-priority.

dfs.namenode.replication.max-streams-hard-limit

块复制任务数上限, 默认值4

Hard limit for all replication streams.

5 DN读写线程数

dfs.datanode.max.transfer.threads

DN读写数据块线程数, 默认值4096

  • Specifies the maximum number of threads to use for transferring data in and out of the DN.

  • dfs.datanode.max.transfer.threads is the number of DataXceiver threads, which is used for transfering blocks via the DTP (data transfer protocol). The block data is big and the transfer takes some time. 1 thread will be served for one block reading. only until the whole block is transferred, the thread can be reused. If there’s many clients request block at the same time, we need more threads. For each write connection, there will be 2 threads. So this number should be larger for write bound applications. [java - Threads in Hadoop - Stack Overflow]

6 安全模式

dfs.namenode.replication.min

最小副本数, 默认值1

Minimal block replication.

dfs.namenode.safemode.threshold-pct

安全模式阀值, 默认值0.999f

Specifies the percentage of blocks that should satisfy the minimal replication requirement defined by dfs.namenode.replication.min. Values less than or equal to 0 mean not to wait for any particular percentage of blocks before exiting safemode. Values greater than 1 will make safe mode permanent.

7 HA配置

dfs.nameservices

the logical name for this new nameservice


  dfs.nameservices
  mycluster

dfs.ha.namenodes.[nameservice ID]

unique identifiers for each NameNode in the nameservice


  dfs.ha.namenodes.mycluster
  nn1,nn2,nn3

dfs.namenode.rpc-address.[nameservice ID].[name node ID]

the fully-qualified RPC address for each NameNode to listen on


  dfs.namenode.rpc-address.mycluster.nn1
  machine1.example.com:8020


  dfs.namenode.rpc-address.mycluster.nn2
  machine2.example.com:8020


  dfs.namenode.rpc-address.mycluster.nn3
  machine3.example.com:8020

dfs.namenode.http-address.[nameservice ID].[name node ID]

the fully-qualified HTTP address for each NameNode to listen on


  dfs.namenode.http-address.mycluster.nn1
  machine1.example.com:9870


  dfs.namenode.http-address.mycluster.nn2
  machine2.example.com:9870


  dfs.namenode.http-address.mycluster.nn3
  machine3.example.com:9870

dfs.namenode.shared.edits.dir

the location of the shared storage directory


  dfs.namenode.shared.edits.dir
  file:///mnt/filer1/dfs/ha-name-dir-shared

dfs.client.failover.proxy.provider.[nameservice ID]

the Java class that HDFS clients use to contact the Active NameNode


  dfs.client.failover.proxy.provider.mycluster
  org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

dfs.ha.fencing.methods

a list of scripts or Java classes which will be used to fence the Active NameNode during a failover

It is critical for correctness of the system that only one NameNode be in the Active state at any given time. Thus, during a failover, we first ensure that the Active NameNode is either in the Standby state, or the process has terminated, before transitioning another NameNode to the Active state. In order to do this, you must configure at least one fencing method. These are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.


  dfs.ha.fencing.methods
  shell(/bin/true)

fs.defaultFS

the default path prefix used by the Hadoop FS client when none is given


  fs.defaultFS
  hdfs://mycluster

你可能感兴趣的:(大数据,hdfs,hadoop,big,data)