http://www.cnblogs.com/rilley/archive/2012/02/13/2349858.html
http://www.cnblogs.com/licheng/archive/2011/11/08/2241854.html
http://www.blogjava.net/ivanwan/archive/2011/01/21/343328.html
今天在hadoop集群环境下需要将两台datanode删除,为了不影响在运行业务,需对节点进行动态删除,记录操作过程如下:
1, 从集群中移走节点,需要对移走节点的数据进行备份:
在主节点的core-site.xml配置文件中添加如下内容:
<property> <name>dfs.hosts.exclude</name> <value>/home/hadoop/hadoop/conf/excludes</value> </property>
说明
dfs.hosts.exclude:指要删除的节点
/home/hadoop/hadoop/conf/excludes:指定要被删除文件所在路径及名称,该处定义为excludes
2, 在1中设置目录中touch excludes,内容为每行需要移走的节点
cloud4
cloud5
3,进入 运行命令:hadoop dfsadmin -refreshNodes(我这用的yum安装的,不同安装方式hadoop目录会在不同路径),该命令可以动态刷新dfs.hosts和dfs.hosts.exclude配置,无需重启NameNode。
执行完成被删除节点datanode消失了,但是tasktracker还会存在,需要自己手动停掉
4,然后通过 bin/hadoop dfsadmin -report查看,结果如下:
Configured Capacity: 17721082527744 (16.12 TB) Present Capacity: 16806607028262 (15.29 TB) DFS Remaining: 14996775104512 (13.64 TB) DFS Used: 1809831923750 (1.65 TB) DFS Used%: 10.77% Under replicated blocks: 6788 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 6 (6 total, 0 dead) Name: 192.168.1.5:50010 Decommission Status : Normal Configured Capacity: 2953511657472 (2.69 TB) DFS Used: 265079108972 (246.87 GB) Non DFS Used: 150286670484 (139.97 GB) DFS Remaining: 2538145878016(2.31 TB) DFS Used%: 8.98% DFS Remaining%: 85.94% Last contact: Thu Sep 08 10:12:45 CST 2011 Name: 192.168.1.8:50010 Decommission Status : Decommission in progress Configured Capacity: 2953511657472 (2.69 TB) DFS Used: 228590288896 (212.89 GB) Non DFS Used: 150240718848 (139.92 GB) DFS Remaining: 2574680649728(2.34 TB) DFS Used%: 7.74% DFS Remaining%: 87.17% Last contact: Thu Sep 08 10:12:45 CST 2011 Name: 192.168.1.7:50010 Decommission Status : Normal Configured Capacity: 2953511657472 (2.69 TB) DFS Used: 266826599821 (248.5 GB) Non DFS Used: 150259458675 (139.94 GB) DFS Remaining: 2536425598976(2.31 TB) DFS Used%: 9.03% DFS Remaining%: 85.88% Last contact: Thu Sep 08 10:12:46 CST 2011 Name: 192.168.1.9:50010 Decommission Status : Decommission in progress Configured Capacity: 2953511657472 (2.69 TB) DFS Used: 226060701696 (210.54 GB) Non DFS Used: 150240718848 (139.92 GB) DFS Remaining: 2577210236928(2.34 TB) DFS Used%: 7.65% DFS Remaining%: 87.26% Last contact: Thu Sep 08 10:12:45 CST 2011 Name: 192.168.1.4:50010 Decommission Status : Normal Configured Capacity: 2953524240384 (2.69 TB) DFS Used: 553202110857 (515.21 GB) Non DFS Used: 163197603447 (151.99 GB) DFS Remaining: 2237124526080(2.03 TB) DFS Used%: 18.73% DFS Remaining%: 75.74% Last contact: Thu Sep 08 10:12:46 CST 2011 Name: 192.168.1.6:50010 Decommission Status : Normal Configured Capacity: 2953511657472 (2.69 TB) DFS Used: 270073113508 (251.53 GB) Non DFS Used: 150250329180 (139.93 GB) DFS Remaining: 2533188214784(2.3 TB) DFS Used%: 9.14% DFS Remaining%: 85.77% Last contact: Thu Sep 08 10:12:44 CST 2011
5,通过4中命令可以查看到被删除节点状态,如192.168.1.9
Decommission Status : Decommissioned
说明从91往其他节点同步数据已经完成,如果状态为Decommission Status : Decommissione in process则还在执行。
至此删除节点操作完成
问题总结
在拔掉节点时注意要把往hadoop放数据程序先停掉,否则程序还会往要删除节点同步数据,删除节点程序会一直执行。
添加节点
1.修改host
和普通的datanode一样。添加namenode的ip
2.修改namenode的配置文件conf/slaves
添加新增节点的ip或host
3.在新节点的机器上,启动服务
4.均衡block
[root@slave-004 hadoop]# ./bin/start-balancer.sh1)如果不balance,那么cluster会把新的数据都存放在新的node上,这样会降低mapred的工作效率
2)设置平衡阈值,默认是10%,值越低各节点越平衡,但消耗时间也更长
3)设置balance的带宽,默认只有1M/s
<property> <name>dfs.balance.bandwidthPerSec</name> <value>1048576</value> <description> Specifies the maximum amount of bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second. </description> </property>
注意:
1. 必须确保slave的firewall已关闭;
2. 确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中,反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中
1.集群配置
修改conf/hdfs-site.xml文件
<property> <name>dfs.hosts.exclude</name> <value>/data/soft/hadoop/conf/excludes</value> <description>Names a file that contains a list of hosts that are not permitted to connect to the namenode. The full pathname of the file must be specified. If the value is empty, no hosts are excluded.</description> </property>
2确定要下架的机器
dfs.hosts.exclude定义的文件内容为,每个需要下线的机器,一行一个。这个将阻止他们去连接Namenode。如:
3.强制重新加载配置
它会在后台进行Block块的移动
4.关闭节点
等待刚刚的操作结束后,需要下架的机器就可以安全的关闭了。
可以查看到现在集群上连接的节点
5.再次编辑excludes文件
一旦完成了机器下架,它们就可以从excludes文件移除了
登录要下架的机器,会发现DataNode进程没有了,但是TaskTracker依然存在,需要手工处理一下