作者:hovlj_1130 | 可以任意转载, 但转载时务必以超链接形式标明文章原始出处 和 作者信息 及 版权声明
http://hi.baidu.com/hovlj_1130/blog/item/e8fe89c3e9a67e160ff47755.html #准备工作 场景: 给hadoop集群机器安装监控工具ganglia,采集cpu、memory、disk、process、network数据的同时,采集hadoop相关的数值(metrics) server:hadoopbj01~hadoopbj08 hadoop-version:hadoop-0.20.2 ganglia-version:ganglia-3.1.1 准备工作: 在ganglia服务端需要安装gmetad的机器(下文称之为ganglia_m)上rhn_register(需要redhat注册码),以便使用yum install,要不会被依赖包折腾的死无活来的。 step1:(in ganglia_m) #安装基础依赖包 yum -y install apr-devel apr-util check-devel cairo-devel pango-devel libxml2-devel \ rpmbuild glib2-devel dbus-devel freetype-devel fontconfig-devel gcc-c++ expat-devel \ python-devel libXrender-devel pcre pcre-devel step2:(in ganglia_m) yum install libconfuse libconfuse-devel step3:(in ganglia_m) #安装RRDTool wget http://dag.wieers.com/rpm/packages/rpmforge-release/rpmforge-release-0.3.6-1.el5.rf.i386.rpm rpm -ivh rpmforge-release-0.3.6-1.el5.rf.i386.rpm 会在/etc/yum.repos.d目录生成一些yum源 yum install rrdtool yum install rrdtool-devel which rrdtoolldconfig -p | grep rrd # make sure you have the new rrdtool libraries linked. step4:(in ganglia_m: 包含 步骤0~4) #步骤 0:安装ganglia cd /tmp/ wget http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.1.1%20\(Wien\)/ganglia-3.1.1.tar.gz/download cd /tmp/ tar zxvf ganglia*gz cd ganglia-3.1.1/ ./configure --with-gmetad make -j8 make install #步骤 1:处理命令行文件 cd /tmp/ganglia-3.1.1/ # you should already be in this directory mkdir -p /var/www/html/ganglia/ # make sure you have apache installed cp -a web/* /var/www/html/ganglia/ # this is the web interface cp gmetad/gmetad.init /etc/rc.d/init.d/gmetad # startup script cp gmond/gmond.init /etc/rc.d/init.d/gmond mkdir /etc/ganglia # where config files go gmond -t | tee /etc/ganglia/gmond.conf # generate initial gmond config cp gmetad/gmetad.conf /etc/ganglia/ # initial gmetad configuration mkdir -p /var/lib/ganglia/rrds # place where RRDTool graphs will be stored chown nobody:nobody /var/lib/ganglia/rrds # make sure RRDTool can write here. chkconfig --add gmetad # make sure gmetad starts up at boot time chkconfig --add gmond # make sure gmond starts up at boot time #步骤 2:修改 /etc/ganglia下的gmond.conf和gmetad.conf vi gmetad.conf #添加需要监控的机器,以及监控组名,下面的列,表示建立监控组:hadoop-bj,监控机器hadoopbj01~hadoopbj08 data_source "hadoop-bj" hadoopbj01 hadoopbj02 hadoopbj03 hadoopbj04 hadoopbj05 hadoopbj06 hadoopbj07 hadoopbj08 vi gmond.conf #现在可以修改 /etc/ganglia/gmond.conf 以命名集群。我们上面定义的集群名称为"hadoop-bj",因此我们将name="unspecified"更改为name="hadoop-bj" #这里我们的udp_send_channel和udp_recv_channel都是用默认的mcast_join和Port,当然你可以修改 cluster { name = "hadoop-bj" owner = "unspecified" latlong = "unspecified" url = "unspecified" } udp_send_channel { mcast_join = 239.2.11.71 port = 8649 ttl = 1 } udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } #步骤 3:注意多宿主计算机 #在我的集群中,eth0 是我的系统的公共 IP 地址。但是,监视服务器将通过 eth1 与私有集群网络中的节点进行通信。我需要确保 Ganglia 使用的多点传送将与 eth1 绑定在一起。这可以通过创建 /etc/sysconfig/network-scripts/route-eth1 文件来完成。添加 239.2.11.71 dev eth1 内容。 #然后您可以使用 service network restart 重新启动网络并确保路由器显示此 IP 通过 eth1。注:您应当使用 239.2.11.71,因为这是 ganglia 的默认多点传送通道。如果使用其他通道或者增加更多通道,请更改它。 touch /etc/sysconfig/network-scripts/route-eth1 echo "239.2.11.71 dev eth1">>/etc/sysconfig/network-scripts/route-eth1 service network restart #步骤 4:在管理服务器中启动它 service gmond start service gmetad start service httpd restart step5:(in ganglia_m) #通过脚本修改配置需要监控的节点,注意我们这里ganglia_m同时也为hadoopbj01,已经配置好gmond,因此下面的脚本中只需配置hadoopbj02~hadoopbj08 touch /tmp/mynodes vi /tmp/mynodes hadoopbj02 hadoopbj03 hadoopbj04 hadoopbj05 hadoopbj06 hadoopbj07 hadoopbj08 touch /tmp/configure-all-ganglia vi /tmp/configure-all-ganglia for i in `cat /tmp/mynodes`; do scp /usr/sbin/gmond $i:/usr/sbin/gmond ssh $i "mkdir -p /etc/ganglia/" scp /etc/ganglia/gmond.conf $i:/etc/ganglia/ scp /etc/init.d/gmond $i:/etc/init.d/ scp /usr/lib64/libganglia-3.1.1.so.0 $i:/usr/lib64/ scp /lib64/libexpat.so.0 $i:/lib64/ scp /usr/lib64/libconfuse.so.0 $i:/usr/lib64/ scp /usr/lib64/libapr-1.so.0 $i:/usr/lib64/ scp -r /usr/lib64/ganglia $i:/usr/lib64/ scp /etc/sysconfig/network-scripts/route-eth1 $i:/etc/sysconfig/network-scripts/ ssh $i "service network restart" ssh $i "service gmond start" ssh $i "chkconfig --add gmond" done sh /tmp/configure-all-ganglia #查看运行情况 http://hadoopbj01/ganglia/ step6:(in ganglia_m) #我们这里使用的是hadoop-0.20.2,在使用ganglia-3.1.1收集hadoop数据时需要打补丁HADOOP-4756 #下面我们给hadoop打补丁 #打补丁的事情,还有下面的重新ant package最好在测试环境先测试通过,然后再将生成的hadoop core包复制到在线环境替换即可。 wget https://issues.apache.org/jira/browse/HADOOP-4756 cp HADOOP-4675-v9.patch HADOOP-4675-v9.patch.bak vi HADOOP-4675-v9.patch #下载下来的补丁需要修改一下,下面是修改的细节,修改后编译生成新的hadoop core包,里面包含最新的org.apache.hadoop.metrics.ganglia.GangliaContext31类 #用org.apache.hadoop.metrics.ganglia.GangliaContext包装的; #(为什么下载下来的补丁不能直接patch,大家是什么情况?因为这个补丁被折腾了好久。) diff HADOOP-4675-v9.patch HADOOP-4675-v9.patch.bak 237c237 < Index: src/core/org/apache/hadoop/metrics/ganglia/GangliaContext.java --- > Index: src/java/org/apache/hadoop/metrics/ganglia/GangliaContext.java 239,240c239,240 < --- src/core/org/apache/hadoop/metrics/ganglia/GangliaContext.java (revision 771522) < +++ src/core/org/apache/hadoop/metrics/ganglia/GangliaContext.java (working copy) --- > --- src/java/org/apache/hadoop/metrics/ganglia/GangliaContext.java (revision 771522) > +++ src/java/org/apache/hadoop/metrics/ganglia/GangliaContext.java (working copy) 325c325 < Index: src/core/org/apache/hadoop/metrics/ganglia/GangliaContext31.java --- > Index: src/java/org/apache/hadoop/metrics/ganglia/GangliaContext31.java 327,328c327,328 < --- src/core/org/apache/hadoop/metrics/ganglia/GangliaContext31.java (revision 0) < +++ src/core/org/apache/hadoop/metrics/ganglia/GangliaContext31.java (revision 0) --- > --- src/java/org/apache/hadoop/metrics/ganglia/GangliaContext31.java (revision 0) > +++ src/java/org/apache/hadoop/metrics/ganglia/GangliaContext31.java (revision 0) cd $HADOOP_HOME patch -p0 < HADOOP-4675-v9.patch vi build.xml #注释掉904和908行 #(ant不熟,同事说904,908编译通不过没关系,为什么?最好自己写一个build.xml,生成包含org.apache.hadoop.metrics.ganglia.GangliaContext31类的hadoop core包,我不会写...只好用自带的。) <target name="forrest.check" unless="forrest.home" depends="java5.check"> <!--fail message="'forrest.home' is not defined. Please pass -Dforrest.home=<base of Apache Forrest installation> to Ant on the command-line." /--> </target> <target name="java5.check" unless="java5.home"> <!--fail message="'java5.home' is not defined. Forrest requires Java 5. Please pass -Djava5.home=<base of Java 5 distribution> to Ant on the command-line." /--> </target> ant package #ant package成功后,会生成build目录,下面有编译生成的新的hadoop core包 #将新生成的带有org.apache.hadoop.metrics.ganglia.GangliaContext31类的hadoop core jar包替换原有的包 cp $HADOOP_HOME/hadoop-0.20.2-core.jar $HADOOP_HOME/hadoop-0.20.2-core.jar.bak cp $HADOOP_HOME/build/hadoop-0.20.3-dev-core.jar $HADOOP_HOME/ #更新hadoop-metrics.properties,使用新的GangliaContext31类收集hadoop数据 vi $HADOOP_HOME/conf/hadoop-metrics.properties # Configuration of the "dfs" context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 dfs.period=10 dfs.servers=239.2.11.71:8649 # Configuration of the "mapred" context for null #mapred.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "mapred" context for file #mapred.class=org.apache.hadoop.metrics.file.FileContext #mapred.period=10 #mapred.fileName=/tmp/mrmetrics.log # Configuration of the "mapred" context for ganglia mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 mapred.period=10 mapred.servers=239.2.11.71:8649 # Configuration of the "jvm" context for null #jvm.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "jvm" context for file #jvm.class=org.apache.hadoop.metrics.file.FileContext #jvm.period=10 #jvm.fileName=/tmp/jvmmetrics.log # Configuration of the "jvm" context for ganglia jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=239.2.11.71:8649 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=239.2.11.71:8649 #重启hadoop集群 #namenode $HADOOP_HOME/bin/stop-dfs.sh #jobtracker $HADOOP_HOME/bin/stop-mapred.sh #namenode $HADOOP_HOME/bin/start-dfs.sh #jobtracker $HADOOP_HOME/bin/start-mapred.sh #重启ganglia集群 service gmond restart service gmetad restart service httpd restart #不出意外的话,这时候ganglia可以收集hadoop metrics了。 参考url: #http://wiki.yepn.net/ganglia#%E9%85%8D%E7%BD%AE_node%E7%9A%84hadoop%E7%9B%91%E6%8E%A7 #http://www.pginjp.org/modules/newbb/viewtopic.php?topic_id=1235&forum=22 #http://www.pginjp.org/modules/newbb/viewtopic.php?topic_id=1234&forum=7 #http://www.ibm.com/developerworks/cn/linux/l-ganglia-nagios-1/index.html #sed用法 http://hi.baidu.com/hovlj_1130/blog/item/e9721b7b31e8dbe02e73b3b2.html |