总揽


   这个Hadoop插件可以用来监控Hadoop集群的NameNode和JobTracker。Hadoop is the leading and defacto distributed big data processing system "out there"。然而被像雅虎(据说拥有非常庞大的Hadoop集群),Facebook,Groupon等公司所使用的似乎只有Ganglia和openTSDB两种监控解决方案。当您阅读文档,你会发现这两个监测解决方案是非常紧密的结合Hadoop并且对Hadoop的版本、库等信息十分敏感。

   这个Hadoop插件的主旨是在已经运行的Hadoop集群或者Zabbix服务中不需要安装任何软件并且能直接使用。这太好了是真的吗?你为什么不继续把下面的内容读完......


安装和配置


   这个Hadoop插件用于从Hadoop的NameNode和JobTracker的Web UI接口截取信息。没有必要添加或者修改任何Hadoop的配置参数或重启你的Hadoop集群。下载这个插件之后你仅需要花费不超过5分钟的时间就能运行这个插件。

   这个插件会调用一个叫做curl的命令工具,所以需要先安装这个命令工具。在Zabbix这边,你可以登录root用户(默认密码为zabbix)运行yast -i curl命令。注意虽然curl的包是非常小的,但是yast将花费几分钟的时间重新更新包仓库。接下来下载Hadoop插件它包括2个shell脚本和2个模板xml文件,下载路径是:http://mikoomi.googlecode.com/svn/plugins/。在Zabbix服务器上创建目录/etc/zabbix/externalscripts,并将shell脚本复制到这个目录里面。

   完成上述操作后打开浏览器,下载NameNode和JobTracker的模板文件,下载路径是:http://mikoomi.googlecode.com/svn/plugins/。打开一个新的浏览窗口或者标签,登录Zabbix的前端(默认的用户名是admin,密码是zabbix)。

   操作如下:

   Configuration >> Templates

   点击窗口右上角的“Import Template”按钮

   在“Import file”对话框内, 找到并选中刚才下载的模板文件。

   上传模板

   现在你可以开始监控你的Hadoop集群了。使用说明如下:


监控你的Hadoop集群

   按照下面的步骤进行:

   监控NameNode

   登录Zabbix的前端然后点击导航栏上的Configuration >> Hosts

   点击右上角的“Create Host”按钮

   按照提示填写监控选项 - Name:你选择的名字(在Zabbix中每一个监控实体被称为一个主机 - 但是它可能是一个主机、一个服务、一个程序乃至一个集群)。

   完成后单击“templates”选项卡里面的“Add”按钮。

   你将看到一个模板列表 - 选择“Template_Hadoop_NameNode”

   在“Macros”选项卡里面添加如下宏-

       {$HADOOP_NAMENODE_HOST}

       {$HADOOP_NAMENODE_METRICS_PORT}

       {$ZABBIX_NAME}

   {$HADOOP_NAMENODE_HOST}的值应该是NameNode节点服务器的主机名或者完全主机名(可以在网络上ping通)。{$HADOOP_NAMENODE_METRICS_PORT}的值是NameNode的Web UI管理界面的端口。最后{$ZABBIX_NAME}是前面在Zabbix前端定义的NameNode的实体名称。

   同样的,安装监控JobTracker的步骤如下 -

   监控JobTracker

   登录Zabbix的前端然后点击导航栏上的Configuration >> Hosts

   点击右上角的“Create Host”按钮

   按照提示填写监控选项 - Name:你选择的名字(在Zabbix中每一个监控实体被称为一个主机 - 但是它可能是一个主机、一个服务、一个程序乃至一个集群)。

   完成后单击“templates”选项卡里面的“Add”按钮。

   你将看到一个模板列表 - 选择“Template_Hadoop_JobTracker”

   在“Macros”选项卡里面添加如下宏-

       {$HADOOP_JOBTRACKER_HOST}

       {$HADOOP_JOBTRACKER_METRICS_PORT}

       {$ZABBIX_NAME}

       {$HADOOP_NAMENODE_HOST}的值应该是NameNode节点服务器的主机名或者完全主机名(可以在网络上ping通)。{$HADOOP_NAMENODE_METRICS_PORT}的值是NameNode的Web UI管理界面的端口。最后{$ZABBIX_NAME}是前面在Zabbix前端定义的NameNode的实体名称。


NameNode监控指标

Configured Cluster Storage

Configured Max. Heap Size (GB)

Hadoop Version

NameNode Process Heap Size (GB)

NameNode Start Time

Number of Dead Nodes

Number of Decommissioned Nodes

Number of Files and Directories in HDFS

Number of HDFS Blocks Used

Number of Live Nodes

Number of Under-Replicated Blocks

Ping Check

Storage Unit

Total % of Storage Available

Total % of Storage Used

Total Storage Available

Total Storage Used by DFS

Total Storage Used by non-DFS

Least (min) Node-level non-DFS Storage Used

Least (min) Node-level Storage Configured

Least (min) Node-level Storage Free

Least (min) Node-level Storage Free %

Least (min) Node-level Storage Used

Least (min) Node-level Storage Used %

Most (max) Node-level non-DFS Storage Used

Most (max) Node-level Storage Configured

Most (max) Node-level Storage Free

Most (max) Node-level Storage Free %

Most (max) Node-level Storage Used

Most (max) Node-level Storage Used %

Node-level Storage Unit of Measure

Node with Least (min) Node-level non-DFS Storage Used

Node with Least (min) Node-level Storage Configured

Node with Least (min) Node-level Storage Free

Node with Least (min) Node-level Storage Free %

Node with Least (min) Node-level Storage Used

Node with Least (min) Node-level Storage Used %

Node with Most (max) Node-level non-DFS Storage Used

Node with Most (max) Node-level Storage Configured

Node with Most (max) Node-level Storage Free

Node with Most (max) Node-level Storage Free %

Node with Most (max) Node-level Storage Used

Node with Most (max) Node-level Storage Used %


JobTracker监控指标

Average Task Capacity Per Node

Hadoop Version

JobTracker Start Time

JobTracker State

Map Task Capacity

Number of Blacklisted Nodes

Number of Excluded Nodes

Number of Jobs Completed

Number of Jobs Failed

Number of Jobs Retired

Number of Jobs Running

Number of Jobs Submitted

Number of Map Tasks Running

Number of Nodes in Hadoop Cluster

Number of Reduce Tasks Running

Occupied Map Slots

Occupied Reduce Slots

Reduce Task Capacity

Reserved Map Slots

Reserved Reduce Slots

Pre-canned NameNode Triggers

Less than 20% free space available on the cluster

NameNode was restarted

No monitoring data received for the last 10 minutes

One or more nodes have become alive or restarted

One or more nodes have become dead

One or more nodes have been added to the decommissioned list

One or more nodes have been removed from the decommissioned list

The number of live nodes has been reduced

The number of live nodes has increased

There has been a reduction in the number of under-replicated blocks

There has been an increase in the number of under-replicated blocks

Less than 20% free space available on one or more nodes in the cluster

Pre-canned JobTracker Triggers

No monitoring data received for the last 10 minutes

One or more jobs have failed

One or more nodes have become blacklisted

One or more nodes have been added to the exclude list

One or more nodes have been added to the Hadoop cluster

One or more nodes have been removed from the blacklisted nodes

One or more nodes have been removed from the exclude list

One or more nodes have been removed from the Hadoop cluster

The JobTracker was restarted