在《Ganglia3.6.0 安装步骤 (含python module)》一文中,安装完毕以后,ganglia可以正常监控到机器负载、cpu、磁盘等机器相关信息。但是,如果需要进一步通过ganglia监控hadoop集群,则需要配置hadoop metrics配置文件及适量修改ganglia配置。
HADOOP_PATH/etc/hadoop/目录下有两个配置文件:hadoop-metrics.properties和hadoop-metrics2.properties。
1. hadoop-metrics.properties
用于hadoop与3.1版本以前的ganglia集成做监控的配置文件(在ganglia3.0到3.1的过程中,消息的格式发生了重要的变化,不兼容之前的版本)。
示例如下:
# Configuration of the "dfs" context for null ##dfs.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "dfs" context for file #dfs.class=org.apache.hadoop.metrics.file.FileContext #dfs.period=10 #dfs.fileName=/tmp/dfsmetrics.log # Configuration of the "dfs" context for ganglia # Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter) # dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 dfs.period=10 dfs.servers=Mas2:8649 # Configuration of the "mapred" context for null ##mapred.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "mapred" context for file #mapred.class=org.apache.hadoop.metrics.file.FileContext #mapred.period=10 #mapred.fileName=/tmp/mrmetrics.log # Configuration of the "mapred" context for ganglia # Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter) # mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 mapred.period=10 mapred.servers=Mas2:8649 # Configuration of the "jvm" context for null #jvm.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "jvm" context for file #jvm.class=org.apache.hadoop.metrics.file.FileContext #jvm.period=10 #jvm.fileName=/tmp/jvmmetrics.log # Configuration of the "jvm" context for ganglia # jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=Mas2:8649 # Configuration of the "rpc" context for null ##rpc.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "rpc" context for file #rpc.class=org.apache.hadoop.metrics.file.FileContext #rpc.period=10 #rpc.fileName=/tmp/rpcmetrics.log # Configuration of the "rpc" context for ganglia # rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=Mas2:8649 # Configuration of the "ugi" context for null ##ugi.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "ugi" context for file #ugi.class=org.apache.hadoop.metrics.file.FileContext #ugi.period=10 #ugi.fileName=/tmp/ugimetrics.log # Configuration of the "ugi" context for ganglia # ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 ugi.period=10 ugi.servers=Mas2:8649
2. hadoop-metrics2.properties
用于hadoop与3.1版本以后的ganglia集成做监控的配置文件(本文采用此配置文件)。
示例如下(根据具体的namenode、datanode或是其他,选择相应的配置):
# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # syntax: [prefix].[source|sink].[instance].[options] # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink # default sampling period, in seconds *.period=10 # The namenode-metrics.out will contain metrics from all context #namenode.sink.file.filename=namenode-metrics.out # Specifying a special sampling period for namenode: #namenode.sink.*.period=8 #datanode.sink.file.filename=datanode-metrics.out # the following example split metrics of different # context to different sinks (in this case files) #jobtracker.sink.file_jvm.context=jvm #jobtracker.sink.file_jvm.filename=jobtracker-jvm-metrics.out #jobtracker.sink.file_mapred.context=mapred #jobtracker.sink.file_mapred.filename=jobtracker-mapred-metrics.out #tasktracker.sink.file.filename=tasktracker-metrics.out #maptask.sink.file.filename=maptask-metrics.out #reducetask.sink.file.filename=reducetask-metrics.out ## *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 *.sink.ganglia.period=10 *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 #namenode.sink.ganglia.servers=Mas2:8649 #resourcemanager.sink.ganglia.servers=namenode1:8649 datanode.sink.ganglia.servers=namenode1:8649 #nodemanager.sink.ganglia.servers=namenode1:8649 #maptask.sink.ganglia.servers=namenode1:8649 #reducetask.sink.ganglia.servers=namenode1:8649
1. gmeted.conf
用于收集各节点上传上来的监控数据。
示例如下(启用示例文件中开启的配置选项即可):
# This is an example of a Ganglia Meta Daemon configuration file # http://ganglia.sourceforge.net/ # # #------------------------------------------------------------------------------- # Setting the debug_level to 1 will keep daemon in the forground and # show only error messages. Setting this value higher than 1 will make # gmetad output debugging information and stay in the foreground. # default: 0 # debug_level 10 # #------------------------------------------------------------------------------- # What to monitor. The most important section of this file. # # The data_source tag specifies either a cluster or a grid to # monitor. If we detect the source is a cluster, we will maintain a complete # set of RRD databases for it, which can be used to create historical # graphs of the metrics. If the source is a grid (it comes from another gmetad), # we will only maintain summary RRDs for it. # # Format: # data_source "my cluster" [polling interval] address1:port addreses2:port ... # # The keyword 'data_source' must immediately be followed by a unique # string which identifies the source, then an optional polling interval in # seconds. The source will be polled at this interval on average. # If the polling interval is omitted, 15sec is asssumed. # # If you choose to set the polling interval to something other than the default, # note that the web frontend determines a host as down if its TN value is less # than 4 * TMAX (20sec by default). Therefore, if you set the polling interval # to something around or greater than 80sec, this will cause the frontend to # incorrectly display hosts as down even though they are not. # # A list of machines which service the data source follows, in the # format ip:port, or name:port. If a port is not specified then 8649 # (the default gmond port) is assumed. # default: There is no default value # # data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655 # data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651 # data_source "another source" 1.3.4.7:8655 1.3.4.8 #data_source "my cluster" localhost data_source "my cluster" 10 Mas1:8650 Mas2:8650 Sla1:8650 Sla2:8650 # # Round-Robin Archives # You can specify custom Round-Robin archives here (defaults are listed below) # # Old Default RRA: Keep 1 hour of metrics at 15 second resolution. 1 day at 6 minute # RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \ # "RRA:AVERAGE:0.5:5760:374" # New Default RRA # Keep 5856 data points at 15 second resolution assuming 15 second (default) polling. That's 1 day # Two weeks of data points at 1 minute resolution (average) #RRAs "RRA:AVERAGE:0.5:1:5856" "RRA:AVERAGE:0.5:4:20160" "RRA:AVERAGE:0.5:40:52704" # #------------------------------------------------------------------------------- # Scalability mode. If on, we summarize over downstream grids, and respect # authority tags. If off, we take on 2.5.0-era behavior: we do not wrap our output # in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always assume # we are the "authority" on data source feeds. This approach does not scale to # large groups of clusters, but is provided for backwards compatibility. # default: on # scalable off # #------------------------------------------------------------------------------- # The name of this Grid. All the data sources above will be wrapped in a GRID # tag with this name. # default: unspecified # gridname "MyGrid" # #------------------------------------------------------------------------------- # The authority URL for this grid. Used by other gmetads to locate graphs # for our data sources. Generally points to a ganglia/ # website on this machine. # default: "http://hostname/ganglia/", # where hostname is the name of this machine, as defined by gethostname(). # authority "http://mycluster.org/newprefix/" # #------------------------------------------------------------------------------- # List of machines this gmetad will share XML with. Localhost # is always trusted. # default: There is no default value # trusted_hosts 127.0.0.1 169.229.50.165 my.gmetad.org # #------------------------------------------------------------------------------- # If you want any host which connects to the gmetad XML to receive # data, then set this value to "on" # default: off # all_trusted on # #------------------------------------------------------------------------------- # If you don't want gmetad to setuid then set this to off # default: on # setuid off # #------------------------------------------------------------------------------- # User gmetad will setuid to (defaults to "nobody") # default: "nobody" # setuid_username "nobody" # #------------------------------------------------------------------------------- # Umask to apply to created rrd files and grid directory structure # default: 0 (files are public) # umask 022 # #------------------------------------------------------------------------------- # The port gmetad will answer requests for XML # default: 8651 xml_port 8651 # #------------------------------------------------------------------------------- # The port gmetad will answer queries for XML. This facility allows # simple subtree and summation views of the XML tree. # default: 8652 interactive_port 8652 # #------------------------------------------------------------------------------- # The number of threads answering XML requests # default: 4 # server_threads 10 # #------------------------------------------------------------------------------- # Where gmetad stores its round-robin databases # default: "/var/lib/ganglia/rrds" rrd_rootdir "/var/www/html/rrds" # #------------------------------------------------------------------------------- # List of metric prefixes this gmetad will not summarize at cluster or grid level. # default: There is no default value # unsummarized_metrics diskstat CPU # #------------------------------------------------------------------------------- # In earlier versions of gmetad, hostnames were handled in a case # sensitive manner # If your hostname directories have been renamed to lower case, # set this option to 0 to disable backward compatibility. # From version 3.2, backwards compatibility will be disabled by default. # default: 1 (for gmetad < 3.2) # default: 0 (for gmetad >= 3.2) case_sensitive_hostnames 0 #------------------------------------------------------------------------------- # It is now possible to export all the metrics collected by gmetad directly to # graphite by setting the following attributes. # # The hostname or IP address of the Graphite server # default: unspecified # carbon_server "my.graphite.box" # # The port and protocol on which Graphite is listening # default: 2003 # carbon_port 2003 # # default: tcp # carbon_protocol udp # # **Deprecated in favor of graphite_path** A prefix to prepend to the # metric names exported by gmetad. Graphite uses dot- # separated paths to organize and refer to metrics. # default: unspecified # graphite_prefix "datacenter1.gmetad" # # A user-definable graphite path. Graphite uses dot- # separated paths to organize and refer to metrics. # For reverse compatibility graphite_prefix will be prepended to this # path, but this behavior should be considered deprecated. # This path may include 3 variables that will be replaced accordingly: # %s -> source (cluster name) # %h -> host (host name) # %m -> metric (metric name) # default: graphite_prefix.%s.%h.%m # graphite_path "datacenter1.gmetad.%s.%h.%m # Number of milliseconds gmetad will wait for a response from the graphite server # default: 500 # carbon_timeout 500 #------------------------------------------------------------------------------- # Memcached configuration (if it has been compiled in) # Format documentation at http://docs.libmemcached.org/libmemcached_configuration.html # default: "" # memcached_parameters "--SERVER=127.0.0.1" #
用于指定本机监控信息的发送,本机所属的集群等的相关配置(gmond可以指定多播,在本示例中采用将监控数据发送到指定地址)。
示例如下(需要改动的部分为第29~72行):
/* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # By default gmond will use reverse DNS resolution when displaying your hostname # Uncommeting following value will override that value. # override_hostname = "mywebserver.domain.com" # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 0 /*secs */ } /* * The cluster attributes specified will be used as part of the <CLUSTER> * tag that will wrap all hosts collected by this instance. */ cluster { name = "my cluster" owner = "nobody" latlong = "unspecified" url = "unspecified" } /* The host section describes attributes of the host, like the location */ host { location = "unspecified" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { bind_hostname = yes # Highly recommended, soon to be default. # This option tells gmond to use a source address # that resolves to the machine's hostname. Without # this, the metrics may appear to come from any # interface and the DNS names associated with # those IPs will be used to create the RRDs. # mcast_join = 239.2.11.71 port = 8649 ttl = 1 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { # mcast_join = 239.2.11.71 port = 8649 # bind = 239.2.11.71 # retry_bind = true # Size of the UDP buffer. If you are handling lots of metrics you really # should bump it up to e.g. 10MB or even higher. # buffer = 10485760 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8650 # If you want to gzip XML output gzip_output = no } /* Channel to receive sFlow datagrams */ #udp_recv_channel { # port = 6343 #} /* Optional sFlow settings */ #sflow { # udp_port = 6343 # accept_vm_metrics = yes # accept_jvm_metrics = yes # multiple_jvm_instances = no # accept_http_metrics = yes # multiple_http_instances = no # accept_memcache_metrics = yes # multiple_memcache_instances = no #} /* Each metrics module that is referenced by gmond must be specified and loaded. If the module has been statically linked with gmond, it does not require a load path. However all dynamically loadable modules must include a load path. */ modules { module { name = "core_metrics" } module { name = "cpu_module" path = "modcpu.so" } module { name = "disk_module" path = "moddisk.so" } module { name = "load_module" path = "modload.so" } module { name = "mem_module" path = "modmem.so" } module { name = "net_module" path = "modnet.so" } module { name = "proc_module" path = "modproc.so" } module { name = "sys_module" path = "modsys.so" } } /* The old internal 2.5.x metric array has been replaced by the following collection_group directives. What follows is the default behavior for collecting and sending metrics that is as close to 2.5.x behavior as possible. */ /* This collection group will cause a heartbeat (or beacon) to be sent every 20 seconds. In the heartbeat is the GMOND_STARTED data which expresses the age of the running gmond. */ collection_group { collect_once = yes time_threshold = 20 metric { name = "heartbeat" } } /* This collection group will send general info about this host every 1200 secs. This information doesn't change between reboots and is only collected once. */ collection_group { collect_once = yes time_threshold = 1200 metric { name = "cpu_num" title = "CPU Count" } metric { name = "cpu_speed" title = "CPU Speed" } metric { name = "mem_total" title = "Memory Total" } /* Should this be here? Swap can be added/removed between reboots. */ metric { name = "swap_total" title = "Swap Space Total" } metric { name = "boottime" title = "Last Boot Time" } metric { name = "machine_type" title = "Machine Type" } metric { name = "os_name" title = "Operating System" } metric { name = "os_release" title = "Operating System Release" } metric { name = "location" title = "Location" } } /* This collection group will send the status of gexecd for this host every 300 secs.*/ /* Unlike 2.5.x the default behavior is to report gexecd OFF. */ collection_group { collect_once = yes time_threshold = 300 metric { name = "gexec" title = "Gexec Status" } } /* This collection group will collect the CPU status info every 20 secs. The time threshold is set to 90 seconds. In honesty, this time_threshold could be set significantly higher to reduce unneccessary network chatter. */ collection_group { collect_every = 20 time_threshold = 90 /* CPU status */ metric { name = "cpu_user" value_threshold = "1.0" title = "CPU User" } metric { name = "cpu_system" value_threshold = "1.0" title = "CPU System" } metric { name = "cpu_idle" value_threshold = "5.0" title = "CPU Idle" } metric { name = "cpu_nice" value_threshold = "1.0" title = "CPU Nice" } metric { name = "cpu_aidle" value_threshold = "5.0" title = "CPU aidle" } metric { name = "cpu_wio" value_threshold = "1.0" title = "CPU wio" } metric { name = "cpu_steal" value_threshold = "1.0" title = "CPU steal" } /* The next two metrics are optional if you want more detail... ... since they are accounted for in cpu_system. metric { name = "cpu_intr" value_threshold = "1.0" title = "CPU intr" } metric { name = "cpu_sintr" value_threshold = "1.0" title = "CPU sintr" } */ } collection_group { collect_every = 20 time_threshold = 90 /* Load Averages */ metric { name = "load_one" value_threshold = "1.0" title = "One Minute Load Average" } metric { name = "load_five" value_threshold = "1.0" title = "Five Minute Load Average" } metric { name = "load_fifteen" value_threshold = "1.0" title = "Fifteen Minute Load Average" } } /* This group collects the number of running and total processes */ collection_group { collect_every = 80 time_threshold = 950 metric { name = "proc_run" value_threshold = "1.0" title = "Total Running Processes" } metric { name = "proc_total" value_threshold = "1.0" title = "Total Processes" } } /* This collection group grabs the volatile memory metrics every 40 secs and sends them at least every 180 secs. This time_threshold can be increased significantly to reduce unneeded network traffic. */ collection_group { collect_every = 40 time_threshold = 180 metric { name = "mem_free" value_threshold = "1024.0" title = "Free Memory" } metric { name = "mem_shared" value_threshold = "1024.0" title = "Shared Memory" } metric { name = "mem_buffers" value_threshold = "1024.0" title = "Memory Buffers" } metric { name = "mem_cached" value_threshold = "1024.0" title = "Cached Memory" } metric { name = "swap_free" value_threshold = "1024.0" title = "Free Swap Space" } } collection_group { collect_every = 40 time_threshold = 300 metric { name = "bytes_out" value_threshold = 4096 title = "Bytes Sent" } metric { name = "bytes_in" value_threshold = 4096 title = "Bytes Received" } metric { name = "pkts_in" value_threshold = 256 title = "Packets Received" } metric { name = "pkts_out" value_threshold = 256 title = "Packets Sent" } } /* Different than 2.5.x default since the old config made no sense */ collection_group { collect_every = 1800 time_threshold = 3600 metric { name = "disk_total" value_threshold = 1.0 title = "Total Disk Space" } } collection_group { collect_every = 40 time_threshold = 180 metric { name = "disk_free" value_threshold = 1.0 title = "Disk Space Available" } metric { name = "part_max_used" value_threshold = 1.0 title = "Maximum Disk Space Used" } } include ("/usr/local/ganglia/etc/conf.d/*.conf")
自此,hadoop和ganglia的配置部分完成,接下来重启hadoop和ganglia进程,即可实现监控!