In this topic we talk about how to set up your Hadoop deployment to correctly collect metrics with the Splunk app for HadoopOps. We identify some of the root causes of problems you can face and give you the information you need to resolve the problems. Metrics collection issues can result from:
hadoop-metrics.properties
file is set up correctly and that FileContext is in place to log metrics to files).Hadoop exposes metrics that can be used to measure performance and events related to Hadoop services. These metrics are a valuable resource for monitoring performance, monitoring cluster health, and debugging system problems. You can configure Hadoop daemons to collect this data and handle it using plug-ins. The Hadoop daemons NameNode, SecondaryNameNode, DataNode, JobTracker, and TaskTracker all expose runtime metrics. The Metrics1 system is used by Apache Hadoop 0.20.0 and CDH3 (which is based on version 0.20.2).The new Metrics2 system for Hadoop replaces the previous Metrics1 system. This new system supports:
When setting up Hadoop metrics collection it is important to use a metrics system and servlet that works with the Hadoop version you have installed. The following table lists the supported metrics systems and servlets for the specified Hadoop versions.
Hadoop version and metrics system reference
Hadoop version | Metrics system | servlets |
---|---|---|
Apache Hadoop 0.20.2 | metrics1 | metrics |
CDH3u1+ | metrics1 | metrics and jmx |
Apache Hadoop 0.20.203 | metrics2 | not available |
Apache Hadoop 0.20.205 | metrics2 | jmx |
Apache Hadoop 1.1x (HDP 1.2) | metrics2 | jmx |
Apache Hadoop 2.0x (CDH4 wiht YARN) | metrics2 | jmx |
CDH4 with MR1 | metrics1 | JobTracker/TaskTracker - metrics and jmx |
CDH4 with MR1 | metrics2 | HDFS - mmx |
Note: CDH refers to Cloudera's Distribution Including Apache Hadoop. HDP refers to the Hortonworks Data Platform.
You can collect metrics data and feed it into Splunk using the Splunk app for HadoopOps. You can collect metrics in one of the following two ways:
/metrics
. The metrics servlet only works with metrics1./jmx
. The JMX servlet works with metrics1 and metrics2. JMX JSON servlet is the preferred method, however, for HadoopOps, we recommend that you configure metrics collection using/metrics
with the old metrics, metrics1, system due to limited support of JMX in metrics1.To collect metrics in Splunk as scripted inputs, configure your input files to use the correct ports. The following table lists the default ports that the Hadoop daemons use.
Hadoop default web server ports
Service | Default port |
---|---|
namenode | 50070 |
datanode | 50075 |
jobtracker | 50030 |
tasktracker | 50060 |
To set up HadoopOps to collect metrics:
curl -k https://<host>:<port>//services/data/inputs/script/_reload
`hadoop_metrics`
The following examples show how you can configure Hadoop and Splunk to collect metrics data for mapping to the Splunk App for HadoopOps.
This is a sample <HADOOP_CONF_DIR>/hadoop-metrics.properties
file showing the configuration for use with the metrics1 system.
dfs.class = org.apache.hadoop.metrics.spi.NoEmitMetricsContext mapred.class = org.apache.hadoop.metrics.spi.NoEmitMetricsContext jvm.class = org.apache.hadoop.metrics.spi.NoEmitMetricsContext #ugi.class = org.apache.hadoop.metrics.spi.NoEmitMetricsContext dfs.class = org.apache.hadoop.metrics.file.FileContext dfs.period=10 dfs.fileName=/tmp/dfsmetrics.log mapred.class = org.apache.hadoop.metrics.file.FileContext mapred.period=10 mapre.fileName=/tmp/mapredmetrics.log jvm.class = org.apache.hadoop.metrics.file.FileContext jvm.period=10 jvm.fileName=/tmp/jvmmetrics.log #ugi.class = org.apache.hadoop.metrics.file.FileContext #ugi.period=10 #ugi.fileName=/tmp/ugimetrics.log
This is a sample <HADOOP_CONF_DIR>/hadoop-metrics2.properties file showing the configuration for use with the metrics2 system.
*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink *.period=10 namenode.sink.file.filename=namenode-metrics.out datanode.sink.file.filename=datanode-metrics.out jobtracker.sink.file.filename=jobtracker-metrics.out tasktracker.sink.file.filename=tasktracker-metrics.out maptask.sink.file.filename=maptask-metrics.out reducetask.sink.file.filename=reducetask-metrics.out
This is a sample Splunk_TA_hadoopops/local/inputs.conf
file showing how to collect metrics as a scripted input using the metrics1 system. Use the host ip address and port number specific to your cluster configuration.
# namenode [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50070/metrics] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics # secondary_namenode #[script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50090/metrics] #disabled = 0 #interval = 10 #sourcetype = hadoop_metrics #index = hadoopmon_metrics # datanode [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50075/metrics] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics # jobtracker [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50030/metrics] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics # tasktracker [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50060/metrics] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics
This is a sample Splunk_TA_hadoopops/local/inputs.conf
file showing how to collect metrics as a scripted input using the metrics2 system. Use the host ip address and port number specific to your cluster configuration.
# namenode [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50070/jmx] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics # secondary_namenode #[script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50090/jmx] #disabled = 0 #interval = 10 #sourcetype = hadoop_metrics #index = hadoopmon_metrics # datanode [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50075/jmx] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics # jobtracker [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50030/jmx] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics # tasktracker [script://./bin/hadoopmon_metrics.sh http://127.0.0.1:50060/jmx] disabled = 0 interval = 10 sourcetype = hadoop_metrics index = hadoopmon_metrics
Note: For jmx endpoint, you can add 'qry?Hadoop:*'
. Some versions use'qry?hadoop:*'
after '/jmx'
to filter out non-Hadoop metrics.
This is a sample Splunk_TA_hadoopops/local/inputs.conf
file showing how to collect metrics using a file input and the metrics1 system.
[monitor://<absolute_path_to_dfs_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_mapred_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_jvm_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics
This is a sample Splunk_TA_hadoopops/local/inputs.conf
file showing how to collect metrics using a file input and the metrics2 system.
[monitor://<absolute_path_to_namenode_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_datanode_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_jobtracker_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_datanode_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_tasktracker_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_maptask_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics [monitor://<absolute_path_to_reducetask_metrics_output_file>] disabled = 0 sourcetype=hadoop_metrics index=hadoopmon_metrics
If you have correctly configured metrics collection in the Splunk for HadoopOps, the metrics data will populate the Dashboards. The events displayed differ depending on the collection method you have configured for your app. The following examples show sample data for each of the following configurations:
/metrics
/jmx
If the data does not come in as expected, then:
/metrics
endpointNode Type | Example |
---|---|
name node | context=dfs,sub_context=namenode,hostName=xxx,sessionId=,AddBlockOps=11,CreateFileOps=12,... |
datanode | context=dfs,sub_context=datanode,hostName=xxx,sessionId=,blockChecksumOp_avg_time=0,... |
jobtracker | context=mapred,sub_context=jobtracker,hostName=xxx,sessionId=,blacklisted_maps=0,... |
tasktracker | context=mapred,sub_context=tasktracker,hostName=xxx,sessionId=,failedDirs=0,mapTaskSlots=1,... |
/jmx
endpointNode Type | Example |
---|---|
name node | { "beans" : [ { "name" : "Hadoop:service=NameNode,name=MetricsSystem,sub=Stats", "modelerType" : "MetricsSystem,sub=Stats", "tag.context" : "metricssystem", "tag.hostName" : "xxx", ... }, { "name" : "Hadoop:service=NameNode,name=FSNamesystemMetrics", "modelerType" : "FSNamesystemMetrics", "tag.context" : "dfs", "tag.hostName" : "xxx", "FilesTotal" : 8683, "BlocksTotal" : 8659, "CapacityTotalGB" : 288, "CapacityUsedGB" : 111, "CapacityRemainingGB" : 153, ...}, ...] } |
datanode | { "beans" : [ { "name" : "Hadoop:service=DataNode,name=DataNode", "modelerType" : "DataNode", "tag.context" : "dfs", "tag.sessionId" : null, "tag.hostName" : "xxx", "bytes_written" : 33654041158, "bytes_read" : 3063741372, "blocks_written" : 16858, "blocks_read" : 4044, ... }, ... ] } |
jobtracker | { "beans" : [ { "name" : "Hadoop:service=JobTracker,name=JobTrackerMetrics", "modelerType" : "JobTrackerMetrics", "tag.context" : "mapred", "tag.sessionId" : "", "tag.hostName" : "xxx", "map_slots" : 8, "reduce_slots" : 8, "blacklisted_maps" : 0, "blacklisted_reduces" : 0, "maps_launched" : 13064, "maps_completed" : 13047, "maps_failed" : 17, "reduces_launched" : 8, "reduces_completed" : 8, "reduces_failed" : 0, "jobs_submitted" : 9, "jobs_completed" : 9, ... }, { "name" : "Hadoop:service=JobTracker,name=QueueMetrics,q=default", "modelerType" : "QueueMetrics,q=default", "tag.context" : "mapred", "tag.sessionId" : "", "tag.Queue" : "default", "tag.hostName" : "xxx", "maps_launched" : 13064, "maps_completed" : 13047, "maps_failed" : 17, "reduces_launched" : 8, "reduces_completed" : 8, "reduces_failed" : 0, "jobs_submitted" : 9, "jobs_completed" : 9, ... }, ...] } |
tasktracker | { "beans" : [ { "name" : "Hadoop:service=TaskTracker,name=TaskTrackerMetrics", "modelerType" : "TaskTrackerMetrics", "tag.context" : "mapred", "tag.sessionId" : "", "tag.hostName" : "xxx", "maps_running" : 0, "reduces_running" : 0, "mapTaskSlots" : 2, "reduceTaskSlots" : 2, "failedDirs" : 0, "tasks_completed" : 78, "tasks_failed_timeout" : 0, "tasks_failed_ping" : 0 }, { "name" : "Hadoop:service=TaskTracker,name=ShuffleServerMetrics", "modelerType" : "ShuffleServerMetrics", "tag.context" : "mapred", "tag.sessionId" : "", "tag.hostName" : "xxx", "shuffle_handler_busy_percent" : 0.0, "shuffle_output_bytes" : 4689439468, "shuffle_failed_outputs" : 0, "shuffle_success_outputs" : 5248, "shuffle_exceptions_caught" : 0 }, ... ] } |
Node Type | Example |
---|---|
name node | dfs.namenode: hostName=xxx, sessionId=, AddBlockOps=2019, CreateFileOps=2022, ... |
datanode | dfs.datanode: hostName=xxx, sessionId=, blockChecksumOp_avg_time=0, blockChecksumOp_num_ops=0,... |
jobtracker | mapred.jobtracker: hostName=xxx, sessionId=, blacklisted_maps=0, blacklisted_reduces=0,... |
tasktracker | mapred.tasktracker: hostName=xxx, sessionId=, failedDirs=0, mapTaskSlots=2,... |
The following examples show the output from a CDH3u3 Hadoop cluster and from Apache Hadoop 1.1.1.
The Hadoop version is 0.20.2. It uses the old metrics1 system.
# hadoop version Hadoop 0.20.2-cdh3u3 Subversion git://ubuntu-slave01/var/lib/jenkins/workspace/CDH3-Selective/build/cdh3/hadoop20/0.20.2-cdh3u3/source -r 318bc781117fa276ae81a3d111f5eeba0020634f Compiled by jenkins on Tue Mar 20 13:26:06 PDT 2012 From source with checksum fc0b509a5d10a59ca4a620ed2f321480
It uses the new metrics2 system.
$ hadoop version Hadoop 1.1.1 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108 Compiled by hortonfo on Mon Nov 19 10:48:11 UTC 2012 From source with checksum 9be520e845cf135867fb8b927a40affb
I'm using the REST API and the metrics1 system. I'm not getting all of my data.
Possible problems:
hadoop-metrics.properties
file.
# curl http://127.0.0.1:50070/metrics dfs FSDirectory {hostName=cdh1,sessionId=}: files_deleted=2 FSNamesystem {hostName=cdh1,sessionId=}: BlockCapacity=2097152 BlocksTotal=162 CapacityRemainingGB=90 CapacityTotalGB=133 CapacityUsedGB=38 CorruptBlocks=0 ExcessBlocks=0 FilesTotal=37 MissingBlocks=0 PendingDeletionBlocks=0 PendingReplicationBlocks=0 ScheduledReplicationBlocks=0 TotalLoad=3 UnderReplicatedBlocks=0 namenode {hostName=cdh1,sessionId=}: AddBlockOps=2 CreateFileOps=2 DeleteFileOps=2 FileInfoOps=13 FilesAppended=0 FilesCreated=2 FilesInGetListingOps=114 FilesRenamed=0 GetBlockLocations=0 GetListingOps=64 JournalTransactionsBatchedInSync=0 SafemodeTime=55260 Syncs_avg_time=0 Syncs_num_ops=10 Transactions_avg_time=0 Transactions_num_ops=12 blockReport_avg_time=0 blockReport_num_ops=38 fsImageLoadTime=9934 jvm metrics {hostName=cdh1,processName=NameNode,sessionId=}: gcCount=41 gcTimeMillis=288 logError=0 logFatal=0 logInfo=202 logWarn=0 ...
The problem in this case is that NullContext is dropping all the metrics data.
# curl http://127.0.0.1:50070/metrics dfs FSDirectory FSNamesystem namenode jvm metrics rpc detailed-metrics metrics ugi ugi
/jmx
- output from a correctly configured namenode# curl http://127.0.0.1:50070/jmx?qry=*adoop:* { "beans" : [ { "name" : "Hadoop:service=NameNode,name=MetricsSystem,sub=Stats", "modelerType" : "MetricsSystem,sub=Stats", "tag.context" : "metricssystem", "tag.hostName" : "funtoo", "num_sources" : 6, "num_sinks" : 0, "snapshot_num_ops" : 0, "snapshot_avg_time" : 0.0, "snapshot_stdev_time" : 0.0, "snapshot_imin_time" : 3.4028234663852886E38, "snapshot_imax_time" : 1.401298464324817E-45, "snapshot_min_time" : 3.4028234663852886E38, "snapshot_max_time" : 1.401298464324817E-45, "publish_num_ops" : 0, "publish_avg_time" : 0.0, "publish_stdev_time" : 0.0, "publish_imin_time" : 3.4028234663852886E38, "publish_imax_time" : 1.401298464324817E-45, "publish_min_time" : 3.4028234663852886E38, "publish_max_time" : 1.401298464324817E-45, "dropped_pub_all" : 0 }, { "name" : "Hadoop:service=NameNode,name=FSNamesystemMetrics", "modelerType" : "FSNamesystemMetrics", "tag.context" : "dfs", "tag.hostName" : "funtoo", "FilesTotal" : 8683, "BlocksTotal" : 8659, "CapacityTotalGB" : 288, "CapacityUsedGB" : 111, "CapacityRemainingGB" : 153, "TotalLoad" : 4, "CorruptBlocks" : 0, "ExcessBlocks" : 0, "PendingDeletionBlocks" : 0, "PendingReplicationBlocks" : 0, "UnderReplicatedBlocks" : 23, "ScheduledReplicationBlocks" : 0, "MissingBlocks" : 0, "BlockCapacity" : 2097152 }, { "name" : "Hadoop:service=NameNode,name=NameNodeInfo", "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem", "Threads" : 27, "HostName" : "funtoo", "Used" : 119565756416, "Version" : "1.1.1, r1411108", "Total" : 309426700288, "UpgradeFinalized" : true, "Free" : 164132498944, "Safemode" : "", "NonDfsUsedSpace" : 25728444928, "PercentUsed" : 38.641056, "PercentRemaining" : 53.044064, "TotalBlocks" : 8659, "TotalFiles" : 8683, "LiveNodes" : "{\"fun1\":{\"usedSpace\":31294746624,\"lastContact\":1},\"solaris\":{\"usedSpace\":30706014208,\"lastContact\":2},\"freebsd\":{\"usedSpace\":27930959872,\"lastContact\":1},\"fedora\":{\"usedSpace\":29634035712,\"lastContact\":1}}", "DeadNodes" : "{}", "DecomNodes" : "{}", "NameDirStatuses" : "{\"failed\":{},\"active\":{\"/home/hadoop/hadoop-1.1.1/libexec/../tmp/hadoop-hadoop/dfs/name\":\"IMAGE_AND_EDITS\"}}" }, ... } ]
I have configured my app for Hadoop metrics collection, but I'm not getting the metrics data.The possible problems are:
Use the following command to check what you have enabled: 'Splunk_TA_hadoopops/bin/hopsconfig.sh --list-all --auth <username>:<password>'
The following output is a result of running the command 'Splunk_TA_hadoopops/bin/hopsconfig.sh --list-all --auth <username>:<password>'
*** Splunk> Splunk_TA_hadoopops command-line setup > SHOW INPUT STATUS *** Scripted Inputs: 0) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_cpu.sh enabled: *** disabled: interval: 30 1) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_df.sh enabled: *** disabled: interval: 300 2) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_dfsreport.sh enabled: disabled: *** 3) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_fsckreport.sh enabled: disabled: *** 4) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_iostat.sh enabled: *** disabled: interval: 60 5) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_metrics.sh http://127.0.0.1:50070/jmx?qry=Hadoop:* enabled: *** disabled: interval: 10 6) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_metrics.sh http://127.0.0.1:50090/jmx?qry=Hadoop:* enabled: *** disabled: interval: 10 7) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_ps.sh enabled: *** disabled: interval: 30 8) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_top.sh enabled: *** disabled: interval: 60 9) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/hadoopmon_vmstat.sh enabled: *** disabled: interval: 60 10) /home/hadoop/splunkforwarder/etc/apps/Splunk_TA_hadoopops/bin/introspect.sh enabled: *** disabled: interval: -1 Monitor Inputs: 11) /home/hadoop/hadoop-1.1.1/conf/*.xml enabled: *** disabled: 12) /home/hadoop/hadoop-1.1.1/logs/hadoop-hadoop-namenode-funtoo* enabled: *** disabled: 13) /home/hadoop/hadoop-1.1.1/logs/hadoop-hadoop-secondarynamenode-funtoo* enabled: *** disabled:
The following output is a result of running the command 'hadoopmon_metrics.sh'
manually to ensure that the script can get data from the REST endpoint.
# Splunk_TA_hadoopops/bin/hadoopmon_metrics.sh http://127.0.0.1:50070/jmx?qry=Hadoop:* { "beans" : [ { "name" : "Hadoop:service=NameNode,name=MetricsSystem,sub=Stats", "modelerType" : "MetricsSystem,sub=Stats", "tag.context" : "metricssystem", "tag.hostName" : "funtoo", "num_sources" : 6, "num_sinks" : 0, "snapshot_num_ops" : 0, "snapshot_avg_time" : 0.0, "snapshot_stdev_time" : 0.0, "snapshot_imin_time" : 3.4028234663852886E38, "snapshot_imax_time" : 1.401298464324817E-45, "snapshot_min_time" : 3.4028234663852886E38, "snapshot_max_time" : 1.401298464324817E-45, "publish_num_ops" : 0, "publish_avg_time" : 0.0, "publish_stdev_time" : 0.0, "publish_imin_time" : 3.4028234663852886E38, ...}, ... } ] }