计数器:计数器是用来记录job的执行进度和状态的。它的作用可以理解为日志。我们通常可以在程序的某个位置插入计数器,用来记录数据或者进度的变化情况,它比日志更便利进行分析。
1. 内置计数器
Hadoop其实内置了很多计数器,那么这些计数器在哪看呢?
我们先来看下最简单的wordcount程序。
HDFS上的源文件:
[hadoop@master logfile]$ hadoop fs -cat /MR_Counter/diary
Today is 2016-3-22
I study mapreduce counter
I realized that mapreduce counter is simple
WordCount.java:
package com.oner.mr.mrcounter;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws IOException,
InterruptedException, URISyntaxException, ClassNotFoundException {
Path inPath = new Path("/MR_Counter/");// 输入目录
Path outPath = new Path("/MR_Counter/out");// 输出目录
Configuration conf = new Configuration();
// conf.set("fsdefaultFS", "hdfs://master:9000");
FileSystem fs = FileSystem.get(new URI("hdfs://master:9000"), conf,
"hadoop");
if (fs.exists(outPath)) {// 如果输出目录已存在,则删除
fs.delete(outPath, true);
}
Job job = Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
}
public static class MyMapper extends
Mapper {
private static Text k = new Text();
private static LongWritable v = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
k.set(word);
context.write(k, v);
}
}
}
public static class MyReducer extends
Reducer {
private static LongWritable v = new LongWritable();
@Override
protected void reduce(Text key, Iterable values,
Reducer.Context context)
throws IOException, InterruptedException {
int sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
v.set(sum);
context.write(key, v);
}
}
}
打成jar包后执行: hadoop jar wc.jar com.oner.mr.mrcounter.WordCount
发现有如下信息(注释部分是自己加的):
16/03/22 14:25:30 INFO mapreduce.Job: Counters: 49 // 表示本次job共49个计数器
File System Counters // 文件系统计数器
FILE: Number of bytes read=235
FILE: Number of bytes written=230421
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=189
HDFS: Number of bytes written=86
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters // 作业计数器
Launched map tasks=1 // 启动的map数为1
Launched reduce tasks=1 // 启动的reduce数为1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=12118
Total time spent by all reduces in occupied slots (ms)=11691
Total time spent by all map tasks (ms)=12118
Total time spent by all reduce tasks (ms)=11691
Total vcore-seconds taken by all map tasks=12118
Total vcore-seconds taken by all reduce tasks=11691
Total megabyte-seconds taken by all map tasks=12408832
Total megabyte-seconds taken by all reduce tasks=11971584
Map-Reduce Framework //MapReduce框架计数器
Map input records=3
Map output records=14
Map output bytes=201
Map output materialized bytes=235
Input split bytes=100
Combine input records=0
Combine output records=0
Reduce input groups=10
Reduce shuffle bytes=235
Reduce input records=14
Reduce output records=10
Spilled Records=28
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=331
CPU time spent (ms)=2820
Physical memory (bytes) snapshot=306024448
Virtual memory (bytes) snapshot=1690583040
Total committed heap usage (bytes)=136122368
Shuffle Errors // Shuffle错误计数器
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters // 文件输入格式计数器
Bytes Read=89 // Map从HDFS上读取的字节数,共89个字节
File Output Format Counters // 文件输出格式计数器
Bytes Written=86 //Reduce输出到HDFS上的字节数,共86个字节
上面的信息就是内置计数器的一些信息,包括:
文件系统计数器(File System Counters)
作业计数器(Job Counters)
MapReduce框架计数器(Map-Reduce Framework)
Shuffle 错误计数器(Shuffle Errors)
文件输入格式计数器(File Output Format Counters)
文件输出格式计数器(File Input Format Counters)
2. 自定义计数器
Hadoop也支持自定义计数器,在Hadoop2.x中可以使用Context的getCounter()方法(其实是接口TaskAttemptContext的方法,Context继承了该接口)得到自定义计数器。
public Counter getCounter(Enum> counterName):Get the Counter for the given counterName
public Counter getCounter(String groupName, String counterName):Get the Counter for the given groupName and counterName
由此可见,可以通过枚举或者字符串来得到计数器。
计数器常见的方法有几下几个:
String getName():Get the name of the counter
String getDisplayName():Get the display name of the counter
long getValue():Get the current value
void setValue(long value):Set this counter by the given value
void increment(long incr):Increment this counter by the given value
假设现在要在控制台输出源文件中的一些敏感词的个数,这里设定“mapreduce”为敏感词,该如何做呢?
package com.oner.mr.mrcounter;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws IOException,
InterruptedException, URISyntaxException, ClassNotFoundException {
Path inPath = new Path("/MR_Counter/");// 输入目录
Path outPath = new Path("/MR_Counter/out");// 输出目录
Configuration conf = new Configuration();
// conf.set("fsdefaultFS", "hdfs://master:9000");
FileSystem fs = FileSystem.get(new URI("hdfs://master:9000"), conf,
"hadoop");
if (fs.exists(outPath)) {// 如果输出目录已存在,则删除
fs.delete(outPath, true);
}
Job job = Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
}
public static class MyMapper extends
Mapper {
private static Text k = new Text();
private static LongWritable v = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Counter sensitiveCounter = context.getCounter("Sensitive Words:",
"mapreduce");// 创建一个组是Sensitive Words,名是mapreduce的计数器
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
if (word.equalsIgnoreCase("mapreduce")) {//如果出现了mapreduce,则计数器值加1
sensitiveCounter.increment(1L);
}
k.set(word);
context.write(k, v);
}
}
}
public static class MyReducer extends
Reducer {
private static LongWritable v = new LongWritable();
@Override
protected void reduce(Text key, Iterable values,
Reducer.Context context)
throws IOException, InterruptedException {
int sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
v.set(sum);
context.write(key, v);
}
}
}
打成jar包后重新执行,发现控制台中确实多了一组计数器Sensitive Words:,其中有一个名叫mapreduce的计数器,值为2。