Hadoop workshop homework.
Since I am an Intellij Idea guy now (I shifted to Intellij Idea from Eclipse several months ago because Intellij Idea is much much better than Eclipse now). Currently Intellij does't have any Hadoop plugins, so I package the output into a jar file, then copy the jar (cdh4-example.jar) into Hadoop cluster with scp.
[root@n1 hadoop-examples]# scp [email protected]:~/prog/hadoop/cdh4-examples/cdh4-examples.jar . Password: cdh4-examples.jar
First example: WordCount01. Code as below:
package wc; import java.io.IOException; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.Date; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCount01 extends Configured implements Tool { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); //conf.set("mapred.job.tracker", "192.168.1.201:9001"); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount01.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); System.out.println("Task name: " + job.getJobName()); System.out.println("Task success? " + (job.isSuccessful() ? "Yes" : "No")); return job.isSuccessful() ? 0 : 1; } public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date start = new Date(); int res = ToolRunner.run(new Configuration(), new WordCount01(), args); Date end = new Date(); float time = (float) ((end.getTime() - start.getTime()) / 60000.0); System.out.println("Task start: " + formatter.format(start)); System.out.println("Task end: " + formatter.format(end)); System.out.println("Time elapsed: " + String.valueOf(time) + " minutes."); System.exit(res); } }
Output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar wc.WordCount01 Shakespeare.txt output 13/07/12 23:02:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 23:02:57 INFO input.FileInputFormat: Total input paths to process : 1 13/07/12 23:02:59 INFO mapred.JobClient: Running job: job_201307122107_0006 13/07/12 23:03:00 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 23:03:32 INFO mapred.JobClient: map 26% reduce 0% 13/07/12 23:03:36 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 23:03:49 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 23:03:56 INFO mapred.JobClient: Job complete: job_201307122107_0006 13/07/12 23:03:56 INFO mapred.JobClient: Counters: 32 13/07/12 23:03:56 INFO mapred.JobClient: File System Counters 13/07/12 23:03:56 INFO mapred.JobClient: FILE: Number of bytes read=2151353 13/07/12 23:03:56 INFO mapred.JobClient: FILE: Number of bytes written=2933308 13/07/12 23:03:56 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 23:03:56 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 23:03:56 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 23:03:56 INFO mapred.JobClient: HDFS: Number of bytes read=10185958 13/07/12 23:03:56 INFO mapred.JobClient: HDFS: Number of bytes written=707043 13/07/12 23:03:56 INFO mapred.JobClient: HDFS: Number of read operations=2 13/07/12 23:03:56 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 23:03:56 INFO mapred.JobClient: HDFS: Number of write operations=1 13/07/12 23:03:56 INFO mapred.JobClient: Job Counters 13/07/12 23:03:56 INFO mapred.JobClient: Launched map tasks=1 13/07/12 23:03:56 INFO mapred.JobClient: Launched reduce tasks=1 13/07/12 23:03:56 INFO mapred.JobClient: Data-local map tasks=1 13/07/12 23:03:56 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=32808 13/07/12 23:03:56 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=9469 13/07/12 23:03:56 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 23:03:56 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 23:03:56 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 23:03:56 INFO mapred.JobClient: Map input records=417884 13/07/12 23:03:56 INFO mapred.JobClient: Map output records=1612019 13/07/12 23:03:56 INFO mapred.JobClient: Map output bytes=15218645 13/07/12 23:03:56 INFO mapred.JobClient: Input split bytes=117 13/07/12 23:03:56 INFO mapred.JobClient: Combine input records=1852684 13/07/12 23:03:56 INFO mapred.JobClient: Combine output records=306113 13/07/12 23:03:56 INFO mapred.JobClient: Reduce input groups=65448 13/07/12 23:03:56 INFO mapred.JobClient: Reduce shuffle bytes=470288 13/07/12 23:03:56 INFO mapred.JobClient: Reduce input records=65448 13/07/12 23:03:56 INFO mapred.JobClient: Reduce output records=65448 13/07/12 23:03:56 INFO mapred.JobClient: Spilled Records=371561 13/07/12 23:03:56 INFO mapred.JobClient: CPU time spent (ms)=9970 13/07/12 23:03:56 INFO mapred.JobClient: Physical memory (bytes) snapshot=288149504 13/07/12 23:03:56 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3245924352 13/07/12 23:03:56 INFO mapred.JobClient: Total committed heap usage (bytes)=142344192 Task name: word count Tash success? Yes Task start: 2013-07-12 23:02:53 Task end: 2013-07-12 23:03:56 Time elapsed: 1.046 minutes.
Note that in this example, we used a Combiner class, which is the same to the Reducer class(they must be the same). The combiner is used as local aggregator to reduce the data copied from mapper to reducer. We can see the difference between WordCount01 with WordCount02 below, from which the Combiner will be removed.
The second example:
package wc; import java.io.IOException; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.Date; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCount02 extends Configured implements Tool { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); //conf.set("mapred.job.tracker", "192.168.1.201:9001"); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount02.class); job.setMapperClass(TokenizerMapper.class); // job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); System.out.println("Task name: " + job.getJobName()); System.out.println("Task success? " + (job.isSuccessful() ? "Yes" : "No")); return job.isSuccessful() ? 0 : 1; } public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date start = new Date(); int res = ToolRunner.run(new Configuration(), new WordCount02(), args); Date end = new Date(); float time = (float) ((end.getTime() - start.getTime()) / 60000.0); System.out.println("Task start: " + formatter.format(start)); System.out.println("Task end: " + formatter.format(end)); System.out.println("Time elapsed: " + String.valueOf(time) + " minutes."); System.exit(res); } }
Output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar wc.WordCount02 Shakespeare.txt output 13/07/12 23:16:20 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 23:16:20 INFO input.FileInputFormat: Total input paths to process : 1 13/07/12 23:16:24 INFO mapred.JobClient: Running job: job_201307122107_0007 13/07/12 23:16:25 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 23:16:42 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 23:17:03 INFO mapred.JobClient: map 100% reduce 67% 13/07/12 23:17:06 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 23:17:12 INFO mapred.JobClient: Job complete: job_201307122107_0007 13/07/12 23:17:12 INFO mapred.JobClient: Counters: 32 13/07/12 23:17:12 INFO mapred.JobClient: File System Counters 13/07/12 23:17:12 INFO mapred.JobClient: FILE: Number of bytes read=3871328 13/07/12 23:17:12 INFO mapred.JobClient: FILE: Number of bytes written=5564882 13/07/12 23:17:12 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 23:17:12 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 23:17:12 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 23:17:12 INFO mapred.JobClient: HDFS: Number of bytes read=10185958 13/07/12 23:17:12 INFO mapred.JobClient: HDFS: Number of bytes written=707043 13/07/12 23:17:12 INFO mapred.JobClient: HDFS: Number of read operations=2 13/07/12 23:17:12 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 23:17:12 INFO mapred.JobClient: HDFS: Number of write operations=1 13/07/12 23:17:12 INFO mapred.JobClient: Job Counters 13/07/12 23:17:12 INFO mapred.JobClient: Launched map tasks=1 13/07/12 23:17:12 INFO mapred.JobClient: Launched reduce tasks=1 13/07/12 23:17:12 INFO mapred.JobClient: Data-local map tasks=1 13/07/12 23:17:12 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=20788 13/07/12 23:17:12 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=20552 13/07/12 23:17:12 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 23:17:12 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 23:17:12 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 23:17:12 INFO mapred.JobClient: Map input records=417884 13/07/12 23:17:12 INFO mapred.JobClient: Map output records=1612019 13/07/12 23:17:12 INFO mapred.JobClient: Map output bytes=15218645 13/07/12 23:17:12 INFO mapred.JobClient: Input split bytes=117 13/07/12 23:17:12 INFO mapred.JobClient: Combine input records=0 13/07/12 23:17:12 INFO mapred.JobClient: Combine output records=0 13/07/12 23:17:12 INFO mapred.JobClient: Reduce input groups=65448 13/07/12 23:17:12 INFO mapred.JobClient: Reduce shuffle bytes=1382689 13/07/12 23:17:12 INFO mapred.JobClient: Reduce input records=1612019 13/07/12 23:17:12 INFO mapred.JobClient: Reduce output records=65448 13/07/12 23:17:12 INFO mapred.JobClient: Spilled Records=4836057 13/07/12 23:17:12 INFO mapred.JobClient: CPU time spent (ms)=11730 13/07/12 23:17:12 INFO mapred.JobClient: Physical memory (bytes) snapshot=275365888 13/07/12 23:17:12 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2282356736 13/07/12 23:17:12 INFO mapred.JobClient: Total committed heap usage (bytes)=91426816 Task name: word count Task success? Yes Task start: 2013-07-12 23:16:18 Task end: 2013-07-12 23:17:12 Time elapsed: 0.8969 minutes.
Note the difference between WordCount02 and WordCount01, in WordCount02, the combiner was removed, so the reducer input records increased, as the job output shown:
Reduce input records=1612019
In WordCount01, which was run with combiner:
Reduce input records=65448
The third example:
package wc; import java.io.IOException; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.Date; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCount03 extends Configured implements Tool { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); //conf.set("mapred.job.tracker", "192.168.1.201:9001"); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount03.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(2); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); System.out.println("Task name: " + job.getJobName()); System.out.println("Task success? " + (job.isSuccessful() ? "Yes" : "No")); return job.isSuccessful() ? 0 : 1; } public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date start = new Date(); int res = ToolRunner.run(new Configuration(), new WordCount03(), args); Date end = new Date(); float time = (float) ((end.getTime() - start.getTime()) / 60000.0); System.out.println("Task start: " + formatter.format(start)); System.out.println("Task end: " + formatter.format(end)); System.out.println("Time elapsed: " + String.valueOf(time) + " minutes."); System.exit(res); } }
Output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar wc.WordCount03 Shakespeare.txt output 13/07/12 23:18:53 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 23:18:53 INFO input.FileInputFormat: Total input paths to process : 1 13/07/12 23:18:55 INFO mapred.JobClient: Running job: job_201307122107_0008 13/07/12 23:18:56 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 23:19:16 INFO mapred.JobClient: map 70% reduce 0% 13/07/12 23:19:19 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 23:19:33 INFO mapred.JobClient: map 100% reduce 50% 13/07/12 23:19:34 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 23:19:39 INFO mapred.JobClient: Job complete: job_201307122107_0008 13/07/12 23:19:39 INFO mapred.JobClient: Counters: 32 13/07/12 23:19:39 INFO mapred.JobClient: File System Counters 13/07/12 23:19:39 INFO mapred.JobClient: FILE: Number of bytes read=3042226 13/07/12 23:19:39 INFO mapred.JobClient: FILE: Number of bytes written=3277040 13/07/12 23:19:39 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 23:19:39 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 23:19:39 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 23:19:39 INFO mapred.JobClient: HDFS: Number of bytes read=10185958 13/07/12 23:19:39 INFO mapred.JobClient: HDFS: Number of bytes written=707043 13/07/12 23:19:39 INFO mapred.JobClient: HDFS: Number of read operations=2 13/07/12 23:19:39 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 23:19:39 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/12 23:19:39 INFO mapred.JobClient: Job Counters 13/07/12 23:19:39 INFO mapred.JobClient: Launched map tasks=1 13/07/12 23:19:39 INFO mapred.JobClient: Launched reduce tasks=2 13/07/12 23:19:39 INFO mapred.JobClient: Data-local map tasks=1 13/07/12 23:19:39 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=25064 13/07/12 23:19:39 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=21033 13/07/12 23:19:39 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 23:19:39 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 23:19:39 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 23:19:39 INFO mapred.JobClient: Map input records=417884 13/07/12 23:19:39 INFO mapred.JobClient: Map output records=1612019 13/07/12 23:19:39 INFO mapred.JobClient: Map output bytes=15218645 13/07/12 23:19:39 INFO mapred.JobClient: Input split bytes=117 13/07/12 23:19:39 INFO mapred.JobClient: Combine input records=1852684 13/07/12 23:19:39 INFO mapred.JobClient: Combine output records=306113 13/07/12 23:19:39 INFO mapred.JobClient: Reduce input groups=65448 13/07/12 23:19:39 INFO mapred.JobClient: Reduce shuffle bytes=503734 13/07/12 23:19:39 INFO mapred.JobClient: Reduce input records=65448 13/07/12 23:19:39 INFO mapred.JobClient: Reduce output records=65448 13/07/12 23:19:39 INFO mapred.JobClient: Spilled Records=371561 13/07/12 23:19:39 INFO mapred.JobClient: CPU time spent (ms)=13840 13/07/12 23:19:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=395960320 13/07/12 23:19:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3918856192 13/07/12 23:19:39 INFO mapred.JobClient: Total committed heap usage (bytes)=162201600 Task name: word count Task success? Yes Task start: 2013-07-12 23:18:50 Task end: 2013-07-12 23:19:39 Time elapsed: 0.80191666 minutes.
In the third example, the number of reduce tasks was changed to 2, so there will be two reduce output file:
-rw-r--r-- 3 root supergroup 353110 2013-07-12 23:19 output/part-r-00000 -rw-r--r-- 3 root supergroup 353933 2013-07-12 23:19 output/part-r-00001
In WordCount01 and WordCount02, there were only one reduce output file generated. The reason is that in my cluster, the property of mapreduce: mapred.tasktracker.reduce.tasks.maximum was set to 1, which means there will be only 1 reduce tasks that a TaskTracker can run simultaneously. But if you specify the number of reduce tasks with org.apache.hadoop.conf.Configuration, the default configuration will be overwritten by the number you specified.