WordCount Analysis

1.Create a new java project, then copy examples folder from /home/hadoop/hadoop-1.0.4/src;

Create a new folder named src, then Paste to the project to this folder.

Error: Could not find or load main class

right-click src folder, --> build Path --> Use as source Folder

2.Copy hadoop-1.0.4-eclipse-plugin.jar to eclipse/plugin . Then restart eclipse.

3.Set the hadoop install directory and configure the hadoop location.

clip_image001

4.Attched the hadoop source code for the project, then you can check hadoop source code freely.

5.Java heap space Error

java.lang.OutOfMemoryError: Java heap space



at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)



at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)



at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)



at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)



at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)



int maxMemUsage = sortmb << 20;



int recordCapacity = (int)(maxMemUsage * recper);



recordCapacity -= recordCapacity % RECSIZE;



kvbuffer = new byte[maxMemUsage - recordCapacity];

so we should configure the value of io.sort.mb to avoid this.

我运行的机器环境配置比较低,three nodes, all 512M memory .

我没有在core-site.xml中设置这个参数的值,为了这次job,我直接设置在job的driver code中,

conf.set("io.sort.mb","10");

6.sample test data for WordCount:

10



9



8



7



6



5



4



3



2



1



line1



line3



line2



line5



Line4



运行结果文件是:



1        1



10        1



2        1



3        1



4        1



5        1



6        1



7        1



8        1



9        1



line1        2



line2        2



line3        2



line4        2



line5        2



line6        1

还有一个文件是_Success.表明job执行成功。

可以看到执行后的文件是排过序的。是根据key 值的类型进行排序的,我们wordcount示例中,key值是string类型。

7.在Wordcount示例中,没有专门处理如果输出目录已经存在的情况,为了方便测试,我们添加如下的代码来处理目录.

Path outPath = new Path(args[1]);



FileSystem dfs = FileSystem.get(outPath.toUri(), conf);



if (dfs.exists(outPath)) {



dfs.delete(outPath, true);



}

8.why the wordcount demo 's mapper and reduce class are both static?

(为什么WordCount示例中的mapper和reducer都设计成static的,难道非要这样吗?)

Let me remove the static key word for mapper class, then run the job, you will get exception as follow:

java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.examples.WordCount$TokenizerMapper.<init>()



at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)



at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)



at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)



Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.examples.WordCount$TokenizerMapper.<init>()



at java.lang.Class.getConstructor0(Class.java:2730)

在这个时候,mapper类变成了wordcount类的内部类,反射辅助类无法准备地找到它的构造函数,无法实例化。

解决方案,把mapper类从内部类转成非内部类,从wordcount类中拿出来,放到外面去或另起一个文件,这样

执行依然可以。

我们可以看到,我们的示例,尽可能地简单,都放在一个类里面了,使用static就可以保证可以正确运行,如果我们的mapper和reducer不是特别复杂,这样的设计也无可厚非。如果复杂的话,最好单拎出来放一个类。

9.默认我们在eclipse里面直接调试运行或直接运行的时候,我们并非是执行在hadoop cluster上面的,而是进程中模拟执行的,这样方便我们进行调试,我们可以看到console中会有输出类似LocalJobRunner的字样,而不是JobTracker去执行。

这就是为什么,即使我们设置reducetask number大于1的时候,我们仍会在输出的目录里面看到一个part-0000之类的输出,是因为localjobrunner只支持一个.

为了方便我们直接在这里写完代码,就模拟在集群上执行,是很有必要的,有时候是因为你写的代码不在集群上执行就

不能及时地发现错误(分布式应用程序写的时候还是需要注意很多事项的)。

因为提交到集群其实需要做的一件事就是打包你的代码为jar文件,然后提交到集群中去,所以这里需要做这些事情。

我使用spork兄的EJob类来完成这件事,如果你熟悉可以自己写,可以参照http://www.cnblogs.com/spork/archive/2010/04/21/1717592.html.

参照文章,然后在驱动代码中进行部分调整即可。

10.

如果我想把单词中第一个字母小于N的放在第一个reduce task中完成,其他的放在第二个reduce task中输出,该怎么做呢?

写自己的partitioner类,默认的partitioner类是HashPartitioner类,我们简单实现自己的,然后设置一下就可以了。

11.附上修改后的完整的WordCount类源码:

package org.apache.hadoop.examples;



import java.io.File;

import java.io.IOException;

import java.util.StringTokenizer;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Partitioner;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

 

class TokenizerMapper 

extends Mapper<Object, Text, Text, IntWritable>{



private final   IntWritable one = new IntWritable(1);

private Text word = new Text();



@SuppressWarnings("unused")

public void map(Object key, Text value, Context context

             ) throws IOException, InterruptedException {

if(false){

	StringTokenizer itr = new StringTokenizer(value.toString());

		while (itr.hasMoreTokens()) {

		 word.set(itr.nextToken());

		 context.write(word, one);

		}

}else

{	

	String s = value.toString();

	String[] words = s.split("\\s+");

	for (int i = 0; i < words.length; i++) {

	    words[i] = words[i].replaceAll("[^\\w]", "");

	   // System.out.println(words[i]);

	    word.set(words[i].toUpperCase());

	    if(words[i].length()>0)

	    	context.write(word,one);

	}

}		



	}

}



public class WordCount {



public static class MyPartitioner<K, V> extends Partitioner<K, V> {



		  public int getPartition(K key, V value,

		                          int numReduceTasks) {

			  if(key.toString().toUpperCase().toCharArray()[0]<'N') return 0;

			  else return 1;

		  }

}



  public static class IntSumReducer 

       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private IntWritable result = new IntWritable();



    public void reduce(Text key, Iterable<IntWritable> values, 

                       Context context

                       ) throws IOException, InterruptedException {

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result);

    }

  }



  public static void main(String[] args) throws Exception {

	args= "hdfs://namenode:9000/user/hadoop/englishwords hdfs://namenode:9000/user/hadoop/out".split(" ");

	

	File jarFile = EJob.createTempJar("bin");

	EJob.addClasspath("/home/hadoop/hadoop-1.0.4/conf");

	//conf.set("mapred.job.tracker","namenode:9001");

	ClassLoader classLoader = EJob.getClassLoader();

	Thread.currentThread().setContextClassLoader(classLoader);

	

	

    Configuration conf = new Configuration();

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    if (otherArgs.length != 2) {

      System.err.println("Usage: wordcount <in> <out>");

      System.exit(2);

    }

    //drop output directory if exists

    Path outPath = new Path(args[1]);

	FileSystem dfs = FileSystem.get(outPath.toUri(), conf);

	if (dfs.exists(outPath)) {

		dfs.delete(outPath, true);

	}

	

	conf.set("io.sort.mb","10");

    Job job = new Job(conf, "word count");

    

    ((JobConf) job.getConfiguration()).setJar(jarFile.toString());

    job.setNumReduceTasks(2);//use to reducer process to process work

    job.setPartitionerClass(MyPartitioner.class);

    

    

    job.setJarByClass(WordCount.class);

    job.setMapperClass(TokenizerMapper.class);

    job.setCombinerClass(IntSumReducer.class);

    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}

你可能感兴趣的:(wordcount)