今天将Hadoop 权威指南天气数据示例代码在hadoop集群上跑通,记录一下。
之前在百度/Google上怎么也没有找到怎么样将自己的Map-Reduce方法跑在集群上的每一步都具体描述,经过一番痛苦的无头苍蝇式的摸索,成功了,心情不错...
1准备天气预报数据(权威指南上的数据的简化版 5-9为year,15-19为temperature)
aaaaa1990aaaaaa0039a
bbbbb1991bbbbbb0040a
ccccc1992cccccc0040c
ddddd1993dddddd0043d
eeeee1994eeeeee0041e
aaaaa1990aaaaaa0031a
bbbbb1991bbbbbb0020a
ccccc1992cccccc0030c
ddddd1993dddddd0033d
eeeee1994eeeeee0031e
aaaaa1990aaaaaa0041a
bbbbb1991bbbbbb0040a
ccccc1992cccccc0040c
ddddd1993dddddd0043d
eeeee1994eeeeee0041e
aaaaa1990aaaaaa0044a
bbbbb1991bbbbbb0045a
ccccc1992cccccc0041c
ddddd1993dddddd0023d
eeeee1994eeeeee0041e
2 编写Map-Reduce函数和调度函数(Job)
简单点:如下
package hadoop.test;
import java.io.第;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
static class MaxTemperatureMapper extends Mapper<LongWritable , Text , Text, IntWritable>
{
private static final int MISSING = 9999;
public void map(LongWritable key, Text value, Context conext) throws IOException, InterruptedException
{
String line = value.toString();
String year = line.substring(5, 9); //自己准备的数据,是天气预报数据的简化版
int airTemperature = Integer.parseInt(line.substring(15, 19)); //自己准备的数据,是天气预报数据的简化版
if(airTemperature != MISSING)
{
conext.write(new Text(year), new IntWritable(airTemperature));
}
}
}
static class MaxTemperatureReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException
{
int maxValue = Integer.MIN_VALUE;
for(IntWritable value : values)
{
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
if(args.length != 2)
{
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
try {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}catch(ClassNotFoundException e)
{
e.printStackTrace();
}catch(InterruptedException e)
{
e.printStackTrace();
}
}
}
3 将第二步编写的代码打包成HadoopTest.jar放到本地某一个目录下,例如/home/hadoop/Documents/
然后export HADOOP_CLASSPATH=/home/hadoop/Documents/
(打包的时候要选择mainclass,不选择好像执行的时候有错误,eclipse的export选项中有MainClass选项
否则:运行hadoop jar 命令时在***.jar后面需要指定包括包路径的mainclass类名
例如 hadoop jar /home/hadoop/Documents/HadoopTest.jar hadoop.test.MaxTemperature /user/hadoop/temperature output
)
4将要分析的数据传到hdfs上去
hadoop dfs -put /home/hadoop/Documents/temperature ./temperature
5 开始执行
hadoop jar /home/hadoop/Documents/HadoopTest.jar /user/hadoop/temperature output
跟书上的命令不大一样,不过他那里是指的local的方式,另外不知道export HADOOP_CLASSPATH=/home/hadoop/Documents/有什么用,执行hadoop jar HadoopTest.jar /user/hadoop/temperature output是不行滴,具体为什么,继续探究吧,先这样了。
这里HadoopTest.jar在本地,要分析的数据文件temperature 在hdfs上,产生的输出在hdfs上,output是一个文件夹
hadoop@hadoop1:~$ hadoop dfs -cat ./output/part-r-00000
1990 44
1991 45
1992 41
1993 43
1994 41