1.环境:Centos 6.5 32位, 在linux环境中开发。
2.核心代码如下:
因为我们需要统计用户的上行流量和下行流量以及总流量,所以很容易的想到Reduce的输 出的值应该是用一个Bean来表示。我们以用户的手机号码来作为key。
而这个bean需要在网络中传输,则需要被序列化,需要继承Writable,并且重写write(DataOutput out)和readFields(DataInput in)方法,write是用来序列化的,
readFields是用来反序列化的,并且write里面的序列化顺序和readFields里面的反序列化必须是一致的。
2.1 Mapper类。
package com.npf.hadoop; import java.io.IOException; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean>{ private FlowBean flowBean = new FlowBean(); private Text phoneNumKey = new Text(); @Override protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException { String line = value.toString(); String[] fields = StringUtils.split(line, "\t"); String phoneNumber = fields[1]; long upFlow = Long.valueOf(fields[fields.length - 3]); long downFlow = Long.valueOf(fields[fields.length - 2]); flowBean.setPhoneNumber(phoneNumber); flowBean.setUpFlow(upFlow); flowBean.setDownFlow(downFlow); flowBean.setSumFlow(upFlow + downFlow); phoneNumKey.set(phoneNumber); context.write(phoneNumKey, flowBean); } }2.2 Reducer类。
package com.npf.hadoop; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> { private FlowBean flowBean = new FlowBean(); @Override protected void reduce(Text key, Iterable<FlowBean> iterable,Context context)throws IOException, InterruptedException { long upFlow = 0L; long downFlow = 0L; long sumFlow = 0L; for(FlowBean bean : iterable){ upFlow = upFlow + bean.getUpFlow(); downFlow = downFlow + bean.getDownFlow(); } sumFlow = upFlow + downFlow; flowBean.setPhoneNumber(key.toString()); flowBean.setDownFlow(downFlow); flowBean.setUpFlow(upFlow); flowBean.setSumFlow(sumFlow); context.write(key, flowBean); } }2.3 FlowBean类。
package com.npf.hadoop; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.Writable; public class FlowBean implements Writable{ private String phoneNumber; private long upFlow; private long downFlow; private long sumFlow; public String getPhoneNumber() { return phoneNumber; } public void setPhoneNumber(String phoneNumber) { this.phoneNumber = phoneNumber; } public long getUpFlow() { return upFlow; } public void setUpFlow(long upFlow) { this.upFlow = upFlow; } public long getDownFlow() { return downFlow; } public void setDownFlow(long downFlow) { this.downFlow = downFlow; } public long getSumFlow() { return sumFlow; } public void setSumFlow(long sumFlow) { this.sumFlow = sumFlow; } @Override public void write(DataOutput out) throws IOException { out.writeUTF(phoneNumber); out.writeLong(upFlow); out.writeLong(downFlow); out.writeLong(sumFlow); } @Override public void readFields(DataInput in) throws IOException { phoneNumber = in.readUTF(); upFlow = in.readLong(); downFlow = in.readLong(); sumFlow = in.readLong(); } }2.4 runner主程序入口。
package com.npf.hadoop; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class FlowCountRunner { public static void main(String[] args) throws Exception { Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); job.setJarByClass(FlowCountRunner.class); job.setMapperClass(FlowCountMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FlowBean.class); job.setReducerClass(FlowCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(FlowBean.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path("hdfs://devcitibank:9000/flowCountJob/srcdata")); FileOutputFormat.setOutputPath(job, new Path("hdfs://devcitibank:9000/flowCountJob/outputdata")); job.waitForCompletion(true); } }
3. 我们通过Eclipse将我们的程序打成一个Jar包,打到/root目录下面。Jar包的名字我们命名为flowcount.jar。
4. ok, 我们来验证下在/root/目录下是否存在我们的Jar包。
5. 验证hadoop集群是否启动。
6. 验证我们在集群中的/flowCountJob/srcdata目录下面是否有我们需要处理的文件。
7.提交flowcount.jar到hadoop集群中去处理。
8. 执行成功后,我们去hadoop集群中去查看结果。
在这里,我们发现,结果并不是我们想要的形式,那是因为我们在Java里面的Bean(FlowBean)没有重写toString()方法。下面我们重写一下FlowBean的toString()方法。
@Override public String toString() { return upFlow+" "+downFlow+" "+sumFlow; }然后我们重新打Jar包,重新提交到Hadoop集群中,查看的效果如下:
9. 源代码已托管到GitHub上面:https://github.com/HadoopOrganization/HadoopMapReducerFlowCount