接着Hadoop2.4.1 简单的用户手机流量统计的MapReduce程序(二) 现在我们又有了新的需求,我们需要根据用户的手机号码所属的不同省份用不同的reduce来进行处理。
1.环境:Centos 6.5 32位, 在linux环境中开发。
2.核心代码如下:
2.1 Mapper类。package com.npf.hadoop.partition; import java.io.IOException; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import com.npf.hadoop.FlowBean; public class FlowCountPartitionMapper extends Mapper<LongWritable, Text, Text, FlowBean>{ private FlowBean flowBean = new FlowBean(); private Text phoneNumKey = new Text(); @Override protected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException { String line = value.toString(); String[] fields = StringUtils.split(line, "\t"); String phoneNumber = fields[1]; long upFlow = Long.valueOf(fields[fields.length - 3]); long downFlow = Long.valueOf(fields[fields.length - 2]); flowBean.setPhoneNumber(phoneNumber); flowBean.setUpFlow(upFlow); flowBean.setDownFlow(downFlow); flowBean.setSumFlow(upFlow + downFlow); phoneNumKey.set(phoneNumber); context.write(phoneNumKey, flowBean); } }2.2 Reducer类。
package com.npf.hadoop.partition; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import com.npf.hadoop.FlowBean; public class FlowCountPartitionReducer extends Reducer<Text, FlowBean, Text, FlowBean> { private FlowBean flowBean = new FlowBean(); @Override protected void reduce(Text key, Iterable<FlowBean> iterable,Context context)throws IOException, InterruptedException { long upFlow = 0L; long downFlow = 0L; long sumFlow = 0L; for(FlowBean bean : iterable){ upFlow = upFlow + bean.getUpFlow(); downFlow = downFlow + bean.getDownFlow(); } sumFlow = upFlow + downFlow; flowBean.setPhoneNumber(key.toString()); flowBean.setDownFlow(downFlow); flowBean.setUpFlow(upFlow); flowBean.setSumFlow(sumFlow); context.write(key, flowBean); } }
136开头的号码存储到0号reduce中,137开头的号码存储到1号reduce中,138开头的号码存储到2号reduce中,139开头的号码存储到3号reduce中,其他的存储在第4号reduce中。
package com.npf.hadoop.partition; import java.util.HashMap; import java.util.Map; import org.apache.hadoop.mapreduce.Partitioner; public class MyPartition<KEY,VALUE> extends Partitioner<KEY,VALUE>{ private static Map<String,Integer> areaMap = new HashMap<String, Integer>(); static { areaMap.put("136", 0); areaMap.put("137", 1); areaMap.put("138", 2); areaMap.put("139", 3); } @Override public int getPartition(KEY key, VALUE value, int numPartitions) { Integer code = areaMap.get(key.toString().substring(0, 3)); return code == null ? 4 : code; } }2.3FlowCountPartitionRunner类。
package com.npf.hadoop.partition; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import com.npf.hadoop.FlowBean; public class FlowCountPartitionRunner { public static void main(String[] args) throws Exception { Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); job.setJarByClass(FlowCountPartitionRunner.class); job.setMapperClass(FlowCountPartitionMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FlowBean.class); job.setReducerClass(FlowCountPartitionReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(FlowBean.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setPartitionerClass(MyPartition.class); job.setNumReduceTasks(5); FileInputFormat.setInputPaths(job, new Path("hdfs://devcitibank:9000/flowCountJob/srcdata")); FileOutputFormat.setOutputPath(job, new Path("hdfs://devcitibank:9000/flowCountJob/outputdatapar")); job.waitForCompletion(true); } }3. 我们通过Eclipse将我们的程序打成一个Jar包,打到/root目录下面。Jar包的名字我们命名为flowpartitioncount.jar。
4. ok, 我们来验证下在/root/目录下是否存在我们的Jar包。
5. 验证hadoop集群是否启动。
6. 验证我们在集群中的/flowCountJob/srcdata目录下面是否有我们需要处理的文件。
7.提交flowpartitioncount.jar到hadoop集群中去处理。
8. 执行成功后,我们去hadoop集群中去查看结果。我们有5个Reducer实例,所以产生0~4个处理文件。
分别查看这个5个文件。
part-r-00000
part-r-00001
part-r-00002
part-r-00003
part-r-00004
9. 源代码已托管到GitHub上面: https://github.com/HadoopOrganization/HadoopMapReducerFlowCount