数据原型:
行数据原型释义:
1363157993044(访问日期) 18211575961(手机号) 94-71-AC-CD-E6-18:CMCC-EASY (mac地址) 120.196.100.99 (ip地址)iface.qiyi.com (网站名称) 视频网站 (网站类型) 15 12 1527(上行流量) 2106(下行流量) 200(运行状态码)
需求:将以上数据进行抽取统计,统计每个用户一天内上网数据的上行流量、下行流量和总流量(注意:用户一天之内很可能有多条上网记录)
使用上一届中创建的自定义Writable数据类型—DataBean作为统一的数据类型
实现过程举例分析:
13888888888 5000 7000
13888888888 2000 1000
13888888888 2000 2000
13888888888 3000 3000
13899999999 3000 4000
13899999999 3000 4000
map阶段:
context.write(13888888888,DataBean(“”,5000,7000);
context.write(13888888888,DataBean(“”,2000,1000);
context.write(13888888888,DataBean(“”,2000,2000);
context.write(13888888888,DataBean(“”,3000,3000);
….
Reduce阶段:
<13888888888,{DataBean(“”,5000,7000),DataBean(“”,2000,1000),DataBean(“”,2000,2000),DataBean
(“”,3000,3000)}>
<13899999999,{DataBean(“”,3000,4000),DataBean(“”,3000,4000)}>
Map-Reduce程序编写:
public class DataCount {
public static class DCMapper extends Mapper<LongWritable, Text, Text, DataBean>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//接收数据
String line = value.toString();
//切分
String[] fields = line.split("\t");
//提取有用信息
String tel = fields[1];
long up = Long.parseLong(fields[8]);
long down = Long.parseLong(fields[9]);
//组装databean
DataBean bean = new DataBean(tel, up, down);
//输出数据
context.write(new Text(tel), bean);
}
}
public static class DCReducer extends Reducer<Text, DataBean, Text, DataBean>{
@Override
protected void reduce(Text key, Iterable<DataBean> values, Context context)
throws IOException, InterruptedException {
long up_sum = 0;
long down_sum = 0;
for(DataBean bean : values){
up_sum += bean.getUpPayLoad();
down_sum += bean.getDownPayLoad();
}
DataBean bean = new DataBean("", up_sum, down_sum);
context.write(key, bean);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(DataCount.class);
job.setMapperClass(DCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DataBean.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(DCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DataBean.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
定义DataBean类,属性:用户手机,上行流量,下行流量,总流量 产生getter和setter方法,其中总流量的有参构造方法中定义为 上行+下行
定义DataCount类,定义两个内部类 DCMapper和DCReducer,在map和reduce方法中实现自己的业务,在main方法中实现作业的配置
打包项目,导出到/root/examples.jar
上传源数据文件,到/data.doc目录下
注意事项:
运行前先检查是否已经存在该输出目录,如果有的话,删除后再运行MR例程
执行结果:
13480253104 180 180 360
13502468823 7335 110349 117684
13560436666 1116 954 2070
13560439658 2034 5892 7926
13602846565 1938 2910 4848
13660577991 6960 690 7650
13719199419 240 0 240
13726230503 2481 24681 27162
13726238888 2481 24681 27162
13760778710 120 120 240
13826544101 264 0 264
13922314466 3008 3720 6728
13925057413 11058 48243 59301
13926251106 240 0 240
13926435656 132 1512 1644
15013685858 3659 3538 7197
15920133257 3156 2936 6092
15989002119 1938 180 2118
18211575961 1527 2106 3633
18320173382 9531 2412 11943
84138413 4116 1432 5548
和源数据进行对比,以上结果是正确的