MapReduce编程——统计用户上网流量DataCount

数据原型:
MapReduce编程——统计用户上网流量DataCount_第1张图片
行数据原型释义:
1363157993044(访问日期) 18211575961(手机号) 94-71-AC-CD-E6-18:CMCC-EASY (mac地址) 120.196.100.99 (ip地址)iface.qiyi.com (网站名称) 视频网站 (网站类型) 15 12 1527(上行流量) 2106(下行流量) 200(运行状态码)

需求:将以上数据进行抽取统计,统计每个用户一天内上网数据的上行流量、下行流量和总流量(注意:用户一天之内很可能有多条上网记录)
  使用上一届中创建的自定义Writable数据类型—DataBean作为统一的数据类型

实现过程举例分析:

13888888888 5000 7000
13888888888 2000 1000
13888888888 2000 2000
13888888888 3000 3000
13899999999 3000 4000
13899999999 3000 4000

map阶段:
context.write(13888888888,DataBean(“”,5000,7000);
context.write(13888888888,DataBean(“”,2000,1000);
context.write(13888888888,DataBean(“”,2000,2000);
context.write(13888888888,DataBean(“”,3000,3000);
….

Reduce阶段:
<13888888888,{DataBean(“”,5000,7000),DataBean(“”,2000,1000),DataBean(“”,2000,2000),DataBean
(“”,3000,3000)}>
<13899999999,{DataBean(“”,3000,4000),DataBean(“”,3000,4000)}>

Map-Reduce程序编写:

public class DataCount {

    public static class DCMapper extends Mapper<LongWritable, Text, Text, DataBean>{

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            //接收数据
            String line = value.toString();
            //切分
            String[] fields = line.split("\t");
            //提取有用信息
            String tel = fields[1];
            long up = Long.parseLong(fields[8]);
            long down = Long.parseLong(fields[9]);  
            //组装databean
            DataBean bean = new DataBean(tel, up, down);
            //输出数据
            context.write(new Text(tel), bean);
        }

    }

    public static class DCReducer extends Reducer<Text, DataBean, Text, DataBean>{

        @Override
        protected void reduce(Text key, Iterable<DataBean> values, Context context)
                throws IOException, InterruptedException {
            long up_sum = 0;
            long down_sum = 0;
            for(DataBean bean : values){
                up_sum += bean.getUpPayLoad();
                down_sum += bean.getDownPayLoad();
            }
            DataBean bean = new DataBean("", up_sum, down_sum);
            context.write(key, bean);
        }


    }


    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(DataCount.class);

        job.setMapperClass(DCMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(DataBean.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        job.setReducerClass(DCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DataBean.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);

    }


}

Maven下开发hadoop程序

  定义DataBean类,属性:用户手机,上行流量,下行流量,总流量 产生getter和setter方法,其中总流量的有参构造方法中定义为 上行+下行
 定义DataCount类,定义两个内部类 DCMapper和DCReducer,在map和reduce方法中实现自己的业务,在main方法中实现作业的配置

打包项目,导出到/root/examples.jar
上传源数据文件,到/data.doc目录下

在 /root !!目录下运行如下命令:
MapReduce编程——统计用户上网流量DataCount_第2张图片

注意事项:
运行前先检查是否已经存在该输出目录,如果有的话,删除后再运行MR例程

执行结果:
13480253104 180 180 360
13502468823 7335 110349 117684
13560436666 1116 954 2070
13560439658 2034 5892 7926
13602846565 1938 2910 4848
13660577991 6960 690 7650
13719199419 240 0 240
13726230503 2481 24681 27162
13726238888 2481 24681 27162
13760778710 120 120 240
13826544101 264 0 264
13922314466 3008 3720 6728
13925057413 11058 48243 59301
13926251106 240 0 240
13926435656 132 1512 1644
15013685858 3659 3538 7197
15920133257 3156 2936 6092
15989002119 1938 180 2118
18211575961 1527 2106 3633
18320173382 9531 2412 11943
84138413 4116 1432 5548
和源数据进行对比,以上结果是正确的

你可能感兴趣的:(MapReduce编程——统计用户上网流量DataCount)