利用MapReduce仿QQ音乐实现"今日推荐歌曲"系统

大数据无处不在,推荐系统无处不在。

QQ音乐的今日推荐歌曲;人人网的好友推荐;新浪微博的你可能感觉兴趣的人;优酷,土豆的电影推荐;豆瓣的图书推荐;大从点评的餐饮推荐;世纪佳缘的相亲推荐;天际网的职业推荐等都用到了大数据。
今天利用MapReduce简单写个仿QQ音乐的推荐系统,希望能给在座各位在工作中或面试中一点帮助!转载请注明出处:Michael孟良

今日推荐歌曲

原理:

通过历史对歌曲操作记录,计算得出每首歌相对其他歌曲同时出现在同一用户的次数,每件歌曲都有自己相对全部歌曲的同现列表,用户会对部分歌曲有过点击,收藏等实际操作,经过计算会得到用户对这部分歌曲的评分向量列表。
使用用户评分向量列表中的分值:
依次乘以每首歌同现列表中该分值的代表歌曲的同现值
求和便是该歌曲的推荐向量

具体算法:

推荐系统——协同过滤(Collaborative Filtering)算法

UserCF
基于用户的协同过滤,通过不同用户对歌曲的评分来评测用户之间的相似性,基于用户之间的相似性做出推荐。简单来讲就是:给用户推荐和他兴趣相似的其他用户喜欢的歌曲。


基于用户的协同过滤

基于用户的协同过滤

-ItemCF
基于item的协同过滤,通过用户对不同item的评分来评测item之间的相似性,基于item之间的相似性做出推荐。简单来讲就是:给用户推荐和他之前喜欢的歌曲相似的歌曲。


基于item的协同过滤

基于item的协同过滤

—Co-occurrence Matrix(同现矩阵)和User Preference Vector(用户评分向量)相乘得到的这个Recommended Vector(推荐向量)

—基于全量数据的统计,产生同现矩阵

1.体现歌曲间的关联性
2.每首歌都有自己对其他全部歌曲的关联性(每件歌曲的特征)

用户评分向量体现的是用户对一些歌曲的评分

任一歌曲需要:

1.用户评分向量乘以基于该歌曲的其他歌曲关联值
2.求和得出针对该歌曲的推荐向量
3.排序取TopN即可
公式

同现矩阵乘以用户评分

好了,逻辑知道了,怎么把它搬到代码中,我们一步一步来:

数据:

item_id,user_id,action,vtime
i161,u2625,click,2018/8/18 15:03
i161,u2626,click,2018/8/23 22:40
i161,u2627,click,2018/8/25 19:09
i161,u2628,click,2018/8/28 21:35
i161,u2629,click,2018/8/27 16:33
i161,u2630,click,2018/8/5 18:45
i161,u2631,click,2018/8/29 16:57
i161,u2632,click,2018/8/24 21:58
i161,u2633,click,2018/8/25 22:41
i161,u2634,click,2018/8/16 13:30
i161,u2635,click,2018/8/20 9:23
i161,u2636,click,2018/8/21 1:00
i161,u2637,click,2018/8/24 22:51
...
...
...
歌曲id:item_id
用户id:user_id
对歌曲操作:action
操作时间:vtime

代码:

启动类StartRun

public class StartRun {

    public static void main(String[] args) {
        Configuration config = new Configuration(true);
        
        config.set("mapreduce.framework.name", "local");
        config.set("mapreduce.app-submission.cross-platform", "true");
//      config.set("fs.defaultFS", "hdfs://node1:8020");
//      config.set("yarn.resourcemanager.hostname", "node3");
        
        // 所有mr的输入和输出目录定义在map集合中
        Map paths = new HashMap();
        paths.put("Step1Input", "/user/root/m_log");
        paths.put("Step1Output", "/data/itemcf/output/step1");
        paths.put("Step2Input", paths.get("Step1Output"));
        paths.put("Step2Output", "/data/itemcf/output/step2");
        paths.put("Step3Input", paths.get("Step2Output"));
        paths.put("Step3Output", "/data/itemcf/output/step3");
        paths.put("Step4Input1", paths.get("Step2Output"));
        paths.put("Step4Input2", paths.get("Step3Output"));
        paths.put("Step4Output", "/data/itemcf/output/step4");
        paths.put("Step5Input", paths.get("Step4Output"));
        paths.put("Step5Output", "/data/itemcf/output/step5");
        paths.put("Step6Input", paths.get("Step5Output"));
        paths.put("Step6Output", "/data/itemcf/output/step6");

        Step1.run(config, paths);
//      Step2.run(config, paths);
//      Step3.run(config, paths);
//      Step4.run(config, paths);
//      Step5.run(config, paths);
//      Step6.run(config, paths);
    }

    public static Map R = new HashMap();
    static {
        R.put("click", 1);
        R.put("share", 2);
        R.put("like", 3);
        R.put("download", 4);
    }
}

我们分为6步:
第一步:

  public class Step1 {

    public static boolean run(Configuration config, Map paths) {
        try {
            FileSystem fs = FileSystem.get(config);
            Job job = Job.getInstance(config);
            
            job.setJobName("step1");

//          config.set("mapred.jar", "D:\\MR\\item.jar");

            job.setJarByClass(Step1.class);
            job.setMapperClass(Step1_Mapper.class);
            job.setReducerClass(Step1_Reducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(NullWritable.class);

            FileInputFormat.addInputPath(job, new Path(paths.get("Step1Input")));
            
            Path outpath = new Path(paths.get("Step1Output"));
            if (fs.exists(outpath)) {
                fs.delete(outpath, true);
            }
            FileOutputFormat.setOutputPath(job, outpath);

            boolean f = job.waitForCompletion(true);
            return f;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return false;
    }

    static class Step1_Mapper extends Mapper {

        
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if (key.get() != 0) {
                context.write(value, NullWritable.get());
            }
        }
    }

    static class Step1_Reducer extends Reducer {

        @Override
        protected void reduce(Text key, Iterable i, Context context)
                throws IOException, InterruptedException {
            context.write(key, NullWritable.get());
        }
    }
}

第一步去重,一进一出,没什么好说的。

第二步:
public class Step2 {

    public static boolean run(Configuration config, Map paths) {
        try {
            FileSystem fs = FileSystem.get(config);
            Job job = Job.getInstance(config);
            
            job.setJobName("step2");
            job.setJarByClass(StartRun.class);
            job.setMapperClass(Step2_Mapper.class);
            job.setReducerClass(Step2_Reducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            FileInputFormat.addInputPath(job, new Path(paths.get("Step2Input")));
            Path outpath = new Path(paths.get("Step2Output"));
            if (fs.exists(outpath)) {
                fs.delete(outpath, true);
            }
            FileOutputFormat.setOutputPath(job, outpath);

            boolean f = job.waitForCompletion(true);
            return f;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return false;
    }

    static class Step2_Mapper extends Mapper {

        // 如果使用:用戶+歌曲的id,同时作为输出key,更好
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            //i1,u2723,click,2014/9/14 9:31
            String[] tokens = value.toString().split(",");
            String item = tokens[0];
            String user = tokens[1];
            String action = tokens[2];
            Text k = new Text(user);
            Integer rv = StartRun.R.get(action);
            // if(rv!=null){
            Text v = new Text(item + ":" + rv.intValue());
            //u2750 i160:1
            context.write(k, v);
        }
    }

    static class Step2_Reducer extends Reducer {

        @Override
        protected void reduce(Text key, Iterable i, Context context)
                throws IOException, InterruptedException {
            Map r = new HashMap();

            //迭代同一用户关注的歌曲的id
            for (Text value : i) {
                //u2750 i160:1
                String[] vs = value.toString().split(":");
                String item = vs[0];
                Integer action = Integer.parseInt(vs[1]);
                action = ((Integer) (r.get(item) == null ? 0 : r.get(item))).intValue() + action;
                r.put(item, action);
            }
            StringBuffer sb = new StringBuffer();
            for (Entry entry : r.entrySet()) {
                sb.append(entry.getKey() + ":" + entry.getValue().intValue() + ",");
            }
            
            //u2756 i105:1,i79:1,i341:1,i319:1,i332:1,i160:1,i342:1,i94:1,
            context.write(key, new Text(sb.toString()));
        }
    }
}

第二步,按用户分组,计算所有歌曲的id出现的组合列表,得到用户对歌曲的id的喜爱度得分矩阵。
运行结果:


第二步

第三步:

     public class Step3 {

    private final static Text K = new Text();
    private final static IntWritable V = new IntWritable(1);

    public static boolean run(Configuration config, Map paths) {
        try {
            FileSystem fs = FileSystem.get(config);
            Job job = Job.getInstance(config);
            job.setJobName("step3");
            job.setJarByClass(StartRun.class);
            job.setMapperClass(Step3_Mapper.class);
            job.setReducerClass(Step3_Reducer.class);
            job.setCombinerClass(Step3_Reducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);

            FileInputFormat
                    .addInputPath(job, new Path(paths.get("Step3Input")));
            Path outpath = new Path(paths.get("Step3Output"));
            if (fs.exists(outpath)) {
                fs.delete(outpath, true);
            }
            FileOutputFormat.setOutputPath(job, outpath);

            boolean f = job.waitForCompletion(true);
            return f;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return false;
    }

    // 第二个MR执行的结果--作为本次MR的输入  样本: u2837  i541:1,i331:1,i314:1,i125:1,
    static class Step3_Mapper extends
            Mapper {

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String[] tokens = value.toString().split("\t");
            String[] items = tokens[1].split(",");
            //嵌套循环,每一个歌曲与其他歌曲组合输出一次,val的值为1
            //WC的思维逻辑
            for (int i = 0; i < items.length; i++) {
                String itemA = items[i].split(":")[0];
                for (int j = 0; j < items.length; j++) {
                    String itemB = items[j].split(":")[0];
                    K.set(itemA + ":" + itemB);
                    context.write(K, V);
                }
            }

        }
    }


    static class Step3_Reducer extends
            Reducer {

        @Override
        protected void reduce(Text key, Iterable i, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable v : i) {
                sum = sum + v.get();
            }
            V.set(sum);
            context.write(key, V);
            //  执行结果
//          i100:i181   1
//          i100:i184   2
        }
    }
}

第三步,对歌曲id组合列表进行计数,建立歌曲id的同现矩阵。
运行加过如下:


第三步

第四步:

public class Step4 {

public static boolean run(Configuration config, Map paths) {
    try {
        FileSystem fs = FileSystem.get(config);
        Job job = Job.getInstance(config);
        job.setJobName("step4");
        job.setJarByClass(StartRun.class);
        job.setMapperClass(Step4_Mapper.class);
        job.setReducerClass(Step4_Reducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job,
                new Path[] { new Path(paths.get("Step4Input1")),
                        new Path(paths.get("Step4Input2")) });
        Path outpath = new Path(paths.get("Step4Output"));
        if (fs.exists(outpath)) {
            fs.delete(outpath, true);
        }
        FileOutputFormat.setOutputPath(job, outpath);

        boolean f = job.waitForCompletion(true);
        return f;
    } catch (Exception e) {
        e.printStackTrace();
    }
    return false;
}

static class Step4_Mapper extends Mapper {
    private String flag;// A同现矩阵 or B得分矩阵

    // 每个maptask,初始化时调用一次
    @Override
    protected void setup(Context context) throws IOException,
            InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        flag = split.getPath().getParent().getName();// 判断读的数据集

        System.out.println(flag + "**********************");
    }

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String[] tokens = Pattern.compile("[\t,]").split(value.toString());

        if (flag.equals("step3")) {// 同现矩阵
            // 样本:  i100:i181   1
            //       i100:i184  2
            String[] v1 = tokens[0].split(":");
            String itemID1 = v1[0];
            String itemID2 = v1[1];
            String num = tokens[1];

            Text k = new Text(itemID1);// 以前一个歌曲id为key 比如i100
            Text v = new Text("A:" + itemID2 + "," + num);// A:i109,1
            // 样本:  i100    A:i181,1
            context.write(k, v);

        } else if (flag.equals("step2")) {// 用户对歌曲id喜爱得分矩阵
            // 样本:  u24  i64:1,i218:1,i185:1,
            String userID = tokens[0];
            for (int i = 1; i < tokens.length; i++) {
                String[] vector = tokens[i].split(":");
                String itemID = vector[0];// 歌曲idid
                String pref = vector[1];// 喜爱分数

                Text k = new Text(itemID); // 以歌曲id为key 比如:i100
                Text v = new Text("B:" + userID + "," + pref); // B:u401,2
                // 样本:  i64 B:u24,1
                context.write(k, v);
            }
        }
    }
}

static class Step4_Reducer extends Reducer {
    @Override
    protected void reduce(Text key, Iterable values, Context context)
            throws IOException, InterruptedException {
        // A同现矩阵 or B得分矩阵
        // 某一个歌曲id,针对它和其他所有歌曲id的同现次数,都在mapA集合中
        Map mapA = new HashMap();
        //和该歌曲id(key中的itemID)同现的其他歌曲id的同现集合//
        //其他歌曲idID为map的key,同现数字为值
        
        Map mapB = new HashMap();
        //该歌曲id(key中的itemID),所有用户的推荐权重分数

        for (Text line : values) {
            String val = line.toString();
            if (val.startsWith("A:")) {// 表示歌曲id同现数字// 样本:  i100    A:i181,1
                String[] kv = Pattern.compile("[\t,]").split(val.substring(2));
                try {
                    mapA.put(kv[0], Integer.parseInt(kv[1]));//mapA:"i181" -> "1"
                } catch (Exception e) {
                    e.printStackTrace();
                }

            } else if (val.startsWith("B:")) {// 样本:  i64   B:u24,1
                String[] kv = Pattern.compile("[\t,]").split(
                        val.substring(2));
                try {
                    mapB.put(kv[0], Integer.parseInt(kv[1]));//mapB:"u24" -> "1"
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        
        double result = 0;
        //同现矩阵A
        Iterator iter = mapA.keySet().iterator();
        //MR原语特征,这里只有一种歌曲的同现列表
        while (iter.hasNext()) {
            String mapk = iter.next();// itemID

            int num = mapA.get(mapk).intValue();
            Iterator iterb = mapB.keySet().iterator();
            //MR原语特征,这里是所有用户的同一歌曲的评分,迭代之
            while (iterb.hasNext()) {//迭代用户名
                String mapkb = iterb.next();// userID
                int pref = mapB.get(mapkb).intValue();
                //注意这里的计算思维理解:
                    //针对A歌曲
                    //使用用户对A歌曲的分值
                    //逐一乘以与A歌曲有同现的歌曲的次数
                    //但是计算推荐向量的时候需要的是A歌曲同现的歌曲,用同现次数乘以各自的分值
                result = num * pref;// 矩阵乘法相乘计算

                                    //Text k = new Text(mapkb);
                                    //Text v = new Text(mapk + "," + result);
                                    //结果样本:  u2723      i9,8.0
                                    //context.write(k, v);
                
                Text k = new Text(mapkb+","+mapk);
                Text v = new Text( key.toString() + "," + result);
                //key:101
                //  结果样本:   u3,101   101,4.0   *
                //  结果样本:   u3,102   101,4.0   
                //  结果样本:   u3,103   101,4.0
                //key:102
                //  结果样本:   u3,101   102,4.0   *
                //  结果样本:   u3,102   102,4.0   
                //  结果样本:   u3,103   102,4.0
                
                context.write(k, v);
            }
        }
    }
}

}
第四步比较飘

  • 把同现矩阵和得分矩阵相乘
  • 利用MR原语特征,按歌曲分组
  • 这样相同歌曲的同现列表和所有用户对该歌曲的评分进到一个reduce中

这一步也可以理解为,之前逻辑我们是打横计算的,现在我们先打竖把所有参数计算好,把整个矩阵填好,第五步再打横的一条数据一条数据技术出来:


先把整个矩阵的数计算出来

运算结果:


第四步结果

第五步:

    public class Step5 {
    private final static Text K = new Text();
    private final static Text V = new Text();

    public static boolean run(Configuration config, Map paths) {
        try {
            FileSystem fs = FileSystem.get(config);
            Job job = Job.getInstance(config);
            job.setJobName("step5");
            job.setJarByClass(StartRun.class);
            job.setMapperClass(Step5_Mapper.class);
            job.setReducerClass(Step5_Reducer.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            FileInputFormat
                    .addInputPath(job, new Path(paths.get("Step5Input")));
            Path outpath = new Path(paths.get("Step5Output"));
            if (fs.exists(outpath)) {
                fs.delete(outpath, true);
            }
            FileOutputFormat.setOutputPath(job, outpath);

            boolean f = job.waitForCompletion(true);
            return f;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return false;
    }

    static class Step5_Mapper extends Mapper {

        /**
         * 原封不动输出
         */
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            //  样本:  u2723:101  i9,8.0
            String[] tokens = Pattern.compile("[\t]").split(value.toString());
            Text k = new Text(tokens[0]);// 用户为key
            Text v = new Text(tokens[1] );
            context.write(k, v);
        }
    }

    static class Step5_Reducer extends Reducer {
        @Override
        protected void reduce(Text key, Iterable values, Context context)
                throws IOException, InterruptedException {
            Map map = new HashMap();// 结果
            Double score = 0.0;
            for (Text line : values) {// i9,4.0
                String[] tokens = line.toString().split(",");
//              String itemID = tokens[0];
                score += Double.parseDouble(tokens[1]);

            }

            String[] tmp = StringUtils.split(key.toString(),',');
            key.set(tmp[0]);
            Text v = new Text(tmp[1]+","+String.valueOf(score));
            context.write(key, v);

        }
        // 样本:  u13 i9,5.0
    }
}
  • 把相乘之后的矩阵相加获得结果矩阵
  • 还是按用户分组,将该用户所有歌曲的推荐向量求和
    运行结果:


    第五步结果

第六步:

    public class Step6 {
    private final static Text K = new Text();
    private final static Text V = new Text();

    public static boolean run(Configuration config, Map paths) {
        try {
            FileSystem fs = FileSystem.get(config);
            Job job = Job.getInstance(config);
            job.setJobName("step6");
            job.setJarByClass(StartRun.class);
            job.setMapperClass(Step6_Mapper.class);
            job.setReducerClass(Step6_Reducer.class);
            job.setSortComparatorClass(NumSort.class);
            job.setGroupingComparatorClass(UserGroup.class);
            job.setMapOutputKeyClass(PairWritable.class);
            job.setMapOutputValueClass(Text.class);

            FileInputFormat
                    .addInputPath(job, new Path(paths.get("Step6Input")));
            Path outpath = new Path(paths.get("Step6Output"));
            if (fs.exists(outpath)) {
                fs.delete(outpath, true);
            }
            FileOutputFormat.setOutputPath(job, outpath);

            boolean f = job.waitForCompletion(true);
            return f;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return false;
    }

    static class Step6_Mapper extends
            Mapper {

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String[] tokens = Pattern.compile("[\t,]").split(value.toString());
            String u = tokens[0];
            String item = tokens[1];
            String num = tokens[2];
            PairWritable k = new PairWritable();
            k.setUid(u);
            k.setNum(Double.parseDouble(num));
            V.set(item + ":" + num);
            context.write(k, V);
        }
    }

    static class Step6_Reducer extends Reducer {
        @Override
        protected void reduce(PairWritable key, Iterable values,
                Context context) throws IOException, InterruptedException {
            int i = 0;
            StringBuffer sb = new StringBuffer();
            for (Text v : values) {
                if (i == 10)
                    break;
                sb.append(v.toString() + ",");
                i++;
            }
            K.set(key.getUid());
            V.set(sb.toString());
            context.write(K, V);
        }

    }

    static class PairWritable implements WritableComparable {

        private String uid;
        private double num;

        @Override
        public void write(DataOutput out) throws IOException {
            out.writeUTF(uid);
            out.writeDouble(num);
        }

        @Override
        public void readFields(DataInput in) throws IOException {
            this.uid = in.readUTF();
            this.num = in.readDouble();
        }

        @Override
        public int compareTo(PairWritable o) {
            int r = this.uid.compareTo(o.getUid());
            if (r == 0) {
                return Double.compare(this.num, o.getNum());
            }
            return r;
        }

        public String getUid() {
            return uid;
        }

        public void setUid(String uid) {
            this.uid = uid;
        }

        public double getNum() {
            return num;
        }

        public void setNum(double num) {
            this.num = num;
        }

    }

    static class NumSort extends WritableComparator {
        public NumSort() {
            super(PairWritable.class, true);
        }

        @Override
        public int compare(WritableComparable a, WritableComparable b) {
            PairWritable o1 = (PairWritable) a;
            PairWritable o2 = (PairWritable) b;

            int r = o1.getUid().compareTo(o2.getUid());
            if (r == 0) {
                return -Double.compare(o1.getNum(), o2.getNum());
            }
            return r;
        }
    }

    static class UserGroup extends WritableComparator {
        public UserGroup() {
            super(PairWritable.class, true);
        }

        @Override
        public int compare(WritableComparable a, WritableComparable b) {
            PairWritable o1 = (PairWritable) a;
            PairWritable o2 = (PairWritable) b;
            return o1.getUid().compareTo(o2.getUid());
        }
    }
}

第六步,按照推荐得分降序排序,每个用户列出10个推荐歌曲。
运行结果:


最终结果

最终结果,key作为用户id,value为推荐分数由高到低的歌曲ID。

后语:

1.第四步可能有点绕,看代码可能看不懂,如果是这样子的话我建议你拿支笔,在纸上写写那张表的计算过程,找找其中规律,就发现,妙啊!
2.用户的action操作评分是我暂时乱写的,具体要看你公司业务需要,设计具体操作的分值。
2.生产中肯定不会这么简单,例如要考虑一开始用户没数据时个推荐什么歌曲;推荐歌曲里不应该有已收藏的歌曲,但收藏这个操作的分值不能低,所以要去到下一步的过滤。。。具体怎么优化欢迎大家留言~


代码下载地址:https://github.com/MichaelYipInGitHub/RecommendDemo


Thanks for watching!

你可能感兴趣的:(利用MapReduce仿QQ音乐实现"今日推荐歌曲"系统)