博客说明:
博客内容用于学习与分享,有问题欢迎大家讨论留言。
关于作者:
程序员:杨洪(ellende)
blog: http://blog.csdn.net/ellende
email: [email protected]
转载请注明出处,引用部分网上博客,若有侵权还请作者联系与我。
用户推荐协同过滤算法(UserCF)原理说明
基于用户的协同过滤,通过不同用户对物品的评分来评测用户之间的相似性,基于用户之间的相似性做出推荐。简单来讲就是:给用户推荐和他兴趣相似的其他用户喜欢的物品。
1.原始数据输入
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
2.构成矩阵
101 102 103 104 105 106 107
[1,] 5.0 3.0 2.5 0.0 0.0 0 0
[2,] 2.0 2.5 5.0 2.0 0.0 0 0
[3,] 2.5 0.0 0.0 4.0 4.5 0 5
[4,] 5.0 0.0 3.0 4.5 0.0 4 0
[5,] 4.0 3.0 2.0 4.0 3.5 4 0
3.欧氏相似矩阵转换
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0000000 0.6076560 0.2857143 1.0000000 1.0000000
[2,] 0.6076560 0.0000000 0.6532633 0.5568464 0.7761999
[3,] 0.2857143 0.6532633 0.0000000 0.5634581 1.0000000
[4,] 1.0000000 0.5568464 0.5634581 0.0000000 1.0000000
[5,] 1.0000000 0.7761999 1.0000000 1.0000000 0.0000000
计算方式:
相似度=n/(1+sqrt(sum((Xi-Yi)^2)))
即需要对两个向量元素做差值并平方再求和再开方,开方后加1,最后n是有效向量差值个数。
如用户1和用户2的相似度计算:
(5.0-2.0)^2 + (3.0-2.5)^2 + (2.5-5.0)^2 = 15.5 //之所以只有3个向量元素做差值 是因为要两个向量元素都为非0值才做差值计算
3/(1+sqrt(15.5)) = 0.607656
如用户1和用户4的相似度计算:
(5.0-5.0)^2 + (2.5-3.0)^2 = 0.25
2/(1+sqrt(0.25)) = 1.333333 因大于1 取相似度为1.000000(程序里去掉了这个限制)
4.最近邻矩阵
根据欧氏相似矩阵找出用户相似度最高的前2个用户,如下所示:
top1 top2
[1,] 4 5
[2,] 5 3
[3,] 5 2
[4,] 1 5
[5,] 1 3
如用户1相似度排序:4[1.0],5[1.0],2[0.607],3[0.285],1[0.0]
5.以用户1为例的推荐矩阵
用户1前2个最高相似度是用户4和用户5,分别列出对应评分矩阵:
101 102 103 104 105 106 107
1 5.0 3.0 2.5 0.0 0.0 0.0 0.0
4 5.0 0.0 3.0 4.5 0.0 4 0
5 4.0 3.0 2.0 4.0 3.5 4 0
去掉用户1已经买过的物品,即101,102,103,剩下用户1未买过的物品进行推荐,推荐矩阵如下:
101 102 103 104 105 106 107
4 0 0 0 4.5 0.0 4 0
5 0 0 0 4.0 3.5 4 0
6.以用户1为例的推荐结果
用户1未购买的物品分别得分:
104[(4.5+4)/2=4.25],106[(4+4)/2=4],105[(0+3.5)/2=1.75],107[(0+0)/2=0]
最后推荐前2个物品,矩阵如下:
推荐物品 物品得分
[1] "104" "4.25"
[2] "106" "4"
7.代码实现
主要基于hadoop实现mapreduce并行算法,UserCF算法在网上并行实现的不多,这里作为练习实现下,主要分为5步实现:
步骤1: 将数据输入整理,为计算欧氏相似矩阵准备数据。
步骤2: 依赖步骤1输出数据,计算欧氏相似矩阵完成。
步骤3: 依赖步骤2输出数据,根据欧氏相似矩阵找出用户相似度最高的前2个用户。
步骤4: 依赖步骤3输出数据和原始数据,计算出每个用户与相似度最高的2个用户之间未买过的物品进行推荐,输出推荐矩阵,输出数据是每个用户对应的推荐物品的平均值。
步骤5: 依赖步骤4输出数据,根据每个用户对应的推荐物品的平均值,计算出前3个推荐物品。
主要源文件:
1)HdfsDAO.java 是一个HDFS操作的工具,用API实现Hadoop的各种HDFS命令,请参考文章:Hadoop编程调用HDFS
2)UserCFHadoop.java 是main入口文件,实现目录配置,步骤运行。
3)UserCF_Step1.java 是步骤1实现文件
4)UserCF_Step2.java 是步骤2实现文件
5)UserCF_Step3.java 是步骤3实现文件
6)UserCF_Step4.java 是步骤4实现文件
7)UserCF_Step5.java 是步骤5实现文件
运行环境:
1)Centos6.5
2)hadoop 2.7.2
3)java sdk 1.7.079
主要代码如下:
1)HdfsDAO.java
package recommend.code1.hdfs;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.mapred.JobConf;
public class HdfsDAO {
private static final String HDFS = "hdfs://localhost:9000/";
public HdfsDAO(Configuration conf) {
this(HDFS, conf);
}
public HdfsDAO(String hdfs, Configuration conf) {
this.hdfsPath = hdfs;
this.conf = conf;
}
private String hdfsPath;
private Configuration conf;
public static void main(String[] args) throws IOException {
JobConf conf = config();
HdfsDAO hdfs = new HdfsDAO(conf);
hdfs.mkdirs("/tmp/new");
hdfs.copyFile("/home/yj/HadoopFile/userFile/small.csv", "/tmp/new");
hdfs.ls("/tmp/new");
}
public static JobConf config(){
JobConf conf = new JobConf(HdfsDAO.class);
conf.setJobName("HdfsDAO");
conf.addResource("classpath:/hadoop/core-site.xml");
conf.addResource("classpath:/hadoop/hdfs-site.xml");
conf.addResource("classpath:/hadoop/mapred-site.xml");
return conf;
}
public void mkdirs(String folder) throws IOException {
Path path = new Path(folder);
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
if (!fs.exists(path)) {
fs.mkdirs(path);
System.out.println("Create: " + folder);
}
fs.close();
}
public void rmr(String folder) throws IOException {
Path path = new Path(folder);
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
fs.deleteOnExit(path);
System.out.println("Delete: " + folder);
fs.close();
}
public void ls(String folder) throws IOException {
Path path = new Path(folder);
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
FileStatus[] list = fs.listStatus(path);
System.out.println("ls: " + folder);
System.out.println("==========================================================");
for (FileStatus f : list) {
System.out.printf("name: %s, folder: %s, size: %d\n", f.getPath(), f.isDir(), f.getLen());
}
System.out.println("==========================================================");
fs.close();
}
public void createFile(String file, String content) throws IOException {
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
byte[] buff = content.getBytes();
FSDataOutputStream os = null;
try {
os = fs.create(new Path(file));
os.write(buff, 0, buff.length);
System.out.println("Create: " + file);
} finally {
if (os != null)
os.close();
}
fs.close();
}
public void copyFile(String local, String remote) throws IOException {
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
fs.copyFromLocalFile(new Path(local), new Path(remote));
System.out.println("copy from: " + local + " to " + remote);
fs.close();
}
public void download(String remote, String local) throws IOException {
Path path = new Path(remote);
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
fs.copyToLocalFile(path, new Path(local));
System.out.println("download: from" + remote + " to " + local);
fs.close();
}
public void cat(String remoteFile) throws IOException {
Path path = new Path(remoteFile);
FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
FSDataInputStream fsdis = null;
System.out.println("cat: " + remoteFile);
try {
fsdis =fs.open(path);
IOUtils.copyBytes(fsdis, System.out, 4096, false);
} finally {
IOUtils.closeStream(fsdis);
fs.close();
}
}
public void location() throws IOException {
// String folder = hdfsPath + "create/";
// String file = "t2.txt";
// FileSystem fs = FileSystem.get(URI.create(hdfsPath), new
// Configuration());
// FileStatus f = fs.getFileStatus(new Path(folder + file));
// BlockLocation[] list = fs.getFileBlockLocations(f, 0, f.getLen());
//
// System.out.println("File Location: " + folder + file);
// for (BlockLocation bl : list) {
// String[] hosts = bl.getHosts();
// for (String host : hosts) {
// System.out.println("host:" + host);
// }
// }
// fs.close();
}
}
2)UserCFHadoop.java
package recommend.code1.recommend;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
public class UserCFHadoop {
public static final String HDFS = "hdfs://localhost:9000";
public static final Pattern DELIMITER = Pattern.compile("[\t,]");
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
Map path = new HashMap();
path.put("data", "/home/yj/HadoopFile/userFile/item.csv");// 本地的数据文件
path.put("input_file", HDFS + "/user/yj/input/userCF/");// HDFS的目录
path.put("input_step1", path.get("input_file") + "/data");
path.put("output_step1", path.get("input_file") + "/step1");
path.put("input_step2", path.get("output_step1"));
path.put("output_step2", path.get("input_file") + "/step2");
path.put("input_step3", path.get("output_step2"));
path.put("output_step3", path.get("input_file") + "/step3");
path.put("input1_step4", path.get("output_step3"));
path.put("input2_step4", path.get("input_step1"));
path.put("output_step4", path.get("input_file") + "/step4");
path.put("input_step5", path.get("output_step4"));
path.put("output_step5", path.get("input_file") + "/step5");
try
{
UserCF_Step1.run(path);
UserCF_Step2.run(path);
UserCF_Step3.run(path);
UserCF_Step4.run(path);
UserCF_Step5.run(path);
}
catch (Exception e)
{
e.printStackTrace();
}
System.exit(0);
}
public static Configuration config() {// Hadoop集群的远程配置信息
Configuration conf = new Configuration();
return conf;
}
}
3)UserCF_Step1.java
package recommend.code1.recommend;
//import hadoop.myMapreduce.martrix.MainRun;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import recommend.code1.hdfs.HdfsDAO;
public class UserCF_Step1 {
public static class MyMapper extends Mapper {
@Override
public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = UserCFHadoop.DELIMITER.split(values.toString());
if (tokens.length >= 3)
{
Text k = new Text(tokens[1]);//itemid
Text v = new Text(tokens[0] + "," + tokens[2]);//userid + score
context.write(k, v);
}
}
}
public static class MyReducer extends Reducer {
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
Map map = new HashMap();
for (Text line : values) {
String val = line.toString();
String[] vlist = UserCFHadoop.DELIMITER.split(val);
if (vlist.length >= 2)
{
map.put(vlist[0], vlist[1]);
}
}
Iterator iterA = map.keySet().iterator();
while (iterA.hasNext())
{
String k1 = iterA.next();
String v1 = map.get(k1);
Iterator iterB = map.keySet().iterator();
while (iterB.hasNext())
{
String k2 = iterB.next();
String v2 = map.get(k2);
context.write(new Text(k1 + "," + k2), new Text(v1 + "," + v2));
}
}
}
}
public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = UserCFHadoop.config();
String input = path.get("input_step1");
String output = path.get("output_step1");
HdfsDAO hdfs = new HdfsDAO(UserCFHadoop.HDFS, conf);
hdfs.rmr(path.get("input_file"));
hdfs.rmr(input);
hdfs.mkdirs(input);
hdfs.copyFile(path.get("data"), input);
Job job = Job.getInstance(conf, "UserCF_Step1 job");
job.setJarByClass(UserCF_Step1.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input));// 加载2个输入数据集
FileOutputFormat.setOutputPath(job, new Path(output));
System.out.println("input : " + input);
System.out.println("output: " + output);
if (!job.waitForCompletion(true))
{
System.out.println("main run stop!");
return;
}
System.out.println("main run successfully!");
}
}
4)UserCF_Step2.java
package recommend.code1.recommend;
//import hadoop.myMapreduce.martrix.MainRun;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.lang.Math;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import recommend.code1.hdfs.HdfsDAO;
public class UserCF_Step2 {
public static class MyMapper extends Mapper {
@Override
public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = UserCFHadoop.DELIMITER.split(values.toString());
if (tokens.length >= 4)
{
Text k = new Text(tokens[0] + "," + tokens[1]);
Text v = new Text(tokens[2] + "," + tokens[3]);
context.write(k, v);
}
}
}
public static class MyReducer extends Reducer {
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
double sum = 0.0;
double similarity = 0.0;
int num = 0;
for (Text line : values) {
String val = line.toString();
String[] vlist = UserCFHadoop.DELIMITER.split(val);
if (vlist.length >= 2)
{
sum += Math.pow((Double.parseDouble(vlist[0]) - Double.parseDouble(vlist[1])), 2);
num += 1;
}
}
if (sum > 0.00000001)
{
similarity = (double)num / (1 + Math.sqrt(sum));
}
// if (similarity > 1.0)
// {
// similarity = 1.0;
// }
context.write(key, new Text(String.format("%.7f", similarity)));
}
}
public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = UserCFHadoop.config();
String input = path.get("input_step2");
String output = path.get("output_step2");
Job job = Job.getInstance(conf, "UserCF_Step2 job");
job.setJarByClass(UserCF_Step2.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input));// 加载2个输入数据集
FileOutputFormat.setOutputPath(job, new Path(output));
System.out.println("input : " + input);
System.out.println("output: " + output);
if (!job.waitForCompletion(true))
{
System.out.println("main run stop!");
return;
}
System.out.println("main run successfully!");
}
}
5)UserCF_Step3.java
package recommend.code1.recommend;
//import hadoop.myMapreduce.martrix.MainRun;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.lang.Math;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import recommend.code1.hdfs.HdfsDAO;
public class UserCF_Step3 {
public static class MyMapper extends Mapper {
@Override
public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = UserCFHadoop.DELIMITER.split(values.toString());
if (tokens.length >= 3)
{
Text k = new Text(tokens[0]);
Text v = new Text(tokens[1] + "," + tokens[2]);
context.write(k, v);
}
}
}
public static class MyReducer extends Reducer {
private final int NEIGHBORHOOD_NUM = 2;
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
Map map = new HashMap();
for (Text line : values) {
String val = line.toString();
String[] vlist = UserCFHadoop.DELIMITER.split(val);
if (vlist.length >= 2)
{
map.put(Double.parseDouble(vlist[1]), vlist[0]);
}
}
List list = new ArrayList();
Iterator iter = map.keySet().iterator();
while (iter.hasNext()) {
Double similarity = iter.next();
list.add(similarity);
}
//然后通过比较器来实现排序
Collections.sort(list,new Comparator() {
//降序排序
public int compare(Double o1, Double o2) {
return o2.compareTo(o1);
}
});
// for (int i = 0; i < NEIGHBORHOOD_NUM && i < list.size(); i++)
// {
// context.write(key, new Text(map.get(list.get(i)) + "," + String.format("%.7f", list.get(i))));
// }
String v = "";
for (int i = 0; i < NEIGHBORHOOD_NUM && i < list.size(); i++)
{
v += "," + map.get(list.get(i)) + "," + String.format("%.7f", list.get(i));
}
context.write(key, new Text(v.substring(1)));
}
}
public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = UserCFHadoop.config();
String input = path.get("input_step3");
String output = path.get("output_step3");
Job job = Job.getInstance(conf, "UserCF_Step3 job");
job.setJarByClass(UserCF_Step3.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input));// 加载2个输入数据集
FileOutputFormat.setOutputPath(job, new Path(output));
System.out.println("input : " + input);
System.out.println("output: " + output);
if (!job.waitForCompletion(true))
{
System.out.println("main run stop!");
return;
}
System.out.println("main run successfully!");
}
}
6)UserCF_Step4.java
package recommend.code1.recommend;
//import hadoop.myMapreduce.martrix.MainRun;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.lang.Math;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import recommend.code1.hdfs.HdfsDAO;
public class UserCF_Step4 {
public static class MyMapper extends Mapper {
private String flag;// A:step3 or B:data
private int itemNum = 7;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
flag = split.getPath().getParent().getName();// 判断读的数据集
System.out.println(flag);
}
@Override
public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = UserCFHadoop.DELIMITER.split(values.toString());
int itemIndex = 100;
if (flag.equals("step3")) {
for (int i = 1; i <= itemNum; i++)
{
Text k = new Text(Integer.toString(itemIndex + i));//itemid
Text v = new Text("A:" + tokens[0] + "," + tokens[1] + "," + tokens[3]);
context.write(k, v);
// System.out.println(k.toString() + " " + v.toString());
}
} else if (flag.equals("data")) {
Text k = new Text(tokens[1]);//itemid
Text v = new Text("B:" + tokens[0] + "," + tokens[2]);//userid + score
context.write(k, v);
// System.out.println(k.toString() + " " + v.toString());
}
}
}
public static class MyReducer extends Reducer {
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
Map mapA = new HashMap();
Map mapB = new HashMap();
for (Text line : values) {
String val = line.toString();
if (val.startsWith("A:")) {
String[] kv = UserCFHadoop.DELIMITER.split(val.substring(2));
mapA.put(kv[0], kv[1] + "," + kv[2]);
} else if (val.startsWith("B:")) {
String[] kv = UserCFHadoop.DELIMITER.split(val.substring(2));
mapB.put(kv[0], kv[1]);
}
}
Iterator iterA = mapA.keySet().iterator();
while (iterA.hasNext())
{
String userId = iterA.next();
if (!mapB.containsKey(userId))//不存在可以推荐 有买过这个物品的不推荐
{
String simiStr = mapA.get(userId);
String[] simi = UserCFHadoop.DELIMITER.split(simiStr);
if (simi.length >= 2)
{
double simiVal1 = mapB.containsKey(simi[0]) ? Double.parseDouble(mapB.get(simi[0])) : 0;
double simiVal2 = mapB.containsKey(simi[1]) ? Double.parseDouble(mapB.get(simi[1])) : 0;
double score = (simiVal1 + simiVal2) / 2;
context.write(new Text(userId), new Text(key.toString() + "," + String.format("%.2f", score)));
}
}
}
}
}
public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = UserCFHadoop.config();
String input1 = path.get("input1_step4");
String input2 = path.get("input2_step4");
String output = path.get("output_step4");
Job job = Job.getInstance(conf, "UserCF_Step4 job");
job.setJarByClass(UserCF_Step4.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input1), new Path(input2));// 加载2个输入数据集
FileOutputFormat.setOutputPath(job, new Path(output));
System.out.println("input1: " + input1);
System.out.println("input2: " + input2);
System.out.println("output: " + output);
if (!job.waitForCompletion(true))
{
System.out.println("main run stop!");
return;
}
System.out.println("main run successfully!");
}
}
7)UserCF_Step5.java
package recommend.code1.recommend;
//import hadoop.myMapreduce.martrix.MainRun;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.lang.Math;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import recommend.code1.hdfs.HdfsDAO;
public class UserCF_Step5 {
public static class MyMapper extends Mapper {
@Override
public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
String[] tokens = UserCFHadoop.DELIMITER.split(values.toString());
if (tokens.length >= 3)
{
Text k = new Text(tokens[0]);
Text v = new Text(tokens[1] + "," + tokens[2]);
context.write(k, v);
}
}
}
public static class MyReducer extends Reducer {
private final int RECOMMENDER_NUM = 3;
@Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
Map map = new HashMap();
for (Text line : values) {
String val = line.toString();
String[] vlist = UserCFHadoop.DELIMITER.split(val);
if (vlist.length >= 2)
{
map.put(Double.parseDouble(vlist[1]), vlist[0]);
}
}
List list = new ArrayList();
Iterator iter = map.keySet().iterator();
while (iter.hasNext()) {
Double similarity = iter.next();
list.add(similarity);
}
//然后通过比较器来实现排序
Collections.sort(list,new Comparator() {
//降序排序
public int compare(Double o1, Double o2) {
return o2.compareTo(o1);
}
});
String v = "";
for (int i = 0; i < RECOMMENDER_NUM && i < list.size(); i++)
{
if (list.get(i).compareTo(new Double(0.001)) > 0)
{
v += "," + map.get(list.get(i)) + "[" + String.format("%.2f", list.get(i)) + "]";
}
}
if (!v.isEmpty())
{
context.write(key, new Text(v.substring(1)));
}
else
{
context.write(key, new Text("none"));
}
}
}
public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = UserCFHadoop.config();
String input = path.get("input_step5");
String output = path.get("output_step5");
Job job = Job.getInstance(conf, "UserCF_Step5 job");
job.setJarByClass(UserCF_Step5.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(input));// 加载2个输入数据集
FileOutputFormat.setOutputPath(job, new Path(output));
System.out.println("input : " + input);
System.out.println("output: " + output);
if (!job.waitForCompletion(true))
{
System.out.println("main run stop!");
return;
}
System.out.println("main run successfully!");
}
}
1)原数据item.csv
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
2)step1输出
3,3 2.5,2.5
3,2 2.5,2.0
3,1 2.5,5.0
3,5 2.5,4.0
3,4 2.5,5.0
2,3 2.0,2.5
2,2 2.0,2.0
2,1 2.0,5.0
2,5 2.0,4.0
2,4 2.0,5.0
1,3 5.0,2.5
1,2 5.0,2.0
1,1 5.0,5.0
1,5 5.0,4.0
1,4 5.0,5.0
5,3 4.0,2.5
5,2 4.0,2.0
5,1 4.0,5.0
5,5 4.0,4.0
5,4 4.0,5.0
4,3 5.0,2.5
4,2 5.0,2.0
4,1 5.0,5.0
4,5 5.0,4.0
4,4 5.0,5.0
2,2 2.5,2.5
2,1 2.5,3.0
2,5 2.5,3.0
1,2 3.0,2.5
1,1 3.0,3.0
1,5 3.0,3.0
5,2 3.0,2.5
5,1 3.0,3.0
5,5 3.0,3.0
2,2 5.0,5.0
2,1 5.0,2.5
2,5 5.0,2.0
2,4 5.0,3.0
1,2 2.5,5.0
1,1 2.5,2.5
1,5 2.5,2.0
1,4 2.5,3.0
5,2 2.0,5.0
5,1 2.0,2.5
5,5 2.0,2.0
5,4 2.0,3.0
4,2 3.0,5.0
4,1 3.0,2.5
4,5 3.0,2.0
4,4 3.0,3.0
3,3 4.0,4.0
3,2 4.0,2.0
3,5 4.0,4.0
3,4 4.0,4.5
2,3 2.0,4.0
2,2 2.0,2.0
2,5 2.0,4.0
2,4 2.0,4.5
5,3 4.0,4.0
5,2 4.0,2.0
5,5 4.0,4.0
5,4 4.0,4.5
4,3 4.5,4.0
4,2 4.5,2.0
4,5 4.5,4.0
4,4 4.5,4.5
3,3 4.5,4.5
3,5 4.5,3.5
5,3 3.5,4.5
5,5 3.5,3.5
5,5 4.0,4.0
5,4 4.0,4.0
4,5 4.0,4.0
4,4 4.0,4.0
3,3 5.0,5.0
3)step2输出
1,1 0.0000000
1,2 0.6076560
1,3 0.2857143
1,4 1.3333333
1,5 1.4164079
2,1 0.6076560
2,2 0.0000000
2,3 0.6532633
2,4 0.5568464
2,5 0.7761999
3,1 0.2857143
3,2 0.6532633
3,3 0.0000000
3,4 0.5634581
3,5 1.0703675
4,1 1.3333333
4,2 0.5568464
4,3 0.5634581
4,4 0.0000000
4,5 1.6000000
5,1 1.4164079
5,2 0.7761999
5,3 1.0703675
5,4 1.6000000
5,5 0.0000000
4)step3输出
1 5,1.4164079,4,1.3333333
2 5,0.7761999,3,0.6532633
3 5,1.0703675,2,0.6532633
4 5,1.6000000,1,1.3333333
5 4,1.6000000,1,1.4164079
5)step4输出
3 102,2.75
4 102,3.00
3 103,3.50
1 104,4.25
2 105,4.00
1 105,1.75
4 105,1.75
3 106,2.00
2 106,2.00
1 106,4.00
2 107,2.50
1 107,0.00
5 107,0.00
4 107,0.00
6)step5输出
1 104[4.25],106[4.00],105[1.75]
2 105[4.00],107[2.50],106[2.00]
3 103[3.50],102[2.75],106[2.00]
4 102[3.00],105[1.75]
5 none