mahout 推荐算法 java_推荐系统之推荐算法实战:mahout推荐算法框架

1.Mahout介绍

1.1概述

根据百度的解说,Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。

1.2发展历史

mahout一直伴随Hadoop发展的,从一开始能够帮助我们在Hadoop上实现很多机器学习,到后来发现它的效率越来越慢,于是放弃使用了一段时间,在一年之后,大概14年开始宣布(0.9版本),截止14年底,mahout不再接受任何MapReduce开发的算法,转向spark。我们知道,Hadoop也是经历了从1.0到2.0时代的变迁,而mahout也是跟着这样一个变化而变化的。到了15年,mahout开始更新,更新到0.10、0.12版本后,mahout就开始使用基于Spark/Flink/H2O这样一些平台来去开发数据挖掘/机器学习库。虽然改变了开发平台,但也不是完全不支持了MapReduce的开发,只是不再接受新的MapReduce算法开发。

1.3特点

扩展性:mahout本身只是一个机器学习库,并不是一个平台,不像H2O,H2O是完整的做机器学习,预测分析的平台,而MapReduce只是一个库,它底层的存储还是基于HDFS,它的调度还是使用了Hadoop平台上的 YARN ,HDFS本身就给mahout带来了存储和计算

容错性:是基于MapReduce/Spark/Flink这些计算引擎来实现的,而MapReduce/Spark/Flink本身具有非常好的容错性,包括它的推送和执行和失败容错机制等。

1.4组件

属于Hadoop生态系统重要组成部分:如果Hadoop是一头大象,而mahout就是一个训象师,引导它往什么方向走,做什么样的事。它也是Hadoop的一个重要组件,伴随Hadoop成长。

1.5实现的大部分常用的数据挖掘算法

聚类算法

分类算法

其他算法

1.6Mahout后端计算引擎

支持mr,后来转向spark并包含原来的mr,spark最大特点是基于内存、基于图调度的方式、算子简单易用和适用的语言(底层有实现)

H2O本身是一个适用于做机器学习和预测分析的平台,自身有一套算法支持的库 ,H2O也可以在Hadoop中集成

Flink:支持流处理和批处理

有些不同的算法支持不同的计算引擎。

1.7Mahout架构

Mahout架构:low-level

Mahout提供的算法架构

2.mahout推荐算法介绍

2.1Mahout推荐系统介绍

协同过滤框架一

使用历史数据(打分,点击,购买等)作为推荐的依据

User-based: 通过发现类似的用户推荐商品。由于用户多变的特性,这种方法很那扩展;

Item-based:通过计算item之间相似度推荐商品。商品不易变化,相似度矩阵可离线计算得到。(诞生于Amazon)

MF-based:通过将原始的user-item矩阵分解成小的矩阵,分析潜在的影响因子,并以解释用户的行为。(诞生于Netflix Prize)

协同过滤框架二

SVD(Singular Value Decomposition)因式分解实现协同过滤

基于ALS(alternating least squares)的协同过滤算法

2.2Mahout推荐系统架构

2.3利用Mahout构建推荐系统

输入输出

输入:原始数据(user preferences,用户偏好)

输出:用户偏好估计

步骤

Step 1:将原始数据映射到Mahout定义的Data Model中

Step 2: 调优推荐组件

相似度组件,临界关系组件等

Step 3: 计算排名估计值

Step 4:评估推荐结果

2.4Mahout推荐系统组件

Mahout关键抽象是通过Java Interface实现的:

DataModel Interface将原始数据映射成Mahout兼容格式

UserSimilarity Interface计算两个用户间的相关度

ItemSimilarity Interface计算两个商品间的相关度

UserNeighborhood Interface定义用户或商品间的“临近”

Recommender Interface实现具体的推荐算法,完成推荐功能(包括训练,预测等)

推荐系统组件:DataModel

推荐系统组件:UserSimilarity

相似度举例:TanimotoDistance

相似度举例:CosineSimilarity

Pearson vs. Euclidean distance

Pearson vs. Euclidean distance

推荐系统组件:UserNeighborhood

从以上组件可以看出,Mahout提供了大量的基于CF的推荐器:

不同的推荐算法

不同的“邻接”定义

不同的相似度定义

评估不同的算法实现非常耗时

Mahout提供了评估不同算法组合效果的工具

Mahout提供了标准的推荐系统评估接口

2.5推荐系统评估

Mahout提供了大量方法用于评估推荐系统

1.基于Prediction-based measures:

Mean Average Error 平均绝对误差

RMSE (Root Mean Square Error) 均方根误差

Class: AverageAbsoluteDifferenceEvaluator

Method: evaluate()

Parameters:

Recommender implementation

DataModel implementation

TrainingSet size (e.g. 70%)

% of the data to use in the evaluation (smaller % for fast prototyping)

2.基于IR-based measures

Precision, Recall, F1-measure 准确率,召回率,F1混合

NDCG (ranking measure)

Class: GenericRecommenderIRStatsEvaluator

Method: evaluate()

Parameters:

Recommender implementation

DataModel implementation

Relevance Threshold (mean+standard deviation)

% of the data to use in the evaluation (smaller % for fast prototyping)

3.mahout推荐算法实战

实例1:preferences

要求:

创建user-item偏好数据,并输出

实现:

使用GenericUserPreferenceArray创建数据

通过PreferenceArray存储数据

代码如下:

package com.zdd.example;

import org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray;

import org.apache.mahout.cf.taste.model.Preference;

import org.apache.mahout.cf.taste.model.PreferenceArray;

public class CreatePreferenceArray {

private CreatePreferenceArray() {

}

public static void main(String[] args) {

PreferenceArray User1Pref = new GenericUserPreferenceArray(2);

User1Pref.setUserID(0, 1L);

User1Pref.setItemID(0, 101L);

User1Pref.setValue(0, 3.0f);

User1Pref.setItemID(1, 102L);

User1Pref.setValue(1, 4.0f);

Preference pref = User1Pref.get(1);

System.out.println(User1Pref);

}

}

运行结果如下:

GenericUserPreferenceArray[userID:1,{101=3.0,102=4.0}]

表示用户ID为1的用户给商品101和102分别打分3.0和4.0

实例2:data model

PreferenceArray存储了单个用户的偏好,所有用户的偏好数据如何保存?HashMap? NO!

Mahout引入了一个为推荐任务优化的数据结构:FastByIDMap

需求:

使用GenericDataModel读入FastByIDMap数据

代码:

package com.zdd.example;

import org.apache.mahout.cf.taste.impl.common.FastByIDMap;

import org.apache.mahout.cf.taste.impl.model.GenericDataModel;

import org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray;

import org.apache.mahout.cf.taste.model.DataModel;

import org.apache.mahout.cf.taste.model.PreferenceArray;

public class CreateGenericDataModel {

private CreateGenericDataModel() {

}

public static void main(String[] args) {

FastByIDMap preferences = new FastByIDMap();

PreferenceArray User1Pref = new GenericUserPreferenceArray(2);

User1Pref.setUserID(0, 1L);

User1Pref.setItemID(0, 101L);

User1Pref.setValue(0, 3.0f);

User1Pref.setItemID(1, 102L);

User1Pref.setValue(1, 4.0f);

PreferenceArray User2Pref = new GenericUserPreferenceArray(2);

User2Pref.setUserID(0, 2L);

User2Pref.setItemID(0, 101L);

User2Pref.setValue(0, 3.0f);

User2Pref.setItemID(1, 102L);

User2Pref.setValue(1, 4.0f);

preferences.put(1L, User1Pref);

preferences.put(2L, User2Pref);

DataModel model = new GenericDataModel(preferences);

System.out.println(model);

System.out.println(preferences);

}

}

输出如下:

GenericDataModel[users:1,2]

{1=GenericUserPreferenceArray[userID:1,{101=3.0,102=4.0}],2=GenericUserPreferenceArray[userID:2,{101=3.0,102=4.0}]}

实例3:Recommender

需求:通过User-based协同过滤推荐算法给用户1推荐20个商品

实现:

1.使用FileDataModel读入文件

2.通过PearsonCorrelationSimilarity来计算相似度

3.使用GenericUserBasedRecommender构建推荐引擎

ua.base数据:

代码:

package com.zdd.example;

import org.apache.mahout.cf.taste.impl.model.file.*;

import org.apache.mahout.cf.taste.impl.similarity.*;

import org.apache.mahout.cf.taste.impl.neighborhood.*;

import org.apache.mahout.cf.taste.impl.recommender.*;

import org.apache.mahout.cf.taste.model.DataModel;

import org.apache.mahout.cf.taste.similarity.*;

import org.apache.mahout.cf.taste.neighborhood.*;

import org.apache.mahout.cf.taste.recommender.*;

import java.io.File;

import java.util.List;

public class RecommenderIntro {

public static void main(String[] args) throws Exception{

DataModel model = new FileDataModel(new File("data/ua.base"));

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model);

Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);

List recommendedItems = recommender.recommend(1, 20);

for (RecommendedItem recommendedItem: recommendedItems){

System.out.println(recommendedItem);

}

}

}

推荐结果如下:

实例4:推荐模型评估(1)

需求:

评估实例3的推荐系统的优劣

实现:

使用AverageAbsoluteDifferenceRecommenderEvaluator和RMSRecommenderEvaluator来评估模型

通过RecommenderBuilder来实现评估模型

实现如下代码:

package com.zdd.example;

import org.apache.mahout.cf.taste.common.TasteException;

import org.apache.mahout.cf.taste.eval.RecommenderBuilder;

import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;

import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator;

import org.apache.mahout.cf.taste.impl.eval.RMSRecommenderEvaluator;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;

import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;

import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;

import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;

import org.apache.mahout.cf.taste.model.DataModel;

import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;

import org.apache.mahout.cf.taste.recommender.*;

import org.apache.mahout.cf.taste.similarity.UserSimilarity;

import java.io.File;

public class EvaluatorIntro {

private EvaluatorIntro() {

}

public static void main(String[] args) throws Exception {

final DataModel model = new FileDataModel(new File("data/ua.base"));

RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderEvaluator recommenderEvaluator = new RMSRecommenderEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

@Override

public Recommender buildRecommender(DataModel model) throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model);

return new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

//参数0.7表示评估的训练集为70%,1.0代表所有的用户来参与评估

double score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);

double rmse = recommenderEvaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);

System.out.println(score);

System.out.println(rmse);

}

}

输出结果如下:

0.8522242111918109

1.0888589811454357

从结果可以看到,平均绝对误差大约为0.85,而均方根误差大约为1.09,在这个不大的数据集中,这个结果还能接受。

我们可以更改第34行代码来比较不同相似度的评分,这里用的相似度计算方式为皮尔森系数:UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

更改为欧几里得:

UserSimilarity similarity = new EuclideanDistanceSimilarity(model);

更改为余弦相似度:

UserSimilarity similarity = new UncenteredCosineSimilarity(model);

实例5:推荐模型评估(2)

需求:

通过IR指标来评估实例3的推荐系统的优劣

实现:

使用RecommenderIRStatsEvaluator来进行评估

实现代码如下:

package com.zdd.example;

import org.apache.mahout.cf.taste.common.TasteException;

import org.apache.mahout.cf.taste.eval.*;

import org.apache.mahout.cf.taste.impl.eval.GenericRecommenderIRStatsEvaluator;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;

import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;

import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;

import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;

import org.apache.mahout.cf.taste.model.DataModel;

import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;

import org.apache.mahout.cf.taste.recommender.*;

import org.apache.mahout.cf.taste.similarity.UserSimilarity;

import java.io.File;

public class IREvaluatorIntro {

public static void main(String[] args) throws Exception {

final DataModel model = new FileDataModel(new File("data/ua.base"));

RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

@Override

public Recommender buildRecommender(DataModel model) throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, model);

return new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

// 参数值5代表推荐5个商品,参数1.0代表全部用户参与评估

// 参数GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD代表Preference为多少时,两个item时相关的,这个参数值代表

// 我们在计算过程中自动调整这个阈值。

IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);

System.out.println(stats.getPrecision());

System.out.println(stats.getRecall());

System.out.println(stats.getF1Measure());

}

}

输出结果如下:

0.011523687580025595

0.011523687580025595

0.011523687580025593

从结果可以看到,各项指标比较低。这是因为我们的数据样本还是很小,下一个实例将会使用相对大一些的数据集,电影数据集来进行实践。

实例6:MovieLens推荐系统

需求:

使用MovieLens 1M数据集实现电影推荐系统

步骤:

实现MovieLens数据集的DataModel

实现Item-based和User-based的协同过滤推荐,并保存结果

实现代码分三个代码文件,1.数据预处理,2.Item-based实现,3.User-based实现

1.数据预处理:

package com.zdd.MovieLens;

import org.apache.commons.io.Charsets;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;

import org.apache.mahout.common.iterator.FileLineIterable;

import java.io.*;

import java.util.regex.Pattern;

public class MovieLensDataModel extends FileDataModel {

private static String COLON_DELIMITER="::";

private static Pattern COLON_DELIMITER_PATTERN=Pattern.compile(COLON_DELIMITER);

public MovieLensDataModel(File ratingsFile) throws IOException{

super(convertFile(ratingsFile));

}

private static File convertFile(File orginalFile) throws IOException{

File resultFile = new File(System.getProperty("java.io.tmpdir"), "ratings.csv");

if (resultFile.exists()){

resultFile.delete();

}

try(Writer writer = new OutputStreamWriter(new FileOutputStream(resultFile), Charsets.UTF_8)) {

for (String line: new FileLineIterable(orginalFile, false)){

int lastIndex = line.lastIndexOf(COLON_DELIMITER);

if (lastIndex < 0 ){

throw new IOException("Invalid data!");

}

String subLine = line.substring(0, lastIndex);

String convertedSubLine = COLON_DELIMITER_PATTERN.matcher(subLine).replaceAll(",");

writer.write(convertedSubLine);

writer.write('\n');

}

} catch (IOException ioe){

resultFile.delete();

throw ioe;

}

return resultFile;

}

}

2.Item-based实现:

package com.zdd.MovieLens;

import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;

import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;

import org.apache.mahout.cf.taste.impl.similarity.precompute.FileSimilarItemsWriter;

import org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities;

import org.apache.mahout.cf.taste.model.DataModel;

import org.apache.mahout.cf.taste.recommender.ItemBasedRecommender;

import org.apache.mahout.cf.taste.similarity.ItemSimilarity;

import org.apache.mahout.cf.taste.similarity.precompute.BatchItemSimilarities;

import org.apache.mahout.cf.taste.similarity.precompute.SimilarItemsWriter;

import java.io.File;

public class BatchItemSimilaritiesMovieLens {

private BatchItemSimilaritiesMovieLens(){

}

public static void main(String[] args) throws Exception{

if (args.length !=1){

System.err.println("Needs MovieLens 1M dataset as arugument!");

System.exit(-1);

}

File resultFile = new File(System.getProperty("java.io.tmpdir"), "similarities.csv");

DataModel dataModel = new MovieLensDataModel(new File(args[0]));

ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel);

ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, similarity);

//参数5代表相似物品的数量

BatchItemSimilarities batchItemSimilarities = new MultithreadedBatchItemSimilarities(recommender, 5);

SimilarItemsWriter writer = new FileSimilarItemsWriter(resultFile);

int numSimilarites = batchItemSimilarities.computeItemSimilarities(Runtime.getRuntime().availableProcessors(), 1, writer);

System.out.println("Computed "+ numSimilarites+ " for "+ dataModel.getNumItems()+" items and saved them to "+resultFile.getAbsolutePath());

}

}

运行代码,打印结果如下:

可以看到,在3706个物品中,有18530个相似物品的结果

并且在C:\Users\ADMINI~1\AppData\Local\Temp目录下,会产生ratings.csv和similarities.csv两个文件

similarities.csv数据如下:

3.User-based实现

package com.zdd.MovieLens;

import org.apache.mahout.cf.taste.common.TasteException;

import org.apache.mahout.cf.taste.eval.RecommenderBuilder;

import org.apache.mahout.cf.taste.impl.eval.RMSRecommenderEvaluator;

import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;

import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;

import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;

import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;

import org.apache.mahout.cf.taste.model.DataModel;

import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;

import org.apache.mahout.cf.taste.recommender.RecommendedItem;

import org.apache.mahout.cf.taste.recommender.Recommender;

import org.apache.mahout.cf.taste.similarity.UserSimilarity;

import java.io.File;

import java.io.IOException;

import java.io.PrintWriter;

import java.util.List;

public class UserRecommenderMovieLens {

private UserRecommenderMovieLens(){

}

public static void main(String[] args) throws Exception {

if (args.length != 1) {

System.err.println("Needs MovieLens 1M dataset as arugument!");

System.exit(-1);

}

File resultFile = new File(System.getProperty("java.io.tmpdir"), "userRcomed.csv");

DataModel dataModel = new MovieLensDataModel(new File(args[0]));

UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, dataModel);

Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity);

Recommender cachingRecommender = new CachingRecommender(recommender);

//Evaluate

RMSRecommenderEvaluator evaluator = new RMSRecommenderEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

@Override

public Recommender buildRecommender(DataModel dataModel) throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(100, similarity, dataModel);

return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);

}

};

double score = evaluator.evaluate(recommenderBuilder, null, dataModel, 0.9, 0.5);

System.out.println("RMSE score is "+score);

try(PrintWriter writer = new PrintWriter(resultFile)){

for (int userID=1; userID <= dataModel.getNumUsers(); userID++){

List recommendedItems = cachingRecommender.recommend(userID, 2);

String line = userID+" : ";

for (RecommendedItem recommendedItem: recommendedItems){

line += recommendedItem.getItemID()+":"+recommendedItem.getValue()+",";

}

if (line.endsWith(",")){

line = line.substring(0, line.length()-1);

}

writer.write(line);

writer.write('\n');

}

} catch (IOException ioe){

resultFile.delete();

throw ioe;

}

System.out.println("Recommended for "+dataModel.getNumUsers()+" users and saved them to "+resultFile.getAbsolutePath());

}

}

运行代码,结果如下:

RMSE score is 1.0747072266152768

Recommended for 6040 users and saved them to C:\Users\ADMINI~1\AppData\Local\Temp\userRcomed.csv

打开userRcomed.csv文件,如下:

1 : 32:5.0,28:5.0

2 : 2726:5.0,2607:5.0

3 : 2624:5.0,1262:5.0

使用电影数据集,给每个用户推荐了2个打分最高的商品。

你可能感兴趣的:(mahout,推荐算法,java)