Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
学习总是一个痛并快乐着的过程。。。
今天简要介绍一下mahout中的Collaborative Filtering with ALS-WR,这个算法,你要问我这个是什么算法,我最多告诉你它是一个推荐算法,其他我也不知道。这里主要是参考这里的介绍Collaborative Filtering with ALS-WR。
此篇作为实战,就是要先把算法跑起来,先不管具体实现过程,通过现象,看到什么,然后才分析具体实现过程。看到官网的介绍上面说其实这个算法跑的是examples/bin/factorize-movielens-1M.sh这个文件,那么就打开这个文件来看看吧:
# Instructions: # # Before using this script, you have to download and extract the Movielens 1M dataset # from http://www.grouplens.org/node/73 # # To run: change into the mahout directory and type: # examples/bin/factorize-movielens-1M.sh /path/to/ratings.dat if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then echo "This script runs the Alternating Least Squares Recommender on the Grouplens data set (size 1M)." echo "Syntax: $0 /path/to/ratings.dat\n" exit fi if [ $# -ne 1 ] then echo -e "\nYou have to download the Movielens 1M dataset from http://www.grouplens.org/node/73 before" echo -e "you can run this example. After that extract it and supply the path to the ratings.dat file.\n" echo -e "Syntax: $0 /path/to/ratings.dat\n" exit -1 fi MAHOUT="../../bin/mahout" WORK_DIR=/tmp/mahout-work-${USER} echo "creating work directory at ${WORK_DIR}" mkdir -p ${WORK_DIR}/movielens echo "Converting ratings..." cat $1 |sed -e s/::/,/g| cut -d, -f1,2,3 > ${WORK_DIR}/movielens/ratings.csv # create a 90% percent training set and a 10% probe set $MAHOUT splitDataset --input ${WORK_DIR}/movielens/ratings.csv --output ${WORK_DIR}/dataset \ --trainingPercentage 0.9 --probePercentage 0.1 --tempDir ${WORK_DIR}/dataset/tmp # run distributed ALS-WR to factorize the rating matrix defined by the training set $MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \ --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 # compute predictions against the probe set, measure the error $MAHOUT evaluateFactorization --input ${WORK_DIR}/dataset/probeSet/ --output ${WORK_DIR}/als/rmse/ \ --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ --tempDir ${WORK_DIR}/als/tmp # compute recommendations $MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \ --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \ --numRecommendations 6 --maxRating 5 # print the error echo -e "\nRMSE is:\n" cat ${WORK_DIR}/als/rmse/rmse.txt echo -e "\n" echo -e "\nSample recommendations:\n" shuf ${WORK_DIR}/recommendations/part-m-00000 |head echo -e "\n\n" echo "removing work directory" rm -rf ${WORK_DIR}mahout@ubuntu:~/mahout-d-0.7/examples/bin$这里可以看到一共有5个操作:(1)把原始数据转换为我们需要的格式;(2)分数据集;(3)并行ALS;(4)评价算法模型;(5)进行推荐;下面来一个一个进行实战:
(1)转换数据,下载原始数据MovieLens Data Sets,这里下载的是1M数据,解压后,打开ratings.dat,可以看到下面的数据:
1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 1::3408::4::978300275 1::2355::5::978824291 1::1197::3::978302268 1::1287::5::978302039 1::2804::5::978300719 1::594::4::978302268 1::919::4::978301368然后使用linux命令:cat ratings.dat |sed -e s/::/,/g|cut -d, -f1,2,3 > ratings.csv,把数据转换成下面的形式:
1,1193,5 1,661,3 1,914,3 1,3408,4 1,2355,5 1,1197,3 1,1287,5 1,2804,5 1,594,4 1,919,4这里简要介绍下数据ratings.dat 的结构如下:UserID::MovieID::Rating::Timestamp,然后转换后的结构如下:UserID,MovieID,Rating。
然后把生成的ratings.csv上传到HDFS文件系统,准备进行下一步。
(2)分数聚集为训练数据和测试数据:进入mahout根目录,使用命令splitDataset,下面是这个命令的参数:
usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --trainingPercentage (-t) trainingPercentage percentage of the data to use as training set (default: 0.9) --probePercentage (-p) probePercentage percentage of the data to use as probe set (default: 0.1) --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run命令为:./mahout splitDataset -i input/ratings.csv -o output/als -t 0.9 -p 0.1 --tempDir temp ,运行完成后,可以看到该命令一共运行了三个Job,分别产生了三分输出结果:(a)应该是原始数据的转换,输入的map记录数为100020,输出也是100020;(b)是产生训练数据集,输入100020条记录,输出900362条记录;(c)输入100020条记录,输出99847条记录;
(3)并行ALS:命令为./mahoutparallelALS ,先看其使用参数和方法:
usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --lambda lambda regularization parameter --implicitFeedback implicitFeedback data consists of implicit feedback? --alpha alpha confidence parameter (only used on implicit feedback) --numFeatures numFeatures dimension of the feature space --numIterations numIterations number of iterations --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run然后使用命令:./mahout parallelALS-i output/als/trainingSet -o output/als/als --tempDir temp/als --numFeatures 20 --numIterations 10 --lambda 0.065
13/10/03 21:27:24 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 0/10) 13/10/03 21:27:50 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 0/10) 13/10/03 21:28:20 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 1/10) ... 13/10/03 21:35:51 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 9/10) 13/10/03 21:36:17 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 9/10)在输出文件中会有M、U和userRationgs三个文件夹,在temp中则会出现U0~U8、M0~M8、M--1、averageRatings和itemRatings这些文件夹。
(4)评价算法模型:使用的mahout命令是evaluateFactorization,首先看下其用法和参数:
usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --userFeatures userFeatures path to the user feature matrix --itemFeatures itemFeatures path to the item feature matrix --output (-o) output The directory pathname for output. --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run使用下面的命令来运行:./mahout evaluateFactorization -i output/als/probeSet -o output/rmse --userFeatures output/als/als/U --itemFeatures output/als/als/M --tempDir temp/rmse,命令运行完毕后,可以在HDFS的 output/ rmse/rmse.txt文件中查看到均方根误差为:0.8548619405669956(感觉好像均方根误差很小的样子?)
(5)推荐:推荐使用的命令是recommendfactorized,这个命令的用户和参数为:
usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --userFeatures userFeatures path to the user feature matrix --itemFeatures itemFeatures path to the item feature matrix --numRecommendations numRecommendations number of recommendations per user --maxRating maxRating maximum rating available --output (-o) output The directory pathname for output. --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run使用命令:./mahout recommendfactorized -i output/als/als/userRatings -o output/recommendations --userFeatures output/als/als/U --itemFeatures output/als/als/M --numRecommendations 6 --maxRating 5,即可运行该命令。运行完毕后,在终端中可以看到map的输出为6040条记录,正好对应了数据集中用户的数量,同时可以在相应的HDFS文件系统上面查看相应的推荐输出:
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990