mahout之推荐系统源码笔记(1) ---预处理之PreparePreferenceMatrixJob

mahout之推荐系统源码笔记(1) —预处理之PreparePreferenceMatrixJob

hadoop篇:

因为时间原因首先更新分布式hadoop上的推荐系统源码的阅读。

本笔记基于 apache-mahout-distribution-0.12.2-src 。

首先给出mahout中taste推荐系统的代码结构:

  • taste
    • common
    • eval
    • hadoop
    • impl
      • model
      • neighborhood
      • recommender
      • similarity
    • model
    • neighborhood
    • recommender
    • similarity

其中重要的有以下几个文件夹

model实现存放数据的各种model,其中model中用到的诸如FastIDbyKey等数据结构存放在common中。
similarity实现相似度计算的不同函数。
neighborhood实现计算相邻用户/物品的计算方法,只有两种,分别是基于距离和基于TopN和最大距离。
recommender实现推荐器的实现。
impl文件夹里面实现的是外部所有接口的具体实现函数。
hadoop存放的是推荐系统真正通过hadoop进行mr编程计算的核心函数。

taste推荐系统基于hadoop的入口类存放在hadoop.item下的RecommenderJob.java中。
RecommenderJob.java的主函数如下


  public static void main(String[] args) throws Exception {

    ToolRunner.run(new Configuration(), new RecommenderJob(), args);
  }

跟踪可以发现run函数回调RecommenderJob类中的run函数。

由此开始执行推荐系统。

RecommenderJob.run() 首先获取用户的Option,然后转化为自己可用的变量。
其中Option的addOption()格式为(OptionName,OptionShortName,OptionDescription,OptionDefaultValue)。根据这个格式,下面的变量添加很容易可以看懂。

/* package org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run */

    addInputOption();
    addOutputOption();
    addOption("numRecommendations", "n", "Number of recommendations per user",
            String.valueOf(AggregateAndRecommendReducer.DEFAULT_NUM_RECOMMENDATIONS));
    addOption("usersFile", null, "File of users to recommend for", null);
    addOption("itemsFile", null, "File of items to recommend for", null);
    addOption("filterFile", "f", "File containing comma-separated userID,itemID pairs. Used to exclude the item from "
            + "the recommendations for that user (optional)", null);
    addOption("userItemFile", "uif", "File containing comma-separated userID,itemID pairs (optional). "
            + "Used to include only these items into recommendations. "
            + "Cannot be used together with usersFile or itemsFile", null);
    addOption("booleanData", "b", "Treat input as without pref values", Boolean.FALSE.toString());
    addOption("maxPrefsPerUser", "mxp",
            "Maximum number of preferences considered per user in final recommendation phase",
            String.valueOf(UserVectorSplitterMapper.DEFAULT_MAX_PREFS_PER_USER_CONSIDERED));
    addOption("minPrefsPerUser", "mp", "ignore users with less preferences than this in the similarity computation "
            + "(default: " + DEFAULT_MIN_PREFS_PER_USER + ')', String.valueOf(DEFAULT_MIN_PREFS_PER_USER));
    addOption("maxSimilaritiesPerItem", "m", "Maximum number of similarities considered per item ",
            String.valueOf(DEFAULT_MAX_SIMILARITIES_PER_ITEM));
    addOption("maxPrefsInItemSimilarity", "mpiis", "max number of preferences to consider per user or item in the "
            + "item similarity computation phase, users or items with more preferences will be sampled down (default: "
        + DEFAULT_MAX_PREFS + ')', String.valueOf(DEFAULT_MAX_PREFS));
    addOption("similarityClassname", "s", "Name of distributed similarity measures class to instantiate, " 
            + "alternatively use one of the predefined similarities (" + VectorSimilarityMeasures.list() + ')', true);
    addOption("threshold", "tr", "discard item pairs with a similarity value below this", false);
    addOption("outputPathForSimilarityMatrix", "opfsm", "write the item similarity matrix to this path (optional)",
        false);
    addOption("randomSeed", null, "use this seed for sampling", false);
    addFlag("sequencefileOutput", null, "write the output into a SequenceFile instead of a text file");

    //Option初始化结束以后,解析用户自定义传入的String args[]到Option

    Map<String, List<String>> parsedArgs = parseArguments(args);
    if (parsedArgs == null) {
      return -1;
    }

    //将Option中的各个变量取出供本函数使用
    Path outputPath = getOutputPath();
    int numRecommendations = Integer.parseInt(getOption("numRecommendations"));
    String usersFile = getOption("usersFile");
    String itemsFile = getOption("itemsFile");
    String filterFile = getOption("filterFile");
    String userItemFile = getOption("userItemFile");
    boolean booleanData = Boolean.valueOf(getOption("booleanData"));
    int maxPrefsPerUser = Integer.parseInt(getOption("maxPrefsPerUser"));
    int minPrefsPerUser = Integer.parseInt(getOption("minPrefsPerUser"));
    int maxPrefsInItemSimilarity = Integer.parseInt(getOption("maxPrefsInItemSimilarity"));
    int maxSimilaritiesPerItem = Integer.parseInt(getOption("maxSimilaritiesPerItem"));
    String similarityClassname = getOption("similarityClassname");
    double threshold = hasOption("threshold")
        ? Double.parseDouble(getOption("threshold")) : RowSimilarityJob.NO_THRESHOLD;
    long randomSeed = hasOption("randomSeed")
        ? Long.parseLong(getOption("randomSeed")) : RowSimilarityJob.NO_FIXED_RANDOM_SEED;


    Path prepPath = getTempPath(DEFAULT_PREPARE_PATH);
    Path similarityMatrixPath = getTempPath("similarityMatrix");
    Path explicitFilterPath = getTempPath("explicitFilterPath");
    Path partialMultiplyPath = getTempPath("partialMultiply");

    AtomicInteger currentPhase = new AtomicInteger();

    int numberOfUsers = -1;

    /* 之上各个变量的功能可以参考之上Option初始化的时候添加进去的Optiondescription */

得到所有用户设定值以后(没有设定的依照系统的默认值),接下来看到运行第一个job:PreparePreferenceMatrixJob()。

    //shouldRunNextPhase这个函数比较难理解,
    //不过跟踪下来可以看到这个是mahout内部的容错机制,
    //其基本的原理是将每个task按照job分成不同的阶段,
    //每完成一个阶段,currentPhase++,这样如果task在某一个job崩溃以后,
    //mahout可以根据currentPhase的值知道接下来在哪个job运行。 

if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      ToolRunner.run(getConf(), new PreparePreferenceMatrixJob(), new String[]{
      //这里是将当前run函数中的变量传递给接下来要执行的PreparePreferenceMatrixJob。
        "--input", getInputPath().toString(),
        "--output", prepPath.toString(),
        "--minPrefsPerUser", String.valueOf(minPrefsPerUser),
        "--booleanData", String.valueOf(booleanData),
        "--tempDir", getTempPath().toString(),
      });

      numberOfUsers = HadoopUtil.readInt(new Path(prepPath, PreparePreferenceMatrixJob.NUM_USERS), getConf());
    }

接下来,可以看到job调用了PreparePreferenceMatrixJob(),我们跟踪进去,代码如下:

public int run(String[] args) throws Exception {

    addInputOption();
    addOutputOption();
    addOption("minPrefsPerUser", "mp", "ignore users with less preferences than this "
            + "(default: " + DEFAULT_MIN_PREFS_PER_USER + ')', String.valueOf(DEFAULT_MIN_PREFS_PER_USER));
    addOption("booleanData", "b", "Treat input as without pref values", Boolean.FALSE.toString());
    addOption("ratingShift", "rs", "shift ratings by this value", "0.0");

    Map<String, List<String>> parsedArgs = parseArguments(args);
    if (parsedArgs == null) {
      return -1;
    }

    int minPrefsPerUser = Integer.parseInt(getOption("minPrefsPerUser"));
    boolean booleanData = Boolean.valueOf(getOption("booleanData"));
    float ratingShift = Float.parseFloat(getOption("ratingShift"));

    // 之上的依旧是通过我们执行任务之前添加进去的各种String args[]变量初始化Option,
    // 然后初始化本函数使用的各个变量,同RecommenderJob一样,这里不再赘述 

    //执行PreparePreferenceMatrixJob第一个job
    //将商品ID转化为hash的index索引
    Job itemIDIndex = prepareJob(getInputPath(), getOutputPath(ITEMID_INDEX), TextInputFormat.class,
            ItemIDIndexMapper.class, VarIntWritable.class, VarLongWritable.class, ItemIDIndexReducer.class,
            VarIntWritable.class, VarLongWritable.class, SequenceFileOutputFormat.class);
    itemIDIndex.setCombinerClass(ItemIDIndexReducer.class);
    boolean succeeded = itemIDIndex.waitForCompletion(true);
    if (!succeeded) {
      return -1;
    }

    //执行PreparePreferenceMatrixJob第二个job
    //将用户偏好转化为向量
    Job toUserVectors = prepareJob(getInputPath(),
                                   getOutputPath(USER_VECTORS),
                                   TextInputFormat.class,
                                   ToItemPrefsMapper.class,
                                   VarLongWritable.class,
                                   booleanData ? VarLongWritable.class : EntityPrefWritable.class,
                                   ToUserVectorsReducer.class,
                                   VarLongWritable.class,
                                   VectorWritable.class,
                                   SequenceFileOutputFormat.class);
    toUserVectors.getConfiguration().setBoolean(RecommenderJob.BOOLEAN_DATA, booleanData);
    toUserVectors.getConfiguration().setInt(ToUserVectorsReducer.MIN_PREFERENCES_PER_USER, minPrefsPerUser);
    toUserVectors.getConfiguration().set(ToEntityPrefsMapper.RATING_SHIFT, String.valueOf(ratingShift));
    succeeded = toUserVectors.waitForCompletion(true);
    if (!succeeded) {
      return -1;
    }


    //收集并记录用户数量
    //这个getCounters的具体mapreduce中的实现我们等下就会看到
    int numberOfUsers = (int) toUserVectors.getCounters().findCounter(ToUserVectorsReducer.Counters.USERS).getValue();
    HadoopUtil.writeInt(numberOfUsers, getOutputPath(NUM_USERS), getConf());


    //执行PreparePreferenceMatrixJob第三个job
    //构建评价矩阵
    Job toItemVectors = prepareJob(getOutputPath(USER_VECTORS), getOutputPath(RATING_MATRIX),
            ToItemVectorsMapper.class, IntWritable.class, VectorWritable.class, ToItemVectorsReducer.class,
            IntWritable.class, VectorWritable.class);
    toItemVectors.setCombinerClass(ToItemVectorsReducer.class);

    succeeded = toItemVectors.waitForCompletion(true);
    if (!succeeded) {
      return -1;
    }

    return 0;
  }
}

本预备job分别包含了三个小job,分别是1(ItemIDIndexMapper,ItemIDIndexReducer)、2(ToItemPrefsMapper、ToUserVectorsReducer)和3(ToItemVectorsMapper、ToItemVectorsReducer),接下来我们分别跟进三个mapreduce,看看它们具体做了什么。

(ItemIDIndexMapper,ItemIDIndexReducer)构建物品的内部索引。
输入:默认索引key,输入文本value (输入数据)
输出:处理后的item内部索引,itemID
,代码:

public final class ItemIDIndexMapper extends Mapper<LongWritable,Text, VarIntWritable, VarLongWritable> {

  private boolean transpose;

  private final VarIntWritable indexWritable = new VarIntWritable();
  private final VarLongWritable itemIDWritable = new VarLongWritable();

  @Override

  //读取默认设定,transpose设定是否item和user互换。
  protected void setup(Context context) {
    Configuration jobConf = context.getConfiguration();
    transpose = jobConf.getBoolean(ToEntityPrefsMapper.TRANSPOSE_USER_ITEM, false);
  }

  @Override
  protected void map(LongWritable key,
                     Text value,
                     Context context) throws IOException, InterruptedException {

    //将输入数据的行分隔为字符串,String[0]:userid,String[1]:itemID,String[2]:pref
    String[] tokens = TasteHadoopUtils.splitPrefTokens(value.toString());

    //根据是基于物品还是基于用户决定是否转置
    long itemID = Long.parseLong(tokens[transpose ? 0 : 1]);

    //将itemID通过内建函数转化为范围在0~0x7FFFFFFE的内部索引,然后写出
    int index = TasteHadoopUtils.idToIndex(itemID);
    indexWritable.set(index);
    itemIDWritable.set(itemID);
    context.write(indexWritable, itemIDWritable);
  }  
}

public final class ItemIDIndexReducer extends Reducer<VarIntWritable, VarLongWritable, VarIntWritable,VarLongWritable> {

  private final VarLongWritable minimumItemIDWritable = new VarLongWritable();

  @Override
  protected void reduce(VarIntWritable index,
                        Iterable<VarLongWritable> possibleItemIDs,
                        Context context) throws IOException, InterruptedException {
    //这里这个reduce的基本作用是将大于0x7FFFFFFE的itemID抛去,将合并到0~0x7FFFFFFE的映射中
    long minimumItemID = Long.MAX_VALUE;
    for (VarLongWritable varLongWritable : possibleItemIDs) {
      long itemID = varLongWritable.get();
      if (itemID < minimumItemID) {
        minimumItemID = itemID;
      }
    }
    if (minimumItemID != Long.MAX_VALUE) {
      minimumItemIDWritable.set(minimumItemID);
      context.write(index, minimumItemIDWritable);
    }
  }

}

(ItemIDIndexMapper,ItemIDIndexReducer)将输入的text格式的文本进行转化,步骤大概如下:

map:
[Index , <userID , itemID , pref>]  
-> [Index , itemID]

combine&reduce:
整合得到[Index , itemID]

(ToItemPrefsMapper、ToUserVectorsReducer)将用户偏好转化为向量。
输入:默认索引,输入文本 (输入数据)
输出:[ userID , Vector< itemID , Pref > ](这里的value使用的是mahout自定义的vector向量结构)
或[ userID , Vector< itemID > ]
代码:

public abstract class ToEntityPrefsMapper extends Mapper<LongWritable,Text, VarLongWritable,VarLongWritable> {
    ...
    //由于初始化预处理等步骤不重要我们省略直接看mapreduce
  @Override
  public void map(LongWritable key,
                  Text value,
//获取输入文本中的各个字符串,将其变为userID,itemID
    String[] tokens = DELIMITER.split(value.toString());
    long userID = Long.parseLong(tokens[0]);
    long itemID = Long.parseLong(tokens[1]);

    //这里根据设定里面决定是基于用户还是基于物品来决定是否转置
    if (itemKey ^ transpose) {
      // If using items as keys, and not transposing items and users, then users are items!
      // Or if not using items as keys (users are, as usual), but transposing items and users,
      // then users are items! Confused?
      long temp = userID;
      userID = itemID;
      itemID = temp;
    }

    //这个booleanData是说明你的输入数据中是否带有用户的偏好信息
    //具体解释这个偏好信息,就是对商品打1~5分,这个叫做用户有偏好信息,如果点赞,喜欢,则没有用户偏好信息
    //没有用户偏好信息的输出[ userID , itemID ] 结构。
    if (booleanData) {
      context.write(new VarLongWritable(userID), new VarLongWritable(itemID));
    } else {
      float prefValue = tokens.length > 2 ? Float.parseFloat(tokens[2]) + ratingShift : 1.0f;
      context.write(new VarLongWritable(userID), new EntityPrefWritable(itemID, prefValue));
    }
  }

}

public final class ToUserVectorsReducer extends Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
  ... 

  @Override
  protected void reduce(VarLongWritable userID,
                        Iterable<VarLongWritable> itemPrefs,
                        Context context) throws IOException, InterruptedException {
    //初始化向量存储itemID,pref的pari对
    Vector userVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
    for (VarLongWritable itemPref : itemPrefs) {
      //内部化itemID索引
      int index = TasteHadoopUtils.idToIndex(itemPref.get());

      //这里判断是否为booleanData,如果是则将其偏好值设定为1.0
      //这里会有一个疑问,< itemID , Pref > 和 itemID 一个是 EntityPrefWritable 一个是VarLongWritable类型,
      //它是如何传参的,跟进EntityPrefWritable 我们发先其实它是VarLongWritable的子类,
      //继承了VarLongWritable的值,这里它作为itemID,然后添加了一个新变量prefvalue存放pref.
      //instanceof 是用来判断左边的变量是否是右边类型
      float value = itemPref instanceof EntityPrefWritable ? ((EntityPrefWritable) itemPref).getPrefValue() : 1.0f;
      userVector.set(index, value);
    }

    if (userVector.getNumNondefaultElements() >= minPreferences) {
      userVectorWritable.set(userVector);
      userVectorWritable.setWritesLaxPrecision(true);
      //这里这个在reduce中实现的getCounter用来计算user的数量,每reduce一个userIDJ就自增1。
      context.getCounter(Counters.USERS).increment(1);
      context.write(userID, userVectorWritable);
    }
  }

}

步骤:

map:
[index , (userID,itemID,pref)] 
-> string[](token[0]:userID,token[1]:itemID,token[2]:pref)
-> if(booleanData)(判断是否存在用户偏好值) [userID , itemID] 
   else [userID , (itemID,pref)]

reduce:
[userID , vector(itemID , pref)] 或 [userID , vector(itemID)] 
-> [userID , vectorWritable(itemID , pref)] 或 [userID , vectorWritable(itemID , 1.0)]

并且获取number of user。

接下来,我们看(ToItemVectorsMapper、ToItemVectorsReducer)构造评价矩阵。
输入:[ userID , Vector < itemID , Pref > ] ((ToItemPrefsMapper、ToUserVectorsReducer)job的输出。)
输出:[ itemID , Vector < userID , Pref > ]
这个mr比较简单,就是将userID和itemID转换了一下,就不再赘述。
代码:


public class ToItemVectorsMapper extends Mapper<VarLongWritable,VectorWritable,IntWritable,VectorWritable> {

  private final IntWritable itemID = new IntWritable();
  private final VectorWritable itemVectorWritable = new VectorWritable();

  @Override
  protected void map(VarLongWritable rowIndex, VectorWritable vectorWritable, Context ctx)
    throws IOException, InterruptedException {
    Vector userRatings = vectorWritable.get();

    int column = TasteHadoopUtils.idToIndex(rowIndex.get());

    itemVectorWritable.setWritesLaxPrecision(true);

    Vector itemVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 1);
    for (Vector.Element elem : userRatings.nonZeroes()) {
      itemID.set(elem.index());
      itemVector.setQuick(column, elem.get());
      itemVectorWritable.set(itemVector);
      ctx.write(itemID, itemVectorWritable);
      // reset vector for reuse
      itemVector.setQuick(elem.index(), 0.0);
    }
  }

}

public class ToItemVectorsReducer extends Reducer<IntWritable,VectorWritable,IntWritable,VectorWritable> {

  private final VectorWritable merged = new VectorWritable();

  @Override
  protected void reduce(IntWritable row, Iterable<VectorWritable> vectors, Context ctx)
    throws IOException, InterruptedException {

    merged.setWritesLaxPrecision(true);
    merged.set(VectorWritable.mergeToVector(vectors.iterator()));
    ctx.write(row, merged);
  }
}

步骤:

map:
[userID , vectorWritable(itemID , pref)]
-> [itemID , vector(userID , pref)]

reducer:
[itemID , vector(userID , pref)]
-> [itemID , vectorWritable(userID , pref)]

构造出的评价矩阵看起来像是userID和itemID转换了一下位置,其实这个和推荐系统原理有关,可以假象一个用户-商品的评价矩阵,矩阵的行代表商品,列代表用户,则[itemID , vectorWritable(userID , pref)] 可以很方便通过行列定位遍历矩阵。

以上就是PreparePreferenceMatrixJob()的具体操作结构,总的来说就是对输入数据进行了一些预处理以供后面的接下来相似度计算的操作。

转载请注明出处:http://blog.csdn.net/Utopia_1919/article/details/51832471

你可能感兴趣的:(源码,hadoop,分布式,Mahout,Taste)