Mahout推荐算法之ItemBased

Mahout推荐之ItemBased

一、   算法原理

(一)    基本原理

如下图评分矩阵所示:行为user,列为item.

Mahout推荐算法之ItemBased_第1张图片

图(1)


    该算法的原理:

1.  计算Item之间的相似度。

2.  对用户U做推荐


公式(一)

Map tmp ;

Map tmp1 ;

 

for(item a  in userRatedItems){

  rate  =userforItemRate(a)

  ListsimItem =getSimItem(a);

  For(Jin simItem){

    Item b =j;

    Simab=sim(a,b);

    Tmp.add(b,Tmp .get(b)+simab*rate)

tmp1.add(b, tmp1.get(b)+simab)

 

}

}

Maptmp2=temp/temp1

Sortbyval(tmp2)

return topK(tmp2,k)

 

(二)    相似度计算

1.  Cos相似度


公式(二)

2.  皮尔逊相似度


公式(三)

3.  调整的cos相似度


公式(四)

(三)    采样

计算全量的itemPair之间的相似度耗费大量的时间,也是没有必要的,所以需要采样,减小计算量。

二、   单机模式实现

(一)    候选Item搜索

计算所有Item Pair之间的相似度在单机模式下是不现实的,需要在海量的候选集中搜索出一部分最有可能的候选集用于计算。Mahout提供了4中候选Item选择策略。

1.  AllSimilarItemsCandidateItemsStrategy

@Override

  FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

    FastIDSet candidateItemIDs = new FastIDSet();

    for (long itemID : preferredItemIDs) {

      candidateItemIDs.addAll(similarity.allSimilarItemIDs(itemID));

    }

    candidateItemIDs.removeAll(preferredItemIDs);

    return candidateItemIDs;

  }

 

2.  AllUnknownItemsCandidateItemsStrategy

@Override

  protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

    FastIDSet possibleItemIDs = new FastIDSet(dataModel.getNumItems());

    LongPrimitiveIterator allItemIDs = dataModel.getItemIDs();

    while (allItemIDs.hasNext()) {

      possibleItemIDs.add(allItemIDs.nextLong());

    }

    possibleItemIDs.removeAll(preferredItemIDs);

    return possibleItemIDs;

  }

 

3.  PreferredItemsNeighborhoodCandidateItemsStrategy

  @Override

  protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

    FastIDSet possibleItemsIDs = new FastIDSet();

    for (long itemID : preferredItemIDs) {

      PreferenceArray itemPreferences = dataModel.getPreferencesForItem(itemID);

      int numUsersPreferringItem = itemPreferences.length();

      for (int index = 0; index < numUsersPreferringItem; index++) {

        possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(itemPreferences.getUserID(index)));

      }

    }

    possibleItemsIDs.removeAll(preferredItemIDs);

    return possibleItemsIDs;

  }

 

4.  SamplingCandidateItemsStrategy

private static int computeMaxFrom(int factor, int numThings) {

    if (factor == NO_LIMIT_FACTOR) {

      return MAX_LIMIT;

    }

    long max = (long) (factor * (1.0 + Math.log(numThings) / LOG2));

    return max > MAX_LIMIT ? MAX_LIMIT : (int) max;

  }

 

  @Override

  protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

    LongPrimitiveIterator preferredItemIDsIterator = new LongPrimitiveArrayIterator(preferredItemIDs);

    if (preferredItemIDs.length > maxItems) {

      double samplingRate = (double) maxItems / preferredItemIDs.length;

//      log.info("preferredItemIDs.length {}, samplingRate {}", preferredItemIDs.length, samplingRate);

      preferredItemIDsIterator =

          new SamplingLongPrimitiveIterator(preferredItemIDsIterator, samplingRate);

    }

    FastIDSet possibleItemsIDs = new FastIDSet();

    while (preferredItemIDsIterator.hasNext()) {

      long itemID = preferredItemIDsIterator.nextLong();

      PreferenceArray prefs = dataModel.getPreferencesForItem(itemID);

      int prefsLength = prefs.length();

      if (prefsLength > maxUsersPerItem) {

        Iterator<Preference> sampledPrefs =

            new FixedSizeSamplingIterator<Preference>(maxUsersPerItem, prefs.iterator());

        while (sampledPrefs.hasNext()) {

          addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(sampledPrefs.next().getUserID()));

        }

      } else {

        for (int i = 0; i < prefsLength; i++) {

          addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(prefs.getUserID(i)));

        }

      }

    }

    possibleItemsIDs.removeAll(preferredItemIDs);

    return possibleItemsIDs;

  }

 

  private void addSomeOf(FastIDSet possibleItemIDs, FastIDSet itemIDs) {

    if (itemIDs.size() > maxItemsPerUser) {

      LongPrimitiveIterator it =

          new SamplingLongPrimitiveIterator(itemIDs.iterator(), (double) maxItemsPerUser / itemIDs.size());

      while (it.hasNext()) {

        possibleItemIDs.add(it.nextLong());

      }

    } else {

      possibleItemIDs.addAll(itemIDs);

    }

  }

 

(二)    估值

protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)

    throws TasteException {

    double preference = 0.0;

    double totalSimilarity = 0.0;

    int count = 0;

    double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());

    for (int i = 0; i < similarities.length; i++) {

      double theSimilarity = similarities[i];

      if (!Double.isNaN(theSimilarity)) {

        // Weights can be negative!

        preference += theSimilarity * preferencesFromUser.getValue(i);

        totalSimilarity += theSimilarity;

        count++;

      }

    }

    // Throw out the estimate if it was based on no data points, of course, but also if based on

    // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.

    // The reason is that in this case the estimate is, simply, the user's rating for one item

    // that happened to have a defined similarity. The similarity score doesn't matter, and that

    // seems like a bad situation.

    if (count <= 1) {

      return Float.NaN;

    }

    float estimate = (float) (preference / totalSimilarity);

    if (capper != null) {

      estimate = capper.capEstimate(estimate);

    }

    return estimate;

  }

 

(三)    推荐

1.  根据历史评分列表推荐

这种推荐方式根据用户之前产生过评分的item做推荐,推荐结果按照估计值的大小排序。

@Override

  public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) throws TasteException {

    Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

    log.debug("Recommending items for user ID '{}'", userID);

    PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);

    if (preferencesFromUser.length() == 0) {

      return Collections.emptyList();

    }

    FastIDSet possibleItemIDs = getAllOtherItems(userID, preferencesFromUser);

    TopItems.Estimator<Long> estimator = new Estimator(userID, preferencesFromUser);

    List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer,

      estimator);

    log.debug("Recommendations are: {}", topItems);

    return topItems;

  }

 

2.  Because推荐

这种推荐方式用于实时推荐。

@Override

  public List<RecommendedItem> recommendedBecause(long userID, long itemID, int howMany) throws TasteException {

    Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

    DataModel model = getDataModel();

    TopItems.Estimator<Long> estimator = new RecommendedBecauseEstimator(userID, itemID);

    PreferenceArray prefs = model.getPreferencesFromUser(userID);

    int size = prefs.length();

    FastIDSet allUserItems = new FastIDSet(size);

    for (int i = 0; i < size; i++) {

      allUserItems.add(prefs.getItemID(i));

    }

    allUserItems.remove(itemID);

    return TopItems.getTopItems(howMany, allUserItems.iterator(), null, estimator);

  }

 

//估值方法

@Override

public double estimate(Long itemID) throws TasteException {

      Float pref = getDataModel().getPreferenceValue(userID, itemID);

      if (pref == null) {

        return Float.NaN;

      }

      double similarityValue = similarity.itemSimilarity(recommendedItemID, itemID);

      return (1.0 + similarityValue) * pref;

    }

 

三、   MapReduce模式实现

(一)    将偏好文件转换成评分矩阵(PreparePreferenceMatrixJob)

(二)    计算共现矩阵相似度(RowSimilarityJob)

(三)    挑选最相似的K个Item

(四)    用户偏好向量和相似降维后的共现矩阵做乘法

(五)    过滤制定的user\titem

(六)    生成最终的推荐结果

四、   实例演示

1.  单机模式

1)  批量推荐

DataModel  dataModel = new FileDataModel(new File("p/pereference"));

 

ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

 

ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

 

System.out.println(recommender.recommend(10, 10));

 

2)  Because推荐

DataModel  dataModel = new FileDataModel(new File("p/pereference"));

 

ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

 

ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

 

System.out.println(recommender.recommendedBecause(10, 10328, 100));

 

2.  MapReduce模式

API

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args)

--input

偏好数据路径,文本文件。格式 userid\t itemid\t preference

--output

推荐结果路径

-- numRecommendations

推荐个数

--usersFile

需要做出推荐的user,默认全部做推荐

--itemsFile

需要做出推荐的item,默认全部做推荐

--filterFile

文件格式文本,userid\itemid 。目的是给userid的用户不要推荐itemid的item

--booleanData

是否是布尔数据

--maxPrefsPerUser

最大偏好值

--minPrefsPerUser

最小偏好值

--maxSimilaritiesPerItem

给每一个Item计算最多的相似item数目

--maxPrefsPerUserInItemSimilarity

ItemSimilarity估计item相似度时,对每一个user最多偏好数目

--similarityClassname

SIMILARITY_PEARSON_CORRELATION、SIMILARITY_COOCCURRENCE、SIMILARITY_LOGLIKELIHOOD、SIMILARITY_TANIMOTO_COEFFICIENT、SIMILARITY_CITY_BLOCK、SIMILARITY_COSINE、SIMILARITY_EUCLIDEAN_DISTANCE

--threshold

删除低于该阈值的item对

--outputPathForSimilarityMatrix

指定生成的item相似矩阵路径,文本文件,格式为 itemA \t itemB \t 相似值

    实例

String  [] args ={"--input","p",

"--output","recommender",

"--numRecommendations","10",

"--outputPathForSimilarityMatrix","simMatrix",

"--similarityClassname","SIMILARITY_PEARSON_CORRELATION"}

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args);

 

五、   参考文献

1.  M.Deshpandeand G. Karypis. Item-based top-n recommendation algorithms.

2.  B.M.Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborativefiltering recommendation algorithms.

3.  Item-based collaborative filtering

4.  Accuratelycomputing running variance

 

你可能感兴趣的:(Mahout,机器学习,推荐系统)