学习Mahout推荐相关算法前,我们必须先要理解Mahout如何对推荐数据进行抽象表示。首先来看下Preference,该抽象是最基本的抽象,这个抽象对象一般代表一个单独的 userID、itemID、Preference 分数,在具体实现层面首先是Preference接口:
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.mahout.cf.taste.model; /** ** A {@link Preference} encapsulates an item and a preference value, which indicates the strength of the * preference for it. {@link Preference}s are associated to users. *
*/ public interface Preference { /** @return ID of user who prefers the item */ long getUserID(); /** @return item ID that is preferred */ long getItemID(); /** * @return strength of the preference for that item. Zero should indicate "no preference either way"; * positive values indicate preference and negative values indicate dislike */ float getValue(); /** * Sets the strength of the preference for this item * * @param value * new preference */ void setValue(float value); }
Mahout中一个Preference 对象表示 一个user 对一个 item的 score(喜爱程度),通常中我们直接用到的实现Preference接口的GenericPreference类,如下代码:
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.mahout.cf.taste.impl.model; import java.io.Serializable; import org.apache.mahout.cf.taste.model.Preference; import com.google.common.base.Preconditions; /** *通常看到这里我们会猜想Mahout会用一个容器类来存下很多Preference 对象,但是考虑到存储和效率Mahout选择了更好的方式表示一个Preference 集合。* A simple {@link Preference} encapsulating an item and preference value. *
*/ public class GenericPreference implements Preference, Serializable { private final long userID; private final long itemID; private float value; public GenericPreference(long userID, long itemID, float value) { Preconditions.checkArgument(!Float.isNaN(value), "NaN value"); this.userID = userID; this.itemID = itemID; this.value = value; } @Override public long getUserID() { return userID; } @Override public long getItemID() { return itemID; } @Override public float getValue() { return value; } @Override public void setValue(float value) { Preconditions.checkArgument(!Float.isNaN(value), "NaN value"); this.value = value; } @Override public String toString() { return "GenericPreference[userID: " + userID + ", itemID:" + itemID + ", value:" + value + ']'; } }
如下图所示:
这里中文版翻译出错了,图3.2 显示是GenericUserPreferenceArray结构,该结构只需要一个userID,但是GenericItemPreferenceArray 则是依据item 为维度,其中只有一个为ItemID,所以这就看你需要做基于user 的CF还是基于Item 的CF。
从上图可以明显发现Mahout定义的PreferenceArray 更有效率的表示了Preference集合,代码如下:
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.mahout.cf.taste.model; import java.io.Serializable; /** * An alternate representation of an array of {@link Preference}. Implementations, in theory, can produce a * more memory-efficient representation. */ public interface PreferenceArray extends Cloneable, Serializable, Iterable{ /** * @return size of length of the "array" */ int length(); /** * @param i * index * @return a materialized {@link Preference} representation of the preference at i */ Preference get(int i); /** * Sets preference at i from information in the given {@link Preference} * * @param i * @param pref */ void set(int i, Preference pref); /** * @param i * index * @return user ID from preference at i */ long getUserID(int i); /** * Sets user ID for preference at i. * * @param i * index * @param userID * new user ID */ void setUserID(int i, long userID); /** * @param i * index * @return item ID from preference at i */ long getItemID(int i); /** * Sets item ID for preference at i. * * @param i * index * @param itemID * new item ID */ void setItemID(int i, long itemID); /** * @return all user or item IDs */ long[] getIDs(); /** * @param i * index * @return preference value from preference at i */ float getValue(int i); /** * Sets preference value for preference at i. * * @param i * index * @param value * new preference value */ void setValue(int i, float value); /** * @return independent copy of this object */ PreferenceArray clone(); /** * Sorts underlying array by user ID, ascending. */ void sortByUser(); /** * Sorts underlying array by item ID, ascending. */ void sortByItem(); /** * Sorts underlying array by preference value, ascending. */ void sortByValue(); /** * Sorts underlying array by preference value, descending. */ void sortByValueReversed(); /** * @param userID * user ID * @return true if array contains a preference with given user ID */ boolean hasPrefWithUserID(long userID); /** * @param itemID * item ID * @return true if array contains a preference with given item ID */ boolean hasPrefWithItemID(long itemID); }
代码:
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.mahout.cf.taste.impl.model; import java.util.Arrays; import java.util.Iterator; import java.util.List; import com.google.common.base.Function; import com.google.common.collect.Iterators; import org.apache.mahout.cf.taste.model.Preference; import org.apache.mahout.cf.taste.model.PreferenceArray; import org.apache.mahout.common.iterator.CountingIterator; /** ** Like {@link GenericItemPreferenceArray} but stores preferences for one user (all user IDs the same) rather * than one item. *
* ** This implementation maintains two parallel arrays, of item IDs and values. The idea is to save allocating * {@link Preference} objects themselves. This saves the overhead of {@link Preference} objects but also * duplicating the user ID value. *
* * @see BooleanUserPreferenceArray * @see GenericItemPreferenceArray * @see GenericPreference */ public final class GenericUserPreferenceArray implements PreferenceArray { private static final int ITEM = 1; private static final int VALUE = 2; private static final int VALUE_REVERSED = 3; private final long[] ids; private long id; private final float[] values; public GenericUserPreferenceArray(int size) { this.ids = new long[size]; values = new float[size]; this.id = Long.MIN_VALUE; // as a sort of 'unspecified' value } public GenericUserPreferenceArray(List extends Preference> prefs) { this(prefs.size()); int size = prefs.size(); long userID = Long.MIN_VALUE; for (int i = 0; i < size; i++) { Preference pref = prefs.get(i); if (i == 0) { userID = pref.getUserID(); } else { if (userID != pref.getUserID()) { throw new IllegalArgumentException("Not all user IDs are the same"); } } ids[i] = pref.getItemID(); values[i] = pref.getValue(); } id = userID; } /** * This is a private copy constructor for clone(). */ private GenericUserPreferenceArray(long[] ids, long id, float[] values) { this.ids = ids; this.id = id; this.values = values; } @Override public int length() { return ids.length; } @Override public Preference get(int i) { return new PreferenceView(i); } @Override public void set(int i, Preference pref) { id = pref.getUserID(); ids[i] = pref.getItemID(); values[i] = pref.getValue(); } @Override public long getUserID(int i) { return id; } /** * {@inheritDoc} * * Note that this method will actually set the user ID for all preferences. */ @Override public void setUserID(int i, long userID) { id = userID; } @Override public long getItemID(int i) { return ids[i]; } @Override public void setItemID(int i, long itemID) { ids[i] = itemID; } /** * @return all item IDs */ @Override public long[] getIDs() { return ids; } @Override public float getValue(int i) { return values[i]; } @Override public void setValue(int i, float value) { values[i] = value; } @Override public void sortByUser() { } @Override public void sortByItem() { lateralSort(ITEM); } @Override public void sortByValue() { lateralSort(VALUE); } @Override public void sortByValueReversed() { lateralSort(VALUE_REVERSED); } @Override public boolean hasPrefWithUserID(long userID) { return id == userID; } @Override public boolean hasPrefWithItemID(long itemID) { for (long id : ids) { if (itemID == id) { return true; } } return false; } private void lateralSort(int type) { //Comb sort: http://en.wikipedia.org/wiki/Comb_sort int length = length(); int gap = length; boolean swapped = false; while (gap > 1 || swapped) { if (gap > 1) { gap /= 1.247330950103979; // = 1 / (1 - 1/e^phi) } swapped = false; int max = length - gap; for (int i = 0; i < max; i++) { int other = i + gap; if (isLess(other, i, type)) { swap(i, other); swapped = true; } } } } private boolean isLess(int i, int j, int type) { switch (type) { case ITEM: return ids[i] < ids[j]; case VALUE: return values[i] < values[j]; case VALUE_REVERSED: return values[i] > values[j]; default: throw new IllegalStateException(); } } private void swap(int i, int j) { long temp1 = ids[i]; float temp2 = values[i]; ids[i] = ids[j]; values[i] = values[j]; ids[j] = temp1; values[j] = temp2; } @Override public GenericUserPreferenceArray clone() { return new GenericUserPreferenceArray(ids.clone(), id, values.clone()); } @Override public int hashCode() { return (int) (id >> 32) ^ (int) id ^ Arrays.hashCode(ids) ^ Arrays.hashCode(values); } @Override public boolean equals(Object other) { if (!(other instanceof GenericUserPreferenceArray)) { return false; } GenericUserPreferenceArray otherArray = (GenericUserPreferenceArray) other; return id == otherArray.id && Arrays.equals(ids, otherArray.ids) && Arrays.equals(values, otherArray.values); } @Override public Iteratoriterator() { return Iterators.transform(new CountingIterator(length()), new Function () { @Override public Preference apply(Integer from) { return new PreferenceView(from); } }); } @Override public String toString() { if (ids == null || ids.length == 0) { return "GenericUserPreferenceArray[{}]"; } StringBuilder result = new StringBuilder(20 * ids.length); result.append("GenericUserPreferenceArray[userID:"); result.append(id); result.append(",{"); for (int i = 0; i < ids.length; i++) { if (i > 0) { result.append(','); } result.append(ids[i]); result.append('='); result.append(values[i]); } result.append("}]"); return result.toString(); } private final class PreferenceView implements Preference { private final int i; private PreferenceView(int i) { this.i = i; } @Override public long getUserID() { return GenericUserPreferenceArray.this.getUserID(i); } @Override public long getItemID() { return GenericUserPreferenceArray.this.getItemID(i); } @Override public float getValue() { return values[i]; } @Override public void setValue(float value) { values[i] = value; } } }
梳排序还是基于冒泡排序,与冒泡不同的是,梳排序比较的是固定距离处的数的比较和交换,类似希尔那样
这个固定距离是待排数组长度除以1.3得到近似值,下次则以上次得到的近似值再除以1.3,直到距离小至3时,以1递减
假设待数组[8 4 3 7 6 5 2 1]
待排数组长度为8,而8÷1.3=6,则比较8和2,4和1,并做交换
[8 4 3 7 6 5 2 1]
[8 4 3 7 6 5 2 1]
交换后的结果为
[2 1 3 7 6 5 8 4]
第二次循环,更新间距为6÷1.3=4,比较2和6,1和5,3和8,7和4
[2 1 3 7 6 5 8 4]
[2 1 3 7 6 5 8 4]
[2 1 3 7 6 5 8 4]
[2 1 3 7 6 5 8 4]
只有7和4需要交换,交换后的结果为
[2 1 3 4 6 5 8 7]
第三次循环,更新距离为3,没有交换
第四次循环,更新距离为2,没有交换
第五次循环,更新距离为1,三处交换
[2 1 3 4 6 5 8 7]
[2 1 3 4 6 5 8 7]
[2 1 3 4 6 5 8 7]
三处交换后的结果为[1 2 3 4 5 6 7 8]
交换后排序结束,顺序输出即可得到[1 2 3 4 5 6 7 8]
参考wiki有个生动的过程图: 点击打开链接
这里源码中排序很多代码写的很棒值得参考借鉴。
Mahout设计者认为实现PreferenceArray所带来的设计复杂度是值得的,因为减少了大约75%的内存消耗。看了PreferenceArray设计后,我们将会发现
Mahout中不会使用java 原生的map、set 容器而是从存储效率角度考虑自己重新构建的数据结构 FastByIDMap 、FastByIDSet,很明显大数据下的操作每条数据节省一些存储将是非常大的存储节省。但是这里并不是说java 中容器效率
不好或者设计很糟糕,因为Mahout根据自己特定业务场景需要重新设计这样数据结构,而java 类似HashSet、HashMap之类是设计成的通用场景容器。
FastByIDMap 也是hash-based,但是它使用linear probing (线性探测再散列)而不是separate chaining(可能我们翻译成中文叫拉链法)。
这里稍微展开讲下散列的溢出处理:
1. 线性探测法
计算要插入元素散列地址,如果散列地址槽为空,就直接把该新元素插入该槽中。但是如果新元素被散列到一个已经满了的散列桶,就必须寻找其他散列桶,最简单办法就是把这个新元素插到最近的未满的散列桶中。
2. 拉链法
允许元素散列地址相同,把散列地址相同元素通过链表结构串联起来。
HashMap处理冲突是采用拉链法,我们查看实现源码:
/** * Associates the specified value with the specified key in this map. * If the map previously contained a mapping for the key, the old * value is replaced. * * @param key key with which the specified value is to be associated * @param value value to be associated with the specified key * @return the previous value associated with key, or * null if there was no mapping for key. * (A null return can also indicate that the map * previously associated null with key.) */ public V put(K key, V value) { if (key == null) return putForNullKey(value); int hash = hash(key.hashCode()); int i = indexFor(hash, table.length); for (Entrye = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(hash, key, value, i); return null; }
其中很关键的是addEntry方法:
/** * Adds a new entry with the specified key, value and hash code to * the specified bucket. It is the responsibility of this * method to resize the table if appropriate. * * Subclass overrides this to alter the behavior of put method. */ void addEntry(int hash, K key, V value, int bucketIndex) { Entry上面方法的代码很简单,但其中包含了一个设计:系统总是将新添加的 Entry 对象放入 table 数组的 bucketIndex 索引处——如果 bucketIndex 索引处已经有了一个 Entry 对象,那新添加的 Entry 对象指向原有的 Entry 对象(产生一个 Entry 链),如果 bucketIndex 索引处没有 Entry 对象,也就是上面程序代码的 e 变量是 null,也就是新放入的 Entry 对象指向 null,也就是没有产生 Entry 链。HashMap里面没有出现hash冲突时,没有形成单链表时,hashmap查找元素很快,get()方法能够直接定位到元素,但是出现单链表后,单个bucket 里存储的不是一个 Entry,而是一个 Entry 链,系统只能必须按顺序遍历每个 Entry,直到找到想搜索的 Entry 为止——如果恰好要搜索的 Entry 位于该 Entry 链的最末端(该 Entry 是最早放入该 bucket 中),那系统必须循环到最后才能找到该元素。e = table[bucketIndex]; table[bucketIndex] = new Entry<>(hash, key, value, e); if (size++ >= threshold) resize(2 * table.length); }
当创建 HashMap 时,有一个默认的负载因子(load factor),其默认值为 0.75,这是时间和空间成本上一种折衷:增大负载因子可以减少 Hash 表(就是那个 Entry 数组)所占用的内存空间,但会增加查询数据的时间开销,而查询是最频繁的的操作(HashMap 的 get() 与 put() 方法都要用到查询);减小负载因子会提高数据查询的性能,但会增加 Hash 表所占用的内存空间。
Mahout中Keys 总是long 型数据在表示而不是使用object,因为使用long 类型节约了内存并且提升 了性能。
Mahout中Set底层实现不是使用的Map,FastByIDMap 表现像一个cache,因为它有一个最大的size,超过这个size不经常使用的entries将会移除。
在Mahout中对recommend input data 的最上层抽象封装就是DataModel,DataModel是一个抽象接口如下:
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.mahout.cf.taste.model; import java.io.Serializable; import org.apache.mahout.cf.taste.common.Refreshable; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.impl.common.FastIDSet; import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator; /** ** Implementations represent a repository of information about users and their associated {@link Preference}s * for items. *
*/ public interface DataModel extends Refreshable, Serializable { /** * @return all user IDs in the model, in order * @throws TasteException * if an error occurs while accessing the data */ LongPrimitiveIterator getUserIDs() throws TasteException; /** * @param userID * ID of user to get prefs for * @return user's preferences, ordered by item ID * @throws org.apache.mahout.cf.taste.common.NoSuchUserException * if the user does not exist * @throws TasteException * if an error occurs while accessing the data */ PreferenceArray getPreferencesFromUser(long userID) throws TasteException; /** * @param userID * ID of user to get prefs for * @return IDs of items user expresses a preference for * @throws org.apache.mahout.cf.taste.common.NoSuchUserException * if the user does not exist * @throws TasteException * if an error occurs while accessing the data */ FastIDSet getItemIDsFromUser(long userID) throws TasteException; /** * @return a {@link LongPrimitiveIterator} of all item IDs in the model, in order * @throws TasteException * if an error occurs while accessing the data */ LongPrimitiveIterator getItemIDs() throws TasteException; /** * @param itemID * item ID * @return all existing {@link Preference}s expressed for that item, ordered by user ID, as an array * @throws org.apache.mahout.cf.taste.common.NoSuchItemException * if the item does not exist * @throws TasteException * if an error occurs while accessing the data */ PreferenceArray getPreferencesForItem(long itemID) throws TasteException; /** * Retrieves the preference value for a single user and item. * * @param userID * user ID to get pref value from * @param itemID * item ID to get pref value for * @return preference value from the given user for the given item or null if none exists * @throws org.apache.mahout.cf.taste.common.NoSuchUserException * if the user does not exist * @throws TasteException * if an error occurs while accessing the data */ Float getPreferenceValue(long userID, long itemID) throws TasteException; /** * Retrieves the time at which a preference value from a user and item was set, if known. * Time is expressed in the usual way, as a number of milliseconds since the epoch. * * @param userID user ID for preference in question * @param itemID item ID for preference in question * @return time at which preference was set or null if no preference exists or its time is not known * @throws org.apache.mahout.cf.taste.common.NoSuchUserException if the user does not exist * @throws TasteException if an error occurs while accessing the data */ Long getPreferenceTime(long userID, long itemID) throws TasteException; /** * @return total number of items known to the model. This is generally the union of all items preferred by * at least one user but could include more. * @throws TasteException * if an error occurs while accessing the data */ int getNumItems() throws TasteException; /** * @return total number of users known to the model. * @throws TasteException * if an error occurs while accessing the data */ int getNumUsers() throws TasteException; /** * @param itemID item ID to check for * @return the number of users who have expressed a preference for the item * @throws TasteException if an error occurs while accessing the data */ int getNumUsersWithPreferenceFor(long itemID) throws TasteException; /** * @param itemID1 first item ID to check for * @param itemID2 second item ID to check for * @return the number of users who have expressed a preference for the items * @throws TasteException if an error occurs while accessing the data */ int getNumUsersWithPreferenceFor(long itemID1, long itemID2) throws TasteException; /** ** Sets a particular preference (item plus rating) for a user. *
* * @param userID * user to set preference for * @param itemID * item to set preference for * @param value * preference value * @throws org.apache.mahout.cf.taste.common.NoSuchItemException * if the item does not exist * @throws org.apache.mahout.cf.taste.common.NoSuchUserException * if the user does not exist * @throws TasteException * if an error occurs while accessing the data */ void setPreference(long userID, long itemID, float value) throws TasteException; /** ** Removes a particular preference for a user. *
* * @param userID * user from which to remove preference * @param itemID * item to remove preference for * @throws org.apache.mahout.cf.taste.common.NoSuchItemException * if the item does not exist * @throws org.apache.mahout.cf.taste.common.NoSuchUserException * if the user does not exist * @throws TasteException * if an error occurs while accessing the data */ void removePreference(long userID, long itemID) throws TasteException; /** * @return true if this implementation actually stores and returns distinct preference values; * that is, if it is not a 'boolean' DataModel */ boolean hasPreferenceValues(); /** * @return the maximum preference value that is possible in the current problem domain being evaluated. For * example, if the domain is movie ratings on a scale of 1 to 5, this should be 5. While a * {@link org.apache.mahout.cf.taste.recommender.Recommender} may estimate a preference value above 5.0, it * isn't "fair" to consider that the system is actually suggesting an impossible rating of, say, 5.4 stars. * In practice the application would cap this estimate to 5.0. Since evaluators evaluate * the difference between estimated and actual value, this at least prevents this effect from unfairly * penalizing a {@link org.apache.mahout.cf.taste.recommender.Recommender} */ float getMaxPreference(); /** * @see #getMaxPreference() */ float getMinPreference(); }
最简单的对DataModel实现的是GenericDataModel类,简单看下使用示例程序如下:
FastByIDMap当然这里只是最简单的使用,平时我们读数据更多是从文件模型或者数据库读入,Mahout也有针对这些场景的设计。preferences = new FastByIDMap (); PreferenceArray prefsForUser1 = new GenericUserPreferenceArray(10); prefsForUser1.setUserID(0, 1L); prefsForUser1.setItemID(0, 101L); prefsForUser1.setValue(0, 3.0f); prefsForUser1.setItemID(1, 102L); prefsForUser1.setValue(1, 4.5f); //...(8 more) preferences.put(1L, prefsForUser1); DataModel model = new GenericDataModel(preferences); System.out.println(model);