首先创建在mysql中创建库以及对应的表
mysql> create database mahout; Query OK, 1 row affected (0.00 sec) mysql> use mahout; Database changed mysql> create table intro( -> uid varchar(20) not null, -> iid varchar(50) not null, -> val varchar(50) not null, -> time varchar(50) default null -> );
注意 在计算的时候会损耗大量资源 建议 添加索引 在my.ini当中设置各种调优参数
(这里只是为了实现功能)
插入数据 (这里就使用mahout in action 第一个推荐例子当中的数据 注意 要把里面的空行删除 不然会有不能为空的提示)
mysql> load data local infile 'D:/intro.csv' replace into table intro fields terminated by ',' lines terminated by '\n' (@col1,@col2,@col3) set uid=@col1,iid=@col2,val=@col3; Query OK, 21 rows affected (0.19 sec) Records: 21 Deleted: 0 Skipped: 0 Warnings: 0
查看一下数据
mysql> select * from intro; +-----+-----+-----+------+ | uid | iid | val | time | +-----+-----+-----+------+ | 1 | 101 | 5.0 | NULL | | 1 | 102 | 3.0 | NULL | | 1 | 103 | 2.5 | NULL | | 2 | 101 | 2.0 | NULL | | 2 | 102 | 2.5 | NULL | | 2 | 103 | 5.0 | NULL | | 2 | 104 | 2.0 | NULL | | 3 | 101 | 2.5 | NULL | | 3 | 104 | 4.0 | NULL | | 3 | 105 | 4.5 | NULL | | 3 | 107 | 5.0 | NULL | | 4 | 101 | 5.0 | NULL | | 4 | 103 | 3.0 | NULL | | 4 | 104 | 4.5 | NULL | | 4 | 106 | 4.0 | NULL | | 5 | 101 | 4.0 | NULL | | 5 | 102 | 3.0 | NULL | | 5 | 103 | 2.0 | NULL | | 5 | 104 | 4.0 | NULL | | 5 | 105 | 3.5 | NULL | | 5 | 106 | 4.0 | NULL | +-----+-----+-----+------+ 21 rows in set (0.00 sec)
然后就是正式程序 写的比较简单主要是为了实现功能
import java.util.List; import org.apache.mahout.cf.taste.impl.model.jdbc.MySQLJDBCDataModel; import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.model.JDBCDataModel; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.RecommendedItem; import org.apache.mahout.cf.taste.recommender.Recommender; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import com.mysql.jdbc.jdbc2.optional.MysqlDataSource; public class MysqlJDBCRecommender { public static void main(String[] args) throws Exception { MysqlDataSource dataSource = new MysqlDataSource(); dataSource.setServerName("localhost"); dataSource.setUser("root"); dataSource.setPassword("toor"); dataSource.setDatabaseName("mahout"); JDBCDataModel dataModel = new MySQLJDBCDataModel(dataSource, "intro", "uid", "iid", "val", "time"); DataModel model = dataModel; UserSimilarity similarity=new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood=new NearestNUserNeighborhood(2,similarity,model); Recommender recommender=new GenericUserBasedRecommender(model,neighborhood,similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 3); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } }
计算结果
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/D:/java%e9%a9%b1%e5%8a%a8/mahout0.7/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/java%e9%a9%b1%e5%8a%a8/mahout0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/java%e9%a9%b1%e5%8a%a8/mahout0.7/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 13/12/07 13:56:41 WARN jdbc.AbstractJDBCDataModel: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced. RecommendedItem[item:104, value:4.257081] RecommendedItem[item:106, value:4.0]
MySQLJDBCDataModel API中的建议
A JDBCDataModel
backed by a MySQL database and accessed via JDBC. It may work with other JDBC databases. By default, this class assumes that there is a DataSource
available under the JNDI name "jdbc/taste", which gives access to a database with a "taste_preferences" table with the following schema:
user_id | item_id | preference |
---|---|---|
987 | 123 | 0.9 |
987 | 456 | 0.1 |
654 | 123 | 0.2 |
654 | 789 | 0.3 |
preference
must have a type compatible with the Java float
type. user_id
and item_id
should be compatible with long type (BIGINT). For example, the following command sets up a suitable table in MySQL, complete with primary key and indexes:
CREATE TABLE taste_preferences ( user_id BIGINT NOT NULL, item_id BIGINT NOT NULL, preference FLOAT NOT NULL, PRIMARY KEY (user_id, item_id), INDEX (user_id), INDEX (item_id) )
The table may optionally have a timestamp
column whose type is compatible with Java long
.
See the notes in AbstractJDBCDataModel
regarding using connection pooling. It's pretty vital to performance.
Some experimentation suggests that MySQL's InnoDB engine is faster than MyISAM for these kinds of applications. While MyISAM is the default and, I believe, generally considered the lighter-weight and faster of the two engines, my guess is the row-level locking of InnoDB helps here. Your mileage may vary.
Here are some key settings that can be tuned for MySQL, and suggested size for a data set of around 1 million elements:
innodb_buffer_pool_size=64M
myisam_sort_buffer_size=64M
query_cache_limit=64M
query_cache_min_res_unit=512K
query_cache_type=1
query_cache_size=64M
Also consider setting some parameters on the MySQL Connector/J driver:
cachePreparedStatements = true cachePrepStmts = true cacheResultSetMetadata = true alwaysSendSetIsolation = false elideSetAutoCommits = true
Thanks to Amila Jayasooriya for contributing MySQL notes above as part of Google Summer of Code 2007.