处理关系数据使用libFM块

英文博文:https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

train.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20

和test.libfm

0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

我会将它们合并,所以就会更容易的整个过程

dataset.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

所以如果我们想用块结构。

我们会有5个文件:

  • rel_user。 libfm(features 0,1 and 6-8 are users features)

0 0:1 6:1
0 1:1 8:1

但事实上你可以避免feature_id_number broken like(0 - 1,6 - 8),我们可以将它,所以(0 - 1 - > 0 - 1和6 - 8 - > 2 - 4)

0 0:1 2:1
0 1:1 4:1

  • rel_product。 libfm产品特性(features 2-5 and 9 are products features)同样的事情我们可以压缩:

0 2:1 9:12.5
0 3:1 9:20
0 4:1 9:78
0 5:1

0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1

  • rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)

0
0
0
1
1

  • rel_product.train (映射)

0
1
2
0
1

    • file y.train which contains the ratings only

5
5
4
1
1

基本完成了…

现在您需要创建。 x和。 xt为用户文件块和产品。 这个你需要脚本可用与libFM /bin/后编译它们。

./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y

you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

然后

./bin/transpose –ifile rel_user.x –ofile rel_user.xt

Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

和运行:

./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output

它有点多余的问题,但我希望你明白这一点。

现在一个真实的例子

对于这个例子,我将使用ml-1m.zip你可以从MovieLens数据集在这里(100万评)

ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

movies.dat (sample) / Format: MovieID::Title::Genres

1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama

users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455

我将创建三个不同的模型。

  1. Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
  2. Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
  3. libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
Model 1 and 2 can be created using the following code:

# -*- coding: utf-8 -*-
__author__ = 'Silbermann Thierry'
__license__ = 'WTFPL'
 
import pandas as pd
import numpy as np
 
def create_libfm(w_filename, model_lvl = 1 ):
 
     # Load the data
     file_ratings = 'ratings.dat'
     data_ratings = pd.read_csv(file_ratings, delimiter = '::' , engine = 'python' ,
                 names = [ 'UserID' , 'MovieID' , 'Ratings' , 'Timestamp' ])
 
     file_movies = 'movies.dat'
     data_movies = pd.read_csv(file_movies, delimiter = '::' , engine = 'python' ,
                 names = [ 'MovieID' , 'Name' , 'Genre_list' ])
 
     file_users = 'users.dat'
     data_users = pd.read_csv(file_users, delimiter = '::' , engine = 'python' ,
                 names = [ 'UserID' , 'Genre' , 'Age' , 'Occupation' , 'ZipCode' ])
 
     # Transform data
     ratings = data_ratings[ 'Ratings' ]
     data_ratings = data_ratings.drop([ 'Ratings' , 'Timestamp' ], axis = 1 )
     
     data_movies = data_movies.drop([ 'Name' ], axis = 1 )
     list_genres = [genres.split( '|' ) for genres in data_movies[ 'Genre_list' ]]
     set_genre = [item for sublist in list_genres for item in sublist]
     
     data_users = data_users.drop([ 'ZipCode' ], axis = 1 )
     
     print 'Data loaded'
 
     # Map the data
     offset_array = [ 0 ]
     dict_array = []
     
     feat = [( 'UserID' , data_ratings), ( 'MovieID' , data_ratings)]
     if model_lvl > 1 :
         feat.extend[( 'Genre' , data_users), ( 'Age' , data_users),
             ( 'Occupation' , data_users), ( 'Genre_list' , data_movies)]
 
     for (feature_name, dataset) in feat:
         uniq = np.unique(dataset[feature_name])
         offset_array.append( len (uniq) + offset_array[ - 1 ])
         dict_array.append({key: value + offset_array[ - 2 ]
             for value, key in enumerate (uniq)})
 
     print 'Mapping done'
 
     # Create libFM file
     
     w = open (w_filename, 'w' )
     for i in range (data_ratings.shape[ 0 ]):
         s = "{0}" . format (ratings[i])
         for index_feat, (feature_name, dataset) in enumerate (feat):
             if dataset[feature_name][i] in dict_array[index_feat]:
                 s + = " {0}:1" . format (
                         dict_array[index_feat][dataset[feature_name][i]]
                             + offset_array[index_feat]
                                           )
         s + = '\n'
         w.write(s)
     w.close()
 
if __name__ = = '__main__' :
     create_libfm( 'model1.libfm' , 1 )
     create_libfm( 'model2.libfm' , 2 )


So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

所以你最终得到 model1.libfm and model2.libfm。 只需要将这些文件一分为二,来创建训练数据集和测试数据集,分别命名叫 train_m1.libfm, test_m1.libfm

然后你就跑libFM是这样的:

./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1

但我猜你已经知道如何去做。


你可能感兴趣的:(处理关系数据使用libFM块)