英文博文:https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/
train.libfm
5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
和test.libfm
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1
我会将它们合并,所以就会更容易的整个过程
dataset.libfm
5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1
所以如果我们想用块结构。
我们会有5个文件:
0 0:1 6:1
0 1:1 8:1
但事实上你可以避免feature_id_number broken like(0 - 1,6 - 8),我们可以将它,所以(0 - 1 - > 0 - 1和6 - 8 - > 2 - 4)
0 0:1 2:1
0 1:1 4:1
0 2:1 9:12.5
0 3:1 9:20
0 4:1 9:78
0 5:1
到
0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1
0
0
0
1
1
0
1
2
0
1
5
5
4
1
1
基本完成了…
现在您需要创建。 x和。 xt为用户文件块和产品。 这个你需要脚本可用与libFM /bin/后编译它们。
./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y
you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.
然后
./bin/transpose –ifile rel_user.x –ofile rel_user.xt
Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test
At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)
和运行:
./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output
它有点多余的问题,但我希望你明白这一点。
现在一个真实的例子
对于这个例子,我将使用ml-1m.zip你可以从MovieLens数据集在这里(100万评)
ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
movies.dat (sample) / Format: MovieID::Title::Genres
1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
我将创建三个不同的模型。
# -*- coding: utf-8 -*-
__author__
=
'Silbermann Thierry'
__license__
=
'WTFPL'
import
pandas as pd
import
numpy as np
def
create_libfm(w_filename, model_lvl
=
1
):
# Load the data
file_ratings
=
'ratings.dat'
data_ratings
=
pd.read_csv(file_ratings, delimiter
=
'::'
, engine
=
'python'
,
names
=
[
'UserID'
,
'MovieID'
,
'Ratings'
,
'Timestamp'
])
file_movies
=
'movies.dat'
data_movies
=
pd.read_csv(file_movies, delimiter
=
'::'
, engine
=
'python'
,
names
=
[
'MovieID'
,
'Name'
,
'Genre_list'
])
file_users
=
'users.dat'
data_users
=
pd.read_csv(file_users, delimiter
=
'::'
, engine
=
'python'
,
names
=
[
'UserID'
,
'Genre'
,
'Age'
,
'Occupation'
,
'ZipCode'
])
# Transform data
ratings
=
data_ratings[
'Ratings'
]
data_ratings
=
data_ratings.drop([
'Ratings'
,
'Timestamp'
], axis
=
1
)
data_movies
=
data_movies.drop([
'Name'
], axis
=
1
)
list_genres
=
[genres.split(
'|'
)
for
genres
in
data_movies[
'Genre_list'
]]
set_genre
=
[item
for
sublist
in
list_genres
for
item
in
sublist]
data_users
=
data_users.drop([
'ZipCode'
], axis
=
1
)
print
'Data loaded'
# Map the data
offset_array
=
[
0
]
dict_array
=
[]
feat
=
[(
'UserID'
, data_ratings), (
'MovieID'
, data_ratings)]
if
model_lvl >
1
:
feat.extend[(
'Genre'
, data_users), (
'Age'
, data_users),
(
'Occupation'
, data_users), (
'Genre_list'
, data_movies)]
for
(feature_name, dataset)
in
feat:
uniq
=
np.unique(dataset[feature_name])
offset_array.append(
len
(uniq)
+
offset_array[
-
1
])
dict_array.append({key: value
+
offset_array[
-
2
]
for
value, key
in
enumerate
(uniq)})
print
'Mapping done'
# Create libFM file
w
=
open
(w_filename,
'w'
)
for
i
in
range
(data_ratings.shape[
0
]):
s
=
"{0}"
.
format
(ratings[i])
for
index_feat, (feature_name, dataset)
in
enumerate
(feat):
if
dataset[feature_name][i]
in
dict_array[index_feat]:
s
+
=
" {0}:1"
.
format
(
dict_array[index_feat][dataset[feature_name][i]]
+
offset_array[index_feat]
)
s
+
=
'\n'
w.write(s)
w.close()
if
__name__
=
=
'__main__'
:
create_libfm(
'model1.libfm'
,
1
)
create_libfm(
'model2.libfm'
,
2
)
So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)
所以你最终得到 model1.libfm and model2.libfm。 只需要将这些文件一分为二,来创建训练数据集和测试数据集,分别命名叫 train_m1.libfm, test_m1.libfm
然后你就跑libFM是这样的:
./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1
但我猜你已经知道如何去做。