CTR预估 论文实践(四)--Field-aware Factorization Machines for CTR Prediction

FFM 模型实践

代码是用python3.5写的,tensorflow的版本为1.10.1;博客代码均以上传至GitHub,欢迎follow和start~~!

1. 数据集

数据集地址:MovieLens

本文使用的数据是movielens-100k,数据包括u.item,u.user,ua.base及ua.test,u.item包括的数据格式为:

movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |

u.user包括的数据格式为:

user id | age | gender | occupation | zip code

ua.base和ua.test的数据格式为:

user id | item id | rating | timestamp

本文将评分等于5分的评分数据作为用户的点击数据,评分小于5分的数据作为用户的未点击数据,构造为一个二分类问题。

2. FFM

2.1 数据处理

要使用FM模型,首先要将数据处理成一个矩阵,本文使用了pandas对数据进行处理,生成输入的矩阵,并且对label做onehot编码处理。

def onehot_encoder(labels, NUM_CLASSES):
    enc = LabelEncoder()
    labels = enc.fit_transform(labels)
    labels = labels.astype(np.int32)
    batch_size = tf.size(labels)
    labels = tf.expand_dims(labels, 1)
    indices = tf.expand_dims(tf.range(0, batch_size,1), 1)
    concated = tf.concat([indices, labels] , 1)
    onehot_labels = tf.sparse_to_dense(concated, tf.stack([batch_size, NUM_CLASSES]), 1.0, 0.0) 
    with tf.Session() as sess:
        return sess.run(onehot_labels)

def load_dataset():
    header = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
    df_user = pd.read_csv('data/u.user', sep='|', names=header)
    header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
            'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
            'Thriller', 'War', 'Western']
    df_item = pd.read_csv('data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
    df_item = df_item.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
    
    df_user['age'] = pd.cut(df_user['age'], [0,10,20,30,40,50,60,70,80,90,100], labels=['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100'])
    df_user = pd.get_dummies(df_user, columns=['gender', 'occupation', 'age'])
    df_user = df_user.drop(columns=['zip_code'])
    
    user_features = df_user.columns.values.tolist()
    movie_features = df_item.columns.values.tolist()
    cols = user_features + movie_features
    
    header = ['user_id', 'item_id', 'rating', 'timestamp']
    df_train = pd.read_csv('data/ua.base', sep='\t', names=header)
    df_train['rating'] = df_train.rating.apply(lambda x: 1 if int(x) == 5 else 0)
    df_train = df_train.merge(df_user, on='user_id', how='left') 
    df_train = df_train.merge(df_item, on='item_id', how='left')
    
    df_test = pd.read_csv('data/ua.test', sep='\t', names=header)
    df_test['rating'] = df_test.rating.apply(lambda x: 1 if int(x) == 5 else 0)
    df_test = df_test.merge(df_user, on='user_id', how='left') 
    df_test = df_test.merge(df_item, on='item_id', how='left')
    train_labels = onehot_encoder(df_train['rating'].astype(np.int32), 2)
    test_labels = onehot_encoder(df_test['rating'].astype(np.int32), 2)
    return df_train[cols].values, train_labels, df_test[cols].values, test_labels

3.2 模型超参数配置

配置训练网络、数据地址、训练数据集等信息。

    x_train, y_train, x_test, y_test, feature2field = load_dataset()
    # initialize the model
    num_classes = 2
    lr = 0.01
    batch_size = 128
    k = 8
    reg_l1 = 2e-2
    reg_l2 = 0
    feature_length = x_train.shape[1]
    # initialize FM model
    batch_gen = batch_generator([x_train,y_train],batch_size)
    test_batch_gen = batch_generator([x_test,y_test],batch_size)
    model = FFM(num_classes, k, 5, lr, batch_size, feature_length, reg_l1, reg_l2, feature2field)
    # build graph for model
    model.build_graph()

3.3 FFM 模型搭建

FFM 公式:
假设样本的 n n n 个特征属于 f f f 个field,那么FFM的二次项有 n f nf nf 个隐向量。而在FM模型中,每一维特征的隐向量只有一个。FM可以看作FFM的特例,是把所有特征都归属到一个field时的FFM模型。根据FFM的field敏感特性,可以导出其模型方程。
y ( x ) = w 0 + ∑ i = 1 n w i x i + ∑ i = 1 n ∑ j = i + 1 n ⟨ v i , f j , v j , f i ⟩ x i x j y(\mathbf{x}) = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle \mathbf{v}_{i, f_j}, \mathbf{v}_{j, f_i} \rangle x_i x_j y(x)=w0+i=1nwixi+i=1nj=i+1nvi,fj,vj,fixixj
其中, f j f_j fj 是第 j j j 个特征所属的field。如果隐向量的长度为 k k k,那么FFM的二次参数有 n f k nfk nfk 个,远多于FM模型的 n k nk nk 个。此外,由于隐向量与field相关,FFM二次项并不能够化简,其预测复杂度是 O ( k n 2 ) O(kn^2) O(kn2)

此外,内积 v i , f j ⋅ v j , f i v_{i,f_j} ⋅ v_{j,f_i} vi,fjvj,fi, f i f_i fi 表示让特征 i i i 与 特征 j j j 的 field 关联,同时让特征 j j j i i i 的 field 关联,由此可见,FM的交叉是针对特征之间的,而FFM是针对特征与 field 之间的。

正是因为FM可以看成FFM中所有特征都属于同一个 field 的特例,故而FM中的每个特征都使用一个隐向量进行表达,而FFM则对每个特征是使用 f f f 个隐向量表达,因为FFM一共有 f f f 个field。

FFM 构建:
得到输入之后,我们使用tensorflow来设计我们的模型,目标函数包括两部分,线性以及交叉特征的部分,交叉特征直接使用我们最后推导的形式即可。


class FFM(object):
    def __init__(self, num_classes, k, field, lr, batch_size, feature_length, reg_l1, reg_l2, feature2field):
        self.num_classes = num_classes
        self.k = k
        self.field = field
        self.lr = lr
        self.batch_size = batch_size
        self.p = feature_length
        self.reg_l1 = reg_l1
        self.reg_l2 = reg_l2
        self.feature2field = feature2field

    def add_input(self):
        self.X = tf.placeholder('float32', [None, self.p])
        self.y = tf.placeholder('float32', [None, num_classes])
        self.keep_prob = tf.placeholder('float32')

    def inference(self):
        with tf.variable_scope('linear_layer'):
            w0 = tf.get_variable('w0', shape=[self.num_classes],
                                initializer=tf.zeros_initializer())
            self.w = tf.get_variable('w', shape=[self.p, num_classes],
                                 initializer=tf.truncated_normal_initializer(mean=0,stddev=0.01))
            self.linear_terms = tf.add(tf.matmul(self.X, self.w), w0)

        with tf.variable_scope('interaction_layer'):
            self.v = tf.get_variable('v', shape=[self.p, self.field, self.k],
                                initializer=tf.truncated_normal_initializer(mean=0, stddev=0.01))
            self.interaction_terms = tf.constant(0, dtype='float32')
            for i in range(self.p):
                for j in range(i+1,self.p):
                    self.interaction_terms += tf.multiply(tf.reduce_sum(tf.multiply(self.v[i,self.feature2field[i]], self.v[j,self.feature2field[j]])),
                        tf.multiply(self.X[:,i], self.X[:,j]))
        self.interaction_terms = tf.reshape(self.interaction_terms, [-1,1]) 
        # self.y_out = tf.math.add(self.linear_terms, self.interaction_terms)
        self.y_out = self.linear_terms + self.interaction_terms
        if self.num_classes == 2:
            self.y_out_prob = tf.nn.sigmoid(self.y_out)
        elif self.num_classes > 2:
            self.y_out_prob = tf.nn.softmax(self.y_out)
            
    def add_loss(self):
        if self.num_classes == 2:
            cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=self.y, logits=self.y_out)
        elif self.num_classes > 2:
            cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.y, logits=self.y_out)
        mean_loss = tf.reduce_mean(cross_entropy)
        self.loss = mean_loss
        tf.summary.scalar('loss', self.loss)

    def add_accuracy(self):
        # accuracy
        self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.y,1), tf.float32))
        self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
        # add summary to accuracy
        tf.summary.scalar('accuracy', self.accuracy)

    def train(self):
        self.global_step = tf.Variable(0, trainable=False)
        optimizer = tf.train.AdagradOptimizer(self.lr) 
        extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(extra_update_ops):
            self.train_op = optimizer.minimize(self.loss, global_step=self.global_step)

    def build_graph(self):
        self.add_input()
        self.inference()
        self.add_loss()
        self.add_accuracy()
        self.train()

运行结果

start training...
2019-03-15 10:22:02,967 : INFO : Iteration 4000: with minibatch training loss = 0.5671271085739136 and accuracy of 0.75
Epoch 1, Overall loss = 0.508
2019-03-15 10:22:40,027 : INFO : Iteration 4500: with minibatch training loss = 0.5328309535980225 and accuracy of 0.78125
Epoch 2, Overall loss = 0.507
2019-03-15 10:23:04,017 : INFO : Iteration 5000: with minibatch training loss = 0.4927006959915161 and accuracy of 0.8046875
2019-03-15 10:23:28,660 : INFO : Iteration 5500: with minibatch training loss = 0.5169863104820251 and accuracy of 0.78125
Epoch 3, Overall loss = 0.507
2019-03-15 10:23:53,208 : INFO : Iteration 6000: with minibatch training loss = 0.5515714883804321 and accuracy of 0.765625
Epoch 4, Overall loss = 0.506
2019-03-15 10:24:24,179 : INFO : Iteration 6500: with minibatch training loss = 0.5365687608718872 and accuracy of 0.765625
...

参考文献

  1. CTR预估 论文精读(三)–Factorization Machines
  2. CTR预估 论文精读(四)–Field-aware Factorization Machines for CTR Prediction
  3. FFM(Field-aware Factorization Machines)的理论与实践

你可能感兴趣的:(推荐系统实践进阶)