首先该算法实验是建立在用户对商品封面图布局、颜色、风格有独特偏好的基础上进行设计的;
大致思路:
先构建 用户-首图双塔模型
然后通过商品曝光点击日志进行模型训练
从训练好的模型中导出user特征和商品首图特征
分别存储到ES
在之后的查询请求中,基于用户id获取特征,并在ES中与商品特征进行向量的点积查询
第一步用户特征梳理
为了打通流程,我们先从已有的数据中简单挑选一些典型的用户特征来演示代码处理细节
- user_id: 用户id(字符串)
- user_login_days_7:用户近七日登录天数(数字)
- user_tags: 用户标签(字符数组)
其中user_id、user_login_days_7为1维数据,而user_tags是一个不定长的文本数组,需要进行特殊处理
第二步导入数据
模型训练3部曲:数据处理、特征工程、模型训练;关于数据处理部分,这里使用工具pandas
logs_df = pd.read_csv('d:/data/近七天模板点击曝光.csv')
product_df = pd.read_csv('d:/data/商品信息.csv')
user_df = pd.read_csv('d:/data/用户信息.csv')
合并3分数据集
df = logs_df .merge(user_df, how='inner', on='user_id').merge(product_df, how='inner', on='product_id')
特征工程
重点关注的是字符串特征的处理,需要先对文本进行分词, 然后使用word2vec模型输出的文本向量作为user_tags的embedder的权重
完整代码如下:
df = pd.read_csv('d:/data/xxx.csv')
tags = []
uids = []
logins = []
imgs = []
labels = []
for idx, row in df.iterrows():
words = []
tags.append(list(jieba.cut(row['tags']), cut_all=False)))
# 获取封面图
img = Image.open("d:/data/h5_cover/"+str(row["id"])+".jpg")
img = img.resize((224, 224), Image.BILINEAR)
imgs.append(np.asarray(img))
labels.append(row['click'])
uid.append(row['uid'])
logins.append(row['user_login_days_7'])
MAX_WORDS=10
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(tags)
vocab = tokenizer.word_index #词袋:停用词已过滤,获取每个词的编号
w2v_model = word2vec.Word2Vec(tags, workers=4, vector_size=100,min_count=1,window=4)
# 强制单位归一化
w2v_model.init_sims(replace=True)
embedding_matrix = np.zeros((len(vocab)+1, 100)) #从0开始计数 加1对应之前特征词
for word, i in vocab.items():
try:
#提取词向量并放置训练矩阵
embedding_vector = w2v_model.wv[str(word)]
# 构建词向量权重字典
embedding_matrix[i] = embedding_vector
except KeyError: #单词未找到跳过
print(word)
continue
data_seq = tokenizer.texts_to_sequences(tags)
tags_seq = pad_sequences(data_seq, maxlen=MAX_WORDS)
# 处理userid为字符串问题 对用户id进行编码
tokenizer = Tokenizer()
tokenizer.fit_on_texts(uids)
data_seq = tokenizer.texts_to_sequences(uids)
uids_seq = pad_sequences(data_seq)
构建数据集
将处理后的特征数据拼接为新的dataframe
data = {'uid_seq': uids_seq.flatten(), 'logins':logins, 'tags_seq': tags_seq.tolist(), 'img':imgs}
data = pd.DataFrame(data=data)
label转换为模型可识别的数据类型
labels = np.asarray(labels)
划分数据集
使用sklearn自带的数据集切分方法对数据进行随机划分,划分比例1:4
train_X, test_X, train_y, test_y = train_test_split(data, labels, test_size=0.2)
搭建模型
最关键的一步是双塔模型的构建,其中图片特征提取部分使用的是经典的Vgg16的最大池化层的输出,商品封面图会产生512维的特征输出,用户特征使用的是多路拼接特征,并在最后使用一个全连接层进行融合
import pandas as pd
import numpy as np
import keras
from keras import layers
from keras.applications.vgg16 import VGG16
import matplotlib.pyplot as plt
from keras.models import Sequential, Model
from sklearn.model_selection import train_test_split
tags_embedding_matrix = np.zeros((2048, 1000))
tags_embedder = layers.Embedding(2048, 1000, input_length=10,
weights=[tags_embedding_matrix], trainable=False, name='tags') #不再训练
uid_embedding_matrix = np.zeros((num_user_id, 100))
uid_embedder = layers.Embedding(num_user_id, 100, input_length=1,
weights=[uid_embedding_matrix], trainable=True, name='uid') # 训练
user_login_embedding_matrix = np.zeros((num_user_login_days_7, 2))
user_login_embedder = layers.Embedding(num_user_login_days_7, 2, input_length=1,
weights=[user_login_embedding_matrix], trainable=True, name='logins') # 训练
# 建立模型
user_tags_seq = Sequential()
user_tags_seq.add(tags_embedder)
user_tags_seq.add(layers.Conv1D(1024, 3, padding='same', activation='relu')) #卷积层步幅3
user_tags_seq.add(layers.MaxPool1D(4, padding='same'))
user_tags_seq.add(layers.Conv1D(512, 3, padding='same', activation='relu')) #卷积层
user_tags_seq.add(layers.Flatten()) #拉直化
user_tags_seq.add(layers.Dropout(0.3))
user_tags_seq.add(layers.Dense(256, activation='relu', name="user_tags"))
uid_seq = Sequential()
uid_seq.add(uid_embedder)
user_login_seq = Sequential()
user_login_seq.add(user_login_embedder)
user_combined = layers.concatenate([uid_seq.output, user_login_seq.output])
user_vector = layers.Dense(512, activation='relu')(user_combined)
user_vector = layers.Flatten()(user_vector)
user_vector = layers.Dropout(0.2)(user_vector)
user_vector = layers.Dense(256, activation='relu',
kernel_regularizer='l2')(user_vector)
user_combined_seq = layers.concatenate([user_tags_seq.output, user_vector])
user_model = layers.Dense(512, activation='relu', name='user_embedding')(user_combined_seq)
vgg_model = VGG16(weights='imagenet', input_shape = ( 224, 224, 3), pooling='max',include_top=False)
for layer in vgg_model.layers[:100]:
layer.trainable = False #layers.trainable = False将不需要重新训练的权重“冷冻”起来
img_model = layers.Dropout(0.5)(vgg_model.output)
img_model = layers.Dense(1024, activation='relu')(img_model)
img_model = layers.Dense(128, name='img_vectors')(img_model)
#combinedInput = layers.Multiply()( [img_model.output, user_model] )
#y = layers.Dense(1, activation='linear')(combinedInput)
y = layers.Dot(axes=1)( [img_model.output, user_model] )
y = layers.Dense(1, activation='sigmoid')(y)
model = Model(inputs=[uid_seq.input, user_login_seq.input, user_tags_seq.input, img_model.input], outputs=[y])
from keras.utils import plot_model
plot_model(model, to_file='model.png', show_shapes=True)
模型训练
其中loss采用均方误差,评估方式参考准确度,优化器使用Adam(对于初学者比较友好 会自行修正学习率)
opt = keras.optimizers.Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_squared_error", metrics=['mae', 'acc'], optimizer=opt)
train_inputs = [ np.array(list(train_X['uid_seq'])), train_X['logins'], np.array(list(train_X['tags_seq'])), np.array(list(train_X['img']))]
test_inputs = [ np.array(list(test_X['uid_seq'])), test_X['logins'], np.asarray(list(test_X['tags_seq'])), np.asarray(list(test_X['img']))]
# train the model
print("[INFO] training model...")
model.fit(
train_inputs, train_y,
validation_data=(test_inputs, test_y),
epochs=10, batch_size=8, verbose=1)
特征导出并存储到
用户特征
import requests
user_layer_model = Model(
inputs=[model.input[0], model.input[1], model.input[2]],
outputs=model.get_layer("user_vectors").output
)
# user_embeddings = []
#简单处理,这里没有对用户和商品进行去重
for index, row in user_df.iterrows():
user_id = row["user_id"]
user_input = [
np.reshape(row["uid_seq"], [1,1]),
np.reshape(row["login_days"], [1,1]),
np.reshape(row["tags_seq"], [1,10])
]
user_embedding = user_layer_model(user_input)
data='''{{
"id":"{}",
"tags":"{}",
"logins":{},
"features":{}
}}'''.format(user_id, row["tags"],row["login_days"],list(user_embedding.numpy().flatten()))
headers={
"Content-Type": "application/json"
}
# print(json.dumps(data))
response = requests.post("http://localhost:9200/user/_doc/"+user_id, data=data.encode(), headers=headers)
print(response.content)
# embedding_str = ",".join([str(x) for x in user_embedding.numpy().flatten()])
# user_embeddings.append([user_id, embedding_str])
# df_user_embedding = pd.DataFrame(user_embeddings, columns = ["user_id", "user_embedding"])
# df_user_embedding.head()
商品首图特征
操作同上
ES查询Query
首先根据用户id获取user塔存储的用户特征user_features,然后基于script_score在商品表与item塔输出特征进行点积计算
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": """
double value = dotProduct(params.query_vector, 'img_features');
return sigmoid(1, Math.E, -value);
""",
"params": {
"query_vector": {user_features}
}
}
}
}
}