Wide and deep by Google模型学习

来源:

http://baijiahao.baidu.com/s?id=1582203712490596565&wfr=spider&for=pc

https://blog.csdn.net/a819825294/article/details/71080472


《Wide & Deep Learning for Recommender Systems》 学习

Wide and deep by Google模型学习_第1张图片

图1.Wide&Deep model结构

本文是将Wide Models(最左侧) 与 Deep Models(最右侧) 进行联合训练,保证记忆能力与泛化能力的均衡, 提出Wide&Deep(中间)。

Wide为LR模型, 需要添加特征变换(transformed feature)来保证模型的memorization能力。Deep learning为DNN模型, 对稀疏以及未知的特征组合做低维嵌入保证模型的generalization(泛化, 归纳)能力。


Wide and deep by Google模型学习_第2张图片

图2.应用于推荐系统的系统架构


1. The Wide Component

Wide 模型的定义(Linear Model):

是包含d个特征的向量,

是特征对应的模型参数, b 是模型偏置。特征集由原始输入特征(raw input feature)与 转换特征(cross-product transformation)。 文中提到:

原始特征:User features(e.g., country, language, demographics),Contextual features(e.g., device, hour of the day, day of the week),Impression features(e.g., app age, historical statistics of an app)

转换特征:

其中,

为k-th transformation,表示用户定义第k个变换组合特征值,

的值为1时,表示第i个特征为该变换组合的一部分。

宽度模型使用多类型的叉乘特征变换,记忆特定的特征组合,但它的限制就是难以归纳以前没出现过的组合,那这就需要人工提供特征。

This captures the interactions between the binary features, and adds nonlinearity to the generalized linear model.

总结:宽度模型的输入是用户安装应用(installation)和为用户展示(impression)的应用间的向量积。


2. The Deep Component

Figure1最右侧结构是一个3个hidden layer的MLP,首先是一个embedding层,然后是两层的神经网络,最后是一个softmax。其将类别特征(categorical feature),如安装了视频类应用、展示的是音乐类应用等,将这些高维稀疏的类别特征映射为低维稠密的向量(Dense Embeddings)。与其他连续特征(用户年龄、应用安装数等)拼接在一起,输入 MLP 中。

文中提到embedding维数为O(10)到O(100), embedding vector 进随机初始化等,隐含层的定义:

其中,l为hidden layer的层数,f为激活函数(ReLu)。 使用神经网络可以通过低维嵌入(lower dimension embeddings)更好地归纳特征组合。

总结:基于 embedding 的深度模型的输入是 类别特征(产生embedding) + 连续特征。


*embedding column详述(来源:https://blog.csdn.net/heyc861221/article/details/80131369)

sparse feature column 通过 embedding 转换成连续型向量后可以作为 deep model 的输入,前面谈到了 cross column 的一个不足之处是在测试集合的泛化能力,通过 embedding column 将离散特征连续化,根据标注学习特征的向量形式,如同矩阵分解中学习物品的隐含因子向量或者词向量模型中单词的词向量。

embedding column 的接口形式:

def embedding_column(sparse_id_column, dimension, combiner=None, initializer=None, 
 ckpt_to_load_from=None,tensor_name_in_ckpt=None, max_norm=None, trainable=True)
对应类为_EmbeddingColumn:
def __new__(cls,sparse_id_column,dimension,combiner="mean",initializer=None, ckpt_to_load_from=None,
 tensor_name_in_ckpt=None,shared_embedding_name=None, shared_vocab_size=None,max_norm=None,
 trainable = True):

sparse_id_column 是 SparseColumn 对象或者 WeightedSparseColumn 对象,dimension 是 embedding column 的向量维度。SparseColumn 的每个特征取值对应一个整数 id,该整数 id 在 embedding column 中对应一个 dimension 维度的浮点数向量。combiner 参数指定在单个样本上对特征向量归一化的方式,initializer 参数指定特征向量的初始化函数,默认按 truncated normal distribution 初始化 (mean = 0, stddev = 1/ sqrt(length of sparse id column))。max_norm 限定每个样本特征向量做 L2 归一化后的最大值:embedding_vector = embedding_vector * max_norm / L2_norm(embedding_vector)。

embedding column示意图:

Wide and deep by Google模型学习_第3张图片

如上图,以 sparse_column_with_keys(column_name = 『gender』, keys = [『female』, 『male』]) 为例,假设 female 对应 id = 0, male 对应 id = 1,每个 id 在 embedding feature 中对应 1 个 6 维的浮点数向量。在实际训练数据中,当 gender 特征取值为』female』时,给到 DNN 输入层的将是 id = 0 对应的向量(tf.embedding_lookup_sparse)。embedding_column 设置了一个 trainable 参数,指定是否根据模型训练误差更新特征对应的 embedding。


3. Joint Training of Wide & Deep Model

文中提到Joint模型Wide与Deep,计算log odds ratio 然后加权求和。


其中,Y 是二值类标签,表示用户是否下载app的行为。


Wide and deep by Google模型学习_第4张图片

为sigmoid函数, 为Wide Component中构造的cross-product transformation,x为app item对应的原始特征,b为偏置。Wwide 与 Wdeep分别为两个模型的参数向量。


4. SYSTEM IMPLEMENTATION

该Apps推荐系统主要包括数据生成, 模型训练, 模型服务三部分:

Data Generation

Label: 标准是 app acquisition,用户下载为 1,否则为 0。

Vocabularies: 将类别特征(categorical features)映射为整型的 id,连续的实值先用累计分布函数CDF归一化到[0,1],再划档离散化。

Model Training

500 billion的训练数据, Input layer 输入Continuous features 和 Categorical features,详情可参考Figure4. 在已经训练模型基础上,采用热启动的方式,也就是从之前的模型中读取 embeddings 以及 linear model weights来初始化一个新模型。当新的训练数据来临的时候,在已有模型的基础上进行训练,以减少计算的复杂度与时间开销。

Model Serving

当模型训练并且优化好之后,载入推荐引擎,对每一个query request,排序系统从检索系统接收候选列表,以及每一个app对应的特征,然后根据app特征通过Wide&Deep Model计算出每一个app分数,并由高到底排序。文中还提到使用更小的batch与并行操作以提高推荐引擎的性能。


完整代码(https://www.tensorflow.org/tutorials/wide_and_deep)

# -*- coding: utf-8 -*-

import tensorflow as tf
import tempfile
import pandas as pd
import urllib
import numpy as np
import warnings

from __future__ import print_function

warnings.filterwarnings("ignore")

# Categorical base columns.
gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["Female", "Male"])
race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=["Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)

# Continuous base columns.
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

wide_columns = [
  gender, native_country, education, occupation, workclass, relationship, age_buckets,
  tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)),
  tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4)),
  tf.contrib.layers.crossed_column([age_buckets, education, occupation], hash_bucket_size=int(1e6))]

deep_columns = [
  tf.contrib.layers.embedding_column(workclass, dimension=8),
  tf.contrib.layers.embedding_column(education, dimension=8),
  tf.contrib.layers.embedding_column(gender, dimension=8),
  tf.contrib.layers.embedding_column(relationship, dimension=8),
  tf.contrib.layers.embedding_column(native_country, dimension=8),
  tf.contrib.layers.embedding_column(occupation, dimension=8),
  age, education_num, capital_gain, capital_loss, hours_per_week]

model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
    model_dir=model_dir,
    linear_feature_columns=wide_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50])

# Define the column names for the data sets.
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
  "marital_status", "occupation", "relationship", "race", "gender",
  "capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket"]
LABEL_COLUMN = 'label'
CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss",
                      "hours_per_week"]

# Download the training and test data to temporary files.
# Alternatively, you can download them yourself and change train_file and
# test_file to your own paths.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)

# Read the training and test data sets into Pandas dataframe.
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)

def input_fn(df):
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values)
                     for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      dense_shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  # Converts the label column into a constant Tensor.
  label = tf.constant(df[LABEL_COLUMN].values)
  # Returns the feature columns and the label.
  return feature_cols, label

def train_input_fn():
  return input_fn(df_train)

def eval_input_fn():
  return input_fn(df_test)

print('df_train shape:',np.array(df_train).shape)
print('df_test shape:',np.array(df_test).shape)

m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

你可能感兴趣的:(机器学习)