Adaptive Softmax

在自然语言处理中，当字典维度过大时，embedding 将占据模型大部分参数量。例如机器翻译任务中，词表维度大约是，embedding维度取1024，那么就会产生将近1亿参数量，如果不共享embedding矩阵和softmax映射的矩阵，将会再多出1亿参数量。

这样会引起常见的两个问题：

参数量巨大会直接影响线上部署显存占用，单点部署的进程数就会收到限制，云上GPU是很贵的
自然语言中单词的分布服从齐夫定律(Zipf law)，少部分单词频数和占据总频数的大部分。这使得出现频数少的单词没法得到充分的训练。

Facebook在Efficient softmax approximation for GPUs中提出了Adaptive Softmax，可以很好的解决以上两个问题。大致思想就是按照每个单词在语料中出现的频数从高到低排序并分组，针对频数高的组设置大的embedding维度，频数低的组设置小的embedding维度。根据这一想法，基于tensor2tensor中的Modality类，实现了一个Adaptive Softmax版本。

具体实现分为两个部分，一方面是对字典index做词嵌入，另一方面是将隐层向量映射到字典维度的概率分布。Adaptive Softmax需要设置三个超参：vocab_size、cutoffs、embedding_size，下面是这些超参的举例值：

vocab_size = 2**17
cutoffs = [0, 2**14, 2**15, 2**16, 2**17]
embedding_size = [1024, 512, 256, 128]

其中vocab_size是指字典维度，也就是由SubwordTextEncoder构建所得的subword总个数。cutoffs是指字典index分组的边界值，例中cutoffs显示共分四组，组大小分别为、、和。由于subword会按照频数由大到小排序，每个组内设置的embedding_size分别为、、和。

index to embedding

词嵌入阶段需要将index转化为embedding，所以需要先声明embedding矩阵，再由不同映射矩阵将不同维度的embedding向量映射到相同的隐层维度。重写Modality类中的_get_weights方法

def _get_weights(self, hidden_dim=None):
    cutoffs = [0] + self._model_hparams.cutoffs + [self._vocab_size]
    if hidden_dim is None:
        hidden_dim = self._body_input_depth
    embeddings, projections = [], []
    for i in range(len(cutoffs) - 1):
        embed_dim = max(hidden_dim // (self._model_hparams.div_val ** i), self._model_hparams.min_hidden_dim)

        var_name = "weights_embed_%d" % i
        embeddings.append(tf.get_variable(name=var_name,
                                          shape=[cutoffs[i + 1] - cutoffs[i], embed_dim],
                                          initializer=tf.random_normal_initializer(0.0, hidden_dim ** -0.5)))

        var_name = "weights_proj_%d" % i
        projections.append(tf.get_variable(name=var_name,
                                           shape=[embed_dim, hidden_dim],
                                           initializer=tf.random_normal_initializer(0.0, hidden_dim ** -0.5)))
    return embeddings, projections

div_val=2是指组与组之间embbeding_size的下降系数，也就是从embedding_size=hidden_dim=1024逐级下降。min_hidden_dim=128是设置最小的embedding_size，防止embedding_size递减到过小的值。_get_weights中得到的embedding矩阵和对应的映射矩阵维度分别为，，，。这样可以轻松计算出Adaptive Softmax的参数量大致为，相比原始版本减少大约三分之二。

准备好weights，具体将index转化为embedding是由tf.gather取出相应embedding，再映射到隐层维度。多组之间的embeddings是根据cutoffs限定index范围从而构建当前组index的mask叠加得到。

def bottom_simple(self, x, name, reuse):
    hidden_dim = self._body_input_depth
    cutoffs = [0] + self._model_hparams.cutoffs + [self._vocab_size]
    with tf.variable_scope(name, reuse=reuse) as vs:
        # Ensure the inputs are 3-D
        if len(x.get_shape()) == 4:
            x = tf.squeeze(x, axis=3)
        while len(x.get_shape()) < 3:
            x = tf.expand_dims(x, axis=-1)
        embeddings, projections = self._get_weights()
        x = common_layers.dropout_no_scaling(
            x, 1.0 - self._model_hparams.symbol_dropout)
        x_shape = common_layers.shape_list(x)
        x = tf.reshape(x, [-1, x_shape[-1]])
        ret = tf.tile(tf.zeros_like(x, dtype=vs.dtype), (1,) * (K.ndim(x) - 1) + (hidden_dim,), )
        for i in range(len(cutoffs) - 1):
            low, high = cutoffs[i], cutoffs[i + 1]
            mask = tf.cast(low <= x, vs.dtype) * tf.cast(x < high, vs.dtype)
            selected = tf.squeeze(tf.gather(embeddings[i], (x - low) * tf.cast(mask, x.dtype)), 1)
            projected = tf.matmul(selected, projections[i])
            ret += projected * mask

        ret = tf.reshape(ret, x_shape[:-1] + [1, hidden_dim])
        x = tf.reshape(x, x_shape)
        if self._model_hparams.multiply_embedding_mode == "sqrt_depth":
            ret *= self._body_input_depth ** 0.5
        ret *= tf.expand_dims(tf.cast(tf.not_equal(x, 0), dtype=vs.dtype), -1)
        return ret

hidden state to probability distribution

Adaptive Softmax.png

计算整个字典温度的概率分布是分为两步的，先计算上的概率分布，其中最后3个维度分别表示组、和的概率，再分别计算组、和上的概率分布。字典维度的分布概率:
$P(i) = \begin{cases} P_{V_h}(i)\ \ \ \ 0<=i<2^{14}\\ P_{V_h}(2^{14})*P_{V_1}(i-2^{14})\ \ \ \ 2^{14}<=i<2^{15}\\ P_{V_h}(2^{14}+1)*P_{V_2}(i-2^{15})\ \ \ \ 2^{15}<=i<2^{16}\\ P_{V_h}(2^{14}+2)*P_{V_3}(i-2^{16})\ \ \ \ 2^{16}<=i<2^{17}\\ \end{cases}$

def top(self, body_output, _):

    scope_name = "shared"
    reuse = tf.AUTO_REUSE

    body_output_shape = common_layers.shape_list(body_output)
    body_output = tf.reshape(body_output, [-1, body_output_shape[-1]])

    hidden_dim = self._body_input_depth
    cutoffs = [0] + self._model_hparams.cutoffs + [self._vocab_size]
    cluster_num = len(cutoffs) - 2
    with tf.variable_scope(scope_name, reuse=reuse):
        kernel_cluster = tf.get_variable(name="kernel-cluster",
                                         shape=[hidden_dim, cluster_num],
                                         initializer=tf.random_normal_initializer(0.0, hidden_dim ** -0.5))
        embeddings, projections = self._get_weights(body_output_shape[-1])

        cluster_probs = None
        outputs = []
        for i in range(len(cutoffs) - 1):
            cluster_input = tf.matmul(body_output, projections[i], transpose_b=True)
            cluster_output = tf.matmul(cluster_input, embeddings[i], transpose_b=True)

            if cluster_probs is None:
                cluster_probs = tf.matmul(cluster_input, kernel_cluster)
                cluster_output = tf.concat([cluster_output, cluster_probs], axis=-1)
                cluster_output = tf.nn.log_softmax(cluster_output, axis=-1)
                cluster_probs = cluster_output[..., -cluster_num:]
                cluster_output = cluster_output[..., :-cluster_num]
            else:
                cluster_output = tf.nn.log_softmax(cluster_output, axis=-1)
                cluster_output = cluster_output + tf.expand_dims(cluster_probs[..., i - 1], axis=-1)

            outputs.append(cluster_output)
        probs = tf.concat(outputs, axis=-1)
        return tf.reshape(probs, body_output_shape[:-1] + [1, self._vocab_size])

kernel_cluster是针对组，和概率的映射矩阵。为了避免概率的乘积运算，使用tf.nn.log_softmax将乘法转化为加法，保证数值稳定性。

note

由于使用log_softmax，top方法输出是概率分布的log形式，这样在计算loss时就不能用tensorflow中基于logits的交叉熵损失计算函数，需要手动根据targets计算损失值。

Adaptive Softmax

index to embedding

hidden state to probability distribution

note

你可能感兴趣的:(Adaptive Softmax)