ALBERT相当于是BERT的一个轻量版,ALBERT的配置类似于 BERT-large,但参数量仅为后者的 1/18,训练速度却是后者的 1.7 倍。
ALBERT主要对BERT做了3点改进,缩小了整体的参数量,加快了训练速度,增加了模型效果。
第一个改进是对嵌入参数化进行因式分解。大的词汇嵌入矩阵分解为两个小的矩阵,将隐藏层的大小与嵌入层的分离开。这种分离使得隐藏层的增加更加容易,同时不显著增加词汇嵌入的参数量。(不再将 one-hot 向量直接映射到大小为 H 的隐藏空间,先映射到一个低维词嵌入空间 E,然后再映射到隐藏空间。通过这种分解,研究者可以将词嵌入参数从 O(V × H) 降低到 O(V × E + E × H),这在 H 远远大于 E 的时候,参数量减少得非常明显。)
第二个改进是跨层参数共享。这一技术可以避免参数量随着网络深度的增加而增加。
第三个改进是提出了Sentence-order prediction (SOP)来取代NSP。具体来说,其正例与NSP相同,但负例是通过选择一篇文档中的两个连续的句子并将它们的顺序交换构造的。这样两个句子就会有相同的话题,模型学习到的就更多是句子间的连贯性。用于句子级别的预测(SOP)。SOP 主要聚焦于句间连贯,用于解决原版 BERT 中下一句预测(NSP)损失低效的问题。
在 BERT 中,Token Embedding 的参数矩阵大小为V×H,其中V表示词汇的长度,H为隐藏层的大小。即:
而 ALBERT 为了减少参数数量,在映射中间加入一个大小为E的隐藏层,这样矩阵的参数大小就从O(V×H)降低为O(V×E+E×H),而E≪H
之所以可以这样做是因为每次反向传播时都只会更新一个 Token 相关参数,其他参数都不会变。而且在第一次投影的过程中,词与词之间是不会进行交互的,只有在后面的 Attention 过程中才会做交互,我们称为 Sparsely updated。如果词不做交互的话,完全没有必要用一个很高维度的向量去表示,所以就引入一个小的隐藏层。
ALBERT 的参数共享主要是针对所有子模块内部进行的,这样便可以把 Attention feed-forward 模块参数量从O(12×L×H×H) 降低到12×H×H,其中L为层数,H为隐藏层的大小。
参数共享能显著减少参数。共享可以分为全连接层、注意力层的参数共享;注意力层的参数对效果的减弱影响小一点。
ALBERT 之所以这样做是因为考虑到每层其实学习到内容非常相似,所以尝试了将其进行参数共享。
NSP任务效果不佳的原因是其难度较小。将主题预测和连贯性预测结合在了一起,但主题预测比连贯性预测简单得多,并且它与LM损失学到的内容是有重合的。
SOP任务也很简单,它的正例和NSP任务一致(判断两句话是否有顺序关系),反例则是判断两句话是否为反序关系。
SOP的正例选取方式与BERT一致(来自同一文档的两个连续段),而负例不同于BERT中的sample,同样是来自同一文档的两个连续段,但交换两段的顺序,从而避免了主题预测,只关注建模句子之间的连贯性。
NSP:下一句预测,正样本=上下相邻的2个句子,负样本=随机2个句子。
SOP:句子顺序预测,正样本=正常顺序的2个相邻句子,负样本=调换顺序的2个相邻句子。
我们举个SOP例子:
正例:1.朱元璋建立的明朝。2.朱元璋处决了蓝玉。
反例:1.朱元璋处决了蓝玉。2.朱元璋建立的明朝。
论文地址:https://arxiv.org/abs/1909.11942
代码地址1:https://github.com/google-research/ALBERT
代码地址2:https://github.com/google-research/google-research
import math
from dataclasses import dataclass
from typing import Optional, Union, Tuple
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
from transformers import PretrainedConfig, shape_list
from transformers.activations_tf import get_tf_activation
from transformers.modeling_tf_utils import unpack_inputs, TFModelInputType, get_initializer
from transformers.tf_utils import stable_softmax
from transformers.utils import ModelOutput
@dataclass
class TFBaseModelOutputWithPooling(ModelOutput):
last_hidden_state: tf.Tensor = None
pooler_output: tf.Tensor = None
hidden_states: Optional[Tuple[tf.Tensor]] = None
attentions: Optional[Tuple[tf.Tensor]] = None
class AlBertConfig(PretrainedConfig):
model_type = "albert"
def __init__(
self,
vocab_size=30000,
embedding_size=128,
hidden_size=4096,
num_hidden_layers=12,
num_hidden_groups=1,
num_attention_heads=64,
intermediate_size=16384,
inner_group_num=1,
hidden_act="gelu_new",
hidden_dropout_prob=0,
attention_probs_dropout_prob=0,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
classifier_dropout_prob=0.1,
position_embedding_type="absolute",
pad_token_id=0,
bos_token_id=2,
eos_token_id=3,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_hidden_groups = num_hidden_groups
self.num_attention_heads = num_attention_heads
self.inner_group_num = inner_group_num
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.classifier_dropout_prob = classifier_dropout_prob
self.position_embedding_type = position_embedding_type
class TFAlBertModel(layers.Layer):
def __init__(self, config: AlBertConfig, *inputs, **kwargs):
super().__init__(self, *inputs, **kwargs)
self.albert = TFAlBertMainLayer(config, name="albert")
@unpack_inputs
def call(
self,
input_ids: Optional[TFModelInputType] = None,
attention_mask: Optional[Union[np.array, tf.Tensor]] = None,
token_type_ids: Optional[Union[np.array, tf.Tensor]] = None,
position_ids: Optional[Union[np.array, tf.Tensor]] = None,
head_mask: Optional[Union[np.array, tf.Tensor]] = None,
inputs_embeds: Optional[Union[np.array, tf.Tensor]] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
training: Optional[bool] = False
) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
outputs = self.albert(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
training=training
)
return outputs
class TFAlBertMainLayer(layers.Layer):
def __init__(self, config: AlBertConfig, add_pooling_layer: bool = True, **kwargs):
super().__init__(**kwargs)
self.config = config
self.embeddings = TFAlBertEmbeddings(config, name="embeddings")
self.encoder = TFAlBertTransformer(config, name="encoder")
self.pooler = layers.Dense(
units=config.hidden_size,
kernel_initializer=get_initializer(config.initializer_range),
activation="tanh",
name="pooler"
) if add_pooling_layer else None
@unpack_inputs
def call(
self,
input_ids: Optional[TFModelInputType] = None,
attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
training: Optional[bool] = False
) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
if input_ids is not None and inputs_embeds is not None:
raise ValueError("你不能同时指定input_ids和inputs_embeds这两个参数!")
elif input_ids is not None:
input_shape = shape_list(input_ids)
elif inputs_embeds is not None:
input_shape = shape_list(inputs_embeds)[: -1]
else:
raise ValueError("你需要指定input_ids或者inputs_embeds两个参数中的一个!")
# 如果attention_mask为None,则将attention_mask设置为全1,后面会将1设置为0
# 在transformer中的掩码矩阵mask中,0表示该位置对应的token参与注意力的计算,1则表示不参与
if attention_mask is None:
attention_mask = tf.fill(dims=input_shape, value=1)
if token_type_ids is None:
token_type_ids = tf.fill(dims=input_shape, value=0)
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
inputs_embeds=inputs_embeds,
training=training
)
# 将attention_mask形状从(batch_size, seq_len)扩展到(batch_size, 1, 1, seq_len)
# 从而通过python广播机制以形状(batch_size, num_heads, seq_len, seq_len)来计算attention
extended_attention_mask = tf.reshape(attention_mask, (input_shape[0], 1, 1, input_shape[1]))
extended_attention_mask = tf.cast(extended_attention_mask, dtype=embedding_output.dtype)
one_cst = tf.constant(1.0, dtype=embedding_output.dtype)
ten_thousand_cst = tf.constant(-10000.0, dtype=embedding_output.dtype)
extended_attention_mask = tf.multiply(tf.subtract(one_cst, extended_attention_mask), ten_thousand_cst)
if head_mask is not None:
raise NotImplementedError
else:
head_mask = [None] * self.config.num_hidden_layers
encoder_outputs = self.encoder(
hidden_states=embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
training=training
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output[:, 0]) if self.pooler is not None else None
if not return_dict:
return (sequence_output, pooled_output, ) + encoder_outputs[1:]
return TFBaseModelOutputWithPooling(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions
)
class TFAlBertEmbeddings(layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.vocab_size = config.vocab_size
self.type_vocab_size = config.type_vocab_size
self.embedding_size = config.embedding_size
self.max_position_embeddings = config.max_position_embeddings
self.initializer_range = config.initializer_range
self.LayerNorm = layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
self.dropout = layers.Dropout(rate=config.hidden_dropout_prob)
def build(self, input_shape):
with tf.name_scope("word_embeddings"):
self.weight = self.add_weight(
name="weight",
shape=[self.vocab_size, self.embedding_size],
initializer=get_initializer(self.initializer_range),
)
with tf.name_scope("token_type_embeddings"):
self.token_type_embeddings = self.add_weight(
name="embeddings",
shape=[self.type_vocab_size, self.embedding_size],
initializer=get_initializer(self.initializer_range),
)
with tf.name_scope("position_embeddings"):
self.position_embeddings = self.add_weight(
name="embeddings",
shape=[self.max_position_embeddings, self.embedding_size],
initializer=get_initializer(self.initializer_range),
)
super().build(input_shape)
def call(
self,
input_ids=None,
position_ids=None,
token_type_ids=None,
inputs_embeds=None,
past_key_values_length=0,
training=False
):
if input_ids is None and inputs_embeds is None:
raise ValueError("Need to provide either `input_ids` or `input_embeds`.")
if input_ids is not None:
inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
input_shape = shape_list(inputs_embeds)[: -1]
if token_type_ids is None:
token_type_ids = tf.fill(dims=input_shape, value=0)
if position_ids is None:
position_ids = tf.expand_dims(
tf.range(start=past_key_values_length, limit=input_shape[0] + past_key_values_length), axis=0
)
position_embeds = tf.gather(params=self.position_embeddings, indices=position_ids)
token_type_embeds = tf.gather(params=self.token_type_embeddings, indices=token_type_ids)
final_embeddings = inputs_embeds + position_embeds + token_type_embeds
final_embeddings = self.LayerNorm(inputs=final_embeddings)
final_embeddings = self.dropout(inputs=final_embeddings, training=training)
return final_embeddings
class TFAlBertTransformer(layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.num_hidden_layers = config.num_hidden_layers
self.num_hidden_groups = config.num_hidden_groups
self.layers_per_group = int(config.num_hidden_layers / config.num_hidden_groups)
self.embedding_hidden_mapping_in = layers.Dense(
units=config.hidden_size,
kernel_initializer=get_initializer(config.initializer_range),
name="embedding_hidden_mapping_in"
)
self.albert_layer_groups = [
TFAlBertLayerGroup(config, name=f"albert_layer_groups_._{i}") for i in range(config.num_hidden_groups)
]
def call(
self,
hidden_states=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
training=False
):
all_attentions = () if output_attentions else None
all_hidden_states = () if output_hidden_states else None
# 词嵌入因式分解
# 在Bert中隐藏层的维度和词嵌入向量的维度相同,即embedding_size=hidden_size
# 但在AlBert中这两个参数不一样,且embedding_size< (batch_size, seq_len, hidden_size)
hidden_states = self.embedding_hidden_mapping_in(hidden_states)
# 交叉层的参数共享
# 相同组在不同层传递时使用的同一层的参数
# 在Bert中,如果encoder包含多层layer,那么每层layer的attention层、ff层都是新创建的layer对象
# 而在AlBert中,如果encoder包含多层layer,则这些layer层的attention层、ff层都是创建的layer单例对象
# 注意,可以只共享attention层或者ff层,这里既共享attention层又共享ff层的参数
for i in range(self.num_hidden_layers):
group_idx = int(i / (self.num_hidden_layers / self.num_hidden_groups))
layer_group_output = self.albert_layer_groups[group_idx](
hidden_states=hidden_states,
attention_mask=attention_mask,
head_mask=head_mask[group_idx * self.layers_per_group: (group_idx + 1) * self.layers_per_group],
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
training=training
)
hidden_states = layer_group_output[0]
if output_attentions:
all_attentions = all_attentions + layer_group_output[-1]
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states, )
if not return_dict:
return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
class TFAlBertLayerGroup(layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.albert_layers = [
TFAlBertLayer(config, name=f"albert_layers_._{i}") for i in range(config.inner_group_num)
]
def call(
self,
hidden_states=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
output_hidden_states=None,
training=False,
):
layer_hidden_states = () if output_hidden_states else None
layer_attentions = () if output_attentions else None
for layer_index, albert_layer in enumerate(self.albert_layers):
if output_hidden_states:
layer_hidden_states = layer_hidden_states + (hidden_states, )
layer_output = albert_layer(
hidden_states=hidden_states,
attention_mask=attention_mask,
head_mask=head_mask[layer_index],
output_attentions=output_attentions,
training=training
)
hidden_states = layer_output[0]
if output_attentions:
layer_attentions = layer_attentions + layer_output[1]
if output_hidden_states:
layer_hidden_states = layer_hidden_states + (hidden_states,)
return tuple(v for v in [hidden_states, layer_hidden_states, layer_attentions] if v is not None)
class TFAlBertLayer(layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.attention = TFAlBertAttention(config, name="attention")
self.ffn = layers.Dense(
units=config.intermediate_size,
kernel_initializer=get_initializer(config.initializer_range),
name="ffn"
)
if isinstance(config.hidden_act, str):
self.activation = get_tf_activation(config.hidden_act)
else:
self.activation = config.hidden_act
self.ffn_output = layers.Dense(
units=config.intermediate_size,
kernel_initializer=get_initializer(config.initializer_range),
name="ffn_output"
)
self.full_layer_layer_norm = layers.LayerNormalization(
epsilon=config.layer_norm_eps,
name="full_layer_layer_norm"
)
self.dropout = layers.Dropout(rate=config.hidden_dropout_prob)
def call(
self,
hidden_states=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
training=False
):
# self-attention
attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions,
training
)
# ffn
ffn_output = self.ffn(attention_outputs[0])
ffn_output = self.activation(ffn_output)
ffn_output = self.ffn_output(ffn_output)
ffn_output = self.dropout(ffn_output, training=training)
hidden_states = self.full_layer_layer_norm(ffn_output + attention_outputs[0])
outputs = (hidden_states, ) + attention_outputs[1:]
return outputs
class TFAlBertAttention(layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
if config.hidden_size % config.num_attention_heads != 0:
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number "
f"of attention heads ({config.num_attention_heads})"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.sqrt_att_head_size = math.sqrt(self.attention_head_size)
self.output_attentions = config.output_attentions
self.query = tf.keras.layers.Dense(
units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query"
)
self.key = tf.keras.layers.Dense(
units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key"
)
self.value = tf.keras.layers.Dense(
units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value"
)
self.dense = tf.keras.layers.Dense(
units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
)
self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
self.attention_dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
self.output_dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor:
tensor = tf.reshape(tensor=tensor, shape=(batch_size, -1, self.num_attention_heads, self.attention_head_size))
return tf.transpose(tensor, perm=[0, 2, 1, 3])
def call(
self,
input_tensor=None,
attention_mask=None,
head_mask=None,
output_attentions=None,
training=False
):
batch_size = shape_list(input_tensor)[0]
mixed_query_layer = self.query(inputs=input_tensor)
mixed_key_layer = self.key(inputs=input_tensor)
mixed_value_layer = self.value(inputs=input_tensor)
query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)
key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)
value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)
# Take the dot product between "query" and "key" to get the raw attention scores.
# (batch size, num_heads, seq_len_q, seq_len_k)
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
dk = tf.cast(self.sqrt_att_head_size, dtype=attention_scores.dtype)
attention_scores = tf.divide(attention_scores, dk)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in TFAlbertModel call() function)
attention_scores = tf.add(attention_scores, attention_mask)
# Normalize the attention scores to probabilities.
attention_probs = stable_softmax(logits=attention_scores, axis=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.attention_dropout(inputs=attention_probs, training=training)
# Mask heads if we want to
if head_mask is not None:
attention_probs = tf.multiply(attention_probs, head_mask)
context_layer = tf.matmul(attention_probs, value_layer)
context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])
# (batch_size, seq_len_q, all_head_size)
context_layer = tf.reshape(tensor=context_layer, shape=(batch_size, -1, self.all_head_size))
self_outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
hidden_states = self_outputs[0]
hidden_states = self.dense(inputs=hidden_states)
hidden_states = self.output_dropout(inputs=hidden_states, training=training)
attention_output = self.LayerNorm(inputs=hidden_states + input_tensor)
# add attentions if we output them
outputs = (attention_output,) + self_outputs[1:]
return outputs