1. 讲下BERT。
-
双向二阶段预训练模型-word-piece。
- Special Token:
[CLS]
、[SEP]
。 - BERT_Base
(12 layers)
、BERT_Large(24 layers)
。 - Pre-training:
Task #1: Masked LM
、Task #2: Next Sentence Prediction
Special Token:
start and end tokens (
、,) delimiter token ($)
。-
Fine-tuning:
Two Sentence Classification
、Single Sentence Classification
、Question Answering
、Single Sentence Tagging
。
2. 能否实现下Word Piece?忘记步骤了,换成实现一下从若干文件中生成一个词典,即word2idx和idx2word。BPE算法。
WordPiece算法可以看作是BPE的变种。不同点在于,WordPiece基于概率生成新的subword而不是下一最高频字节对。
算法:
- 准备足够大的训练语料
- 确定期望的subword词表大小
- 将单词拆分成字符序列
- 基于第3步数据训练语言模型
- 从所有可能的subword单元中选择加入语言模型后能最大程度地增加训练数据概率的单元作为新的单元
- 重复第5步直到达到第2步设定的subword词表大小或概率增量低于某一阈值
class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""
def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.
Returns:
A list of wordpiece tokens.
"""
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
Byte Pair Encoding:
def bpe(self, token):
if token in self.cache:
return self.cache[token]
word = tuple(token)
pairs = get_pairs(word)
if not pairs:
return token
while True:
bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
except ValueError:
new_word.extend(word[i:])
break
else:
new_word.extend(word[i:j])
i = j
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = " ".join(word)
self.cache[token] = word
return word
def _tokenize(self, text):
""" Tokenize a string. """
bpe_tokens = []
for token in re.findall(self.pat, text):
token = "".join(
self.byte_encoder[b] for b in token.encode("utf-8")
) # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
return bpe_tokens
# '想要有直升机\n想要和你飞到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'
# 这个数据集有6万多个字符。为了打印方便,我们把换行符替换成空格,然后仅使用前1万个字符来训练模型。
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]
idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size # 1027
3. 讲下bert,讲着讲着面试官打断了我,说你帮我估算下一层bert大概有多少参数量。
讲下BERT:如问题1。
# BertEmbeddings:
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# BertAttention(BertSelfAttention+ BertSelfOutput):
#### BertSelfAttention
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
#### BertSelfOutput
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# BertIntermediate
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
# BertOutput
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# BertPooler
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
4. bert里add&norm是什么以及作用。
- add:Residual connection
(残差连接)
,主要是为了避免模型较深时,在进行反向传播时,梯度消失等问题。 - norm:Layer Normalization
(层归一化)
,为了解决网络中数据分布变化大,学习过程慢的问题。
5. 了不了解bert的扩展模型,roberta,SpanBERT,XLM,albert;介绍几个除了bert之外的模型。
roberta:
SpanBERT:
XLM:
albert:
transformer-xl(extra-long)
:
xlnet:
6. bert源码里mask部分在哪个模块;bert如何mask。
Multi-Head Attention:(BertSelfAttention)
class BertSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
if encoder_hidden_states is not None:
mixed_key_layer = self.key(encoder_hidden_states)
mixed_value_layer = self.value(encoder_hidden_states)
attention_mask = encoder_attention_mask
else:
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
7. 估计一下bert的参数量。
估算下bert的参数量:如问题3。
- : L=12, H=768, A=12, Total Parameters=110M。
- : L=24, H=1024, A=16, Total Parameters=340M。
8. roberta和bert在预训练时的不同。
- Static vs. Dynamic Masking
- Model Input Format and Next Sentence Prediction
9. 介绍下roberta,为什么选用wwm。
同上:
Whole Word Masking (wwm),暂翻译为全词Mask或整词Mask,是谷歌在2019年5月31日发布的一项BERT的升级版本,主要更改了原预训练阶段的训练样本生成策略。 简单来说,原有基于WordPiece的分词方式会把一个完整的词切分成若干个子词,在生成训练样本时,这些被分开的子词会随机被mask。 在全词Mask中,如果一个完整的词的部分WordPiece子词被mask,则同属该词的其他部分也会被mask,即全词Mask。
需要注意的是,这里的mask指的是广义的mask(替换成[MASK];保持原词汇;随机替换成另外一个词),并非只局限于单词替换成[MASK]
标签的情况。 更详细的说明及样例请参考:#4
同理,由于谷歌官方发布的BERT-base, Chinese
中,中文是以字为粒度进行切分,没有考虑到传统NLP中的中文分词(CWS)。 我们将全词Mask的方法应用在了中文中,使用了中文维基百科(包括简体和繁体)进行训练,并且使用了哈工大LTP作为分词工具,即对组成同一个词的汉字全部进行Mask。
下述文本展示了全词Mask
的生成样例。 注意:为了方便理解,下述例子中只考虑替换成[MASK]标签的情况。
10. BERT、GPT、ELMO之间的区别(模型结构、训练方式)。
- BERT-双向两阶段预训练模型:Pre-training
(MLM、NSP)
+Fine-Tuning(...)
- GPT-单向两阶段预训练模型:Pre-training
(LM)
+Fine-Tuning(...)
- ELMO-两个方向预训练模型:Pre-training
(两个单向LM)
+supervised NLP tasks(使用LSTM各层表征)
11. BERT为什么只用Transformer的Encoder而不用Decoder。
BERT在Pre-training过程中使用的Masked Language Model(AE)
,联合上下文信息预测被[MASK]掉的标记。而Decoder采用的一种单向的语言模型(LM)
。所以BERT使用Encoder,而不是用Decoder。
12. xlnet和bert有啥不同。自回归&&自编码的知识,其中解释了xlnet排列语言模型以及双流attention。
bert:自回归;xlnet:自回归&&自编码。