源代码地址
大佬预训练代码地址
模型结构之前已经进行分析过了,这里从训练过程开始分析
training_args = TrainingArguments(
output_dir='record',
num_train_epochs=num_train_epochs,
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
save_steps=save_steps,
logging_steps=500,
save_total_limit=5,
prediction_loss_only=True,
seed=seed
)
得到对应的training_args
training_args =
output_dir:record
overwrite_output_dir:False
do_train:False
do_eval:False
do_predict:False
evaluation_strategy:IntervalStrategy.NO
prediction_loss_only:True
per_device_train_batch_size:32
per_device_eval_batch_size:8
per_gpu_train_batch_size:None
per_gpu_eval_batch_size:None
gradient_accumulation_steps:1
接着定义Trainer的内容
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset
)
然后调用trainer.train()进入训练的过程
这里首先调用data_collator.py之中的mask_tokens的内容
首先分析DataCollatorForLanguageModeling类之中的内容
class DataCollatorForLanguageModeling:
tokenizer: PreTrainedTokenizerBase
mlm: bool = True
mlm_probability: float = 0.15
pad_to_multiple_of: Optional[int] = None
def __post_init__(self):
if self.mlm and self.tokenizer.mask_token is None:
raise ValueError(
"This tokenizer does not have a mask token which is necessary for masked language modeling. "
"You should pass `mlm=False` to train on causal language modeling instead."
)
这是初始化之中的一些参数,之前初始化过了,这里直接调用DataCollatorForLanguageModeling之中的__call__函数
def __call__(
self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
if isinstance(examples[0], (dict, BatchEncoding)):
batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
else:
batch = {"input_ids": _collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)}
# If special token mask has been preprocessed, pop it from the dict.
special_tokens_mask = batch.pop("special_tokens_mask", None)
#special_tokens_mask = None
if self.mlm:
#special_tokens_mask = None
batch["input_ids"], batch["labels"] = self.mask_tokens(
batch["input_ids"], special_tokens_mask=special_tokens_mask
)
else:
labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels
return batch
这里对应的放入的内容为:
[{'input_ids': tensor([ 101, 169, 107, 10539, 142, 8231, 107, 131, 107, 146,
8189, 8168, 9402, 8156, 8177, 8660, 8154, 8408, 8921, 8148,
............
100, 100, 107, 171, 117, 169, 107, 10539, 107, 102])},
............
{'input_ids': tensor([ 101, 169, 107, 10539, 142, 8231, 107, 131, 107, 8424,
8161, 12540, 12675, 8154, 10696, 9647, 8168, 8849, 8139, 9355,
............
100, 8123, 118, 8143, 100, 100, 100, 100, 100, 102])},
............]
接着调用对应的pad函数
batch = self.tokenizer.pad(examples,return_tensors="pt,pad_to_multiple_of=self.pad_to_multiple_of)
这里发现上面的数据有很多100的对应数值,联想到之前数据分词的时候可能有所操作。
回看之前LineByLineTextDataset的分词部分操作的内容
batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
对应初始化操作之中
tokenizer:PreTrainedTokenizer
所以这里的self.tokenizer.pad要去PreTrainedTokenizer中的pad函数之中去查看
进入到PreTrainedTokenizer之中,也就是说直接使用PreTrainedTokenizer(…)的内容去进行相应的分词操作。
真实的tokenizer定义在main()的对应定义之中
tokenizer = BertTokenizer.from_pretrained(vocab_file)
经过的相应输出之后,发现这里的内容仍然为PreTrainedTokenizer的内容
tokenizer =
PreTrainedTokenizer(name_or_path=
'/home/xiaoguzai/数据/nezha-chinese-base/vocab.txt', vocab_size=21128,
model_max_len=1000000000000000019884624838656, is_fast=False,
padding_side='right', special_tokens={'unk_token': '[UNK]',
'sep_token': '[SEP]', 'pad_token': '[PAD]',
'cls_token': '[CLS]', 'mask_token': '[MASK]'})
所以这里调用的DataCollatorForWholeWordMask的对应函数内容
这里面调用函数的过程如下:
PreTrainedTokenizer tokenize
PreTrainedTokenizer split_on_tokens
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer convert_tokens_to_ids
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
......
下一波又进入了这种调用函数的运行状态之中:
PreTrainedTokenizer tokenize
PreTrainedTokenizer split_on_tokens
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer convert_tokens_to_ids
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
......
这里有一个相应的疑点,就是直接调用的时候为什么就调用到了PreTrainedTokenizer tokenize函数之中了,猜想大概是因为在初始化过程之中绑定了tokenize函数,使得调用了对应的tokenize的函数内容。
这里的实现应该类似于建立模型之后使用model(input_ids)直接找到对应的输出内容。
进入PreTrainedTokenizer的tokenize函数之中去查看调用的过程:
def tokenize(self, text: TextInput, **kwargs) -> List[str]:
这里输入的text =
{"text_id": "e225b9fd36b8914f42c188fc92e8918f", "query": "河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元", "candidate": [{"text": "巩义市桐和街", "label": "不匹配"},
{"text": "桐和街依家小店", "label": "不匹配"}, {"text": "桐和街CHANG六LIULIU", "label": "不匹配"}, {"text": "桐和街佳乐钢琴", "label": "不匹配"},
{"text": "世博领秀城南门桐和街囍饭食堂", "label": "不匹配"}]}
接着运行对应的all_special_tokens_extended内容
all_special_tokens_extended = dict(
(str(t), t) for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
)
输出对应的all_special_tokens_extended
all_special_tokens_extended = {}
接下来调用prepare_for_tokenization函数内容
text, kwargs = self.prepare_for_tokenization(text,**kwargs)
得到的text和kwargs没有变化,kwargs = {}仍然不变。
接下来
if hasattr(self,"do_lower_case") and self.do_lower_case:
......
这里由于都是中文,所以不会运行这一句
接下来调用split_on_token函数和split_on_tokens函数内容
no_split_token = self.unique_no_split_tokens
tokenized_text = split_on_tokens(no_split_token,text)
进入split_on_tokens之中
这里放入我修改过之后的split_on_tokens函数
def split_on_tokens(tok_list, text):
print('PreTrainedTokenizer split_on_tokens')
if not text.strip():
return []
if not tok_list:
return self._tokenize(text)
tokenized_text = []
text_list = [text]
for tok in tok_list:
tokenized_text = []
for sub_text in text_list:
if sub_text not in self.unique_no_split_tokens:
tokenized_text.extend(split_on_token(tok, sub_text))
else:
tokenized_text.append(sub_text)
text_list = tokenized_text
return list(
itertools.chain.from_iterable(
(
self._tokenize(token) if token not in self.unique_no_split_tokens else [token]
for token in tokenized_text
)
)
)
这中间的内容都没有改变,关键在于最后return这一部分的内容
return list(
itertools.chain.from_iterable(
(
self._tokenize(token) if token not in self.unique_no_split_tokens else [token] for token in tokenized_text
)
)
)
这里的
self.unique_no_split_tokens = ['[CLS]','[MASK]','[PAD]','[SEP]','[UNK]']
处理完之后,对应的字符串内容为
['{', '"', 'text', '_', 'id', '"', ':', '"', 'e2', '##25', '##b', '##9', '##f', '##d', '##36', '##b', '##89', '##14', '##f', '##42', '##c', '##18', '##8', '##fc', '##92', '##e', '##89', '##18', '##f', '"', ',', '"', 'q', '##ue', '##ry', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]',
'[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '6', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '3', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]',
'"', ',', '"', 'can', '##di', '##da', '##te', '"', ':', '[', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"',
'[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', 'chang', '[UNK]', 'liu', '##li', '##u', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',',
'{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',',
'{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ']', '}']
可以看出,所有的中文汉字都被替换成为了标志’[UNK]'的内容
比如传入的字符串为
['{"text_id": "e225b9fd36b8914f42c188fc92e8918f",
"query": "河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元",
"candidate": [{"text": "巩义市桐和街", "label": "不匹配"}, {"text": "桐和街依家小店", "label": "不匹配"},
{"text": "桐和街chang六liuliu", "label": "不匹配"}, {"text": "桐和街佳乐钢琴", "label": "不匹配"},
{"text": "世博领秀城南门桐和街囍饭食堂", "label": "不匹配"}]}']
得到对应的list数组内容为
PreTrainedTokenizer split_on_tokens
['{', '"', 'text', '_', 'id', '"', ':', '"', 'e2', '##25', '##b', '##9', '##f', '##d', '##36', '##b', '##89', '##14', '##f', '##42', '##c', '##18', '##8', '##fc', '##92', '##e', '##89', '##18', '##f', '"', ',', '"', 'q', '##ue', '##ry', '"', ':', '"', '[UNK]',
'[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '6', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '3', '[UNK]', '[UNK]',
'[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'can', '##di', '##da', '##te', '"', ':', '[', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"',
'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', 'chang', '[UNK]', 'liu',
'##li', '##u', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]',
'"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ']', '}']
ids =
[169, 107, 10539, 142, 8231, 107, 131, 107, 12357, 8743, 8204, 8160, 8189, 8168, 9159, 8204, 9402, 8717, 8189, 9240, 8177, 8662, 8156, 9717, 9595, 8154, 9402, 8662, 8189, 107, 117,
107, 159, 8803, 8449, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 127, 100, 100, 100, 100, 100, 124, 100, 100, 100, 100, 100, 107, 117,
107, 9109, 9172, 8521, 8299, 107, 131, 138, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 117, 169, 107,
10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 11680, 100,
12306, 8636, 8207, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107,
131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 140, 171]
从上面的内容可以看出,中文的内容全被使用[UNK]遮盖住了,而英文的内容则被保留了下来。
接下来查看训练数据的过程
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset
)
这里面首先调用的是DataCollatorForLanguageModeling类的调用__call__函数的过程
def __call__(
self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
print('data/data_collator.py __call__')
# Handle dict or lists with proper padding and conversion to tensor.
print('data_collator examples = ')
print(examples)
print('#########################')
if isinstance(examples[0], (dict, BatchEncoding)):
#运行第一个对应的if语句
batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
#这里的tokenizer使用的是PreTrainedTokenizerBase
print('|||self.tokenizer = |||')
print(self.tokenizer)
print('---self.pad_to_multiple_of---')
print(self.pad_to_multiple_of)
r"""
self.tokenizer = PreTrainedTokenizer(name_or_path='/home/...vocab.txt',
special_tokens={'unk_token':'[UNK]','sep_token':'[SEP]',...'mask_token':'[MASK]'}
"""
else:
print('situation2')
batch = {"input_ids": _collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)}
print('999batch = 999')
print(batch)
r"""
batch =
{'input_ids':tensor(
[[101,169,...102],
................
[101,169,...102]]),
'attention_mask':tensor(
[[1,1,...1,1]
"""
#batch['input_ids'].shape = ([32,90])
#batch['attention_mask'].shape = ([32,90])
print('99999999999999')
r"""
batch =
{'input_ids': tensor(
[[ 101, 169, 107, ..., 10539, 107, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
...,
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 117, 169, 102]]),
'attention_mask': tensor(
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]])}
"""
# If special token mask has been preprocessed, pop it from the dict.
special_tokens_mask = batch.pop("special_tokens_mask", None)
#special_tokens_mask = None
r"""
special_tokens_mask =
[[1,0,0,...0,0,1],
[1,0,0,...1,1,1],
...............
[1,0,0,...0,0,1]]
"""
if self.mlm:
#special_tokens_mask = None
batch["input_ids"], batch["labels"] = self.mask_tokens(
batch["input_ids"], special_tokens_mask=special_tokens_mask
)
r"""
***batch = ***
{'input_ids': tensor(
[[ 101, 169, 107, ..., 10539, 107, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
...,
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 103, 169, 102]]),
'attention_mask': tensor(
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]),
'labels': tensor(
[[-100, -100, -100, ..., -100, 107, -100],
[-100, -100, -100, ..., -100, -100, -100],
[-100, -100, -100, ..., -100, -100, -100],
...,
[-100, -100, -100, ..., -100, -100, -100],
[-100, -100, -100, ..., -100, -100, -100],
[-100, -100, -100, ..., 117, -100, -100]])}
"""
else:
labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels
return batch
首先输出
data_collator examples =
[{'input_ids': tensor([ 101, 169, 107, 10539, 142, 8231, 107, 131, 107, 8360,
8717, 8139, 9099, 8168, 9267, 8157, 8177, 9446, 8177, 9419,
8510, 10340, 10696, 8129, 11008, 8160, 8204, 8849, 8152, 8139,
...........
107, 10539, 107, 131, 107, 100, 100, 100, 100, 102])},
......
{'input_ids': tensor([ 101, 169, 107, 10539, 142, 8231, 107, 131, 107, 9226,
9102, 9039, 8854, 8748, 8159, 9717, 8204, 8189, 9242, 8168,
8189, 11219, 11414, 8148, 8154, 9102, 9410, 8157, 8139, 107,
117, 107, 159, 8803, 8449, 107, 131, 107, 100, 100,
100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 100, 100, 100, 100, 100, 100, 100, 100, 127,
100, 100, 100, 100, 100, 107, 117, 107, 9109, 9172,
8521, 8299, 107, 131, 138, 169, 107, 10539, 107, 131,
107, 100, 100, 100, 100, 100, 100, 107, 171, 102])}]
接着输出对应的内容
self.pad_to_multiple_of = None
这里得到对应的attention_mask的内容
batch =
{'input_ids': tensor(
[[ 101, 169, 107, ..., 10539, 107, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
...,
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 117, 169, 102]]),
'attention_mask': tensor(
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]])}
接着调用对应的语句
special_tokens_mask = batch.pop("special_tokens_mask",None)
由于上面的batch之中没有对应的special_tokens_mask的属性,所以得到对应的special_tokens_mask的对应值为None
special_tokens_mask = None
接着进入调用语句
if self.mlm:
#special_tokens_mask = None
batch["input_ids"], batch["labels"] = self.mask_tokens(
batch["input_ids"], special_tokens_mask=special_tokens_mask
)
这里面需要进入self.mask_tokens去调用
def mask_tokens(
self, inputs: torch.Tensor, special_tokens_mask: Optional[torch.Tensor] = None
) -> Tuple[torch.Tensor, torch.Tensor]:
print('data/data_collator.py mask_tokens')
"""
Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
"""
labels = inputs.clone()
# We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
r"""
labels = tensor(
[[ 101, 169, 107, ..., 10539, 107, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
...,
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 117, 169, 102]])
"""
probability_matrix = torch.full(labels.shape, self.mlm_probability)
r"""
probability_matrix =
tensor([[0.1500,0.1500,...],
[0.1500,0.1500,...],
..................
"""
if special_tokens_mask is None:
special_tokens_mask = [
self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
else:
special_tokens_mask = special_tokens_mask.bool()
probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # We only compute loss on masked tokens
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
# 10% of the time, we replace masked input tokens with random word
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
inputs[indices_random] = random_words[indices_random]
# The rest of the time (10% of the time) we keep the masked input tokens unchanged
return inputs, labels
首先复制一下对应的labels的值
labels = inputs.clone()
labels = tensor(
[[ 101, 169, 107, ..., 10539, 107, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
...,
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 100, 100, 102],
[ 101, 169, 107, ..., 117, 169, 102]]
)
接着调用对应的probability_matrix矩阵
probability_matrix = torch.full(labels.shape,self.mlm_probability)
得到的对应的probability_matrix矩阵
probability_matrix =
tensor([[0.1500,0.1500,...],
[0.1500,0.1500,...],
..................
[0.1500,0.1500,...]])
接下来查看对于masked_indices的调用
if special_tokens_mask is None:
special_tokens_mask = [
self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
print('special_tokens_mask1 = ')
print(special_tokens_mask)
special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
print('special_tokens_mask2 = ')
print(special_tokens_mask)
else:
special_tokens_mask = special_tokens_mask.bool()
得到对应的内容为
special_tokens_mask1 =
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1],
.....................
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1]]
对应的special_tokens_mask2的内容为
special_tokens_mask2 =
tensor([[ True, False, False, ..., False, False, True],
...,
[ True, False, False, ..., False, False, True]])
这里调用special_tokens_mask1需要调用get_special_tokens_mask的函数内容
special_tokens_mask = [
self.tokenizer.get_special_tokens_mask(val,already_head_special_tokens=True) for valu in labels.tolist()
]
下一篇博客继续解读相应的代码内容