Hugging face 是一家总部位于纽约的聊天机器人初创服务商,令它广为人知的是Hugging Face专注于NLP技术,拥有大型的开源社区,尤其是在github上开源的自然语言处理,预训练模型库 Transformers。最初叫 pytorch-pretrained-bert 。
安装方式
pip install transformers
当报错:‘ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on’时,需要自己单独下载vocab文件和预训练模型放在对应的位置,下载链接如下:
Multilingual Cased
instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parametersPRETRAINED_VOCAB_ARCHIVE_MAP = { 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt", 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt", 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt", 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt", 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt", 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt", 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt", }
PRETRAINED_MODEL_ARCHIVE_MAP = { 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz", 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz", 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz", 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz", 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz", 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz", 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz", }
BertTokenizer.from_pretrained(‘D:\code\EMNLP-2019-master\data\bert-base-uncased-vocab.txt’)
BertModel.from_pretrained(‘D:\code\EMNLP-2019-master\data\bert-base-uncased’)
加载模型运行
加载下载的模型和词表
from pytorch_pretrained_bert import BertModel, BertTokenizer
import numpy as np
# 加载bert的分词器
tokenizer = BertTokenizer.from_pretrained('你存放的路径/bert-base-uncased-vocab.txt')
# 加载bert模型,这个路径文件夹下有bert_config.json配置文件和model.bin模型权重文件
bert = BertModel.from_pretrained('你存放的路径/bert-base-uncased/')
输入语句
s = "I'm not sure, this can work, lol -.-"
tokens = tokenizer.tokenize(s)
print("\\".join(tokens))
# "i\\'\\m\\not\\sure\\,\\this\\can\\work\\,\\lo\\##l\\-\\.\\-"
# 是否需要这样做?
# tokens = ["[CLS]"] + tokens + ["[SEP]"]
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokens)])
print(ids.shape)
# torch.Size([1, 15])
result = bert(ids, output_all_encoded_layers=True)
输出结果
result = (
[encoder_0_output, encoder_1_output, ..., encoder_11_output],
pool_output
)
output_all_encoded_layers=True
,12层Transformer
的结果全返回了,存在第一个列表中,每个encoder_output
的大小为[batch_size, sequence_length, hidden_size]
;[batch_size, hidden_size]
,pooler层的输出在论文中描述为:[cls]
的hidden states,其已经蕴含了整个input句子的信息了。output_all_encoded_layers
参数设置为Fasle
,那么result中的第一个元素就不是列表了,只是encoder_11_output
,大小为[batch_size, sequence_length, hidden_size]
的张量,可以看作bert
对于这句话的表示。当前任务上finetune
有两种方案
class CustomModel(nn.Module):
def __init__(self, bert_path, n_other_features, n_hidden):
super().__init__()
# 加载并冻结bert模型参数
self.bert = BertModel.from_pretrained(bert_path)
for param in self.bert.parameters():
param.requires_grad = False
self.output = nn.Sequential(
nn.Dropout(0.2),
nn.Linear(768 + n_other_features, n_hidden),
nn.ReLU(),
nn.Linear(n_hidden, 1)
)
def forward(self, seqs, features):
_, pooled = self.bert(seqs, output_all_encoded_layers=False)
concat = torch.cat([pooled, features], dim=1)
logits = self.output(concat)
return logits
测试
s = "I'm not sure, this can work, lol -.-"
tokens = tokenizer.tokenize(s)
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokens)])
# print(ids)
# tensor([[1045, 1005, 1049, 2025, 2469, 1010, 2023, 2064, 2147, 1010, 8840, 2140,
# 1011, 1012, 1011]])
model = CustomModel('你的路径/bert-base-uncased/',10, 512)
outputs = model(ids, torch.rand(1, 10))
# print(outputs)
# tensor([[0.1127]], grad_fn=)