在看论文和复现代码时经常涉及bert预训练模型的使用,而且transformers的自动下载时通过国外的数据库,非常慢。本文总结一下Bert预训练模型的使用方法.
首先需要安装transformers库
pip install transformers
transformers中的BertModel和BertTokenizer经常被使用:
from transformers import BertModel, BertTokenizer
直接使用上述import导入BertModel会从官方数据库下载,保存至C:\user\Admin.cache\torch\transformers文件夹中。
如果网络不是很顺畅的话需要手动下载,以下为常用Bert预训练文件的网址
bert-base-uncased : https://huggingface.co/bert-base-uncased
bert-base-chinese : https://huggingface.co/bert-base-chinese
下载好放在自己想放的路径里。
运行时tokenizer的文件缓存自动下载(文件很小,可下载),而模型则需要根据自己路径导入。
使用方法简单示例:
from transformers import BertModel, BertTokenizer
model_name = 'bert-base-chinese'
model_path = '' #这里写你的bert-base-chinese的存放路径
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_path)
print(tokenizer.encode('新手村村民零零一')) #[101, 3173, 2797, 3333, 3333, 3696, 7439, 7439, 671, 102]
sen_code = tokenizer.encode_plus('这个故事没有终点', "正如星空没有彼岸")
print(sen_code)
print(tokenizer.convert_ids_to_tokens(sen_code['input_ids']))
输出结果:
[101, 3173, 2797, 3333, 3333, 3696, 7439, 7439, 671, 102]
{'input_ids': [101, 6821, 702, 3125, 752, 3766, 3300, 5303, 4157, 102, 3633, 1963, 3215, 4958, 3766, 3300, 2516, 2279, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', '这', '个', '故', '事', '没', '有', '终', '点', '[SEP]', '正', '如', '星', '空', '没', '有', '彼', '岸', '[SEP]']
pip install pytorch_pretrained_bert
需要提前下载bert-base-uncased放在指定文件夹
用法示例:
from pytorch_pretrained_bert import BertModel, BertTokenizer
import torch.nn as nn
import numpy as np
import torch
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained('D:\\导师任务\\useful-libs\\bert-base-uncased\\vocab.txt')
model = BertModel.from_pretrained('D:/导师任务/useful-libs/bert-base-uncased/')
s = "The first step is as good as half over"
tokens = tokenizer.tokenize(s)
print(tokens)
tokens = ["[CLS]"] + tokens + ["[SEP]"]
print(tokens)
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokens)]) #注意此处的[]
print(ids)
all_layers_all_words, pooled = model(ids, output_all_encoded_layers=True)
print(len(all_layers_all_words))
输出结果为:
['the', 'first', 'step', 'is', 'as', 'good', 'as', 'half', 'over']
['[CLS]', 'the', 'first', 'step', 'is', 'as', 'good', 'as', 'half', 'over', '[SEP]']
tensor([[ 101, 1996, 2034, 3357, 2003, 2004, 2204, 2004, 2431, 2058, 102]])
12