【Python】Transformers加载BERT模型from_pretrained()问题解决

文章目录

  • 开发环境搭建
  • OSError: Can‘t load config for 'xxxxxx'. If you were trying
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
  • Can't load the configuration of 'xxxxxx'.
  • Loading model from pytorch_pretrained_bert into transformers library
  • ERROR: No matching distribution found for boto3
  • Missing key(s) in state_dict: "bert.embeddings.position_ids".

开发环境搭建

Ubuntu服务器上安装Miniconda,通过VSCode或PyCharm或Gateway连接远程开发。

推荐阅读:VSCode通过虚拟环境运行Python程序

安装PyTorch、Scikit-Learn、Transformers等库。

推荐阅读:Conda安装TensorFlow和PyTorch的GPU支持包

说明:安装Scikit-Learn的时候不要pip install sklearn,应该pip install scikit-learn

OSError: Can‘t load config for ‘xxxxxx’. If you were trying

遇到报错:
OSError: Can‘t load config for ‘xxxxxx’. If you were trying

根据这篇博客,试着手动下载了bert-base-uncased的相关文件,但还是不能成功。

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x80 in position 0: invalid start byte

遇到报错:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x80 in position 0: invalid start byte

根据网上的文章,该错误的产生原因大致是以错误的编码格式和读取方式读取了二进制文件,在此工程中无法处理。

Can’t load the configuration of ‘xxxxxx’.

Can’t load the configuration of ‘xxxxxx’. If you were trying to load it from ‘https://huggingface.co/models’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘xxxxxx’ is the correct path to a directory containing a config.json file

引入如下脚本,先将bert-base-uncased模型从Huggingface的仓库中download到本地:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer.save_pretrained('./bert/')
model.save_pretrained('./bert/')

Loading model from pytorch_pretrained_bert into transformers library

查看Huggingface官方Discussion帖子Loading model from pytorch_pretrained_bert into transformers library,有这样一段话:

Hi. This is probably caused by the transformer verison. You might downgrade your transformer version from 4.4 to 2.8 with pip install transformers==2.8.0

因此尝试将transformers版本降到2.8.0。

首先查看transformers版本:
pip show transformers

输出信息显示版本为4.26.1:
Name: transformers
Version: 4.26.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache
Location: xxxxxxxxxxxxxxxxxxxxxxx
Requires: filelock, huggingface-hub, importlib-metadata, numpy, packaging, pyyaml, regex, requests, tokenizers, tqdm
Required-by:

直接安装transformers的2.8.0版本:
pip install transformers==2.8.0

遇到一串错误,其中一行是:
During handling of the above exception, another exception occurred:

卸载transformers:
pip uninstall transformers

随后安装:
pip install transformers==2.8.0

遇到错误:
ERROR: Could not find a version that satisfies the requirement boto3 (from transformers) (from versions: none)
ERROR: No matching distribution found for boto3

ERROR: No matching distribution found for boto3

为了解决上面的问题,参考StackOverflow,安装boto3库。

首先查看是否已安装boto3:
pip show boto3

输出结果:
WARNING: Package(s) not found: boto3

显然,没有安装过。

然后正式安装boto3:
pip install boto3

随后安装2.8.0版本的transformers库:
pip install transformers==2.8.0

Missing key(s) in state_dict: “bert.embeddings.position_ids”.

安装2.8.0版本的transformers库后,运行程序报错:
Missing key(s) in state_dict: “bert.embeddings.position_ids”.

参考这篇博客稍加改造后,加入以下代码:

cudnn.benchmark = True

仍然报错:
TypeError: ‘BertTokenizer’ object is not callable

检索到GitHub的一个相关Issue:TypeError: ‘BertTokenizer’ object is not callable #53,该Issue的回复指出:

Transformers fails “TypeError: ‘BertTokenizer’ object is not callable” if the installed version is =3.0.0”

因此决定将版本升到3.0.0:
pip install transformers==3.0.0

成功,部分输出如下:
Installing collected packages: tokenizers, transformers
Attempting uninstall: tokenizers
Found existing installation: tokenizers 0.5.2
Uninstalling tokenizers-0.5.2:
Successfully uninstalled tokenizers-0.5.2
Attempting uninstall: transformers
Found existing installation: transformers 2.8.0
Uninstalling transformers-2.8.0:
Successfully uninstalled transformers-2.8.0
Successfully installed tokenizers-0.8.0rc4 transformers-3.0.0

运行程序,可以得到结果,伴随着输出如下内容:
Some weights of the model checkpoint at ./bert/ were not used when initializing BertModel: [‘embeddings.position_ids’]
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

你可能感兴趣的:(Python,python,bert,vscode,ubuntu,深度学习)