1,微调transformer中mlm任务中(多卡跑),nohup运行出现如下错误,经搜索说是nohup的bug
{'loss': 1.5461, 'learning_rate': 3.933343085625122e-05, 'epoch': 0.64} 21%|██▏ | 35000/164064 [2:03:12<7:30:17, 4.78it/s][INFO|trainer.py:2700] 2022-12-28 20:39:56,894 >> Saving model checkpoint to /data//models/myhugBert30w2/checkpoint-35000 [INFO|configuration_utils.py:447] 2022-12-28 20:39:56,895 >> Configuration saved in /data//models/myhugBert30w2/checkpoint-35000/config.json [INFO|modeling_utils.py:1702] 2022-12-28 20:39:57,345 >> Model weights saved in /data//models/myhugBert30w2/checkpoint-35000/pytorch_model.bin [INFO|tokenization_utils_base.py:2157] 2022-12-28 20:39:57,346 >> tokenizer config file saved in /data//models/myhugBert30w2/checkpoint-35000/tokenizer_config.json [INFO|tokenization_utils_base.py:2164] 2022-12-28 20:39:57,346 >> Special tokens file saved in /data//models/myhugBert30w2/checkpoint- 35000/special_tokens_map.json 22%|██▏ | 36805/164064 [2:09:35<7:35:49, 4.65it/s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33543 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33544 closing signal SIGHUP
搜查后说tmux可解决。
2,transformer中跑sentence transformer的模型,多次推理得到句子的embedding,出现如下bug
tokenization_utils_fast.py", line 429, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
经搜索发现是transformer版本过高的问题,我不认为是这样的,我直接跑sentence transformer试试看,可解决。
3,wc -l 读取文件行数与pands read_csv读取数据行数不一致,wc -l读取400w,而pandas读取380w,相差近20w,这相差也太大了吧,而且pandas读取后再存储后wc -l读取还是400w,卧槽,有鬼。
with open试试,并存储list ,采用pickle存储。【可解决】
4,安装syngec出错,如下:
python setup.py egg_info did not run successfully.
error: metadata-generation-failed
ERROR: Failed building wheel for jsonnet
conda install -c conda-forge jsonnet安装后,再次执行可解决。【seq2edit可以,syngec未知】
5,mac 本地安装wget等库
fatal: not in a git directory
Error: Command failed with exit 128: git 如下可解决
git config --global --add safe.directory /opt/homebrew/Library/Taps/homebrew/homebrew-core
git config --global --add safe.directory /opt/homebrew/Library/Taps/homebrew/homebrew-cask
6, ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /data/envs/ddddc/lib/python3.7/site-packages/opencc/clib/opencc_clib.cpython-37m-x86_64-linux-gnu.so)
先查看,确实缺少1.3.8
sudo strings /lib64/libstdc++.so.6 | grep 'CXXABI'
centos下没法用apt方式下载,按照ubuntu方式不可行。
下载libstdc++:链接: https://pan.baidu.com/s/14ErRZGP_fxDQcaO3VFYmcA?pwd=mrp6 提取码: mrp6
然后按照下面操作即可。
unzip 解压下载的文件,放到下面文件夹下
cd /usr/lib64
ls | grep libstdc
#备份@Q group 277356808
mv libstdc++.so.6 libstdc++.so.6_bak
#重新建立软连接
ln -s libstdc++.so.6.0.26 libstdc++.so.6
再次上面的查询方式即可。
7,安装filelock==3.7.1 gpustat==0.6.0再次遇到4. 问题,因为还是syngec项目。
注释掉这俩库即可(也就是不安装)。
8 ,FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
scikit-learn库版本在1.0及以下可以。
9 ,numpy 之前版本pip install numpy==1.10.1 --use-pep517
10,dlopen(/Users/Library/Python/3.9/lib/python/site-packages/pandas/_libs/interval.cpython-39-darwin.so, 0x0002): tried: '/Users/Library/Python/3.9/lib/python/site-packages/pandas/_libs/interval.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64)))
进入x86 :arch -x86_64 $SHELL
然后重新安装相应的库即可。
11,pip install seqeval 安装失败。
先安装pip install setuptools-scm 再安装即可。
12,json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 18 (char 17)
双引号与单引号问题。
13,torch serve中需要安装Java
java.lang.NoSuchMethodError: java.nio.file.Files.readString(Ljava/nio/file/Path;)Ljava/lang/String;
安装1.8后发现不行需要11
yum -y remove java-1.8.0-openjdk*
14 tf2.4训练NER任务
Cause: module 'gast' has no attribute 'Index'
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
Traceback (most recent call last):
File "/data/transformers/examples/tensorflow/token-classification/run_ner.py", line 592, in
main() #
File "/data/transformers/examples/tensorflow/token-classification/run_ner.py", line 451, in main
model.compile(optimizer=optimizer, jit_compile=training_args.xla)
File "/data/python3.9/site-packages/transformers/modeling_tf_utils.py", line 1412, in compile
super().compile(
File "/data/python3.9/site-packages/tensorflow/python/keras/engine/training.py", line 534, in compile
self._validate_compile(optimizer, metrics, **kwargs)
File "/data/python3.9/site-packages/tensorflow/python/keras/engine/training.py", line 2519, in _validate_compile
raise TypeError('Invalid keyword argument(s) in `compile`: %s' %
TypeError: Invalid keyword argument(s) in `compile`: {'jit_compile'}
升级2.4.1 tf-gpu为2.5.1后
AttributeError: module 'numpy' has no attribute 'object'
卸载numpy,pandas及tf-gpu,tf,安装tf-gpu=2.6版本,仍旧不行,torch的环境也给我搞坏了,卧槽
15,RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):
cannot import name dataclass_transform
后来发现是deepspeed的问题,卸载重新安装即可。