NLP 工作中bug记录

1,微调transformer中mlm任务中(多卡跑),nohup运行出现如下错误,经搜索说是nohup的bug

{'loss': 1.5461, 'learning_rate': 3.933343085625122e-05, 'epoch': 0.64} 21%|██▏ | 35000/164064 [2:03:12<7:30:17, 4.78it/s][INFO|trainer.py:2700] 2022-12-28 20:39:56,894 >> Saving model checkpoint to /data//models/myhugBert30w2/checkpoint-35000 [INFO|configuration_utils.py:447] 2022-12-28 20:39:56,895 >> Configuration saved in /data//models/myhugBert30w2/checkpoint-35000/config.json [INFO|modeling_utils.py:1702] 2022-12-28 20:39:57,345 >> Model weights saved in /data//models/myhugBert30w2/checkpoint-35000/pytorch_model.bin [INFO|tokenization_utils_base.py:2157] 2022-12-28 20:39:57,346 >> tokenizer config file saved in /data//models/myhugBert30w2/checkpoint-35000/tokenizer_config.json [INFO|tokenization_utils_base.py:2164] 2022-12-28 20:39:57,346 >> Special tokens file saved in /data//models/myhugBert30w2/checkpoint- 35000/special_tokens_map.json 22%|██▏ | 36805/164064 [2:09:35<7:35:49,      4.65it/s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33543 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33544 closing signal SIGHUP

搜查后说tmux可解决。

2,transformer中跑sentence transformer的模型,多次推理得到句子的embedding,出现如下bug

tokenization_utils_fast.py", line 429, in _batch_encode_plus

    encodings = self._tokenizer.encode_batch(

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

经搜索发现是transformer版本过高的问题,我不认为是这样的,我直接跑sentence transformer试试看,可解决。

3,wc -l 读取文件行数与pands read_csv读取数据行数不一致,wc -l读取400w,而pandas读取380w,相差近20w,这相差也太大了吧,而且pandas读取后再存储后wc -l读取还是400w,卧槽,有鬼。

with open试试,并存储list ,采用pickle存储。【可解决】

4,安装syngec出错,如下:

python setup.py egg_info did not run successfully.

error: metadata-generation-failed  

ERROR: Failed building wheel for jsonnet

conda install -c conda-forge jsonnet安装后,再次执行可解决。【seq2edit可以,syngec未知】

5,mac 本地安装wget等库

fatal: not in a git directory

Error: Command failed with exit 128: git 如下可解决

git config --global --add safe.directory /opt/homebrew/Library/Taps/homebrew/homebrew-core
git config --global --add safe.directory /opt/homebrew/Library/Taps/homebrew/homebrew-cask

6, ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /data/envs/ddddc/lib/python3.7/site-packages/opencc/clib/opencc_clib.cpython-37m-x86_64-linux-gnu.so)

先查看,确实缺少1.3.8

sudo strings /lib64/libstdc++.so.6 | grep 'CXXABI'

centos下没法用apt方式下载,按照ubuntu方式不可行。

下载libstdc++:链接: https://pan.baidu.com/s/14ErRZGP_fxDQcaO3VFYmcA?pwd=mrp6 提取码: mrp6

然后按照下面操作即可。

unzip 解压下载的文件,放到下面文件夹下
cd /usr/lib64
ls | grep libstdc
 
#备份@Q group 277356808
mv libstdc++.so.6 libstdc++.so.6_bak
 
#重新建立软连接
ln -s libstdc++.so.6.0.26 libstdc++.so.6

再次上面的查询方式即可。

7,安装filelock==3.7.1 gpustat==0.6.0再次遇到4. 问题,因为还是syngec项目。

注释掉这俩库即可(也就是不安装)。

8 ,FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

scikit-learn库版本在1.0及以下可以。

9 ,numpy 之前版本pip install numpy==1.10.1 --use-pep517

10,dlopen(/Users/Library/Python/3.9/lib/python/site-packages/pandas/_libs/interval.cpython-39-darwin.so, 0x0002): tried: '/Users/Library/Python/3.9/lib/python/site-packages/pandas/_libs/interval.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64)))

进入x86 :arch -x86_64 $SHELL 

然后重新安装相应的库即可。

11,pip install seqeval 安装失败。

先安装pip install setuptools-scm 再安装即可。

12,json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 18 (char 17)

双引号与单引号问题。

13,torch serve中需要安装Java

java.lang.NoSuchMethodError: java.nio.file.Files.readString(Ljava/nio/file/Path;)Ljava/lang/String;

安装1.8后发现不行需要11

yum -y remove java-1.8.0-openjdk*

14 tf2.4训练NER任务

Cause: module 'gast' has no attribute 'Index'

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

Traceback (most recent call last):

  File "/data/transformers/examples/tensorflow/token-classification/run_ner.py", line 592, in

    main() #

  File "/data/transformers/examples/tensorflow/token-classification/run_ner.py", line 451, in main

    model.compile(optimizer=optimizer, jit_compile=training_args.xla)

  File "/data/python3.9/site-packages/transformers/modeling_tf_utils.py", line 1412, in compile

    super().compile(

  File "/data/python3.9/site-packages/tensorflow/python/keras/engine/training.py", line 534, in compile

    self._validate_compile(optimizer, metrics, **kwargs)

  File "/data/python3.9/site-packages/tensorflow/python/keras/engine/training.py", line 2519, in _validate_compile 

    raise TypeError('Invalid keyword argument(s) in `compile`: %s' %

TypeError: Invalid keyword argument(s) in `compile`: {'jit_compile'}

升级2.4.1 tf-gpu为2.5.1后

AttributeError: module 'numpy' has no attribute 'object'

卸载numpy,pandas及tf-gpu,tf,安装tf-gpu=2.6版本,仍旧不行,torch的环境也给我搞坏了,卧槽

15,RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):

cannot import name dataclass_transform

后来发现是deepspeed的问题,卸载重新安装即可。

你可能感兴趣的:(Recommendation,自然语言处理,bug)