本文记录了跑Fairseq的wav2ec的主要过程,希望对诸君有用。
以下是基本过程.
#原始命令
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid
#魔改之后可以使用模块的写法
python -m fairseq.examples.wav2vec.wav2vec_manifest /data/Corpus/BostenAI/BSTPlan0 --dest ~/Documents/Projects/Fairseq/Corpus/BSTPlan0 --ext wav --valid-percent 0.1
要使用这种模块模式的话,需要给wav2ec增加一个 _init_.py文件。由于官方向导是–editable 安装,需要卸载,去掉–editable 再装一次
#关于pip install --editable ./的作用是创建一个链接,这个技巧真好用,居然没有直接安装,而是进行连接
~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq.egg-link
#内容是:
~/Documents/workspace/fairseq
.
#一个可以用的例子
python ~/Documents/workspace/fairseq/examples/wav2vec/wav2vec_manifest.py \
/data/Corpus/BostenAI/BSTPlan0 \
--dest ~/Documents/Projects/Fairseq/Corpus/BSTPlan0 \
--ext wav --valid-percent 0.1
很难想象,这个wav2vec_manifest居然是单进程的程序,这个忍不了,自己写了并行版本,但是发现硬盘是主要瓶颈,就不贴代码了,丑!
居然把HDD搞出坏道了!
#安装之后有fairseq-hydra-train
fairseq-hydra-train \
task.data=/data/Temp/SLR12/ \
--config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/pretraining \
--config-name d1 \
--restore_file checkpoint_last.pt \
--tensorboard_logdir outputs --save_dir
fairseq-hydra-train task.data=/data/Temp/SLR12/ checkpoint.finetune_from_model=~/Documents/Projects/Fairseq/outputs/2021-03-03/18-48-53/checkpoints/checkpoint_last.pt common.tensorboard_logdir=~/Documents/Projects/Fairseq/tensorboard --config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/pretraining --config-name d1
Notes 关键参数说明:
执行之后的结果:
2021-03-04 15:54:23 | INFO | fairseq.checkpoint_utils | loading pretrained model from ~/Documents/Projects/Fairseq/outputs/2021-03-03/18-48-53/checkpoints/checkpoint_last.pt: optimizer, lr scheduler, meters, dataloader will be reset
2021-03-04 15:55:03 | INFO | fairseq.trainer | loaded checkpoint ~/Documents/Projects/Fairseq/outputs/2021-03-03/18-48-53/checkpoints/checkpoint_last.pt (epoch 103 @ 0 updates)
2021-03-04 15:55:03 | INFO | fairseq.optim.adam | using FusedAdam
2021-03-04 15:55:04 | INFO | fairseq.trainer | loading train data for epoch 1
2021-03-04 15:55:04 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 28219, skipped 24 samples
2021-03-04 15:55:04 | WARNING | fairseq.logging.progress_bar | tensorboard not found, please install with: pip install tensorboardX
2021-03-04 15:55:04 | INFO | fairseq.trainer | begin training epoch 1
2021-03-04 15:55:27 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-03-04 15:55:40 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-03-04 15:55:52 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-03-04 15:56:04 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-03-04 15:56:46 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.0
准备实现向导的基本例子
Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format.
A letter vocabulary can be downloaded here.
An example script that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:
split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split
Fine-tuning on 100h of Librispeech with letter targets:
$ fairseq-hydra-train \
distributed_training.distributed_port=$PORT \
task.data=/path/to/data \
model.w2v_path=/path/to/model.pt \
--config-dir /path/to/fairseq-py/examples/wav2vec/config/finetuning \
--config-name base_100h
There are other config files in the config/finetuning directory that can be used to fine-tune on other splits.
You can specify the right config via the --config-name
parameter.
Note: you can simulate 24 GPUs by using k GPUs and adding command line parameters (before --config-dir
)
distributed_training.distributed_world_size=k
+optimization.update_freq='[x]'
where x = 24/k
Decoding with a language model during training requires wav2letter python bindings.
If you want to use a language model, add +criterion.wer_args='[/path/to/kenlm, /path/to/lexicon, 2, -1]'
to the command line.
基本结论:例子可以跑起来,但是问题不少
Notes:需要重点说明几点:
1.网上下载的预训练模型是经过CTC预训练的版本,因此不能用官方的向导加载。!!!!!重要!!!!!
2.正确运行需要魔改代码,在参数解析阶段ast存在严重问题。
按照向导的例子创建数据集,finetuning好像使用dev-other数据集,需要用脚本创建train.tsv,train.ltrd等文件,没有就手动创建
#声明变量 ,标准例子
split=train
$ python ~/Documents/workspace/fairseq/examples/wav2vec/libri_labels.py ~/Documents/Projects/Fairseq/Corpus/BSTPlan1 --output-dir ~/Documents/Projects/Fairseq/Corpus/BSTPlan1 --output-name $split
#经过魔改之后的安装路径无关的例子
#创建train.tsv
$ python -m fairseq.examples.wav2vec.libri_labels ~/Documents/Projects/Fairseq/Corpus/BSTPlan1/train.tsv --output-dir ~/Documents/Projects/Fairseq/Corpus/BSTPlan1 --output-name train
#创建train.tltr
$ python -m fairseq.examples.wav2vec.libri_labels /data/Cache/SLR12/train.tsv --output-dir ~/Documents/Projects/Fairseq/Corpus/SLR21 --output-name train
正确可用的微调脚本:
#Fine-tuning on 100h of Librispeech with letter targets:
#There are other config files in the config/finetuning directory that can be used to fine-tune on other splits.
$ CUDA_VISIBLE_DEVICES=2 fairseq-hydra-train task.data=/data/Cache/SLR12/dev-other/ \
common.tensorboard_logdir=~/Documents/Projects/Fairseq/tensorboard \
model.w2v_path=~/Documents/Research/fairseq/model/wav2vec_vox_960h_pl.pt.zip \
distributed_training.distributed_world_size=1 optimization.update_freq=[24] \
distributed_training.device_id=2 \
--config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/finetuning \
--config-name vox_960h
#参数说明
#task.data 数据目录
#common.tensorboard_logdir 启用Tensorboard谁用谁知道
#model.w2v_path 预训练模型的路径
#CUDA_VISIBLE_DEVICES=2 见Bug说明,3090和2080Ti的问题
#optimization.update_freq=[24] 这里是[24]而不是'[24]',当然魔改后'[24]'应该也没问题
NOTES:以下bug调试过程可能存在各种过程性bug,正确脚本可能根本不出现
问题表现:
Traceback (most recent call last):
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq_cli/hydra_train.py", line 38, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/distributed_utils.py", line 334, in call_main
main(cfg, **kwargs)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq_cli/train.py", line 69, in main
task.load_dataset(valid_sub_split, combine=False, epoch=1)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/tasks/audio_pretraining.py", line 137, in load_dataset
self.datasets[split] = FileAudioDataset(
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/data/audio/raw_audio_dataset.py", line 158, in __init__
with open(manifest_path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/Cache/SLR12/dev_other.tsv'
主要原因没有生成数据集文件,生成对应的配置文件:
#生成/data/Cache/SLR12/dev_other.tsv
python -m fairseq.examples.wav2vec.wav2vec_manifest_parallel /data/Cache/SLR12/dev-other/LibriSpeech/dev-other --dest /data/Cache/SLR12/dev-other --ext flac --valid-percent 0
#生成ltr
python -m fairseq.examples.wav2vec.libri_labels /data/Cache/SLR12/dev-other/dev_other.tsv --output-dir /data/Cache/SLR12/dev-other/ --output-name dev_other
再次出现问题,拷贝资源dict.ltr.txt,当然也可以用脚本生成,我就偷了个懒。
FileNotFoundError: [Errno 2] No such file or directory: '/data/Cache/SLR12/dev_other/dict.ltr.txt'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
按照官网的说明:
Note: you can simulate 24 GPUs by using k GPUs and adding command line parameters (before --config-dir
)
distributed_training.distributed_world_size=k
+optimization.update_freq='[x]'
where x = 24/k
是会出现如下问题:
Traceback (most recent call last):
....
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/models/wav2vec/wav2vec2_asr.py", line 281, in __init__
state = checkpoint_utils.load_checkpoint_to_cpu(cfg.w2v_path, arg_overrides)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 237, in load_checkpoint_to_cpu
state = _upgrade_state_dict(state)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 468, in _upgrade_state_dict
state["cfg"] = convert_namespace_to_omegaconf(state["args"])
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/dataclass/utils.py", line 335, in convert_namespace_to_omegaconf
overrides, deletes = override_module_args(args)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/dataclass/utils.py", line 278, in override_module_args
_override_attr(k, FairseqConfig.__dataclass_fields__[k].type, args)
File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/dataclass/utils.py", line 224, in _override_attr
val = ast.literal_eval(val)
File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 105, in literal_eval
return _convert(node_or_string)
File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 104, in _convert
return _convert_signed_num(node)
File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 78, in _convert_signed_num
return _convert_num(node)
File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 69, in _convert_num
_raise_malformed_node(node)
File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 66, in _raise_malformed_node
raise ValueError(f'malformed node or string: {node!r}')
ValueError: malformed node or string: <ast.Name object at 0x7fad3c167640>
主要原因是ast这个库和Fairseq中的typing.Optional语法出了bug。
追踪过程:1.进行追踪,得到信息如下,说明这个传递参数有问题,改为[24]也有问题
>>> Error merging override optimization.update_freq='[24]'
Invalid value assigned : str is not a subclass of ListConfig or list.
full_key: optimization.update_freq
reference_type=OptimizationConfig
object_type=OptimizationConfig
>>> field_type
typing.Optional[fairseq.dataclass.constants.Choices]
#追踪到fairseq/dataclass/utils.py,
if (
isinstance(val, str)
and not val.startswith("${") # not interpolation
and field_type != str
and (not inspect.isclass(field_type) or not issubclass(field_type, Enum)) # not choices enum
):
# upgrade old models that stored complex parameters as string
val = ast.literal_eval(val)
#发现这里的断言不过 ,not inspect.isclass(field_type) 为True,导致计算异常
#分析原因是inspect不认识typing.Optional
#魔改如下:~/Documents/workspace/fairseq/fairseq/dataclass/utils.py
def interpret_dc_type(field_type):
if isinstance(field_type, str):
raise RuntimeError("field should be a type")
if field_type == Any:
return str
typestring = str(field_type)
if re.match(r"(typing.|^)Union\[(.*), NoneType\]$", typestring):
return field_type.__args__[0]
#此处是改动,加入typing.Optional的识别
#linger>>>>
if re.match(r"(typing.|^)Optional\[*(.*)?\]$", typestring):
return field_type.__args__[0]
#linger<<<<
return field_type
这里是官方文档不理解导致的,参考Fine-tune a pre-trained model with CTC节。此节没有明确说明model.w2v_path是什么,我直接使用了网上预训练的模型,导致循环引用错误。
循环引用导致加载问题,在去年的测试中坑了我很久,这次也坑了我7天来调试,这真是师傅的那一手啊!大厂的文档工作也是坑啊。具体循环分析不写了,累!
# torch.nn.modules.module.ModuleAttributeError: 'Wav2VecCtc' object has no attribute 'remove_pretraining_modules
# 既然这样,就给他加一个空函数好了
# fairseq/fairseq/models/wav2vec/wav2vec2_asr.py 的 Wav2VecCtc
def remove_pretraining_modules(self):
"""Linger fix the bug torch.nn.modules.module.ModuleAttributeError: 'Wav2VecCtc' object has no attribute 'remove_pretraining_modules """
pass
Note: 需要说明这样修改没有意义,这其实是重新加载和初次加载的模型问题
在使用我自己预训练的模型代替网上下载的,将导致配置错误如下:
omegaconf.errors.ConfigAttributeError: Key 'normalize' is not in struct
full_key: task.normalize
reference_type=Any
object_type=dict
没办法,直接屏蔽检查代码吧,或者加个配置吧
File "~/Documents/workspace/fairseq/fairseq/data/data_utils.py", line 303, in batch_by_size
from fairseq.data.data_utils_fast import (
File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
#Numpy 版本问题
pip install --upgrade numpy
曾经出现过如下的错误,在使用CUDA_VISIBLE_DEVICES=2 后消除,分析是3090和2080Ti同时使用带来的问题,毕竟3090在0槽
#错误提示信息
File "~/anaconda3/envs/lSrv08/lib/python3.8/site-packages/fairseq/models/wav2vec/wav2vec2.py", line 460, in forward
unmasked_features = features.clone()
RuntimeError: CUDA error: no kernel image is available for execution on the device
#使用如下的命令后消除
CUDA_VISIBLE_DEVICES=2 fairseq-hydra-train task.data=/data/Cache/SLR12/dev-other/ common.tensorboard_logdir=~/Documents/Projects/Fairseq/tensorboard model.w2v_path=~/Documents/Projects/Fairseq/outputs/2021-03-04/15-54-15/checkpoints/checkpoint_last.pt distributed_training.distributed_world_size=1 optimization.update_freq=[24] distributed_training.device_id=2 --config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/finetuning --config-name vox_100h
至此,向导中的CTC Finetune例子算是运行起来了