wav2ec 训练心得

wav2ec 训练心得

本文记录了跑Fairseq的wav2ec的主要过程,希望对诸君有用。

基本结论

  1. 代码经过修改是可以跑起来的,这与2020年12月的尝试结果不同。
  2. “预训练"这个词具有歧义,Fairseq向导里给的预训练模型是经过finetune的模型而不是原始的audiopretraing的模型,直接使用将导致循环加载,无法使用,这是两次尝试失败的核心原因。
  3. 大厂的代码质量也是靠不住的。

以下是基本过程.

文章目录

  • wav2ec 训练心得
  • 基本结论
  • 1. 预训练例子
    • 1.1 数据准备
      • Tips 1:魔改支持模块
      • Tips 2 : 魔改支持多进程加载数据
    • 1.2 初始训练
      • Tips 3: 断掉了也不用怕,可以继续训练
  • 2. 训练和测试语音识别模型
      • Fine-tune a pre-trained model with CTC:
    • 2.1 准备数据
    • 2.2 微调模型
      • Bug 1: 数据参数传递问题
      • Bug2:模拟多GPU的参数传递问题
      • Bug 3 :对预训练模型的理解问题
        • Bug3.1. remove_pretraining_modules错误
        • Bug3.1. normalize错误
      • Bug4 : Numpy版本问题
      • Bug 5: 3090和2080Ti同时使用带来的问题

1. 预训练例子

1.1 数据准备

#原始命令
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid
#魔改之后可以使用模块的写法
python -m fairseq.examples.wav2vec.wav2vec_manifest /data/Corpus/BostenAI/BSTPlan0 --dest ~/Documents/Projects/Fairseq/Corpus/BSTPlan0 --ext wav --valid-percent 0.1

Tips 1:魔改支持模块

要使用这种模块模式的话,需要给wav2ec增加一个 _init_.py文件。由于官方向导是–editable 安装,需要卸载,去掉–editable 再装一次

#关于pip install --editable ./的作用是创建一个链接,这个技巧真好用,居然没有直接安装,而是进行连接
~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq.egg-link
#内容是:
~/Documents/workspace/fairseq
.
#一个可以用的例子
python ~/Documents/workspace/fairseq/examples/wav2vec/wav2vec_manifest.py  \
/data/Corpus/BostenAI/BSTPlan0 \
--dest ~/Documents/Projects/Fairseq/Corpus/BSTPlan0 \
--ext wav --valid-percent 0.1

Tips 2 : 魔改支持多进程加载数据

很难想象,这个wav2vec_manifest居然是单进程的程序,这个忍不了,自己写了并行版本,但是发现硬盘是主要瓶颈,就不贴代码了,丑!
居然把HDD搞出坏道了!

1.2 初始训练

#安装之后有fairseq-hydra-train
fairseq-hydra-train \
    task.data=/data/Temp/SLR12/ \
    --config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/pretraining \
    --config-name d1 \
    --restore_file checkpoint_last.pt \
    --tensorboard_logdir outputs  --save_dir 

Tips 3: 断掉了也不用怕,可以继续训练

fairseq-hydra-train     task.data=/data/Temp/SLR12/  checkpoint.finetune_from_model=~/Documents/Projects/Fairseq/outputs/2021-03-03/18-48-53/checkpoints/checkpoint_last.pt  common.tensorboard_logdir=~/Documents/Projects/Fairseq/tensorboard  --config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/pretraining     --config-name d1

Notes 关键参数说明

  • 1.覆盖参数必须写在配置文件的前面;
  • 2.加载历史文件是checkpoint.finetune_from_model
  • 3.打开TensorBoard是 common.tensorboard_logdir

执行之后的结果:

2021-03-04 15:54:23 | INFO | fairseq.checkpoint_utils | loading pretrained model from ~/Documents/Projects/Fairseq/outputs/2021-03-03/18-48-53/checkpoints/checkpoint_last.pt: optimizer, lr scheduler, meters, dataloader will be reset
2021-03-04 15:55:03 | INFO | fairseq.trainer | loaded checkpoint ~/Documents/Projects/Fairseq/outputs/2021-03-03/18-48-53/checkpoints/checkpoint_last.pt (epoch 103 @ 0 updates)
2021-03-04 15:55:03 | INFO | fairseq.optim.adam | using FusedAdam
2021-03-04 15:55:04 | INFO | fairseq.trainer | loading train data for epoch 1
2021-03-04 15:55:04 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 28219, skipped 24 samples
2021-03-04 15:55:04 | WARNING | fairseq.logging.progress_bar | tensorboard not found, please install with: pip install tensorboardX
2021-03-04 15:55:04 | INFO | fairseq.trainer | begin training epoch 1
2021-03-04 15:55:27 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2021-03-04 15:55:40 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2021-03-04 15:55:52 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2021-03-04 15:56:04 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2021-03-04 15:56:46 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 4.0

2. 训练和测试语音识别模型

准备实现向导的基本例子

Fine-tune a pre-trained model with CTC:

Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format.
A letter vocabulary can be downloaded here.
An example script that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:

split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split

Fine-tuning on 100h of Librispeech with letter targets:

$ fairseq-hydra-train \
    distributed_training.distributed_port=$PORT \
    task.data=/path/to/data \
    model.w2v_path=/path/to/model.pt \
    --config-dir /path/to/fairseq-py/examples/wav2vec/config/finetuning \
    --config-name base_100h

There are other config files in the config/finetuning directory that can be used to fine-tune on other splits.
You can specify the right config via the --config-name parameter.

Note: you can simulate 24 GPUs by using k GPUs and adding command line parameters (before --config-dir)
distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 24/k

Decoding with a language model during training requires wav2letter python bindings.
If you want to use a language model, add +criterion.wer_args='[/path/to/kenlm, /path/to/lexicon, 2, -1]' to the command line.

基本结论:例子可以跑起来,但是问题不少

Notes:需要重点说明几点:

1.网上下载的预训练模型是经过CTC预训练的版本,因此不能用官方的向导加载。!!!!!重要!!!!!

2.正确运行需要魔改代码,在参数解析阶段ast存在严重问题。

2.1 准备数据

按照向导的例子创建数据集,finetuning好像使用dev-other数据集,需要用脚本创建train.tsv,train.ltrd等文件,没有就手动创建

#声明变量 ,标准例子
split=train
$ python ~/Documents/workspace/fairseq/examples/wav2vec/libri_labels.py ~/Documents/Projects/Fairseq/Corpus/BSTPlan1 --output-dir ~/Documents/Projects/Fairseq/Corpus/BSTPlan1 --output-name $split
#经过魔改之后的安装路径无关的例子
#创建train.tsv
$ python -m fairseq.examples.wav2vec.libri_labels ~/Documents/Projects/Fairseq/Corpus/BSTPlan1/train.tsv --output-dir ~/Documents/Projects/Fairseq/Corpus/BSTPlan1 --output-name train
#创建train.tltr
$ python -m fairseq.examples.wav2vec.libri_labels /data/Cache/SLR12/train.tsv --output-dir ~/Documents/Projects/Fairseq/Corpus/SLR21 --output-name train

2.2 微调模型

正确可用的微调脚本:

#Fine-tuning on 100h of Librispeech with letter targets: 
#There are other config files in the config/finetuning directory that can be used to fine-tune on other splits.
$ CUDA_VISIBLE_DEVICES=2  fairseq-hydra-train   task.data=/data/Cache/SLR12/dev-other/ \
    common.tensorboard_logdir=~/Documents/Projects/Fairseq/tensorboard \
    model.w2v_path=~/Documents/Research/fairseq/model/wav2vec_vox_960h_pl.pt.zip \
    distributed_training.distributed_world_size=1 optimization.update_freq=[24] \
    distributed_training.device_id=2 \
    --config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/finetuning \
    --config-name vox_960h
#参数说明
#task.data 数据目录
#common.tensorboard_logdir 启用Tensorboard谁用谁知道
#model.w2v_path 预训练模型的路径
#CUDA_VISIBLE_DEVICES=2 见Bug说明,3090和2080Ti的问题
#optimization.update_freq=[24]  这里是[24]而不是'[24]',当然魔改后'[24]'应该也没问题

NOTES:以下bug调试过程可能存在各种过程性bug,正确脚本可能根本不出现

Bug 1: 数据参数传递问题

问题表现:

Traceback (most recent call last):
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq_cli/hydra_train.py", line 38, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/distributed_utils.py", line 334, in call_main
    main(cfg, **kwargs)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq_cli/train.py", line 69, in main
    task.load_dataset(valid_sub_split, combine=False, epoch=1)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/tasks/audio_pretraining.py", line 137, in load_dataset
    self.datasets[split] = FileAudioDataset(
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/data/audio/raw_audio_dataset.py", line 158, in __init__
    with open(manifest_path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/Cache/SLR12/dev_other.tsv'

主要原因没有生成数据集文件,生成对应的配置文件:

#生成/data/Cache/SLR12/dev_other.tsv
python -m fairseq.examples.wav2vec.wav2vec_manifest_parallel /data/Cache/SLR12/dev-other/LibriSpeech/dev-other --dest /data/Cache/SLR12/dev-other --ext flac --valid-percent 0
#生成ltr
python -m fairseq.examples.wav2vec.libri_labels /data/Cache/SLR12/dev-other/dev_other.tsv --output-dir /data/Cache/SLR12/dev-other/ --output-name dev_other

再次出现问题,拷贝资源dict.ltr.txt,当然也可以用脚本生成,我就偷了个懒。

FileNotFoundError: [Errno 2] No such file or directory: '/data/Cache/SLR12/dev_other/dict.ltr.txt'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Bug2:模拟多GPU的参数传递问题

按照官网的说明:

Note: you can simulate 24 GPUs by using k GPUs and adding command line parameters (before --config-dir)
distributed_training.distributed_world_size=k +optimization.update_freq='[x]' where x = 24/k

是会出现如下问题:

Traceback (most recent call last):
  ....
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/models/wav2vec/wav2vec2_asr.py", line 281, in __init__
    state = checkpoint_utils.load_checkpoint_to_cpu(cfg.w2v_path, arg_overrides)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 237, in load_checkpoint_to_cpu
    state = _upgrade_state_dict(state)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 468, in _upgrade_state_dict
    state["cfg"] = convert_namespace_to_omegaconf(state["args"])
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/dataclass/utils.py", line 335, in convert_namespace_to_omegaconf
    overrides, deletes = override_module_args(args)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/dataclass/utils.py", line 278, in override_module_args
    _override_attr(k, FairseqConfig.__dataclass_fields__[k].type, args)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/site-packages/fairseq/dataclass/utils.py", line 224, in _override_attr
    val = ast.literal_eval(val)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 105, in literal_eval
    return _convert(node_or_string)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 104, in _convert
    return _convert_signed_num(node)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 78, in _convert_signed_num
    return _convert_num(node)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 69, in _convert_num
    _raise_malformed_node(node)
  File "~/anaconda3/envs/lSrv09/lib/python3.9/ast.py", line 66, in _raise_malformed_node
    raise ValueError(f'malformed node or string: {node!r}')
ValueError: malformed node or string: <ast.Name object at 0x7fad3c167640>

主要原因是ast这个库和Fairseq中的typing.Optional语法出了bug。

追踪过程:1.进行追踪,得到信息如下,说明这个传递参数有问题,改为[24]也有问题

>>> Error merging override optimization.update_freq='[24]'
Invalid value assigned : str is not a subclass of ListConfig or list.
	full_key: optimization.update_freq
	reference_type=OptimizationConfig
	object_type=OptimizationConfig
  1. 进一步追踪,Fairseq的传递参数有问题:
>>> field_type
typing.Optional[fairseq.dataclass.constants.Choices]
#追踪到fairseq/dataclass/utils.py,
        if (
            isinstance(val, str)
            and not val.startswith("${")  # not interpolation
            and field_type != str
            and (not inspect.isclass(field_type) or not issubclass(field_type, Enum))  # not choices enum
        ):
            # upgrade old models that stored complex parameters as string
            val = ast.literal_eval(val)
            
#发现这里的断言不过 ,not inspect.isclass(field_type) 为True,导致计算异常
#分析原因是inspect不认识typing.Optional
#魔改如下:~/Documents/workspace/fairseq/fairseq/dataclass/utils.py
def interpret_dc_type(field_type):
    if isinstance(field_type, str):
        raise RuntimeError("field should be a type")

    if field_type == Any:
        return str

    typestring = str(field_type)
    if re.match(r"(typing.|^)Union\[(.*), NoneType\]$", typestring):
        return field_type.__args__[0]
    #此处是改动,加入typing.Optional的识别   
    #linger>>>>
    if re.match(r"(typing.|^)Optional\[*(.*)?\]$", typestring):
        return field_type.__args__[0]        
    #linger<<<<
    return field_type

Bug 3 :对预训练模型的理解问题

这里是官方文档不理解导致的,参考Fine-tune a pre-trained model with CTC节。此节没有明确说明model.w2v_path是什么,我直接使用了网上预训练的模型,导致循环引用错误。

循环引用导致加载问题,在去年的测试中坑了我很久,这次也坑了我7天来调试,这真是师傅的那一手啊!大厂的文档工作也是坑啊。具体循环分析不写了,累!

Bug3.1. remove_pretraining_modules错误

# torch.nn.modules.module.ModuleAttributeError: 'Wav2VecCtc' object has no attribute 'remove_pretraining_modules
# 既然这样,就给他加一个空函数好了
# fairseq/fairseq/models/wav2vec/wav2vec2_asr.py   的 Wav2VecCtc
    def remove_pretraining_modules(self):
        """Linger fix the bug torch.nn.modules.module.ModuleAttributeError: 'Wav2VecCtc' object has no attribute 'remove_pretraining_modules """
        pass

Note: 需要说明这样修改没有意义,这其实是重新加载和初次加载的模型问题

Bug3.1. normalize错误

在使用我自己预训练的模型代替网上下载的,将导致配置错误如下:

omegaconf.errors.ConfigAttributeError: Key 'normalize' is not in struct
	full_key: task.normalize
	reference_type=Any
	object_type=dict

没办法,直接屏蔽检查代码吧,或者加个配置吧

Bug4 : Numpy版本问题

  File "~/Documents/workspace/fairseq/fairseq/data/data_utils.py", line 303, in batch_by_size
    from fairseq.data.data_utils_fast import (
  File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
#Numpy 版本问题
pip install --upgrade numpy

Bug 5: 3090和2080Ti同时使用带来的问题

曾经出现过如下的错误,在使用CUDA_VISIBLE_DEVICES=2 后消除,分析是3090和2080Ti同时使用带来的问题,毕竟3090在0槽

#错误提示信息
File "~/anaconda3/envs/lSrv08/lib/python3.8/site-packages/fairseq/models/wav2vec/wav2vec2.py", line 460, in forward
    unmasked_features = features.clone()
RuntimeError: CUDA error: no kernel image is available for execution on the device
#使用如下的命令后消除
CUDA_VISIBLE_DEVICES=2 fairseq-hydra-train   task.data=/data/Cache/SLR12/dev-other/     common.tensorboard_logdir=~/Documents/Projects/Fairseq/tensorboard     model.w2v_path=~/Documents/Projects/Fairseq/outputs/2021-03-04/15-54-15/checkpoints/checkpoint_last.pt     distributed_training.distributed_world_size=1 optimization.update_freq=[24]     distributed_training.device_id=2     --config-dir ~/Documents/workspace/fairseq/examples/wav2vec/config/finetuning     --config-name vox_100h

至此,向导中的CTC Finetune例子算是运行起来了

你可能感兴趣的:(Fairseq,深度学习,神经网络,pytorch)