GeForce RTX 3090--tensorflow开源asr项目采坑

背景

采购新的GPU后,就有赶紧尝鲜的冲动 GeForce RTX 3090

项目尝试

准备尝试ASR中文项目:

https://github.com/nl8590687/ASRT_SpeechRecognition​github.com

 

然后天真以为很香、很简单。根据项目的介绍,安装了tensorflow 1.13。再根据一些中文网和tensorflow的提示, 安装了 cuda10,cudnn7.6 噩梦从此开始。方向走错了,然后疯狂弥补错误只会越走越远,发现少了各类dll文件,开始网上搜罗。

比如缺少各种 cudart64_100.dll 这类文件,甚至还找到下面这个资源,下载文件配置环境变量,一切以为正常了。 资料链接 https://download.mersenne.ca/CUDA-DLLs/CUDA-10.0

配置文件,下载cudnn、cuda 几乎用了一整天时间。然后程序运行一下午,几乎慢到蜗牛一样,打开任务管理器一看。GPU使用5%, ……………………,内心收到一万点伤害。

最后使用了 cuda_11.1 和 cudnn-v8.0430 版本。 同时使用了tensorflow较新的版本。修改了keras 的源码支持了GPU运行。 具体踩坑如下

各类出错

a、windows numpy 版本报错

 fails to pass a sanity check due to a bug in the windows runtime. See this issue for more informati

解决问题 > pip install numpy==1.19.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

 

b、各类dll文件缺失

ImportError: Could not find 'cudart64_100.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 10.0 from this URL: CUDA Toolkit 9.0 Downloads

解决问题:

下载 cuda 10 tensorflow 1.14.1 然后依然出现 问题。版本不对,要人命, 因为3090比较新所以按照很多人提示 也查了tensorflow对应的版本,CAX2无法使用问题,安装了2.3.0等版本,依然不行。

 

c、运行测试GPU

import tensorflow as tf
tf.test.is_gpu_available()

依然返回false

直到下载了最新的cuda_11.1 和 cudnn-v8.0430,并将解压的cudnn文件进行path环境变量,终于看到了曙光,但是依然爆出CPU无法支持。

3090需要使用cuda 11. 重新删除cuda10

这里注意删除时候,需要将所有带10版本的cuda软件均删掉【控制面板--程序--删除程序】

 

d、安装 tf-nightly-gpu

import tensorflow as tf
tf.test.is_gpu_available()

成功返回true

e、运行 python train_mspeech.py, 直接挂

 File "D:\ASR_project\asr\SpeechModel251.py", line 44, in __init__
    self._model, self.base_model = self.CreateModel()
  File "D:\ASR_project\asr\SpeechModel251.py", line 73, in CreateModel
    layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 446, in __call__
    self.assert_input_compatibility(inputs)
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 310, in assert_input_compatibility
    K.is_keras_tensor(x)
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 695, in is_keras_tensor
    if not is_tensor(x):
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 703, in is_tensor
    return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

传说中的 tensorflow版本之间的不兼容问题?

开始尝试进行更新 keras版本,几乎崩溃。 只能下手去改源码,折磨,去github tensorflow issue 中寻找解决方案

NMazzatenta commented on 27 Apr • 
I had the same issue. TF 2.1 built from source + keras 2.3.1 in conda environment. Solved by modifying file "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py" at line 704.
Before:
return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
After:
return isinstance(x, tf_ops._TENSOR_LIKE_TYPES) or tf_ops.is_dense_tensor_like(x)

Don't know if it is the right thing to do, but I got things running in this way.


依然报错,点击进入ops.py 文件发现没有对应的属性 _TensorLike。 修改源码如下解决

GeForce RTX 3090--tensorflow开源asr项目采坑_第1张图片

 

终于解决了,但是我已经淡定了,知道肯定会有其他代码问题,果然没让我失望。

f、爆出错误

WARNING:tensorflow:From train_mspeech.py:23: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

Traceback (most recent call last):
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 310, in assert_input_compatibility
    K.is_keras_tensor(x)
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 697, in is_keras_tensor
    str(type(x)) + '`. '
ValueError: Unexpectedly found an instance of type ``. Expected a symbolic tensor instance.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_mspeech.py", line 46, in 
    ms = ModelSpeech(datapath)
  File "D:\ASR_project\asr\SpeechModel251.py", line 44, in __init__
    self._model, self.base_model = self.CreateModel()
  File "D:\ASR_project\asr\SpeechModel251.py", line 73, in CreateModel
    layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 446, in __call__
    self.assert_input_compatibility(inputs)
  File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 316, in assert_input_compatibility
    str(inputs) + '. All inputs to the layer '
  File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\keras\engine\keras_tensor.py", line 332, in __repr__
    layer = self._keras_history.layer
AttributeError: 'tuple' object has no attribute 'layer'

 

要替换成tensorflow自带的 keras, ok,替换全文开始

AttributeError: ‘tuple‘ object has no attribute ‘layer‘​blog.csdn.netGeForce RTX 3090--tensorflow开源asr项目采坑_第2张图片

 

 

 

g、运行项目时候竟然出现了 out of memory。 刚购买的 3090 应该不太可能。 调整一下GPU参数。

config.gpu_options.per_process_gpu_memory_fraction = 0.95
# config.gpu_options.allow_growth=True #不全部占满显存, 按需分配?

batch_size 到64改小

 

h、真香,速度飞快开始训练

GeForce RTX 3090--tensorflow开源asr项目采坑_第3张图片

 

看到GPU使用起来了,特别开心,终于训练速度直线上升,比起刚开始CPU让人激动ing。

 

然后 一轮终于跑完,然后 还是跪了。

alhost/replica:0/task:0/device:GPU:0 with 23347 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
2020-11-10 09:56:23.572753: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Traceback (most recent call last):
  File "train_mspeech.py", line 49, in 
    ms.TrainModel(datapath, epoch = 50, batch_size = 32, save_step = 500)
  File "D:\ASR_project\asr\SpeechModel251.py", line 187, in TrainModel
    self.TestModel(self.datapath, str_dataset='train', data_count = 4)
  File "D:\ASR_project\asr\SpeechModel251.py", line 250, in TestModel
    pre = self.Predict(data_input, data_input.shape[0] // 8)
  File "D:\ASR_project\asr\SpeechModel251.py", line 326, in Predict
    r1 = r[0][0].eval(session=tf.compat.v1.Session())
  File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1258, in eval
    "eval is not supported when eager execution is enabled, "
NotImplementedError: eval is not supported when eager execution is enabled, is .numpy() what you're looking for?

大大的几个字, what you're looking for? 扎心!!!

 

继续检索,据说增加这个可以搞定,

tf.compat.v1.disable_eager_execution()

跑起来了,终于要去见证奇迹了,然后, out of memory,死机了,死机了!!

 

……………………心态差点爆炸………………

 

继续踩坑,修改batch_size 改到16一点。

增加 onfig.gpu_options.allow_growth=True

 

GeForce RTX 3090--tensorflow开源asr项目采坑_第4张图片

然后看着一切正常,未完待续………………

你可能感兴趣的:(rtx,gpu,深度学习,语音识别,tensorflow)