采购新的GPU后,就有赶紧尝鲜的冲动 GeForce RTX 3090
准备尝试ASR中文项目:
https://github.com/nl8590687/ASRT_SpeechRecognitiongithub.com
然后天真以为很香、很简单。根据项目的介绍,安装了tensorflow 1.13。再根据一些中文网和tensorflow的提示, 安装了 cuda10,cudnn7.6 噩梦从此开始。方向走错了,然后疯狂弥补错误只会越走越远,发现少了各类dll文件,开始网上搜罗。
比如缺少各种 cudart64_100.dll 这类文件,甚至还找到下面这个资源,下载文件配置环境变量,一切以为正常了。 资料链接 https://download.mersenne.ca/CUDA-DLLs/CUDA-10.0
配置文件,下载cudnn、cuda 几乎用了一整天时间。然后程序运行一下午,几乎慢到蜗牛一样,打开任务管理器一看。GPU使用5%, ……………………,内心收到一万点伤害。
最后使用了 cuda_11.1 和 cudnn-v8.0430 版本。 同时使用了tensorflow较新的版本。修改了keras 的源码支持了GPU运行。 具体踩坑如下
a、windows numpy 版本报错
fails to pass a sanity check due to a bug in the windows runtime. See this issue for more informati
解决问题 > pip install numpy==1.19.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
b、各类dll文件缺失
ImportError: Could not find 'cudart64_100.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Download and install CUDA 10.0 from this URL: CUDA Toolkit 9.0 Downloads
解决问题:
下载 cuda 10 tensorflow 1.14.1 然后依然出现 问题。版本不对,要人命, 因为3090比较新所以按照很多人提示 也查了tensorflow对应的版本,CAX2无法使用问题,安装了2.3.0等版本,依然不行。
c、运行测试GPU
import tensorflow as tf
tf.test.is_gpu_available()
依然返回false
直到下载了最新的cuda_11.1 和 cudnn-v8.0430,并将解压的cudnn文件进行path环境变量,终于看到了曙光,但是依然爆出CPU无法支持。
3090需要使用cuda 11. 重新删除cuda10
这里注意删除时候,需要将所有带10版本的cuda软件均删掉【控制面板--程序--删除程序】
d、安装 tf-nightly-gpu
import tensorflow as tf
tf.test.is_gpu_available()
成功返回true
e、运行 python train_mspeech.py, 直接挂
File "D:\ASR_project\asr\SpeechModel251.py", line 44, in __init__
self._model, self.base_model = self.CreateModel()
File "D:\ASR_project\asr\SpeechModel251.py", line 73, in CreateModel
layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 75, in symbolic_fn_wrapper
return func(*args, **kwargs)
File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 446, in __call__
self.assert_input_compatibility(inputs)
File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 310, in assert_input_compatibility
K.is_keras_tensor(x)
File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 695, in is_keras_tensor
if not is_tensor(x):
File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 703, in is_tensor
return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'
传说中的 tensorflow版本之间的不兼容问题?
开始尝试进行更新 keras版本,几乎崩溃。 只能下手去改源码,折磨,去github tensorflow issue 中寻找解决方案
NMazzatenta commented on 27 Apr •
I had the same issue. TF 2.1 built from source + keras 2.3.1 in conda environment. Solved by modifying file "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py" at line 704.
Before:
return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)
After:
return isinstance(x, tf_ops._TENSOR_LIKE_TYPES) or tf_ops.is_dense_tensor_like(x)
Don't know if it is the right thing to do, but I got things running in this way.
依然报错,点击进入ops.py 文件发现没有对应的属性 _TensorLike。 修改源码如下解决
终于解决了,但是我已经淡定了,知道肯定会有其他代码问题,果然没让我失望。
f、爆出错误
WARNING:tensorflow:From train_mspeech.py:23: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.
Traceback (most recent call last):
File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 310, in assert_input_compatibility
K.is_keras_tensor(x)
File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 697, in is_keras_tensor
str(type(x)) + '`. '
ValueError: Unexpectedly found an instance of type ``. Expected a symbolic tensor instance.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_mspeech.py", line 46, in
ms = ModelSpeech(datapath)
File "D:\ASR_project\asr\SpeechModel251.py", line 44, in __init__
self._model, self.base_model = self.CreateModel()
File "D:\ASR_project\asr\SpeechModel251.py", line 73, in CreateModel
layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
File "D:\ASR_project\asr\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 75, in symbolic_fn_wrapper
return func(*args, **kwargs)
File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 446, in __call__
self.assert_input_compatibility(inputs)
File "D:\ASR_project\asr\venv\lib\site-packages\keras\engine\base_layer.py", line 316, in assert_input_compatibility
str(inputs) + '. All inputs to the layer '
File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\keras\engine\keras_tensor.py", line 332, in __repr__
layer = self._keras_history.layer
AttributeError: 'tuple' object has no attribute 'layer'
要替换成tensorflow自带的 keras, ok,替换全文开始
AttributeError: ‘tuple‘ object has no attribute ‘layer‘blog.csdn.net
g、运行项目时候竟然出现了 out of memory。 刚购买的 3090 应该不太可能。 调整一下GPU参数。
config.gpu_options.per_process_gpu_memory_fraction = 0.95
# config.gpu_options.allow_growth=True #不全部占满显存, 按需分配?
batch_size 到64改小
h、真香,速度飞快开始训练
看到GPU使用起来了,特别开心,终于训练速度直线上升,比起刚开始CPU让人激动ing。
然后 一轮终于跑完,然后 还是跪了。
alhost/replica:0/task:0/device:GPU:0 with 23347 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
2020-11-10 09:56:23.572753: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Traceback (most recent call last):
File "train_mspeech.py", line 49, in
ms.TrainModel(datapath, epoch = 50, batch_size = 32, save_step = 500)
File "D:\ASR_project\asr\SpeechModel251.py", line 187, in TrainModel
self.TestModel(self.datapath, str_dataset='train', data_count = 4)
File "D:\ASR_project\asr\SpeechModel251.py", line 250, in TestModel
pre = self.Predict(data_input, data_input.shape[0] // 8)
File "D:\ASR_project\asr\SpeechModel251.py", line 326, in Predict
r1 = r[0][0].eval(session=tf.compat.v1.Session())
File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1258, in eval
"eval is not supported when eager execution is enabled, "
NotImplementedError: eval is not supported when eager execution is enabled, is .numpy() what you're looking for?
大大的几个字, what you're looking for? 扎心!!!
继续检索,据说增加这个可以搞定,
tf.compat.v1.disable_eager_execution()
跑起来了,终于要去见证奇迹了,然后, out of memory,死机了,死机了!!
……………………心态差点爆炸………………
继续踩坑,修改batch_size 改到16一点。
增加 onfig.gpu_options.allow_growth=True
然后看着一切正常,未完待续………………