+ 关键字:`数据类型`,`dtype`
+ 问题描述:使用PTB数据集训练词向量模型,设置输入层的`dtype`参数值为`float32`,在启动训练的时候出现张量类型错误。
+ 报错信息:
```
37 num_epochs=1,
38 event_handler=event_handler,
---> 39 feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in train(self, num_epochs, event_handler, reader, feed_order)
403 else:
404 self._train_by_executor(num_epochs, event_handler, reader,
--> 405 feed_order)
406
407 def test(self, reader, feed_order):
/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in _train_by_executor(self, num_epochs, event_handler, reader, feed_order)
481 exe = executor.Executor(self.place)
482 reader = feeder.decorate_reader(reader, multi_devices=False)
--> 483 self._train_by_any_executor(event_handler, exe, num_epochs, reader)
484
485 def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):
/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in _train_by_any_executor(self, event_handler, exe, num_epochs, reader)
510 fetch_list=[
511 var.name
--> 512 for var in self.train_func_outputs
513 ])
514 else:
/usr/local/lib/python3.5/dist-packages/paddle/fluid/executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
468
469 self._feed_data(program, feed, feed_var_name, scope)
--> 470 self.executor.run(program.desc, scope, 0, True, True)
471 outs = self._fetch_data(fetch_list, fetch_var_name, scope)
472 if return_numpy:
EnforceNotMet: Tensor holds the wrong type, it holds f at [/paddle/paddle/fluid/framework/tensor_impl.h:29]
PaddlePaddle Call Stacks:
```
+ 问题复现:在使用`fluid.layers.data`接口定义网络的输出层,设置每个输入层的`name`为单独的名称,`shape`的值为`[1]`且设置`dtype`的值为`float32`,启动训练的时候就会出现该错误。错误代码如下:
```python
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='float32')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='float32')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='float32')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='float32')
```
+ 解决问题:因为PTB数据集下训练的时候,已经把单词转换成整数,所以输入的数据应该是整数而不是浮点数字,出现的错误也是因为这个原因。正确代码如下:
```python
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
```
+ 问题拓展:PaddlePaddle的输入层数据类型有`float`、`int`、`uint`、`bool`,但是就没有字符串类型,所以训练数据都会转换成相应数据类型,所以在PTB数据集数据中也是把字符串的单词转换成整型。
+ 问题分析:编写神经网络时,获得编写任何程序时,细节都是重要的,细节不正确就会导致程序运行不起来,而深度学习的编程中,类型不正确是常出现的错误,要避免这类错误,你需要熟悉你使用的训练数据的数据类型,如果不熟悉,此时最好的方法就是在使用时打印一下数据的类型与shape,方便编写出正确的fluid.layers.data
## `已审阅` 2.问题:设置向量表征类型为整型时训练报错
+ 关键字:`数据类型`,`词向量`
+ 问题描述:定义N-gram神经网络训练PTB数据集时,使用PaddlePaddle内置的`fluid.layers.embedding`接口计算词向量,当设置该数据类型为`int64`时报错。
+ 报错信息:
```
31 # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
32 optimizer_func=optimizer_func,
---> 33 place=place)
34
35 trainer.train(
/usr/local/lib/python3.5/dist-packages/paddle/fluid/contrib/trainer.py in __init__(self, train_func, optimizer_func, param_path, place, parallel, checkpoint_config)
280 with self._prog_and_scope_guard():
281 exe = executor.Executor(place)
--> 282 exe.run(self.startup_program)
283
284 if self.checkpoint_cfg and self.checkpoint_cfg.load_serial is not None:
/usr/local/lib/python3.5/dist-packages/paddle/fluid/executor.py in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
468
469 self._feed_data(program, feed, feed_var_name, scope)
--> 470 self.executor.run(program.desc, scope, 0, True, True)
471 outs = self._fetch_data(fetch_list, fetch_var_name, scope)
472 if return_numpy:
EnforceNotMet: op uniform_random does not have kernel for data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN] at [/paddle/paddle/fluid/framework/operator.cc:733]
PaddlePaddle Call Stacks:
```
+ 问题复现:使用`fluid.layers.embedding`接口定义词向量时,设置参数`dtype`的值为`int64`,`size`为`[数据的单词数量, 词向量维度]`,在训练的时候就会报这个错误。错误代码如下:
```python
embed_first = fluid.layers.embedding(
input=first_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
embed_second = fluid.layers.embedding(
input=second_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
embed_third = fluid.layers.embedding(
input=third_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
embed_fourth = fluid.layers.embedding(
input=fourth_word,
size=[dict_size, EMBED_SIZE],
dtype='int64',
is_sparse=is_sparse,
param_attr='shared_w')
```
+ 解决问题:输入层的数据类型虽然是`int64`,但是词向量的数据类型是`float32`。用户可能是理解误以为词向量的数据类型也许是`int64`,所以才会导致错误。正确代码如下:
```python
embed_first = fluid.layers.embedding(
input=first_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
embed_second = fluid.layers.embedding(
input=second_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
embed_third = fluid.layers.embedding(
input=third_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
embed_fourth = fluid.layers.embedding(
input=fourth_word,
size=[dict_size, EMBED_SIZE],
dtype='float32',
is_sparse=is_sparse,
param_attr='shared_w')
```
+ 问题拓展:词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量(embedding vector),如`embedding(母亲节)=[0.3,4.2,−1.5,...]`,`embedding(康乃馨)=[0.2,5.6,−2.3,...]`。在这个映射到的实数向量表示中,希望两个语义(或用法)上相似的词对应的词向量“更像”。
+ 问题分析:NLP中,词向量技术是比较底层的计算,是很多上层技术的支撑,如RNN、LSTM等,输入都是经过词向量嵌入后的向量,将词编码成相应要保持其语义信息是更好的,即将词编程稠密向量,one-hot独热向量虽然简单,但会编码维度灾难与语义鸿沟的问题。