关键字:PTB数据集
,数据维度
问题描述:使用PaddlePaddle提供的词向量PTB数据集接口paddle.dataset.imikolov.train
创建训练数据,然后使用这个数据进行训练时,出现错误,错误提示数据的长度不正确。
报错信息:
in train(use_cuda, train_program, params_dirname)
37 num_epochs=1,
38 event_handler=event_handler,
---> 39 feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
/opt/conda/envs/py35-paddle1.0.0/lib/python3.5/site-packages/paddle/fluid/contrib/trainer.py in train(self, num_epochs, event_handler, reader, feed_order)
403 else:
404 self._train_by_executor(num_epochs, event_handler, reader,
--> 405 feed_order)
406
407 def test(self, reader, feed_order):
/opt/conda/envs/py35-paddle1.0.0/lib/python3.5/site-packages/paddle/fluid/contrib/trainer.py in _train_by_executor(self, num_epochs, event_handler, reader, feed_order)
481 exe = executor.Executor(self.place)
482 reader = feeder.decorate_reader(reader, multi_devices=False)
--> 483 self._train_by_any_executor(event_handler, exe, num_epochs, reader)
484
485 def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):
/opt/conda/envs/py35-paddle1.0.0/lib/python3.5/site-packages/paddle/fluid/contrib/trainer.py in _train_by_any_executor(self, event_handler, exe, num_epochs, reader)
494 for epoch_id in epochs:
495 event_handler(BeginEpochEvent(epoch_id))
--> 496 for step_id, data in enumerate(reader()):
497 if self.__stop:
498 if self.checkpoint_cfg:
/opt/conda/envs/py35-paddle1.0.0/lib/python3.5/site-packages/paddle/fluid/data_feeder.py in __reader_creator__()
275 if not multi_devices:
276 for item in reader():
--> 277 yield self.feed(item)
278 else:
279 num = self._get_number_of_places_(num_places)
/opt/conda/envs/py35-paddle1.0.0/lib/python3.5/site-packages/paddle/fluid/data_feeder.py in feed(self, iterable)
189 assert len(each_sample) == len(converter), (
190 "The number of fields in data (%s) does not match " +
--> 191 "len(feed_list) (%s)") % (len(each_sample), len(converter))
192 for each_converter, each_slot in six.moves.zip(converter,
193 each_sample):
AssertionError: The number of fields in data (7) does not match len(feed_list) (5)
paddle.dataset.imikolov.build_dict
创建一个数据集字典,然后使用这个字典通过调用paddle.dataset.imikolov.train
接口创建一个训练数据,参数n
设置为7,启动训练的时候就会上面的错误。错误代码如下:word_dict = paddle.dataset.imikolov.build_dict()
train_reader = paddle.batch(paddle.dataset.imikolov.train(word_dict, 7), 64)
trainer.train(
reader=train_reader,
num_epochs=1,
event_handler=event_handler,
feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
feed_order
只有5个输入数据,包括一个label的数据,而在定义训练数据的长度是7,所以导致输入数据的长度不同。paddle.dataset.imikolov.train
接口的参数应该设置为5。正确代码如下:word_dict = paddle.dataset.imikolov.build_dict()
train_reader = paddle.batch(paddle.dataset.imikolov.train(word_dict, 5), 64)
trainer.train(
reader=train_reader,
num_epochs=1,
event_handler=event_handler,
feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
paddle.dataset.imikolov.train
接口可以动态设置输出一条数据的单词数量,如果要修改这个数量,需要修改网络的词向量数量和训练接口的feed_order
参数值。