使用训练好的情感分析模型预测句子结果都是一样的?

  • 关键字:数据字典字符编码

  • 问题描述:使用循环神经网络训练一个IMDB数据集得到一个模型,使用这个模型进行预测句子,无论句子是正面还是负面的,预测的结果都是一样。

  • 报错信息:

[[5146, 5146, 5146, 5146, 5146, 5146], [5146, 5146, 5146, 5146, 5146], [5146, 5146, 5146, 5146]]
Predict probability of  0.54538333  to be positive and  0.45461673  to be negative for review ' read the book forget the movie '
Predict probability of  0.54523355  to be positive and  0.45476642  to be negative for review ' this is a great movie '
Predict probability of  0.54504114  to be positive and  0.45495886  to be negative for review ' this is very bad '
  • 问题复现:在预测是,使用Inferencer接口创建一个预测器,然后把句子里的每个单词转换成列表形式,然后使用word_dict.get(words, UNK)根据数据集的字典把单词转换成标签,然后使用这些标签进行预测,最后预测的都是错误的。错误代码如下:
inferencer = Inferencer(
    infer_func=partial(inference_program, word_dict),
    param_path=params_dirname,
    place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c.split() for c in reviews_str]
UNK = word_dict['']
lod = []
for c in reviews:
    lod.append([word_dict.get(words, UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})
  • 解决问题:错误的原因是没使用正确的编码,所以在使用word_dict.get(words, UNK)转换编码时,程序理解里面都是,所以句子都是对应的编码。需要对里面的单词转换成UTF-8的字符编码,例子这样word_dict.get(words.encode('utf-8')。正确代码如下:
inferencer = Inferencer(
    infer_func=partial(inference_program, word_dict),
    param_path=params_dirname,
    place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c.split() for c in reviews_str]
UNK = word_dict['']
lod = []
for c in reviews:
    lod.append([word_dict.get(words.encode('utf-8'), UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})

你可能感兴趣的:(PaddlePaddle,问答专区)