NLP——基于attention的简易中英文翻译系统

attention搭建翻译模型

最近根据nmt指导搭建一个简易的attention框架,基本功能可用,beam search有一些bug。

地址:https://github.com/audier/my_deep_project/tree/master/NLP/2.translation

本文为该路径下的jupyter示例:
https://github.com/audier/my_deep_project/blob/master/NLP/2.translation/HowToUseModel.ipynb

该示例使用建模框架实现两个简易模型比较:

  • char分词attention模型
  • jieba分词attention模型

结论是数据经过分词工具编码效果好于以每个char进行编码的结果。

1. char分词建模

train

参数设置——utils

  • 数据:cmn.txt
  • 切词:char
  • 句子:200

参数设置——train

  • outputdir: logs/char
  • epochs: 50
  • unit type: lstm
  • units num: 256
  • num layer: 2
  • attention: luong
  • optimizer: sdg
  • learnrate: 1
  • keepprob : 0.8
# 包含数据处理函数
from utils import GenData
# 包含模型参数文件
from params import create_hparams
# 模型文件
from model import BaseModel

data = GenData('cmn.txt','char',200)
param = create_hparams()
param.out_dir = 'logs/char'
param.encoder_vocab_size = len(data.id2en)
param.decoder_vocab_size = len(data.id2ch)

model = BaseModel(param, 'train')
model.train(data)
restore model from  logs/char_2_256
INFO:tensorflow:Restoring parameters from logs/char_2_256/model
epochs 0 : average loss =  2.077897276878357
epochs 5 : average loss =  1.812596092224121
epochs 10 : average loss =  1.3171268725395202
epochs 15 : average loss =  1.208145541548729
epochs 20 : average loss =  1.0813179659843444
epochs 25 : average loss =  0.9847841136157512
epochs 30 : average loss =  0.6884907965362072
epochs 35 : average loss =  0.5871514855325222
epochs 40 : average loss =  0.6516741314530372
epochs 45 : average loss =  0.6519982025399804

infer

参数设置——utils

  • 同train

参数设置——infer

  • 解码方法:greedy(beam search有bug)

set param.batch_size = 1

set model = BaseModel(param, 'infer')

use model.inference(data) make inference work

from utils import GenData
from params import create_hparams
from model import BaseModel


data = GenData('cmn.txt','char',200)
param = create_hparams()
param.out_dir = 'logs/char'
param.encoder_vocab_size = len(data.id2en)
param.decoder_vocab_size = len(data.id2ch)

# infer模式下需要改动
param.batch_size = 1
param.keepprob = 1

model = BaseModel(param, 'infer')
model.inference(data)
restore model from  logs/char_2_256
INFO:tensorflow:Restoring parameters from logs/char_2_256/model
input english: Why me?
output chinese: 为什么是我?
input english: Ask Tom.
output chinese: 去问汤姆。
input english: Call us.
output chinese: 联系我们。
input english: Humor me.
output chinese: 你就随了我
input english: Can I help?
output chinese: 我可以幫忙嗎?
input english: Hey, relax.
output chinese: 嘿,放松点。
input english: I eat here.
output chinese: 我我這裡裡。
input english: exit

2. jieba分词建模

train

参数设置——utils

  • 数据:cmn.txt
  • 切词:jieba
  • 句子:200

参数设置——train

  • outputdir: logs/jieba
  • epochs: 50
  • unit type: lstm
  • units num: 256
  • num layer: 2
  • attention: luong
  • optimizer: sdg
  • learnrate: 1
  • keepprob : 0.8
# 包含数据处理函数
from utils import GenData
# 包含模型参数文件
from params import create_hparams
# 模型文件
from model import BaseModel

data = GenData('cmn.txt','jieba',200)
param = create_hparams()
param.out_dir = 'logs/jieba'
param.encoder_vocab_size = len(data.id2en)
param.decoder_vocab_size = len(data.id2ch)

model = BaseModel(param, 'train')
model.train(data)
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Hongwen\AppData\Local\Temp\jieba.cache
Loading model cost 0.692 seconds.
Prefix dict has been built succesfully.


restore model from  logs/jieba_2_256
INFO:tensorflow:Restoring parameters from logs/jieba_2_256/model
epochs 0 : average loss =  2.4543039417266845
epochs 5 : average loss =  2.010868504047394
epochs 10 : average loss =  1.6857356426119805
epochs 15 : average loss =  1.561958097219467
epochs 20 : average loss =  1.2476714611053468
epochs 25 : average loss =  0.9543516409397125
epochs 30 : average loss =  0.7893321856856346
epochs 35 : average loss =  1.002643452435732
epochs 40 : average loss =  0.6801241132616996
epochs 45 : average loss =  0.6159672170877457

参数设置——utils

  • 同train

参数设置——infer

  • 解码方法:greedy(beam search有bug)

set param.batch_size = 1

set model = BaseModel(param, 'infer')

use model.inference(data) make inference work

from utils import GenData
from params import create_hparams
from model import BaseModel


data = GenData('cmn.txt','jieba',200)
param = create_hparams()
param.out_dir = 'logs/jieba'
param.encoder_vocab_size = len(data.id2en)
param.decoder_vocab_size = len(data.id2ch)

# infer模式下需要改动
param.batch_size = 1
param.keepprob = 1

model = BaseModel(param, 'infer')
model.inference(data)
restore model from  logs/jieba_2_256
INFO:tensorflow:Restoring parameters from logs/jieba_2_256/model
input english: Why me?
output chinese: 为什么是我?
input english: Ask Tom.
output chinese: 去问汤姆。
input english: Call us.
output chinese: 联系跟着我们。
input english: Humor me.
output chinese: 你就随了我的意吧。
input english: Can I help?
output chinese: 我可以幫忙嗎?
input english: Hey, relax.
output chinese: 嘿,放松点。
input english: I eat here.
output chinese: 我在這裡吃。
input english: exit

你可能感兴趣的:(NLP)