为文本摘要网络Pointer-Generator Networks制作中文复述训练数据

这里首先向大家推荐一篇论文https://arxiv.org/abs/1704.04368。这篇论文介绍了一个文本摘要网络,具体是怎样的,这里不作详细介绍。向大家推荐知乎上一篇介绍它的文章https://zhuanlan.zhihu.com/p/27272224,其余请自行了解阅读论文。

下面是pointer-generator的开源项目地址:https://github.com/abisee/pointer-generator。我们现在要用它做中文复述的工作,那么首先来看一下它是如何处理英文文本摘要的。

Github网页上给了测试集输出结果,我们拿出第一篇看下效果:

这是文章原文(article):

washington ( cnn ) president barack obama says he is `` absolutely committed to making sure '' israel maintains
 a military advantage over iran . his comments to the new york times , published on sunday , come amid 
criticism from israeli prime minister benjamin netanyahu of the deal that the united states and five other world
 powers struck with iran . tehran agreed to halt the country 's nuclear ambitions , and in exchange , western 
powers would drop sanctions that have hurt the iran 's economy . obama said he understands and respects 
netanyahu 's stance that israel is particularly vulnerable and does n't `` have the luxury of testing these 
propositions '' in the deal . `` but what i would say to them is that not only am i absolutely committed to making
 sure they maintain their qualitative military edge , and that they can deter any potential future attacks , but 
what i 'm willing to do is to make the kinds of commitments that would give everybody in the neighborhood ,
 including iran , a clarity that if israel were to be attacked by any state , that we would stand by them , '' obama 
said . that , he said , should be `` sufficient to take advantage of this once-in-a-lifetime opportunity to see 
whether or not we can at least take the nuclear issue off the table , '' he said . the framework negotiators 
announced last week would see iran reduce its centrifuges from 19,000 to 5,060 , limit the extent to which 
uranium necessary for nuclear weapons can be enriched and increase inspections . the talks over a final draft 
are scheduled to continue until june 30 . but netanyahu and republican critics in congress have complained that iran 
wo n't have to shut down its nuclear facilities and that the country 's leadership is n't trustworthy enough for the 
inspections to be as valuable as obama says they are . obama said even if iran ca n't be trusted , there 's still a case to 
be made for the deal . `` in fact , you could argue that if they are implacably opposed to us , all the more reason for 
us to want to have a deal in which we know what they 're doing and that , for a long period of time , we can prevent 
them from having a nuclear weapon , '' obama said .

这是对这篇文章提供的参考总结(referance summary):

1. in an interview with the new york times , president obama says he understands israel feels particularly vulnerable .
2. obama calls the nuclear deal with iran a `` once-in-a-lifetime opportunity '' .
3. israeli prime minister benjamin netanyahu and many u.s. republicans warn that iran can not be trusted .

下面是pointer-generator模型给出的结果:

1. president barack obama says he is `` absolutely committed to making sure '' israel maintains a military advantage 
over iran .
2. obama said he understands and respects netanyahu 's stance that israel is particularly vulnerable and 
does n't `` have the luxury of testing these propositions '' .

可以看出来效果还是不错的。
另外,Github上有提供pretrained model,也提供了英文训练数据,可以自己训练模型。那么接下来我们看一下它的训练数据是怎样处理的。

在https://github.com/abisee/cnn-dailymail这个网页上提供了英文数据处理的代码,当然它也提供了处理好了的数据。我们通过下载CNN Stories数据集,找出其中一篇查看一下。结构很简单,前面是一篇文章,最后一部分是给出的reference summary,由@highlight标记。reference summay格式如下,文章篇幅过长就不放了。

@highlight

A new tour in Taipei, Taiwan, allows tourists to do four-hour ride-alongs in local taxis

@highlight

Tourists go wherever local fares hire the cabs to go

@highlight

The appeal is going to unexpected locations and meeting chatty locals

@highlight

One English tourist was invited to a Taiwanese family dinner by a passenger in his taxi

那么训练过程就是将文章和reference summary输入,模型去学习自己产生这个summary。现在我们有中文平行语料,利用它做中文复述也很简单。我的中文语料是这种格式的(需要先分词,可以利用哈工大的ltp):

这 并 不 奇怪 
没什么 奇怪 的 

那么我们需要做的就是将中文语料第一句作为article,中文语料第2句作为reference summary输入就可以了。但是我们并不能利用上面提供的链接里的make_datafiles.py直接进行处理。为什么呢?差别在于英文语料是一个.story文件放置一篇article和它的reference summary,而我们是一个文本文件放不止一对的平行语料。代码只需稍作改动,注意不要把所有的中1语料放入一个article,中2放入一个abstract就行了,我们要做的就是分开一个一个存放。下面贴出我的代码:

import os
import struct
import collections
from tensorflow.core.example import example_pb2

# 我们用这两个符号切分在.bin数据文件中的摘要句子
SENTENCE_START = ''
SENTENCE_END = ''

train_file = './train/train.txt'
val_file = './val/val.txt'
test_file = './test/test.txt'
finished_files_dir = './finished_files'
chunks_dir = os.path.join(finished_files_dir, "chunked")

VOCAB_SIZE = 200000
CHUNK_SIZE = 1000  # 每个分块example的数量,用于分块的数据


def chunk_file(set_name):
    in_file = os.path.join(finished_files_dir, '%s.bin' % set_name)
    print(in_file)
    reader = open(in_file, "rb")
    chunk = 0
    finished = False
    while not finished:
        chunk_fname = os.path.join(chunks_dir, '%s_%03d.bin' % (set_name, chunk))  # 新的分块
        with open(chunk_fname, 'wb') as writer:
            for _ in range(CHUNK_SIZE):
                len_bytes = reader.read(8)
                if not len_bytes:
                    finished = True
                    break
                str_len = struct.unpack('q', len_bytes)[0]
                example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]
                writer.write(struct.pack('q', str_len))
                writer.write(struct.pack('%ds' % str_len, example_str))
            chunk += 1


def chunk_all():
    # 创建一个文件夹来保存分块
    if not os.path.isdir(chunks_dir):
        os.mkdir(chunks_dir)
    # 将数据分块
    for set_name in ['train', 'val', 'test']:
        print("Splitting %s data into chunks..." % set_name)
        chunk_file(set_name)
    print("Saved chunked data in %s" % chunks_dir)


def read_text_file(text_file):
    lines = []
    with open(text_file, "r", encoding='utf-8') as f:
        for line in f:
            lines.append(line.strip())
    return lines


def write_to_bin(input_file, out_file, makevocab=False):
    if makevocab:
        vocab_counter = collections.Counter()

    with open(out_file, 'wb') as writer:
        # 读取输入的文本文件,使偶数行成为article,奇数行成为abstract(行号从0开始)
        lines = read_text_file(input_file)
        for i, new_line in enumerate(lines):
            if i % 2 == 0:
                article = lines[i]
            if i % 2 != 0:
                abstract = "%s %s %s" % (SENTENCE_START, lines[i], SENTENCE_END)

                # 写到tf.Example
                tf_example = example_pb2.Example()
                tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, encoding='utf-8')])
                tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, encoding='utf-8')])
                tf_example_str = tf_example.SerializeToString()
                str_len = len(tf_example_str)
                writer.write(struct.pack('q', str_len))
                writer.write(struct.pack('%ds' % str_len, tf_example_str))

                # 如果可以,将词典写入文件
                if makevocab:
                    art_tokens = article.split(' ')
                    abs_tokens = abstract.split(' ')
                    abs_tokens = [t for t in abs_tokens if
                                  t not in [SENTENCE_START, SENTENCE_END]]   # 从词典中删除这些符号
                    tokens = art_tokens + abs_tokens
                    tokens = [t.strip() for t in tokens]  # 清楚句子开头结尾的空字符
                    tokens = [t for t in tokens if t != ""] # 删除空行
                    vocab_counter.update(tokens)

    print("Finished writing file %s\n" % out_file)

    # 将词典写入文件
    if makevocab:
        print("Writing vocab file...")
        with open(os.path.join(finished_files_dir, "vocab"), 'w', encoding='utf-8') as writer:
            for word, count in vocab_counter.most_common(VOCAB_SIZE):
                writer.write(word + ' ' + str(count) + '\n')
        print("Finished writing vocab file")


if __name__ == '__main__':

    if not os.path.exists(finished_files_dir): os.makedirs(finished_files_dir)

    # 读取文本文件,做一些后处理然后写入到.bin文件
    write_to_bin(test_file, os.path.join(finished_files_dir, "test.bin"))
    write_to_bin(val_file, os.path.join(finished_files_dir, "val.bin"))
    write_to_bin(train_file, os.path.join(finished_files_dir, "train.bin"), makevocab=True)

    chunk_all()

下面是我的工程目录:

为文本摘要网络Pointer-Generator Networks制作中文复述训练数据_第1张图片

只要将文件按照格式放在对应的路径下,产生的.bin会放在finished_files文件夹下,然后就可以拿来训练pointer-generator了。最后pointer-generator产生的结果就是我们想要的中文复述了。

好了,文章就到这里了。整篇文章其实没什么东西,只是向大家推荐了pointer-generator,和把文本摘要工具应用在中文复述的一点想法。小弟是自然语言处理的新手,还有很多东西需要向大家学习,欢迎找我交流。

 

你可能感兴趣的:(自然语言处理NLP)