做论文常用中文摘要数据集

(1)短文本

1)哈工大LCSTS

加载方式:

import pandas as pd
from datasets import load_dataset, Dataset

lcsts_part_1 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
                             warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
                             encoding='utf-8')
lcsts_part_1 = lcsts_part_1[0].dropna()
lcsts_part_1 = lcsts_part_1.reset_index(drop=True)
lcsts_part_1 = pd.concat([lcsts_part_1[1::2].reset_index(drop=True), lcsts_part_1[::2].reset_index(drop=True)], axis=1)
lcsts_part_1.columns = ['document', 'summary']

lcsts_part_2 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
                             warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
                             encoding='utf-8')
lcsts_part_2 = lcsts_part_2[0].dropna()
lcsts_part_2 = lcsts_part_2.reset_index(drop=True)
x = lcsts_part_2[1::2].reset_index(drop=True)
xx = lcsts_part_2[::2].reset_index(drop=True)
lcsts_part_2 = pd.concat([lcsts_part_2[1::2].reset_index(drop=True), lcsts_part_2[::2].reset_index(drop=True)], axis=1)
lcsts_part_2.columns = ['document', 'summary']

dataset_train = Dataset.from_dict(lcsts_part_1).shuffle(seed=42)
dataset_valid = Dataset.from_dict(lcsts_part_2).shuffle(seed=42)

(2)中等长度

1)NLPCC2017的单文档新闻测试集合TTNews

2)NLPCC2021的字节跳动CNewSum

转换脚本:

# coding=utf-8
import json
from datasets import load_dataset
import jsonlines

data_type = 'jsonl'
data_field = 'data'
json_data_path = r'./test.simple.anno.label.jsonl'

article = ''
summary = ''
data = []
dict = {}
index=0
with open("./CNewSum_test_original.json","w",encoding='UTF-8') as f:
    with jsonlines.open(json_data_path) as reader:
        for idx,obj in enumerate(reader):
            tmp = 10
            for _ in obj['article']:
                article+=str(_)
            for _ in obj['summary']:
                summary+=str(_)

            dict['content'] = article
            dict['title'] = summary
            data.append(dict)
            article=''
            dict={}
            summary=''
    d = json.dumps(data,  indent=4, sort_keys=False, ensure_ascii=False)
    f.write(d)


dataset_train = load_dataset('json', data_files=[r'./CNewSum_test_original.json'])

(3)长文本

1)NLPCC2020的CLTS,但该数据集并不好很差,大量摘要为正文摘抄抽取。

你可能感兴趣的:(transformer,nlp,深度学习)