首先加载torchtext
from torchtext import data
设置Field
,对输入文本数据的格式进行"预设置"
question = data.Field(sequential=True, fix_length=20, pad_token='0')
label = data.Field(sequential=False, use_vocab=False)
sequential=True | tokenizer | fix_length | pad_first=True | tensor_type | lower |
---|---|---|---|---|---|
是否为sequences | 分词器 | 文本长度 | 是否从左补全 | Tensor type | 是否令英文字符为小写 |
以question
为例,设置文本长度为20,超过20删除,不足20则使用pad_token
补全。sequential
的含义为输入文本是否是序列文本,若为True则是序列文本,需要配合tokenize
(默认使用splits,也可以用Spacy)进行分词,若为False则输入已经是切分好的文本或不需要进行分词。如果处理的是中文文本,也可以自定义tokenizer
对中文进行切分:
import jieba
def chinese_tokenizer(text):
return [tok for tok in jieba.lcut(text)]
question = data.Field(sequential=True, tokenize=chinese_tokenizer, fix_length=20)
同时读取训练集、验证集与测试集,path
为路径,train
、validation
和test
为文件名。
splits()
的作用为 Create train-test(-valid?) splits from the instance’s examples, and return Datasets for train, validation, and test splits in that order, if the splits are provided.
train, val, test = data.TabularDataset.splits(
path = './',
train = 'train.json',
validation = 'val.json',
test = 'test.json',
format = 'json',
fields = {'question': ('question',question),
'label': ('label', label)})
测试数据是否正确读入(此处使用jieba中文分词器)
for i in range(0, len(val)):
print(vars(val[i])
打印结果示例为
{'question': ['世界', '上', '为什么', '有', '好人', '和', '坏人'], 'label': 'generate'}
{'question': ['为什么', '有', '坏人', '有', '好人', '呀'], 'label': 'generate'}
cache = '.vector_cache
if not os.path.exists(cache):
os.mkdir(cache)
vectors = Vectors(name=configs.embedding_path, cache = cache)
question.build_vocab(train, val, test, min_freq=5, vectors=vectors)
从预训练的 vectors 中,将当前 corpus 词汇表的词向量抽取出来,构成当前 corpus 的 Vocab(词汇表)
.build_vocab
用以构建词汇表,将分词结果转化为整数(.vocab.vectors
是与此词汇表相关联的词向量)此外,torchtext也提供了一些预训练好的词向量。max_size
设置词汇表最大个数,min_freq
设置词汇最低出现频率的阈值。
train_iter = data.Iterator(dataset=train, batch_size=256, shuffle=True, sort_within_batch=False, repeat=False, device=configs.device)
val_iter = data.Iterator(dataset=val, batch_size=256, shuffle=False, sort=False, repeat=False, device=configs.device)
test_iter = data.Iterator(dataset=test, batch_size=256, shuffle=False, sort=False, repeat=False, device=configs.device)
dataset |
batch_size |
batch_size_fn |
sort_key |
train |
repeat |
shuffle |
sort |
sort_within_batch |
---|---|---|---|---|---|---|---|---|
加载的数据集 | Batch 大小 | 产生动态的batch_size的函数 | 排序的key | 是否为训练集 | 是否在不同epoch中重复迭代 | 是否打乱数据 | 是否对数据进行排序 | batch内部是否排序 |
import codecs
import jieba
import os
from config import config
from torchtext import data, datasets
from torchtext.vocab import Vectors
def chinese_tokenizer(text):
return [tok for tok in jieba.lcut(text)]
def load_data(configs):
TEXT = data.Field(sequential=True, tokenize = chinese_tokenizer, fix_length=20)
LABEL = data.Field(sequential=False, use_vocab=False)
train, val, test = data.TabularDataset.splits(
path = configs.file_path,
train = configs.train,
validation = configs.val,
test = configs.test,
format = 'json',
fields = {
'question': ('question', TEXT),
'label': ('label', LABEL)
}
)
# for i in range(0, len(val)):
# print(vars(val[i]))
print('Read {} success, {} texts in total.'.format(configs.train, len(train)))
print('Read {} success, {} texts in total.'.format(configs.val, len(val)))
print('Read {} success, {} texts in total.\n'.format(configs.test, len(test)))
cache = '.vector_cache'
if not os.path.exists(cache):
os.mkdir(cache)
vectors = Vectors(name=configs.embedding_path, cache = cache)
print('load word2vec vectors from {}.'.format(configs.embedding_path))
TEXT.build_vocab(train, val, test, min_freq=5, vectors=vectors)
train_iter = data.Iterator(dataset=train, batch_size=configs.batch_size, shuffle=True,
sort_within_batch=False, repeat=False, device=configs.device)
val_iter = data.Iterator(dataset=val, batch_size=configs.batch_size, shuffle=False,
sort=False, repeat=False, device=configs.device)
test_iter = data.Iterator(dataset=test, batch_size=configs.batch_size, shuffle=False,
sort=False, repeat=False, device=configs.device)
return train_iter, val_iter, test_iter, len(TEXT.vocab), TEXT.vocab.vectors