这一课最后的部分简单的介绍了一下如何使用fastai来解决自然语言处理的问题,课程中分别介绍了对句子进行预测以及分类的两个问题。
课程地址:https://www.kaggle.com/hortonhearsafoo/fast-ai-v3-lesson-3-imdb
#预处理
#导入fastai库
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.text import *
#下载数据集,下载路径做了修改,下载到当前文件夹下
path = untar_data(URLs.IMDB_SAMPLE, dest="./")
path.ls()
[WindowsPath('imdb_sample/data_save.pkl'),
WindowsPath('imdb_sample/texts.csv')]
#读取下载的数据集文件,观察文件的结构可以看到,文件中每条数据分为标签(积极,消极),文本内容,以及是否为验证集
df = pd.read_csv(path/'texts.csv')
df.head()
#查看第二条文本的内容
df['text'][1]
'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.
But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.
The result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen. But it is never painted as a black-and-white case. There is baseness and nobility on both sides, and also the hope for change in the younger generation.
There is redemption of a sort, in the end, when Puro has to make a hard choice between a man who has ruined her life, but also truly loved her, and her family which has disowned her, then later come looking for her. But by that point, she has no option that is without great pain for her.
This film carries the message that both Muslims and Hindus have their grave faults, and also that both can be dignified and caring people. The reality of partition makes that realisation all the more wrenching, since there can never be real reconciliation across the India/Pakistan border. In that sense, it is similar to "Mr & Mrs Iyer".
In the end, we were glad to have seen the film, even though the resolution was heartbreaking. If the UK and US could deal with their own histories of racism with this kind of frankness, they would certainly be better off.'
#从csv文件生成训练数据集
data_lm = TextDataBunch.from_csv(path, 'texts.csv', num_workers=0)
# data_lm
#保存我们创建的数据集,这样在下次训练时可以直接载入
data_lm.save()
#从文件中读取之前保存的数据集,fastai最新的版本里已经修改了读取数据集的函数,这里也相应修改
data = load_data(path)
# data = TextDataBunch.load(path)
#显示数据集数据,可以看出数据集已经自动做过处理(数据清洗)
#处理主要包括分词,处理html标签,大写转小写,对特殊符号处理等
#其中可以看到有一些xx开头的词,这是由于在进行数据清洗的时候,fastai会对所有词汇进行统计,
#并且将出现次数最多的词(默认是前60000个),组成一个词典,且没个词的频次不可地域一个最低值
#而在词典以外的那些词,会被标记为未知的词汇,同时还会给文章分成段落并且标记文章的开始结束等信息
#这些未知词汇,分段信息都会编码放在数据集中,一般以xx开头的表示
data.show_batch()
#同时数据集还自动为每个分词进行了编号,放在一个列表vocab.itos中,用于通过数字找到对应的分词
#同时也提供了vocab.stoi反向通过分词查找对应序号
#下面显示了该列表的前十个分词
data.vocab.itos[:10]
['xxunk',
'xxpad',
'xxbos',
'xxeos',
'xxfld',
'xxmaj',
'xxup',
'xxrep',
'xxwrep',
'the']
#展示以及被分词和清洗后的数据
data.train_ds[0][0]
Text xxbos xxmaj just do n't bother . i thought i would see a movie with great xxunk and action .
xxmaj but it grows boring and terribly predictable after the interesting start . xxmaj in the middle of the film you have a little social drama and all tension is lost because it slows down the speed . xxmaj towards the end the it gets better but not really great . i think the director took this movie just too serious . xxmaj in such a kind of a movie even if u do n't care about the plot at least you want some nice action . i nearly dozed off in the middle / main part of it . xxmaj rating 3 / 10 .
xxunk .
#展示真正训练的数据,已经被数字编码化
data.train_ds[0][0].data[:10]
array([ 2, 5, 58, 60, 37, 946, 11, 19, 212, 19], dtype=int64)
#上面生成数据集的方式采用了默认的方法和参数
#为了可以灵活配置我们的数据集,我们可以采用data block的api来手动配置生成数据集
data = (TextList.from_csv(path, 'texts.csv', cols='text')
.split_from_df(col=2)#对数据集拆分,其中列数表示验证集的标记列,即is_valid列
.label_from_df(cols=0)#根据指定的标签列,对数据进行标记
.databunch(num_workers=0))#生成数据集
#定义batchsize,即每批训练数据的大小,由于占用内存(或显存)很大,因此这个值建议设置小一点
bs=48
#下载完整的数据集
path = untar_data(URLs.IMDB)
path.ls()
#显示训练集文件
(path/'train').ls()
#使用fastai的datablockapi构建语言模型的数据集,
#数据集会打乱每个batch中的text,并且重新将他们合在一起,
#同时会忽略标签,并且将每句话的后面的词作为这句话的学习目标
#最终我们需要训练的是一个预测模型,输入一句话,预测后面的话
data_lm = (TextList.from_folder(path)
#Inputs: all the text files in path
.filter_by_folder(include=['train', 'test', 'unsup'])
#We may have other temp folders that contain text files so we only keep what's in train and test
.random_split_by_pct(0.1)
#We randomly split and keep 10% (10,000 reviews) for validation
.label_for_lm()
#We want to do a language model so we label accordingly(使用语言模型来构建数据集)
.databunch(bs=bs))
#保存构建好的数据集
data_lm.save('tmp_lm')
#读取数据集,使用新的函数
# data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=bs)
data_lm = load_data(path, 'tmp_lm', bs=bs)
#显示数据集的每个批次
data_lm.show_batch()
#定义学习器,默认的学习器是一个RNN网络,我们使用预训练好的模型进行训练,
#这个模型是用维基百科的语料进行训练的,最后设置了drop率为0.3
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.3)
#寻找合适的学习率
learn.lr_find()
#显示到第15条数据
learn.recorder.plot(skip_end=15)
#使用合适的学习率训练,moms表示学习率的变化率,
#在学习率上升阶段,学习率的变化从0.8到0.7,在学习率下降阶段,学习率变化率则从0.7到0.8
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
#保存训练好的模型
learn.save('fit_head')
#读取模型
learn.load('fit_head');
#对模型微调后,我们可以继续进一步重新训练模型的所有参数
#解冻模型参数
learn.unfreeze()
# commented out because the training time didn't fit in a single Kernel session
#重新训练,这里由于性能关系,课程中将其注释了
# learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))
#保存进一步训练后的模型
learn.save('fine_tuned')
#读取模型
learn.load('fine_tuned');
#定义输入数据
TEXT = "i liked this movie because"
#定义参数,预测单词个数,预测的句子个数
N_WORDS = 40
N_SENTENCES = 2
#打印所有预测的结果,这里设置了预测的热度为0.75,这个值越大则预测的结果越随机,预测结果是编码后的值
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))
#最后需要保存一下词编码器,使得对预测的结果可以反编码回原来的句子
learn.save_encoder('fine_tuned_enc')
#下面进行评论的分类学习
#首先下载带有评论标签的数据,标签有正面和负面两个值
path = untar_data(URLs.IMDB)
#手动构建数据集,这里我们将之前保存的数据集中的字典应用于这个数据集,使用这个字典来对文中的单词进行编号
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
#grab all the text files in path
#设定验证集的文件夹
.split_by_folder(valid='test')
#split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
#通过文件夹来设置文中标签
.label_from_folder(classes=['neg', 'pos'])
#label them all with their folders
.databunch(bs=bs))
#保存数据集
data_clas.save('tmp_clas')
#读取之前保存的数据集,原文中的方法比较旧,采用新的方法
# data_clas = TextClasDataBunch.load(path, 'tmp_clas', bs=bs)
data_clas = load_data(path, 'tmp_clas', bs=bs)
#显示数据集
data_clas.show_batch()
#初始化rnn网络
learn = text_classifier_learner(data_clas, drop_mult=0.5)
#读取之前保存的编码器
learn.load_encoder('fine_tuned_enc')
#冻结最后一层之前的参数
learn.freeze()
#寻找学习率
learn.lr_find()
#显示学习率曲线
learn.recorder.plot()
#使用合适的学习率进行一轮训练
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))
#保存第一次训练模型
learn.save('first')
#读取第一次训练的模型
learn.load('first');
#冻结倒数两层之前的参数
learn.freeze_to(-2)
#选择其他的学习率进行一轮训练(最小的学习率变化了)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))
#保存第二次学习的模型
learn.save('second')
#读取第二次的模型
learn.load('second');
#冻结到最后三层之前的参数
learn.freeze_to(-3)
#调整学习率继续训练一轮
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))
#保存第三次训练的模型
learn.save('third')
#读取第三次的模型
learn.load('third');
#解冻所有参数
learn.unfreeze()
#调整学习率进行两轮训练,这一次是对所有参数进行重新训练
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))
#使用训练好的模型对输入的句子进行预测,返回它的标签对应的概率(正面,负面)
learn.predict("I really loved that movie, it was awesome!")