作者|Aaron Abrahamson
编译|VK
来源|Towards Data Science
在沙丘魔堡2000上训练文本生成模型
沙丘魔堡是一个遥远的封建社会的故事。它关注的是一位公爵和他的家人,他们被迫成为沙漠星球阿拉基斯的管理者。弗兰克·赫伯特在1965年出版了这部经典作品。几乎任何现代科幻小说都可以追溯到沙丘的某些元素。
我最近完成了《沙丘》的续集《沙丘的弥赛亚》,并且刚刚开始了《沙丘的孩子》系列的第三部。有六个故事最初是赫伯特写的,后来又有一大堆是他儿子写的。我没读过那些。
我一直在探索文本生成模型。我觉得用沙丘试试会很有趣。很多的“经典”机器学习模型被用于预测和聚类。生成性建模允许模型创建角度从中学习的训练数据。最近一个关于生成建模能力的例子是StyleGAN,看看这段视频(https://www.youtube.com/watch?v=kSLJriaOumA)。
这里有一个链接到我在这个项目中使用的Colab笔记本(https://drive.google.com/file/d/15Z7SNBnBL12acmUGvvMLQ-OoMspb-B5k/view?usp=sharing)。
处理过程
-
获取文本数据的语料库
-
数据清洗。我有一些unicode字符,每当有分页符的时候就会出现“page”这个词,这个词是没有用的。每一章的开头都有一段摘自世界上的回忆录或书籍,我决定把它们拿出来。我还删除了每章的后半部分,以帮助处理时间。
-
标记化。这是删除标点符号,使内容小写,然后将长字符串拆分为每个单独的单词。模型将学习这些单词标记的顺序和频率。另外请注意,对于这种NLP任务,我们不删除停用词
-
建立模型。请确保使用LSTM层,并且输出层是词汇表的大小。基本上,它所做的是对下一个单词可能是什么进行分类,只需输入少量的文本https://my.openwrite.cn/logout
-
训练模型。Keras建议至少20个epoch,我运行了33个epoch。
-
生成文本。我将在下面展示模型的一些输出
第一章:男爵
我想在一段时间后测试一下,看看会有什么结果。种子词是“男爵”(Baron),是书中一个卑鄙的对手。
‘Baron The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron’
一直是这样。一点也不好。
33个epoch之后的模型做得非常好,但它仍然陷入循环,只是不停地发出各种名词。下面是种子单词Spice的输出结果:
The Spice Itself Stood Out To The Left Wall The Fremen Seeker Followed The Chains The Troop Was A Likely Shadow And The Natural Place Of The Great Places That Was A Subtle City Of The Room'S Features That The Man Was A Master Of The Cavern The Growing The Bronze The Sliding Hand
以下是“Paul”(主角)的输出:
Paul Stood Unable To The Duke And The Reverend Mother Ramallo To The Guard Captain And The Man Looked At Him And The Child Was A Relief One Of The Fremen Had Been In The Doorway And The Fedaykin Control Them To Be Like The Spice Diet Out Of The Wind And The Duke Said I Am The Fremen To Get The Banker Said When The Emperor Asked His Fingers Nefud I Know You Can Take The Duchy Of Government The Sist The Duke Said He Turned To The Hand Beside The Table The Baron Asked The Emperor Will Hold
下面是“She looked”的输出:
'She Looked At The Transparent End Of The Table Saw A Small Board In The Room And The Way Of The Old Woman He Had Been Sent By The Wind Of The Duke And The Worms They Had Seen The Waters Of The Desert And The Sandworms The Troop Had Been Subtly Prepared By The Wind Of The Worm Had Been Subtly Always In The Deep Sinks Of The Women And The Duke Had Been Given Last Of Course But The Others Had Been In The Fremen Had Been Shaped On The Light Of The Light Of The Hall Had Had Seen'
想法和下一步
我认为这绝对是进步和进步的表现。我想把它训练到至少100个epoch,但进展缓慢。每个epoch大约11分钟,所以总共超过18个小时。我需要一台更好的电脑。
最后,我想补充一点,这样做的讽刺意味并没有让我忘记。在《沙丘宇宙》中,在远古时代的某个时刻,“会思考的电脑”反抗人类,几乎将人类灭绝。在这本书的时代,计算机已经被“mentats”所取代,反而是人类被培养和训练来模仿计算机的计算能力。
原文链接:https://towardsdatascience.com/the-text-must-flow-3bb4edff7b5b
欢迎关注磐创AI博客站:
http://panchuang.net/
sklearn机器学习中文官方文档:
http://sklearn123.com/
欢迎关注磐创博客资源汇总站:
http://docs.panchuang.net/