本节我们将前面介绍的几种循环神经网络—— RNN,LSTM,GRU \text{RNN,LSTM,GRU} RNN,LSTM,GRU关于实例中的一个演示,但重点并不仅在于这些模型,这里以示例的形式对 One-hot \text{One-hot} One-hot向量重新进行认知。
自然语言 ( Natural Language ) (\text{Natural Language}) (Natural Language)是人类交流和思维的主要工具。例如汉语、英语都是自然语言的例子。其本质上是一种离散化的符号系统。以某一种语言为例,我们能够使用到的词语集合 V \mathcal V V表示如下:
其中
∣ V ∣ |\mathcal V| ∣V∣表示词语集合
V \mathcal V V中的词语数量;
ω i ( i = 1 , 2 , ⋯ ∣ V ∣ ) \omega_i(i=1,2,\cdots|\mathcal V|) ωi(i=1,2,⋯∣V∣)表示某个具体词语。
V = { ω 1 , ω 2 , ⋯ , ω ∣ V ∣ } \mathcal V = \{\omega_1,\omega_2,\cdots,\omega_{|\mathcal V|}\} V={ω1,ω2,⋯,ω∣V∣}
面对的第一个问题:如何表示这些词语。我们需要将这些词语转化为机器学习模型能够识别的数字特征。因此,最初始的动机就此形成:词语以数字向量的形式表示出来——词语表征 ( Word Representation ) (\text{Word Representation}) (Word Representation)。
而 One-hot \text{One-hot} One-hot表征 ( One-hot Representation ) (\text{One-hot Representation}) (One-hot Representation)就是其中最简单、最直接的表征方式。以上述词语集合 V \mathcal V V为例,其内部的每个词 ω ( i ) ( i = 1 , 2 , ⋯ , ∣ V ∣ ) \omega^{(i)}(i=1,2,\cdots,|\mathcal V|) ω(i)(i=1,2,⋯,∣V∣)都可以使用长度为 ∣ V ∣ |\mathcal V| ∣V∣的向量进行表示:
其中
i = 1 , 2 , ⋯ , ∣ V ∣ i=1,2,\cdots,|\mathcal V| i=1,2,⋯,∣V∣可表示每个词语在
V \mathcal V V中的编号/位置信息。
以
ω ( 1 ) , ω ( 2 ) \omega^{(1)},\omega^{(2)} ω(1),ω(2)为例,对应的
One-hot \text{One-hot} One-hot向量分别表示为‘第
1 , 2 1,2 1,2个位置’为
1 1 1,其余位置信息均为
0 0 0的向量。
关于 One-hot \text{One-hot} One-hot向量的优点:该向量与词语集合 V \mathcal V V中的任意一个词语均存在映射关系,并且在向量转换过程中,没有存在特征信息丢失的情况。
关于 One-hot \text{One-hot} One-hot向量的缺点:
而离散性质与局部表征两种缺陷导致:很难表达词语之间的相似度 ( Similarity ) (\text{Similarity}) (Similarity)。任意从 V \mathcal V V中取出两个词语 ω ( i ) , ω ( j ) ( i , j ∈ { 1 , 2 , ⋯ , ∣ V ∣ } , i ≠ j ) \omega^{(i)},\omega^{(j)}(i,j \in \{1,2,\cdots,|\mathcal V|\},i \neq j) ω(i),ω(j)(i,j∈{1,2,⋯,∣V∣},i=j),它们对应的内积结果 [ ω ( i ) ] T ⋅ ω ( j ) = 0 [\omega^{(i)}]^T \cdot \omega^{(j)} =0 [ω(i)]T⋅ω(j)=0恒成立。
对应的‘余弦相似度结果’
= 0 =0 =0,从而导致'各词语向量'之间正交,向量之间没有关联关系。
上述就是对 One-hot \text{One-hot} One-hot的一个简单认知。但我们需要纠正一个错误认知:词语之间相似度与序列信息需要区分开。
这里使用 One-hot \text{One-hot} One-hot向量为例,使用循环神经网络模型对序列进行预测。
文章末尾附完整代码。
对应代码表示如下:
def GetTxtFile(WritePath,SeqInput,RepeatNum=500):
"""
:param WritePath:D:\code_work\MachineLearning/FlareData.txt
:param SeqInput:"Deep learning is to learn the internal laws and presentation levels of sample data."
:param RepeatNum:500
:return:FlareData.txt
"""
with open(WritePath,"w",encoding="UTF-8") as f:
for _ in range(RepeatNum):
f.write(SeqInput)
f.write("\n")
f.close()
return 0
对应结果返回如下:
这个句子是网上随意找的句子,并将其重复若干次作为数据集。而这里重复若干次的目的仅在于:示例中最小化特征分布的多样性。当然也可以尝试直接截取一段较长文本。
这个多样性是指:无论如何去选取其中一段文本,各字母的后续结果总是‘有限的’。这里我们更关注模型是否能够学习出序列信息,因而特征分布构建的简单一点。
这里以一个字符作为一个向量单元。首先对数据集格式进行整理,并去重 ( Set ) (\text{Set}) (Set)得到所有出现过的字符,并将字符与对应编号构建字典。具体代码如下:
def GetStringDict(SeqPath):
def ReadData(SeqPath):
Data = open(SeqPath).read().replace("\n", " ").replace("\r", " ")
return Data
def DelRepeat(Data):
letters = list(set(Data))
return letters
def GetLetterDict(LetterList):
IndexLetterDict = {i: j for i, j in enumerate(LetterList)}
LetterIndexDict = {j: i for i, j in enumerate(LetterList)}
return IndexLetterDict, LetterIndexDict
Data = ReadData(SeqPath)
LetterList = DelRepeat(Data)
IndextoLetter, LettertoIndex = GetLetterDict(LetterList)
return Data, IndextoLetter, LettertoIndex
关于映射字典的返回结果如下:
# IndextoLetter -> length:20
{0: 't', 1: 'g', 2: '.', 3: 'm', 4: 'D', 5: 'a', 6: 'w', 7: 'v', 8: 'r', 9: 'p', 10: ' ', 11: 'e', 12: 'n', 13: 'h', 14: 'o', 15: 'f', 16: 'l', 17: 's', 18: 'd', 19: 'i'}
# LettertoIndex
{'t': 0, 'g': 1, '.': 2, 'm': 3, 'D': 4, 'a': 5, 'w': 6, 'v': 7, 'r': 8, 'p': 9, ' ': 10, 'e': 11, 'n': 12, 'h': 13, 'o': 14, 'f': 15, 'l': 16, 's': 17, 'd': 18, 'i': 19}
以大小为 20 20 20个字符串长度的窗口抓取字符数据,根据循环神经网络的描述,这里根据 20 20 20个序列长度(时刻)的序列信息预测下一时刻的输出信息。
对应代码表示如下:
窗口大小
Slide \text{Slide} Slide大小为
20 20 20,移动步长默认为
1 1 1
def ExtractData(Data,Slide):
x = list()
y = list()
for i in range(len(Data) - Slide):
x.append([a for a in Data[i:i+Slide]])
y.append(Data[i+Slide])
return x,y
对应数据中前 5 5 5个结果表示如下:
第
1 1 1行是原始数据信息,用于比对。
Deep learning is to learn the internal laws and presentation levels of sample data.
Token:['D', 'e', 'e', 'p', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'i', 's', ' ', 't', 'o', ' ']
Label:l
----------------------------------------------------------------------------------------------------------
Token:['e', 'e', 'p', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'i', 's', ' ', 't', 'o', ' ', 'l']
Label:e
----------------------------------------------------------------------------------------------------------
Token:['e', 'p', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'i', 's', ' ', 't', 'o', ' ', 'l', 'e']
Label:a
----------------------------------------------------------------------------------------------------------
Token:['p', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'i', 's', ' ', 't', 'o', ' ', 'l', 'e', 'a']
Label:r
----------------------------------------------------------------------------------------------------------
Token:[' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'i', 's', ' ', 't', 'o', ' ', 'l', 'e', 'a', 'r']
Label:n
上述抓取的信息就是样本、标签的原始形式。需要将原始的字符特征与各字符对应的 Index \text{Index} Index进行转换,而转换后的 Index \text{Index} Index特征作为 One-hot \text{One-hot} One-hot向量的特征信息:
def LettertoIndexData(x,y,LettertoIndex):
xtoIndex = list()
ytoIndex = list()
for i in range(len(x)):
xtoIndex.append([LettertoIndex[Letter] for Letter in x[i]])
ytoIndex.append([LettertoIndex[Letter] for Letter in y[i]])
return xtoIndex,ytoIndex
与上述抓取的 5 5 5个结果相对应,得到该结果的 Index \text{Index} Index特征信息表示如下:
[15, 14, 14, 3, 4, 11, 14, 5, 18, 2, 9, 2, 13, 4, 9, 1, 4, 7, 0, 4]
[11]
-------------------------------------------------------------------
[14, 14, 3, 4, 11, 14, 5, 18, 2, 9, 2, 13, 4, 9, 1, 4, 7, 0, 4, 11]
[14]
-------------------------------------------------------------------
[14, 3, 4, 11, 14, 5, 18, 2, 9, 2, 13, 4, 9, 1, 4, 7, 0, 4, 11, 14]
[5]
-------------------------------------------------------------------
[3, 4, 11, 14, 5, 18, 2, 9, 2, 13, 4, 9, 1, 4, 7, 0, 4, 11, 14, 5]
[18]
-------------------------------------------------------------------
[4, 11, 14, 5, 18, 2, 9, 2, 13, 4, 9, 1, 4, 7, 0, 4, 11, 14, 5, 18]
[2]
-------------------------------------------------------------------
仅将xtoIndex
转化至 One-hot \text{One-hot} One-hot向量格式,ytoIndex
作为分类标签使用。
def DataProcessing(Data,LettertoIndex,Slide=20):
def GetOneHot(IndexToken,Slide):
assert IndexToken < Slide
OneHotInit = np.zeros(Slide,dtype=np.int16)
OneHotInit[IndexToken] = 1
return OneHotInit
LetterX,Lettery = ExtractData(Data,Slide)
IndexTokenX,IndexTokeny = LettertoIndexData(LetterX,Lettery,LettertoIndex)
Label = list(np.array(IndexTokeny).flatten())
OnehotToken = list()
for SubSilde in IndexTokenX:
OnehotSlideToken = list()
for i in SubSilde:
OnehotResult = GetOneHot(i,Slide)
OnehotSlideToken.append(OnehotResult)
OnehotToken.append(OnehotSlideToken)
return np.array(OnehotToken),Label
这里仅示例某窗口内的特征信息与对应的 One-hot \text{One-hot} One-hot向量结果如下:
# xtoIndex;20
[10, 4, 4, 14, 15, 7, 4, 0, 8, 2, 18, 2, 12, 15, 18, 17, 15, 3, 19, 15]
# One-hot Result;(20,20)
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]]
这里使用 keras \text{keras} keras构建一个 LSTM \text{LSTM} LSTM神经网络模型:
第一层是
LSTM \text{LSTM} LSTM加
ReLU \text{ReLU} ReLU激活函数;
第二层是'全连接神经网络'加
Softmax \text{Softmax} Softmax激活函数。
from keras.models import Sequential
from keras.layers import Dense,LSTM
def GetModel(X_train,NumLetters):
model = Sequential()
model.add(LSTM(units=20,input_shape=(X_train.shape[1],X_train.shape[2]),activation="relu"))
model.add(Dense(units=NumLetters,activation="softmax"))
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])
return model
这里 NewLetters \text{NewLetters} NewLetters作为测试,选择了原始数据的一部分,观察它的输出结果:
def Console(SeqPath):
def GetyTrainOneHot(y_train,NumLetters):
yTrainList = list()
for i in y_train:
OneHotResult = GetOneHot(i,NumLetters)
yTrainList.append(OneHotResult)
return np.array(yTrainList)
Data, IndextoLetter, LettertoIndex = GetStringDict(SeqPath=SeqPath)
NumLetters = len(LettertoIndex)
X,y = DataProcessing(Data, LettertoIndex)
X_Train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=10)
y_train_category = GetyTrainOneHot(y_train,NumLetters)
Model = GetModel(X_Train,NumLetters)
Model.fit(X_Train,y_train_category,batch_size=500,epochs=10)
NewLetters = "is to learn the internal laws and presentation levels of sample data."
xNew,yNew = DataProcessing(NewLetters,LettertoIndex)
yNewPred = [np.argmax(i) for i in Model.predict(xNew)]
print("".join([IndextoLetter[i] for i in yNewPred]))
输出结果返回如下:
Epoch 1/10
76/76 [==============================] - 2s 10ms/step - loss: 2.7461 - accuracy: 0.1332
Epoch 2/10
76/76 [==============================] - 1s 9ms/step - loss: 2.4212 - accuracy: 0.2519
Epoch 3/10
76/76 [==============================] - 1s 9ms/step - loss: 1.5785 - accuracy: 0.5274
Epoch 4/10
76/76 [==============================] - 1s 9ms/step - loss: 0.8844 - accuracy: 0.7459
Epoch 5/10
76/76 [==============================] - 1s 9ms/step - loss: 0.4660 - accuracy: 0.8764
Epoch 6/10
76/76 [==============================] - 1s 10ms/step - loss: 0.4800 - accuracy: 0.9009
Epoch 7/10
76/76 [==============================] - 1s 10ms/step - loss: 0.7330 - accuracy: 0.7894
Epoch 8/10
76/76 [==============================] - 1s 10ms/step - loss: 0.0548 - accuracy: 1.0000
Epoch 9/10
76/76 [==============================] - 1s 10ms/step - loss: 0.0254 - accuracy: 1.0000
Epoch 10/10
76/76 [==============================] - 1s 10ms/step - loss: 0.0154 - accuracy: 1.0000
可以看出,该模型能够学习到序列信息并收敛。比较输入测试结果与模型的预测结果:
# input
is to learn the internal laws and presentation levels of sample data.
# Predict
2/2 [==============================] - 0s 2ms/step
rnal laws and presentation levels of sample data.
它们之间相差的字符数正好是一个窗口大小:
# length:20
is to learn the inte
实际上:基于当前窗口(窗口内的序列信息),预测下一个字符(下一时刻信息)。验证了循环神经网络中的思想。
# Slide = 20
for i in range(0,xNew.shape[0] - 20):
print(NewLetters[i:i+20], " predict new latter is: ",IndextoLetter[yNewPred[i]])
返回结果如下:
is to learn the inte predict new latter is: r
s to learn the inter predict new latter is: n
to learn the intern predict new latter is: a
to learn the interna predict new latter is: l
o learn the internal predict new latter is:
learn the internal predict new latter is: l
learn the internal l predict new latter is: a
earn the internal la predict new latter is: w
arn the internal law predict new latter is: s
rn the internal laws predict new latter is:
n the internal laws predict new latter is: a
the internal laws a predict new latter is: n
the internal laws an predict new latter is: d
he internal laws and predict new latter is:
e internal laws and predict new latter is: p
internal laws and p predict new latter is: r
internal laws and pr predict new latter is: e
nternal laws and pre predict new latter is: s
ternal laws and pres predict new latter is: e
ernal laws and prese predict new latter is: n
rnal laws and presen predict new latter is: t
nal laws and present predict new latter is: a
al laws and presenta predict new latter is: t
l laws and presentat predict new latter is: i
laws and presentati predict new latter is: o
laws and presentatio predict new latter is: n
aws and presentation predict new latter is:
ws and presentation predict new latter is: l
s and presentation l predict new latter is: e
import numpy as np
from keras.models import Sequential
from keras.layers import Dense,LSTM
from sklearn.model_selection import train_test_split
def GetTxtFile(WritePath,SeqInput,RepeatNum=500):
"""
:param WritePath:D:\code_work\MachineLearning/FlareData.txt
:param SeqInput:"Deep learning is to learn the internal laws and presentation levels of sample data."
:param RepeatNum:500
:return:
"""
with open(WritePath,"w",encoding="UTF-8") as f:
for _ in range(RepeatNum):
f.write(SeqInput)
f.write("\n")
f.close()
return 0
def GetStringDict(SeqPath):
def ReadData(SeqPath):
Data = open(SeqPath).read().replace("\n", " ").replace("\r", " ")
return Data
def DelRepeat(Data):
letters = list(set(Data))
return letters
def GetLetterDict(LetterList):
IndexLetterDict = {i: j for i, j in enumerate(LetterList)}
LetterIndexDict = {j: i for i, j in enumerate(LetterList)}
return IndexLetterDict, LetterIndexDict
Data = ReadData(SeqPath)
LetterList = DelRepeat(Data)
IndextoLetter, LettertoIndex = GetLetterDict(LetterList)
return Data, IndextoLetter, LettertoIndex
def ExtractData(Data,Slide):
x = list()
y = list()
for i in range(len(Data) - Slide):
x.append([a for a in Data[i:i+Slide]])
y.append(Data[i+Slide])
return x,y
def LettertoIndexData(x,y,LettertoIndex):
xtoIndex = list()
ytoIndex = list()
for i in range(len(x)):
xtoIndex.append([LettertoIndex[Letter] for Letter in x[i]])
ytoIndex.append([LettertoIndex[Letter] for Letter in y[i]])
return xtoIndex,ytoIndex
def GetOneHot(IndexToken,NumLetters):
assert IndexToken < NumLetters
OneHotInit = np.zeros(NumLetters,dtype=np.int16)
OneHotInit[IndexToken] = 1
return OneHotInit
def DataProcessing(Data,LettertoIndex,Slide=20):
LetterX,Lettery = ExtractData(Data,Slide)
IndexTokenX,IndexTokeny = LettertoIndexData(LetterX,Lettery,LettertoIndex)
Label = list(np.array(IndexTokeny).flatten())
OnehotToken = list()
for SubSilde in IndexTokenX:
OnehotSlideToken = list()
for i in SubSilde:
OnehotResult = GetOneHot(i,len(LettertoIndex))
OnehotSlideToken.append(OnehotResult)
OnehotToken.append(OnehotSlideToken)
return np.array(OnehotToken),Label
def GetModel(X_train,NumLetters):
model = Sequential()
model.add(LSTM(units=20,input_shape=(X_train.shape[1],X_train.shape[2]),activation="relu"))
model.add(Dense(units=NumLetters,activation="softmax"))
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])
return model
def Console(SeqPath):
def GetyTrainOneHot(y_train,NumLetters):
yTrainList = list()
for i in y_train:
OneHotResult = GetOneHot(i,NumLetters)
yTrainList.append(OneHotResult)
return np.array(yTrainList)
Data, IndextoLetter, LettertoIndex = GetStringDict(SeqPath=SeqPath)
NumLetters = len(LettertoIndex)
X,y = DataProcessing(Data, LettertoIndex)
X_Train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=10)
y_train_category = GetyTrainOneHot(y_train,NumLetters)
Model = GetModel(X_Train,NumLetters)
Model.fit(X_Train,y_train_category,batch_size=500,epochs=10)
NewLetters = "is to learn the internal laws and presentation levels of sample data."
xNew,yNew = DataProcessing(NewLetters,LettertoIndex)
yNewPred = [np.argmax(i) for i in Model.predict(xNew)]
print("".join([IndextoLetter[i] for i in yNewPred]))
# Slide = 20
for i in range(0,xNew.shape[0] - 20):
print(NewLetters[i:i+20], " predict new latter is: ",IndextoLetter[yNewPred[i]])
if __name__ == '__main__':
SeqPath = "D:\code_work\MachineLearning/FlareData.txt"
Console(SeqPath)
相关参考:
Word Representation(1) - Background