LSTM
学习大纲
https://blog.csdn.net/phdongou/article/details/113515466
https://www.jianshu.com/p/2f17a3c62cde
keras中的units参数的意思-->这层的隐藏神经单元个数
一个cell共有多少个参数
假设units = 64
根据上图,我们可以计算,假设a向量是128维向量,x向量是28维向量,那么二者concat以后就是156维向量,为了能相乘,那么Wf就应该是(64,156),同理其余三个框,也应该是同样的shape。于是,在第一层就有参数64x156x4 + 64x4个。
若是把cell外面的参数也算进去,那么假设有10个类,那么对于最终的shape为(64,1)的输出at,还要有一个shape为(10,64)的W跟一个shape为(10,1)的b。
Posted on August 27, 2015
人的思考是持续性的,跟过去有关
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.
Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.
Recurrent Neural Networks have loops.
In the above diagram, a chunk of neural network, AA, looks at some input Xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.
上图的循环比较让人困惑. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:
rnn展开
This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.
And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.
Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.
One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.
Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.
But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs don’t have this problem!
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
The repeating module in a standard RNN contains a single layer.
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
The repeating module in an LSTM contains four interacting layers.
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.
The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1ht−1 and xtxt, and outputs a number between 00 and 11 for each number in the cell state Ct−1Ct−1. A 11 represents “completely keep this” while a 00 represents “completely get rid of this.”
Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.
The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~tC~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.
In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.
It’s now time to update the old cell state, Ct−1Ct−1, into the new cell state CtCt. The previous steps already decided what to do, we just need to actually do it.
We multiply the old state by ftft, forgetting the things we decided to forget earlier. Then we add it∗C~tit∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.
In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.
Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.
Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!
Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.
LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…
Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!
Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks.
https://blog.csdn.net/weixin_44731100/article/details/99976214
warning中说明了,input_dim和input_length的用法已经被废弃了,可以用input_shape属性代替。outputs用法也被废弃了,用units方法替换。可见这几个属性只是版本差异的问题。
https://blog.csdn.net/cskywit/article/details/87436879
为LSTM准备数据的例子
1.单特征输入实例:多个时间步的序列,仅一个特征
rom numpy import array
data = array([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0])
#将一维输入序列转换为keras要求的3维序列(1sample,10time steps,1 feature)
data = data.reshape((1,10,1))
print(data.shape)
输出:
(1, 10, 1)
现在数据可以输入LSTM,如:
model = Sequential()
model.add(LSTM(32,input_shape=(10,1)))
对特征输入实例 比如下面的两个并行序列:
series 1: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
series 2: 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1
from numpy import array
data = array([
[0.1,1.0],
[0.2,0.9],
[0.3,0.8],
[0.4,0.7],
[0.5,0.6],
[0.6,0.5],
[0.7,0.4],
[0.8,0.3],
[0.9,0.2],
[1.0,0.1]
])
#将数据reshape为1sampe,10time-steps,2features
data = data.reshape(1,10,2)
print(data.shape)
输出:
(1, 10, 2)
现在数据可以输入LSTM
model = model.Sequential()
model.add(LSTM(32,input_shape=(10,2)))
Keras中LSTM的输入小结:
LSTM输入数据必须是三维
3个输入维度分别是:sample,time steps,features
LSTM输入层通过input_shape参数在第一个隐藏层上定义
input_shape参数的参数是由time steps和features这两个值组成的元组
sample的数量程序认为是1或更多
使用numpy array中的reshape函数将1D或2D数据转换为LSTM输入要求的3D数据,该函数的参数是新shape组成元组。
https://www.jianshu.com/p/3edff278f021
LSTM结构图
图中看起来是三个cell,其实是一个cell在不同时刻上的拼接,也就是说其实是一个cell在不同时刻的状态。我们就以中间那个cell为例进行说明吧。
其中,四个黄色的小矩形就是普通神经网络的隐藏层结构,其中第一、二和四的激活函数是sigmoid
,第三个的激活函数是tanh
。t
时刻的输入X
和t-1
时刻的输出h(t-1)
进行拼接,然后输入cell中,其实可以这样理解,我们的输入X(t)
分别feed进了四个小黄矩形中,每个小黄矩形中进行的运算和正常的神经网络的计算一样(矩阵乘法),有关记忆的部分完全由各种门结构来控制(就是0和1),同时在输入时不仅仅有原始的数据集,同时还加入了上一个数据的输出结果,也就是h(t-1)
,那么讲来LSTM和正常的神经网络类似,只是在输入和输出上加入了一点东西。cell中可以大体分为两条横线,上面的横线用来控制长时记忆,下面的横线用来控制短时记忆。关于LSTM我通过参考画了一张图,如下:
输入和输出
输入
在Keras中,LSTM的输入shape=(samples, time_steps, input_dim)
,其中samples
表示样本数量,time_steps
表示时间步长,input_dim
表示每一个时间步上的维度。举例,现在有一个数据集有四个属性(A,B, C, D)
,我们希望的预测标签式D
,假设这里的样本数量为N
。如果时间步长为1,那么此时的输入shape=(N, 1, 4)
,具体的数据是这样的[A(t-1), B(t-1), C(t-1), D(t-1)]
(此处表示一个数据样本),样本标签为[D(t)]
;如果时间步长为2,那么此时的输入shape=(N, 2, 4)
,具体的数据是[[A(t-2), B(t-2), C(t-2), D(t-2)], [A(t-1), B(t-1), C(t-1), D(t-1)]]
(此处仍表示一个样本数据)。
输出
关于Keras中LSTM的输出问题,在搭建网络时有两个参数,一个是output_dim
表示输出的维度,这个参数其实就是确定了四个小黄矩形中权重矩阵的大小。另一个可选参数return_sequence
,这个参数表示LSTM返回的时一个时间序列还是最后一个,也就是说当return_sequence=True
时返回的是(samples, time_steps, output_dim)
的3D张量,如果return_sequence=Flase
时返回的是(samples, output_dim)
的2D张量。比如输入shape=(N, 2, 8)
,同时output_dim=32
,当return_sequence=True
时返回(N, 2, 32)
;当return_sequence=False
时返回(N, 32)
,这里表示的时输出序列的最后一个输出。
使用LSTM搭建多层LSTM网络还是比较方便的,我们只需要使用Sequential()
进行堆叠即可。
在进行多层LSTM网络时,需要注意一下几点:
需要对第一层的LSTM指定input_shape
参数。
将前N-1层LSTM的return_sequence
设置为True
,保证每一曾都会想下一层传播所有时间步长上的预测,同时保证最后一层的return_sequence
为False
(如果只需要最后一个输出的话)。
其实,在第二层时,不用指定input_shape
,因为根据上一层的output_dim
和当前层的output_dim
可以得出当前层中权重矩阵的大小。
作者:Manfestain
链接:https://www.jianshu.com/p/3edff278f021
https://mp.weixin.qq.com/s/7dsjfcOfm9uPheJrmB0Ghw
Word2Vec数学推导
https://blog.csdn.net/weixin_36723038/article/details/121068776
一、word2vec的概念解释
word2vec是一种将单词转换为向量形式的工具。用于将文本的处理的问题简化为向量空间中的向量运算,通过计算向量空间上的距离来表示文本语义上的相似度。
word2vec在2018年之前比较主流,但随着Bert、GPT2.0的出现,地位有所下降。
二、word2vec的具体实现方法
+、独热 One-hot
简单来说就是借助词表,将词表中的所有词进行统一编码,每一个词在词空间中占据一个位置;
形如: “话筒”表示为 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 …]
向量的唯独是词表的大小,向量中只有一个维度的值为 1,其他为 0;
+、嵌入式 Embedding
不仅需要 词表,还需要 训练语料,根据每个词出现的上下文 Context,来训练语言模型;
不同的 语料库数据 和 不同的训练参数 训练出来的模型是不一样的;
Embedding 单词向量将单词映射到高维空间上的点,点之间的距离可以看作对应的两个词之间的距离,即两个词在语法、语义上的相似性;
1、CBOW - Continuous Bag-Of-Words Model 连续性的词袋模型
通过上下文来预测 当前值;相当于一句话中抠掉一个词,让你猜这个词是什么
从数学上看,CBoW模型等价于一个词袋模型的向量乘以一个embedding矩阵,从而得到一个连续的embedding向量。这也是CBoW模型名称的由来。
假如语料为一句话:Real dream is the other shore of reality.我们设定一个滑动窗口 window=2,即中心词 左右分别 选 两个词 作为其 上下文 词。
1.在训练前,首先要将原始文本生成训练样本数据。下图展示了根据原始语料生成训练数据的过程。
图中蓝色阴影所覆盖的单词为 中心词,生成的训练所用的数据。每一个训练样本 由多个输入特征和一个输出组成。其中input是feature,output是label。
2、Skip-gram
用当前值预测上下文;相当于给你一个词,让你猜前面和后面可能会出现什么词;
三、word2vec的使用场景
一般情况,是将word2vec的结果直接用于神经网络模型的输入层,用神经网络来完成词性预测、句法分析、情感分析等任务。
-----------------------------------------------------------------
wireshark
一般默认的数据包信息,包括以下几个字段:No. 数据包编号、Time 时间、Source 和 Destination 源目的IP、Protocol 协议、Length 数据包长度、Info 信息等,也可通过添删改的方式定义所需的字段,例如增加常用的 Time Delta ,Stream ,TTL 等等。
https://blog.csdn.net/llzhang_fly/article/details/108676070
在TCP层,有个FLAGS字段,这个字段有以下几个标识:SYN, FIN, ACK, PSH, RST, URG.
对于我们日常的分析有用的就是前面的五个字段:它们的含义是:
SYN:建立连接,FIN:关闭连接,ACK:响应,PSH:有 DATA数据传输,RST:连接重置
位码即tcp标志位,有6种标示:SYN(synchronous建立联机) ACK(acknowledgement 确认) PSH(push传送) FIN(finish结束) RST(reset重置) URG(urgent紧急)Sequence number(顺序号码) Acknowledge number(确认号码)
https://zhuanlan.zhihu.com/p/439614017 重要文章
Wireshark抓包
本地请求61.135.185.32
这个ip
,这个过程的抓包如下。
三次握手
(客户端)1
号包:我能和你建立连接吗?
seq=0
,表示这是一个新的开始ack
,因为还没有建立连接,也就不存在我收到了对方多少的数据的说法Len=0
,表示我没有传输数据,就是一个想要建立连接的tcp
包而已。(服务端)2
号包:我收到了,我们能进行连接,快来玩吧。
seq=0
ack=1
暗示了两点,第一表示我收到了你刚才的那个seq=0
的连接请求,另外告诉对方接下来请从seq=1
开始给我传输数据Len=0
,表示同样没有传输数据。(客户端)3
号包:好的,那我们就连接吧。
seq=1
,响应上面的包,我真的从seq=1
开始传输哦ack=1
,表示我收到了你的seq=0
同意连接,下面你也请从seq=1
给我传输数据吧Len=0
三次握手结束,建立起连接。
总结一下三次握手的过程:
seq
都等于0
ack=对方上一个的seq+1
seq
等于对方上次的ack
号数据传输过程
(客户端)4
号包:我要你的首页信息
客户端发送http
请求,http
请求需要tcp
进行控制,然后交给ip
层,然后由网卡发出...
注意4
号帧tcp
包的内容
seq=1
,因为上次没有传输数据,seq
号不变,也就是3
号包的seq=1,len=0
ack=1
,告诉服务端你要是发送数据,得从seq=1
开始哈len=77
,表示我这次传输的数据字节数(服务端)5
号包:好的,我收到你的请求了。
seq=1
,如4
号包的ack
所要求的ack=78
,ack=4号包的
seq+4号包的len = 1+77=78
表示客户端啊,你要是再发就从seq=78
开始发送哈
len=0
(服务端)`6
`号包:诺,给你的数据
5、6
号均为服务端发送的包,在这期间没有接收到包,理所应当的,5、6
号包的seq、ack
是一样的。
seq=1
ack=78
len=1440
,数据的长度(客户端)7
号包:收到啦
seq=78
,你让我从78
发,我就从78
发ack=1441
,1441=6
号包的seq+6
号包的len=1+1440=1441
,表示我收到啦en=0
发送方的包包括seq
和len
,接收方如何告知对方数据已经收到呢?
答案就在于接收方的ack=发送方的seq+发送方的len
。
整体来讲,就是这样。
特殊情况在于三次握手时,客户端、服务端握手时,len=0
,此时对方就不是ack=seq+0
,而是ack=seq+1
。
结合状态图
https://www.bilibili.com/read/cv15076127/
选项字节总算抄完了,确实有点多,以前还觉得不总要,这样理解完了,再回来看看抓包,就知道各个字段的含义了吧。
TCP:表明是个TCP协议
Length:74 表明包的长度是74个字节
56739->443 :表明是从源地址的56739端口发送给目的地址的443端口
[SYN]表明这是一个TCP的同步请求,是TCP握手的第一步
Seq=0: TCP协议中的序号,这里为0.
在TCP中第一个SYN 包所包含的 sequence 是随机的,而第一个 SYN+ACK包里的sequence 也是随机的,wireshark 为了你便于观察都使用相对值,初始化这两个随机值为0,后面的sequence 和 acknowledge 都在上面累加
Win=29200: 发送报文段一方的接收窗口。TCP协议中的字段
Len=0: 发送文件TCP报文段Datas段的长度
MSS=1460: 最大报文段长度,指每个TCP报文段中数据字段的最大长度。它不包含首部长度。是TCP首部中,选项中的字段。
WS=128:窗口扩大因子;只能在连接建立阶段确定;在连接期间他的值不能够改变;新的窗口值=首部中定义的窗口值乘以2的(窗口扩大因子)的次方;由于窗口值不够用。选项中的字段。 ??书上说,窗口扩大选项占3个字节,其中一个字节表示移位值S,S最大为14,新的窗口值等于TCP首部中的窗口位数从16增大到(16+S)。可这里WS怎么会等于128呢??
答:这里的128是指窗口扩大了128倍,其S=7, 2的7次方 = 128. 符合S<=14. 打开软件下面TCP部分的详细说明,有介绍。(计算后的结果了,原始数据为7)
SACK_PERM=1: 允许选择确认。 TCP选项中的字段
https://blog.csdn.net/qq_23350817/article/details/125278377
调整数据包列表中时间戳显示格式。调整方法为View -->Time Display Format --> Date and Time of Day。调整后格式如下:
原文链接:https://blog.csdn.net/llzhang_fly/article/details/108676070
5、使用wireshark出现很多TCP Retransmission信息
出现TCP Retransmission多数是因为目标主机的端口没开有开放监听,很少出现是网络不好导致的。
如果在某个时间段(RTT的倍数)内没有确认发送的数据,则将数据重新传输到远程主机。重传超时从RTT开始,并随着每次重传而增加一倍。重传超时总是受限于CFGZ-MNRTO和CFGYMax RTO。如果自从第一次传输数据以来,CFGY-ReTrExtTMO时间就过去了,连接被关闭,即状态被设置为关闭。注意,当一个套接字被关闭时,将响应于接收到的端口所发送的任何数据包来发送重置。
当超时发生时,将重新发送输出窗口中的所有未确认数据。数据被重新打包,因此,包将不与原始包相同。例如,如果以10字节的数据发送分组,则发送具有30字节数据的分组,并且第一分组丢失,40字节的未确认数据将在输出窗口中。当超时发生时,所有40个字节将在一个分组中发送(假设MSS大于或等于40)。
如果接收到三个重复的确认,则快速重传算法无需等待超时即可重传TCP数据。RTIP32还实现了RFC 2582中定义的NeReNeO快速恢复算法。
转载:https://blog.csdn.net/lemontree1945/article/details/88581516
类似场景:client 连接服务器时,因 TLS 证书设置错误,所以会导致连接服务器后,没有收到应答;即发送 SYN 报文,无响应。
https://blog.csdn.net/qq_35273499/article/details/79098689
http://www.hankcs.com/nlp/word-vector-representations-word2vec.html
CS224n笔记2 词的向量表示:word2vec-码农场 (hankcs.com)