目录
1. N-gram Language Models
2. Recurrent Neural Networks
2.1 RNN Unrolled
2.2 RNN Training
2.3 (Simple) RNN for Language Model
2.4 RNN Language Model: Training
2.5 RNN Language Model: Generation
3. Long Short-term Memory Networks
3.1 Language Model… Solved?
3.2 Long Short-term Memory (LSTM)
3.3 Gating Vector
3.4 Simple RNN vs. LSTM
3.5 LSTM: Forget Gate
3.6 LSTM: Input Gate
3.7 LSTM: Update Memory Cell
3.8 LSTM: Output Gate
3.9 LSTM: Summary
4. Applications
4.1 Shakespeare Generator
4.2 Wikipedia Generator
4.3 Code Generator
4.4 Deep-Speare
4.5 Text Classification
4.6 Sequence Labeling
4.7 Variants
4.8 Multi-layer LSTM
4.9 Bidirectional LSTM
5. Final Words
- Can be implemented using counts (with smoothing) 可利用计数(经平滑处理)来实施
- Can be implemented using feed-forward neural networks 可以使用前馈神经网络实现
- Generates sentences like (trigram model): 生成类似(三元模型)的句子:
- Problem: limited context
Allow representation of arbitrarily sized inputs 允许表示任意大小的输入
Core ldea: processes the input sequence one at a time, by applying a recurrence formula 核心理念: 通过应用递归公式,一次处理一个输入序列
Uses a state vector to represent contexts that have been previously processed 使用状态向量表示以前处理过的上下文
- An unrolled RNN is just a very deep neural network 一个展开的 RNN 只是一个非常深的神经网络
- But parameters are shared across all time steps 但是参数是在所有时间步骤中共享的
- To train RNN, we just need to create the unrolled computation graph given an input sequence 为了训练 RNN,我们只需要创建一个给定输入序列的展开计算图
- And use backpropagation algorithm to compute gradients as usual 并像往常一样使用反向传播算法计算梯度
- This procedure is called backpropagation through time 这个过程叫做时间反向传播
- RNN has the capability to model infinite context RNN 具有对无限上下文进行建模的能力
- But can it actually capture long-range dependencies in practice? 但是它真的能够在实践中捕获长期依赖吗?
- No… due to “vanishing gradients” 没有,因为“消失的梯度”
- Gradients in later steps diminish quickly during backpropagation 后阶梯度在反向传播过程中迅速减小
- Earlier inputs do not get much update 早期的输入不会得到太多更新
- LSTM is introduced to solve vanishing gradients 引入 LSTM 方法解决消失梯度问题
- Core idea: have "memory cells" that preserve gradients across time 核心理念: 拥有“记忆单元”,可以跨时间保持渐变
- Access to the memory cells is controlled by "gates" 进入存储单元是由“门”控制的
- For each input, a gate decides:
- A gate g is a vector
- g is multiplied component-wise with vector v, to determine how much information to keep for v 将 g 与向量 v 按分量相乘,以确定要为 v 保留多少信息
- Use sigmoid function to produce g:
- Training data = all works of Shakespeare
- Model: character RNN, hidden dimension = 512
Training data = 100MB of Wikipedia raw data
Pros
- Has the ability to capture long range contexts 有能力捕捉远距离环境
- Just like feedforward networks: flexible 就像前馈网络一样: 灵活
Cons
- Slower than FF networks due to sequential processing 由于顺序处理,比 FF 网络慢
- In practice doesn't capture long range dependency very well (evident when generating very long text) 实际上并不能很好地捕捉到长距离依赖关系(当生成非常长的文本时显而易见)
- In practice also doesn't stack well (multi-layer LSTM) 实际上也不能很好地叠加(多层 LSTM)
- Less popular nowadays due to the emergence of more advanced architectures 现在没那么受欢迎了