hierarchical attention network—

之前的方法是利用稀疏的词汇特征作文本表示、CNN、rnnLSTM

使用层级的attention network可以捕捉文档的层级结构：1.因为文档有层级结构，所以我们构建的句子表示，然后聚合成文本表示 2.文本中不同的句子和单词有不同的作用，并且高度依赖上下文，例如，相同的词在不同的文档中有不用的意思

所以这篇文章，包括两个level的attention，一个是在word上的一个是在sentence上的，为了在构建文本表示的时候让模型对word和sentence有更多的attention。

与之前工作不同的是：本文利用上下文来发现token什么时候是相关的，而不是简单地过滤了token，

结构图

HAN：包括几个部分：word-sequence encoder、word-level attention layer、sentence encoder、sentence-level attention layer

1.sequence encoder(GRU-based)

GRU使用门机制，去追踪序列的状态，没有使用单独的记忆单元

一共有两种类型的门：reset gate rt 和update gate zt，来一起控制信息是怎样来更新的，time t how to compute the new state:

是之前的状态ht-1和目前的状态ht的线性插值，zt门决定之前有多少信息应该被保存，多少新的信息应该被添加，zt的更新方法

xt是时刻t中的序列向量，候选序列ht^ 使用类似传统的RNN的方式来计算

rt是重置门，控制过去的状态对候选状态的贡献，如果rt是0，。忘记之前的状态，重置门的更新方法：

使用层级结构来构建document level 向量

2.1 word encoder

首先embeds the words 通过一个embedding矩阵，使用双向的GRU，通过累加词的双向信息，来获得词的annotation（注释？）

We obtain an annotation for a given word wit by concatenating ie.hit = [-!h it; - h it],

（先直接使用了embedding，后面完整的模型中，使用的了BiGRU）

2.2 Word Attention

使用attention机制来提取，对sentence的meaning比较重要的词，然后汇集这些词的表示组成句子向量

1.首先将word annotation 通过一层的MLP feed得到ut，作为隐层的表示，

2.然后在word-level，度量词的重要性作为uit的相似度，然后通过softmax获得一个标准的重要性权重ait

3.计算句子向量，si，权重单词加和

4.context vector uw是最高层级的representation

2.3 sentence encoder

2.4 sentence attention

v是document vector

2.5 document classification