论文原文:https://arxiv.org/pdf/2010.16056v2.pdf
代码链接(含数据集):https://github.com/cuhksz-nlp/R2Gen/
treat the input from a radiology image as the source sequence X = { X 1 , X 2 , . . . , X S } , X S ∈ R d \mathbf{X}=\{\mathbf{X}_1,\mathbf{X}_2,...,\mathbf{X}_S\},\mathbf{X}_S \in \mathbb{R}^d X={X1,X2,...,XS},XS∈Rd
视觉提取器
given a radiology image I m g Img Img
process:
{ X 1 , X 2 , . . . , X S } = f v ( I m g ) \{\mathbf{X}_1,\mathbf{X}_2,...,\mathbf{X}_S\}=f_v(Img) {X1,X2,...,XS}=fv(Img)
standard encoder from Transformer 标准的编码器
process:
{ h 1 , h 2 , . . . , h S } = f e ( X 1 , X 2 , . . . , X S ) \{\mathbf{h}_1,\mathbf{h}_2,...,\mathbf{h}_S\}=f_e(\mathbf{X}_1,\mathbf{X}_2,...,\mathbf{X}_S) {h1,h2,...,hS}=fe(X1,X2,...,XS)
introduce an extra memory module to Transformer by improving the original layer normalization with MCLN for each decoding layer 通过对每个解码层用MCLN改进原始层归一化,为Transformer引入一个额外的内存模块(Relational Memory)
Transformer介绍:https://zhuanlan.zhihu.com/p/82312421
process:
y t = f d ( h 1 , . . . , h S , MCLN ( R M ( y 1 , . . . , y t − 1 ) ) ) y_t=f_d(\mathbf{h}_1,...,\mathbf{h}_S,\text{MCLN}(\mathbf{RM}(y_1, ...,y_{t-1}))) yt=fd(h1,...,hS,MCLN(RM(y1,...,yt−1)))
entire generation process can be formalized as a recursive application of the chain rule:
p ( Y ∣ I m g ) = ∏ t = 1 T ( p ( y t ) ∣ y 1 , . . . , y t − 1 , I m g ) p(Y|Img)=\prod_{t=1}^T(p(y_t)|y_1,...,y_{t-1},Img) p(Y∣Img)=t=1∏T(p(yt)∣y1,...,yt−1,Img)
model - maximize p ( Y ∣ I m g ) p(Y|Img) p(Y∣Img) through the negative conditional log-likelihood of Y Y Y 负条件对数似然:
θ ∗ = arg θ max ∑ t = 1 T log p ( y t ∣ y 1 , . . . , y t − 1 , I m g ; θ ) \theta^*=\text{arg}_\theta\text{max}\sum^T_{t=1}\text{log}p(y_t|y_1,...,y_{t-1},Img;\theta) θ∗=argθmaxt=1∑Tlogp(yt∣y1,...,yt−1,Img;θ)
关联记忆网络 - 建模模式化信息
relevant I m g Img Img may share similar patterns in reports
H H H sets of queries, keys and values via 3 linear transformations
Q = M t − 1 ⋅ W q K = [ M t − 1 ; y t − 1 ] ⋅ W k V = [ M t − 1 ; y t − 1 ] ⋅ W v \mathbf{Q}=\mathbf{M}_{t-1}·\mathbf{W}_\mathbf{q} \\ \mathbf{K} = [\mathbf{M}_{t-1};\mathbf{y}_{t-1}]·\mathbf{W}_\mathbf{k} \\ \mathbf{V}=[\mathbf{M}_{t-1};\mathbf{y}_{t-1}]·\mathbf{W}_\mathbf{v} Q=Mt−1⋅WqK=[Mt−1;yt−1]⋅WkV=[Mt−1;yt−1]⋅Wv
Multi-head attention is uesd to model Q , K , V Q,K,V Q,K,V so as to depict relations of different patterns 采用多头注意对Q、K、V进行建模,以刻画不同模式之间的关系
result:
Z = softmax ( Q T T / d k ) ⋅ V \mathbf{Z}=\text{softmax}(\mathbf{QT}^\mathrm{T}/\sqrt{d_k})·\mathbf{V} Z=softmax(QTT/dk)⋅V
Consider that the relational memory is performed in a recurrent manner along with the decoding process, it potentially suffers from gradient vanishing and exploding 考虑到关系存储是在解码过程中以循环的方式执行的,它可能会遭受梯度消失和爆炸
solution: introduce residual connections and a gate mechanism
M t ~ = f m l p ( Z + M t − 1 ) + Z + M t − 1 \tilde{\mathbf{M}_t}=f_{mlp}(\mathbf{Z}+\mathbf{M}_{t-1})+\mathbf{Z}+\mathbf{M}_{t-1} Mt~=fmlp(Z+Mt−1)+Z+Mt−1
f m l p ( ⋅ ) f_{mlp}(·) fmlp(⋅): multi-layer perceptron (MLP)
forget & input gates: balance the inputs from M t − 1 \mathbf{M}_{t-1} Mt−1 and y t − 1 y_{t-1} yt−1
formalized as:
G t f = Y t − 1 W f + tanh ( M t − 1 ) ⋅ U f G t i = Y t − 1 W i + tanh ( M t − 1 ) ⋅ U i \mathbf{G}_t^f = \mathbf{Y}_{t-1}\mathbf{W}^f+\text{tanh}(\mathbf{M}_{t-1})·\mathbf{U}^f \\ \mathbf{G}_t^i = \mathbf{Y}_{t-1}\mathbf{W}^i+\text{tanh}(\mathbf{M}_{t-1})·\mathbf{U}^i Gtf=Yt−1Wf+tanh(Mt−1)⋅UfGti=Yt−1Wi+tanh(Mt−1)⋅Ui
final output of the gate mechanism:
M t = σ ( G t f ) ⊙ M t − 1 + σ ( G t i ) ⊙ tanh ( M ~ t ) \mathbf{M}_t=\sigma(\mathbf{G}_t^f)\odot \mathbf{M}_{t-1}+\sigma(\mathbf{G}^i_t)\odot\text{tanh}(\tilde{\mathbf{M}}_t) Mt=σ(Gtf)⊙Mt−1+σ(Gti)⊙tanh(M~t)
⊙ \odot ⊙: Hadamard product 哈达玛积
哈达玛积参考:https://baike.baidu.com/item/%E5%93%88%E8%BE%BE%E7%8E%9B%E7%A7%AF/18894493?fr=aladdin
σ \sigma σ: sigmoid function
M t \mathbf{M}_t Mt: output of the entire relational memory module at step t t t
MLP: used to predict a change Δ γ t \Delta\gamma_t Δγt on γ t \gamma_t γt from m t \mathbf{m}_t mt, 预测变化 update it via:
Δ γ t = f m l p ( m t ) γ ~ t = γ + Δ γ t \Delta\gamma_t=f_{mlp}(\mathbf{m}_t) \\ \tilde{\gamma}_t=\gamma+\Delta\gamma_t Δγt=fmlp(mt)γ~t=γ+Δγt
Δ β t \Delta\beta_t Δβt and β ~ t \tilde{\beta}_t β~t are performed by:
Δ β t = f m l p ( m t ) β ~ t = β + Δ β t \Delta\beta_t=f_{mlp}(\mathbf{m}_t) \\ \tilde{\beta}_t=\beta+\Delta\beta_t Δβt=fmlp(mt)β~t=β+Δβt
then the predicted β ~ t \tilde{\beta}_t β~t and γ ~ t \tilde{\gamma}_t γ~t are applied to the mean and variance results of the multi-head self-attention from the previous generated outputs: 应用于先前生成的输出的多头自我注意的均值和方差结果
f m c l n ( r ) = γ ^ t ⊙ r − μ v + β ^ t f_{mcln}(\mathbf{r})=\hat{\gamma}_t\odot\frac{\mathbf{r}-\mu}{v}+\hat{\beta}_t fmcln(r)=γ^t⊙vr−μ+β^t
基于记忆的层归一化
3 MCLNs in each Transformer decoding layer
datasets: IU X-RAY & MIMIC-CXR
baselines:
BASE: vanilla Transformer
BASE+RM: the relational memory is directly concatenated to the output of the Transformer ahead of the softmax at each time step 关系内存直接连接到Transformer的输出,位于softmax之前
learning rate: 5e-5 and 1e-4 for the visual extractor and other parameters
for MCLN: use two MLPs to obtain Δ γ t \Delta\gamma_t Δγt and Δ β \Delta\beta Δβ where they do not share parameters
hyper-parameters & generation results
∣ S ∣ ∈ { 1 , 2 , 3 , 4 } |S|\in\{1,2,3,4\} ∣S∣∈{1,2,3,4}: numbers of memory slots
memory provides more detailed information for the generation process
2 important factors to enhance radiology report generation:
BASE+RM+MCLN: almost cover all of the necessary medical terms in the ground-truth reports
the intermediate imgae-text correspondences for several words from the multi-head attentions in the first layer of the decoders:
our model: improves the interaction between the images and the generated texts
our model is able to generate long reports with necessary medical terms and meaningful image-text attention mappings 我们的模型能够生成带有必要的医学术语和有意义的图像-文本注意力映射的长报告。
借鉴:https://blog.csdn.net/c9Yv2cf9I06K2A9E/article/details/114695686