论文原文:https://arxiv.org/pdf/2106.06963.pdf
参考:https://blog.csdn.net/qq_45645521/article/details/123493075
先验知识:这些柿子红了,肯定已经熟了
后验知识:我刚刚吃了柿子,已经熟透了
Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED)
first examine the abnormal regions 检查异常部位
assign the disease topic tags 分配疾病主题标签
include modules:
directly applying image captioning approaches to radiology images has problems:
encoder-decoder framework - translates the image to a single descriptive sentence 单一描述性句子
radiology report generation - aims to generate a long paragraph - consists of multiple structural sentences
explore and distill the posterior and prior knowledge for accurate radiology report generation 探索和提取后验和先验知识,以便准确地生成放射学报告
PoKE后验知识资源管理器 + PrKE先验知识资源管理器 + MKD多域知识蒸馏器
PoKE : { I , T } → I ′ ; PrKE : { I ′ , W Pr } ; { I ′ , G Pr } → G Pr ′ MKD : { I ′ , W Pr ′ , G Pr ′ } → R \text{PoKE}:\{I,T\}\to I'; \\ \text{PrKE}:\{I',W_{\text{Pr}}\};\ \{I',G_{\text{Pr}}\}\to G'_{\text{Pr}} \\ \text{MKD}:\{I',W'_{\text{Pr}},G'_{\text{Pr}}\}\to R PoKE:{I,T}→I′;PrKE:{I′,WPr}; {I′,GPr}→GPr′MKD:{I′,WPr′,GPr′}→R
I I I: adopt the ResNet-152 to extract 2048 7$\times 7 i m a g e f e a t u r e m a p s w h i c h a r e f u r t h e r p r o j e c t e d i n t o 5127 7 image feature maps which are further projected into 512 7 7imagefeaturemapswhicharefurtherprojectedinto5127\times$7 feature maps, resulting I = { i 1 , i 2 , . . . , i N 1 } ∈ R N 1 × d ( N 1 = 49 , d = 512 ) I=\{i_1,i_2,...,i_{N_1}\}\in \mathbb{R}^{N_1 \times d}(N_1=49,d=512) I={i1,i2,...,iN1}∈RN1×d(N1=49,d=512)
T T T: topic bag (common abnormality topics or findings)
W Pr W_{\text{Pr}} WPr: the reports of the top- N K N_K NK retrieved images are returned and encoded as the W Pr = { R 1 , R 2 , . . . , R N K } ∈ R N K × d W_{\text{Pr}}=\{R_1,R_2,...,R_{N_K}\}\in\mathbb{R}^{N_K\times d} WPr={R1,R2,...,RNK}∈RNK×d
G Pr G_{\text{Pr}} GPr:
The MHA consists of n parallel heads and each head is defined as a scaled dot-product attention:
Att i ( X , Y ) = softmax ( X W i Q ( Y W i K ) T d n ) Y W i V MHA ( X , Y ) = [ Att 1 ( X , Y ) ; . . . ; Att n ( X , Y ) ] W O \text{Att}_i(X,Y)=\text{softmax}(\frac{X\text{W}_i^\text{Q}(Y\text{W}_i^\text{K})^T}{\sqrt{d_n}})Y\text{W}_i^\text{V} \\ \text{MHA}(X,Y)=[\text{Att}_1(X,Y);...;\text{Att}_n(X,Y)]\text{W}^{\text{O}} Atti(X,Y)=softmax(dnXWiQ(YWiK)T)YWiVMHA(X,Y)=[Att1(X,Y);...;Attn(X,Y)]WO
X ∈ R l x × d X\in\mathbb{R}^{l_x \times d} X∈Rlx×d: the Query matrix
Y ∈ R l y × d Y\in\mathbb{R}^{l_y \times d} Y∈Rly×d: the Key/Value matrix
W i Q , W i K , W i V ∈ R d × d n \text{W}_i^\text{Q},\text{W}_i^\text{K},\text{W}_i^\text{V}\in\mathbb{R}^{d\times d_n} WiQ,WiK,WiV∈Rd×dn, W i O ∈ R d × d \text{W}_i^\text{O}\in \mathbb{R}^{d\times d} WiO∈Rd×d: learnable parameters
d n = d / n d_n=d/n dn=d/n
[ ⋅ , ⋅ ] [·,·] [⋅,⋅]: concatenation operation 序连运算
序连运算:https://blog.csdn.net/Frank_LJiang/article/details/104333272
FNN ( x ) = max ( 0 , x W f + b f ) W ff + b ff \text{FNN}(x)=\text{max}(0,x\text{W}_\text{f}+\text{b}_\text{f})\text{W}_\text{ff}+\text{b}_\text{ff} FNN(x)=max(0,xWf+bf)Wff+bff
apply MHA to correlate the posterior and prior knowledge for the input radiology image, as well as distilling useful knowledge to generate accurate reports 应用MHA将输入的放射图像的后验和先验知识关联起来,并提取有用的知识以生成准确的报告
extract the posterior knowledge from the input image (abnormal regions) 从输入图像中提取后验知识
T ^ = FFN ( MHA ( I , T ) ) ; I ^ = FFN ( MHA ( T ^ , I ) ) ; \hat{T}=\text{FFN}(\text{MHA}(I,T)); \\ \hat{I}=\text{FFN}(\text{MHA}(\hat{T},I)); T^=FFN(MHA(I,T));I^=FFN(MHA(T^,I));
the image features I ∈ R N 1 × d I\in\mathbb{R}^{N_1\times d} I∈RN1×d are first used to find the most relevant topics and filter out the irrelevant topics, resulting in T ^ ∈ R N 1 × d \hat{T}\in\mathbb{R}^{N_1\times d} T^∈RN1×d. Then the attended topics T ^ \hat{T} T^ are further used to mine topic related image features I ^ ∈ R N 1 × d \hat{I}\in\mathbb{R}^{N_1\times d} I^∈RN1×d 用于挖掘与主题相关的图像特征
利用词袋中包含的异常主题找到图像中的异常区域
align the attended abnormal regions with the relevant topics 异常区域与相关的主题相一致
将参与的异常区域和相关主题进行对齐
since I ^ \hat{I} I^ and T ^ \hat{T} T^ are aligned, we directly add them up to acquire the posterior knowledge of the input image:
I ′ = LayerNorm ( I ^ + T ^ ) I'=\text{LayerNorm}(\hat{I}+\hat{T}) I′=LayerNorm(I^+T^)
PrKE consists of a Prior Working Experience component and a Prior Medical Knowledge component
W Pr ′ = FNN ( MHA ( I ′ , W Pr ) ) G Pr ′ = FNN ( MHA ( I ′ , G Pr ) ) W'_{\text{Pr}}=\text{FNN}(\text{MHA}(I',W_{\text{Pr}})) \\ G'_{\text{Pr}}=\text{FNN}(\text{MHA}(I',G_{\text{Pr}})) WPr′=FNN(MHA(I′,WPr))GPr′=FNN(MHA(I′,GPr))
通过这两个部分来处理PoKE中的后验知识,就可以获得输入图像异常区域的先验知识
performs as a decoder 作为解码器生成最终的放射学报告
take the embedding of current input word x t = w t + e t x_t=w_t+e_t xt=wt+et as input:
h t = MHA ( x t , x 1 : t ) h_t = \text{MHA}(x_t,x_{1:t}) ht=MHA(xt,x1:t)
Then employ the proposed Adaptive Distilling Attention (ADA) to distill the useful and correlated knowledge: 然后使用提出的自适应蒸馏注意(ADA)来提取有用的和相关的知识:
h t ′ = ADA ( h t , I ′ , G Pr ′ , W Pr ′ ) h_t'=\text{ADA}(h_t,I',G'_{\text{Pr}},W'_{\text{Pr}}) ht′=ADA(ht,I′,GPr′,WPr′)
Finally, the h t ′ h_t' ht′ is passed to a FFN and a linear layer to predict the next word: 被传递给一个FFN和一个线性层来预测下一个单词
y t ∼ p t = softmax ( FNN ( h t ′ ) W p + b p ) y_t\sim p_t=\text{softmax}(\text{FNN}(h'_t)\text{W}_p+\text{b}_p) yt∼pt=softmax(FNN(ht′)Wp+bp)
train the PPKED by minimizing the cross-entropy loss:
L CE ( θ ) = − ∑ i = 1 N R log ( p θ ( y i ∗ ∣ y 1 : i − 1 ∗ ) ) L_{\text{CE}}(\theta)=-\sum_{i=1}^{N_R}\text{log}(p_\theta(y_i^*|y_{1:i-1}^*)) LCE(θ)=−i=1∑NRlog(pθ(yi∗∣y1:i−1∗))
make the model adaptively learn to distill correlate knowledge: 使模型自适应学习提取相关知识
ADA ( h t , I ′ , G Pr ′ , W Pr ′ ) = MHA ( h t , I ′ + λ 1 ⊙ G Pr ′ + λ 2 ⊙ W Pr ′ ) λ 1 , λ 2 = σ ( h t W h ⊕ ( I ′ W I + G Pr ′ W G + W Pr ′ W W ) ) \text{ADA}(h_t,I',G'_{\text{Pr}},W'_{\text{Pr}})=\text{MHA}(h_t,I'+\lambda_1\odot G'_{\text{Pr}}+\lambda_2\odot W'_{\text{Pr}}) \\ \lambda_1,\lambda_2 = \sigma(h_t\text{W}_h\oplus(I'\text{W}_I+G'_{\text{Pr}}\text{W}_G+W'_{\text{Pr}}\text{W}_W)) ADA(ht,I′,GPr′,WPr′)=MHA(ht,I′+λ1⊙GPr′+λ2⊙WPr′)λ1,λ2=σ(htWh⊕(I′WI+GPr′WG+WPr′WW))
datasets: IU-Xray and MIMIC-CXR
PoKE can better recognize abnormalities
based on the Transformer Decoder equipped with the proposed Adaptive Distilling Attention
prove that their arguments and verify the effectiveness of our proposed approach in alleviating the data bias problem by exploring and distilling posterior and prior knowledge 证明了我们的论点,并验证了我们提出的方法通过探索和提取后验和先验知识来缓解数据偏差问题的有效性