[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis

论文原文:GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis | IEEE Journals & Magazine | IEEE Xplore

论文代码:https://github.com/LarryUESTC/GATE

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

目录

1. 省流版

1.1. 心得

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.3.1. Disease Prediction on fMRI Data

2.3.2. GCNs for Disease Prediction on fMRI Data

2.3.3. Self-Supervised Learning

2.4. Method

2.4.1. Multi-View fMRI Dynamic Functional Connectivity Generation

2.4.2. Graph Embedding

2.4.3. Objective Function

2.5. Theoretical Motivation and Analysis on CCA Loss

2.6. Experiments

2.6.1. Experimental Setup

2.6.2. Results and Analysis

2.6.3. Ablation Study

2.7. Discussion

2.7.1. The Needs of Label Efficiency for fMRI

2.7.2. Graph Learning for Neuroimaging

2.7.3. Technical Contributions of Our Work

2.8. Conclusion

3. 知识补充

3.1. Transductive learning

3.2. Adam W optimizer

3.3. Unknown

4. Reference List


1. 省流版

1.1. 心得

(1)虽然作者说是人口图但是为什么在第五节实验里面说得每个节点是个ROI?

(2)怎么但凡涉及到编码解码的论文就全是数学。没什么模型设计,数学大秀

(3)FTD数据库为啥我没找到

1.2. 论文框架图

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第1张图片

2. 论文逐段精读

2.1. Abstract

        ①They designed a self-supervised learning (SSL) structure to optimize GCNs, and it called Graph CCA for Temporal sElf-supervised learning on fMRI analysis ( GATE ).

        ②Traditional models are always relys on plenty of labeled data, and they might be influenced by mislabeled data

        ③Their training based on fMRI dynamic functional connectives (FC)

        ④They need to firstly train SSL on unlabeled fMRI population graph and fine-tune the results

spurious  adj.虚假的;伪造的;谬误的;建立在错误的观念(或思想方法)之上的

2.2. Introduction

        ①Sliding window method is widely used in dynamic FC capturing

        ②Previous works rely on time-consuming labeling

        ③Contrastive-based SSL, reconstruction-based SSL, and similarity-based SSL are three main SSL strategies categories. Similarity-based SSL is choosen for their approach.

        ④Challenge 1 for similarity-based SSL: the data augmentations. Obviously they required data with low coupling of labels and low of spurious features.

        ⑤Challenge 2: design the corresponding consistency loss function. Maximizing the consistency of correlated signals is needed.

        ⑥The authors firstly augmented fMRI data and generated two views from BOLD:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第2张图片

where each node denotes a subject and which used SSL to capture information. Then adopted GCN encoder to obtain their embedding matrices. Finally give a Canonical Correlation Analysis (CCA) analysis

        ⑦Contributions: 1) high efficiency, 2) tackling spurious labels in dynamic FC by self-designed GCN-based CCA regularization, 3) includes theoretical discussion, 4) ablation study.

2.3. Related Work

2.3.1. Disease Prediction on fMRI Data

(1)Medical imaging approach examples:

        ①Magnetic Resonance Imaging (MRI)

        ② Computed Tomography (CT)

        ③Positron Emission Tomography (PET)

        ④X-ray

(2)Structural MRI and functional MRI

        ①sMRI: nodes are anatomical connections between anatomical connections, edges are topology between them

        ②fMRI: nodes are functional regions of the brain, edges are correlations between nodes. Additionally, fMRI presents the dynamic changes in a short time

2.3.2. GCNs for Disease Prediction on fMRI Data

(1)Population graph-based models

        ①Classification based on population

        ②Nodes are subjects and edges are similarity between subjects

(2)Brain region graph based models

        ②Classification based on brain region

        ③Nodes are brain regions and edges are functional or structural connectives among brain regions

2.3.3. Self-Supervised Learning

        ①Contrastive-based SSL: increase the similarity between local and global representations by tuning positive and negative sample pairs. Additionally, it mostly relies on negative samples. However, it is not suitable for small number of samples or small number of classes.

        ②Reconstruction-based SSL: transfer input with low dimensional features to high dimension

        ③Similarity-based SSL: benefit from the coupling between multiple views of the same data. 

2.4. Method

        ①Key components of GATE: 1) Dynamic FC augmentation, 2) GCN encoder, 3) Objective function 

        ②Training procedure: 1) unsupervised pre-training, 2) fine-tuning of pre-training label

        ③The whole framework

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第3张图片

2.4.1. Multi-View fMRI Dynamic Functional Connectivity Generation

        Main characteristics are kept , but the predictions may vary in spurious features.

(1)Dynamic Functional Connectivity:

        ①Sliding window method is used for capturing temporal information

        ②BOLD signals S_{i}\in \mathbb{R}^{R\times T} , where R denotes the number of brain Regions-Of-Interests (ROIs) in fMRI of the i-th subject, T denotes the length of the segment

        ③FC matrix F_{i}\in \mathbb{R}^{R\times R} is calculated by Pearson’s correlation between the matched BOLD segments of the paired ROIs

        ④Then they flatten the upper triangle matrix to x_{i}

        ⑤Population graph G=\left \{ X,A \right \}, where X denotes feature of an individual, A denotes similarities between each subjects, and each node feature comes from the FC matrix

        ⑥Size of sliding window: L

        ⑦The step of the sliding window: s

(2)Step Window Augmentation (S-A):

        ①There are M sub-segments where M=\left \lfloor \frac{T-L}{s} \right \rfloor+1 , hence\left \{ S_{i}^{1},...,S_{i}^{M} \right \}  is the set for one subject.

        ②S-A randomly select G^a=\{\mathbf{X}^m,\mathbf{A}^m\} as the first view and find a neighbor G^b=\{\mathbf{X}^{m\pm1},\mathbf{A}^{m\pm1}\} as the other view

(3)Multi-Scale Window Augmentation (M-A):

        ①Choose two different window size: l_{a},l_{b}

        ②Then getting two views: G^a=\{\mathbf{X}^{m,l_a},\mathbf{A}^{m,l_a}\} and G^b=\{\mathbf{X}^{m,l_{b}},\mathbf{A}^{m,l_{b}}\}

2.4.2. Graph Embedding

        ①They adopt GCN as their encoder, the function of the i-th layer:

\mathbf{H}^{(l+1)}=\sigma\left(\mathbf{D}^{-\frac12}\mathbf{A}\mathbf{D}^{-\frac12}\mathbf{H}^{(l)}\boldsymbol{\Theta}^{(l)}\right)

where D denotes the diagonal matrix of A , {\Theta}^{(l)} is the weight matrix after training, \mathbf{H}^{(l)} is the feature matrix of all subjects

        ②Normalized views: \mathbf{Z}^a=f(\mathbf{X}^a,\mathbf{A}^a) and \mathbf{Z}^b=f(\mathbf{X}^b,\mathbf{A}^b)

2.4.3. Objective Function

        ①Reconstruction-based SSL may overfit to scattered noises

        ②They used GATE, which ignores negative sample and avoids reconstruct

        ③Maximize the correlation between each matrix

        ④Input-consistency regularization loss:

\mathcal{L}=-\frac1N\sum_{i=1}^N\frac{\left\langle\mathbf{z}_i^a,\mathbf{z}_i^b\right\rangle}{\|\mathbf{z}_i^a\|\left\|\mathbf{z}_i^b\right\|}+\gamma\sum_{v=a,b}\|(\mathbf{Z}^v)^\top\mathbf{Z}^v-\mathbf{I}\|_F^2

where \left \langle \right \rangle is the dot product operator, \gamma denotes trade-off coefficient, v denotes one of the view (a or b), Z^{v} denotes the embeddings matrices of view v. What is more, the first part of this function is a regulation term that it keeps the relative activity of features in low dimension. And the second part is to ensure the irrelevance of each dimension.

        ⑤Replacing A with identity matrix I in order to fine-tune

        ⑥Activation: ELU, denoted as \psi \left ( \right )

2.5. Theoretical Motivation and Analysis on CCA Loss

        ①The adding of input-consistency regularization decrease the relevance of spurious features and true label and increase performance

        ②The CCA function:

\begin{aligned}&\max_f\mathcal{L}_{\mathrm{CCA}}:=\mathbb{E}_{G^a,G^b}[f(G^a)^\top f(G^b)]\\&s.t.\Sigma_{f(G^a),f(G^a)}=\Sigma_{f(G^b),f(G^b)}=\mathbf{I}\end{aligned}

where f is a normalized non-linear embedding, \sum is covariance matrix (\Sigma_{f,f}=\mathbb{E}_{G^a}[f(G^a)f(G^a)])

        ③Connection between CCA and generalization error of downstream tasks:

\begin{aligned}\mathcal{L}_{U,V,S}:=\max_{\left\|h\right\|_{L^2(G^b)}=1}\left\|\mathcal{T}h-\mathcal{T}_kh\right\|_{L^2(G^a)}\\\mathcal{T}_kh:=\sum_{i=1}^k\sigma_i\left<v_i,h\right>_{L^2(G^b)}u_i=f^\top\mathbf{w}\end{aligned}

where  \Gamma denotes representation operation, R and h are low rank approximation operator, Singular Value Decomposition (SVD) of \Gamma is U=[u_1,\ldots,u_k],V=[v_1,\ldots,v_k]\mathbf{w}=\langle v_i,h\rangle

        ④General theorem for non-linear CCA, which presents the approximation error: 

\begin{aligned}e_{apx}(f)&:=\min_\mathbf{W}{\mathbb{E}_{G^a}[\|h^*(G^a)-\mathbf{W}^\top f(G^a)\|^2]}\\&\leq2\mathbb{E}_y[\|h^*-\mathcal{R}w_{b,y}\|^2+\|(\mathcal{R}-\mathcal{T}_k)w_{b,y}\|^2]\end{aligned}

where h^{\ast } is the optimal function that can predict Y\sigma_i:=\mathbb{E}_{G^a,G^b}[f_i(G^a)f_i(G^b)]

        ⑤Denote:

(\mathcal{T}_k\circ h_y)(g_a):=\sum_{i=1}^k\sigma_i\mathbb{E}[f_i(G^b)h_y(G^b)]f(g_a)

(\mathcal{R}\circ h_y)(g_a):=\mathbb{E}_Y[\mathbb{E}_{G^b}[h_y(G^b)|Y]|G^a=g_a]

\mathbb{E}[h_y(G^b)|Y=y]=\mathbf{1}(Y=y)

        ⑥Upper bound of excess risk of downstream task:

这数学是真的一个字都看不下去了救命,而且格式是真的奇怪

2.6. Experiments

2.6.1. Experimental Setup

(1)ABIDE dataset:

        ①Datasets: Autism brain imaging data exchange (ABIDE) I/II

        ②Object: health control (HC) vs. autism patient (classification 1)

        ③Samples: 485 ASD and 544 HCs in ABIDE

        ④Atlas: Bootstrap Analysis of Stable Cluster parcellation with 122 ROIs (BASC-122)

        ⑤Node: ROI

        ⑥Edge: Pearson’s correlation between the time series of BOLD signals of their ROIs

        ⑦Dimension: 7503 (.....................

(2)FTD

        ①Dataset: Frontotemporal dementia (FTD) 

        ②Object: HC  vs. dementia (classification 2)

        ③Samples:  86 HC and 95 FTD in FTD

        ④Pre-process: DPARSF

        ⑤Number of ROI: 116

(3)Graph Construction:

        ①Construct similarity graph S\in \mathbb{R}^{n\times n} with low-dimensional and discriminative features extracted from raw images, where n denotes the number of nodes in the population graph. This approach mitigates the influence of noise, redundant features and the dimensionality curse brought by high-dimensional features.

        ②Then construct phenotypic graph matrix \tilde{S} with gender, age or gene etc.

        ③Get initial graph A=S\bigodot \tilde{S}

        ④Only keeping the top-k edge features of one node

        ⑤Add diagonal matrix I to AI+A\rightarrow A

(4)Comparison Methods (adopt the same window):

        ①Methods without SSL: vanilla GCN, GAT, SAC-GCN

        ②Contrastive- based SSL: DGI, MVGRL

        ③Similarity-based SSL: BGRL, CCA-SSG

(5) Implementation Details:

        ①Optimizer: Adam W

        ②Learning rate: 0.001

        ③\gamma =0.2

        ④In S-A, L  is 30, s is 15

        ⑤In M-A, l_{a} and l_{b} are randomly chosen from \left \{ 10,20,30,40,50 \right \}

        ⑥Labeled data: 20%, 206 in ABIDE, 36 in FTD

        ⑦Validation: 5-fold cross validation 

(6)Performance Evaluation:

        ①Evaluation metrics: accuracy, area under the ROC curve (AUC), precision, recall, F1 score

        ②The higher the matrics, the better the performance

2.6.2. Results and Analysis

        ①Comparison table:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第4张图片

        ②Then they change the proportion of labeled data from 10 to 80:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第5张图片

2.6.3. Ablation Study

(1)Effectiveness of Dynamic FC Augmentation:

        MA and SA significantly enhance the performance:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第6张图片

(2)Effectiveness of Different SSL Strategies:

        They compare Contrastive-based SSL (CL), Reconstruction-based SSL (Re) and their model, while changing object function in CL to InfoNCE loss with random selecting negative samples and in RE to MSE loss with extra decoder:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第7张图片

(3)Different Dimensional Embedding:

        Chose the dimension from {16, 32, 64, 128, 256, 512, 1024}, the performances relatively reach peak when chose 256 for ABIDE and 128 for FTD:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第8张图片

Hence they chose 256 in all the experiments in that low dimensionality is lack of representation ability and high dimensionality will consume computational time

(4)Effectiveness of γ in the Objective Function:

        γ in objective function tends to stabilize at values of 0.1-0.8:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第9张图片

(5)Effectiveness of Fine-Tuning and Graph:

        GATE without fine-tuning or without graph (replace original A by I) in SSL:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第10张图片

Fine-tuning is for obtaining correct labeled data, and graph structure is for providing common biomarkers

(6)Low-Rank Representation:

        As low-provide common biomarkers, GATE is able to reduce the upper limit of excess risk for downstream tasks. Here is the comparison of GATE and vanilla GCN:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第11张图片

(7)Parameter Sensitivity Analysis:

        They research whether GATE is sensitive to sliding-window parameters, such as window length, step sizes or gaps of multiple windows:

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第12张图片

[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis_第13张图片

2.7. Discussion

2.7.1. The Needs of Label Efficiency for fMRI

        Exactly, the more the labeled data, the higher the accuracy. However, it is obviously big challenge of getting plenty of labeled images. Thus, designed GATE is able to achieve excellent performance under 20% labels that its accuracy is similar to vanilla GCN under 50%.

2.7.2. Graph Learning for Neuroimaging

        GATE shows better extraction of associations between subjects. Then it maximize the correlation.

2.7.3. Technical Contributions of Our Work

        ①SSL strategy produces multiple coupled views of a fMRI BOLD signal

        ②Pre-processing and fine-tuning

2.8. Conclusion

        GATE, which used in population graph, implementes high-precision functionality in small amounts of labeled data and noisy environments

3. 知识补充

3.1. Transductive learning

相关链接:转导学习 transductive learning_TBYourHero的博客-CSDN博客

3.2. Adam W optimizer

(1)在Adam优化器的基础上增加了weight decay正则化,相当于衰减了原先的权重

(2)相关链接:【优化器】(六) AdamW原理 & pytorch代码解析_Lcm_Tech的博客-CSDN博客

3.3. Unknown

(1)corruption function

(2)feature collapse

4. Reference List

Peng, L. et al. (2022) 'GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis', IEEE Transactions on Medical Imaging, vol. 42, issue. 2, pp. 391-402. doi: 10.1109/TMI.2022.3201974

你可能感兴趣的:(论文精读,人工智能,学习,笔记,深度学习,机器学习,transformer)