需要大量手工标注数据集一直阻碍病理学方面的决策支持系统的发展以及在临床上部署。为了解决这一问题,本文提出了基于多实例学习的深度学习系统,其仅仅使用已报告的诊断作为训练的标签,得意边广泛且费时间的逐像素手工标注。本文在来自15187位病人的44732张全切片图像构成的数据集上评估了该框架的性能,并且这些数据没有经过任何整理(curation)。前列腺癌、基底细胞癌和乳腺癌转移到腋窝淋巴结的试验结果显示,所有癌症类型的曲线下面积(areas under the curve)均在0.98以上。该系统的临床应用将使得在保持100%敏感度下病理学家能够排除65-75%的无效切片。实验结果表明,该系统能够在史无前例的大范围数据集上训练准确的分类模型,以此为临床级决策支持系统的落地奠定基础。
The development of decision support systems for pathology and their deployment in clinical practice have been hindered by the need for large manually annotated datasets. To overcome this problem, we present a multiple instance learning-based deep learning system that uses only the reported diagnoses as labels for training, thereby avoiding expensive and time-consuming pixel-wise manual annotations. We evaluated this framework at scale on a dataset of 44,732 whole slide images from 15,187 patients without any form of data curation. Tests on prostate cancer, basal cell carcinoma and breast cancer metastases to axillary lymph nodes resulted in areas under the curve above 0.98 for all cancer types. Its clinical application would allow pathologists to exclude 65–75% of slides while retaining 100% sensitivity. Our results show that this system has the ability to train accurate classification models at unprecedented scale, laying the foundation for the deployment of computational decision support systems in clinical practice.
硬件:Memorial Sloan Kettering Cancer Center (MSK) 的高性能计算集群:7个NVIDIA DGX-1计算节点,每个节点包含8个V100 Volta GPU和8TB SSD,每个模型在单GPU上训练
2.Skin cancer basal cell carcinoma (BCC)皮肤基底细胞癌分类
对于前列腺癌与腋淋巴结乳腺癌转移数据集,真实标签来自实验室信息系统(Laboratory Information System, LIS),而对于皮肤基底细胞癌数据集,则通过训练有数的专家进行确认和最终手工为每个病例设置二进制标签。
The complete pipeline for the MIL classification algorithm (Fig. 1c) comprises the following steps:
(1) tiling of each slide in the dataset (for each epoch, which consists of an entire pass through the training data);
(2) a complete inference pass through all of the data;
(3) intra-slide ranking of instances;
(4) model learning based on the top-ranked instance for each slide.
放大尺寸 | training&validation重叠比例 | test重叠比例 |
5× | 67% | 80% |
10× | 50% | 80% |
20× | 0% | 80% |
Bags: B = B s i : i = 1 , 2 , . . . , n B={B_{s_i}:i=1,2,...,n} B=Bsi:i=1,2,...,n
B s i = b i , 1 , b i , 2 , . . . , b i , m B_{s_i}={b_{i,1},b_{i,2},...,b_{i,m}} Bsi=bi,1,bi,2,...,bi,m
Given a tiling strategy, we produce bags B = B s i : i = 1 , 2 , . . . , n B={B_{s_i}:i=1,2,...,n} B=Bsi:i=1,2,...,n, where B s i = b i , 1 , b i , 2 , . . . , b i , m B_{s_i}={b_{i,1},b_{i,2},...,b_{i,m}} Bsi=bi,1,bi,2,...,bi,m is the bag for slide s i s_i si containing mi total tiles.
The model is a function f _ θ f\_{\theta} f_θ with current parameter θ \theta θ that maps input tiles bi,j to class probabilities for ‘negative’ and ‘positive’ classes. Given our bags B B B, we obtain a list of vectors O = o _ i : i = 1 , 2 , … , n O={o\_i: i=1, 2,…, n} O=o_i:i=1,2,…,n—one for each slide s _ i s\_i s_i containing the probabilities of class ‘positive’ for each tile b _ i , j : j = 1 , 2 , … , m b\_i,j: j=1, 2,…, m b_i,j:j=1,2,…,m in B _ s _ i B\_{s\_i} B_s_i. We then obtain the index ki of the tile within each slide, which shows the highest probability of being ‘positive’: ki=argmax(oi).
This is the most stringent version of MIL, but we can relax the standard MIL assumption by introducing hyper-parameter K K K and assume that at least K K K tiles exist in positive slides that are discriminative. For K = 1 K=1 K=1, the highest ranking tile in bag B s i B_{s_i} Bsi is then b i , k b_i,k bi,k. The output of the network y ^ i = f θ ( b i , k ) \hat{y}_i=f_{\theta}(b_i,k) y^i=fθ(bi,k) can then be compared to y i y_i yi, the target of slide s i s_i si, through the cross-entropy loss l l l as in equation (1). Similarly, if K > 1 K>1 K>1, all selected tiles from a slide share the same target y i y_i yi and the loss can be computed with equation (1) for each one of the K K K tiles:
l = − w 1 [ y i l o g [ y ^ i ] ] − w 0 [ ( 1 − y i ) l o g [ 1 − y ^ i ] ] ( 1 ) l=−w1[y_i log[\hat{y}_i]]-w0[(1−y_i)log[1-\hat{y}_i]] (1) l=−w1[yilog[y^i]]−w0[(1−yi)log[1−y^i]] (1)
The features extracted are:
(1) total count of tiles with probability ≥0.5;
(2–11) tenbin histogram of tile probability;
(12–30) count of connected components for a probability threshold of 0.1 of size in the ranges 1–10, 11–15, 16–20, 21–25, 26–30, 31–40, 41–50, 51–60, 61–70 and >70, respectively;
(31–40) ten-bin local histogram with a window of size 3×3 aggregated by max-pooling;
(41–50) ten-bin local histogram with a window of size 3×3 aggregated by averaging;
(51–60) ten-bin local histogram with a window of size 5×5 aggregated by max-pooling;
(61–70) ten-bin local histogram with a window of size 5×5 aggregated by averaging;
(71–80) ten-bin local histogram with a window of size 7×7 aggregated by max-pooling;
(81–90) ten-bin local histogram with a window of size 7×7 aggregated by averaging;
(91–100) ten-bin local histogram with a window of size 9×9 aggregated by max-pooling;
(101–110) ten-bin local histogram with a window of size 9×9 aggregated by averaging;
(111–120) ten-bin histogram of all tissue edge tiles;
(121–130) ten-bin local histogram of edges with a linear window of size 3×3 aggregated by max-pooling;
(131–140) ten-bin local histogram of edges with a linear window of size 3×3 aggregated by averaging;
(141–150) ten-bin local histogram of edges with a linear window of size 5×5 aggregated by max-pooling;
(151–160) ten-bin local histogram of edges with a linear window of size 5×5 aggregated by averaging;
(161–170) ten-bin local histogram of edges with a linear window of size 7×7 aggregated by max-pooling;
(171–180) ten-bin local histogram of edges with a linear window of size 7×7 aggregated by averaging.
模型: f f f
特征提取器 f F f_F fF 将像素空间转为表示空间
线性分类器 f C f_C fC 将表示变量投射为类概率
向量表示的有序序列 e = e 1 , e 2 , . . . e s e=e_1,e_2,...e_s e=e1,e2,...es同状态向量 h h h一起作为RNN输入,
step i=1,2,…,S次重复前向计算中:通过以下等式更新状态向量 h i h_i hi:
h i = R e L u ( W e e i + W h h i − 1 + b ) ( 2 ) h_i=ReLu(W_e e_i + W_h h_{i-1}+b) (2) hi=ReLu(Weei+Whhi−1+b) (2)
W e , W h W_e,W_h We,Wh是RNN网络的权重,step i=S时slide分类为 o = W o h S o=W_o h_S o=WohS, W o W_o Wo将状态向量映射为类概率。
2.给定不同放大下的模型 f 20 × , f 10 × , f 5 × f_{20×},f_{10×},f_{5×} f20×,f10×,f5×,根据同一slide同一中心像素但不同放大下的平均预测值获得最感兴趣的S个tile,
此时step i 下的有序向量为 e 20 × , e 10 × , e 5 × e_{20×},e_{10×},e_{5×} e20×,e10×,e5×,同状态向量更新 h i − 1 h_{i-1} hi−1一通作为RNN的输入,状态更新方程如下:
h i = R e L U ( W 20 × e 20 × , i + W 10 × e 10 × , i + W 5 × e 5 × , i + W h h i − 1 + b ) ( 3 ) h_i=ReLU(W_{20×}e_{20×,i}+W_{10×}e_{10×,i}+W_{5×}e_{5×,i}+W_h h_{i-1}+b) (3) hi=ReLU(W20×e20×,i+W10×e10×,i+W5×e5×,i+Whhi−1+b) (3)
所有实验中,状态表示的向量都是128维,重复步step S=10,并加权正类,以更重视模型的敏感性。使用交叉熵损失函数和256批量的小梯度随机下降优化算法.
通过每次训练验证集上最小平衡错误(minimum balanced error)来决定每次实验的最佳设置。
Architecture | Balanced error |
ResNet34 | 0 |
AlexNet | +0.0738 |
VGG11BN | -0.003 |
ResNet18 | +0.025 |
ResNet101 | +0.0265 |
DenseNet201 | +0.0085 |
使用t-分布邻域嵌入算法(t-distributed neighbor embedding)进行降维,降维成两维