医学超声图像小型数据集

Do you see cancer in the mammogram above? If you’re struggling, don’t worry, you’re not alone. Biomedical imagery is a domain where computer vision and artificial intelligence could be better suited to outperform human judgement. A recent Google study that used over 25,000 medical images supports this claim; they were able to build a machine learning model for breast cancer detection that outperformed humans, presumably, in part, because breast cancer detection is difficult, even for trained professionals. Machines often excel at these texture-based challenges. (In case you are curious, there’s no cancer in that mammogram.)

您在上面的乳房X线照片中看到癌症了吗？如果您正在挣扎，请不要担心，您并不孤单。生物医学图像是计算机视觉和人工智能可以更好地胜过人类判断的领域。谷歌最近的一项研究使用了25,000幅医学图像来支持这一说法 ; 他们能够建立一个乳腺癌检测的机器学习模型，该模型的性能优于人类，部分原因是，即使对于受过培训的专业人员，乳腺癌的检测也很困难。机器通常在应对这些基于纹理的挑战方面表现出色。 (以防万一，您在该X光检查中没有癌症。)

But before computer vision can broadly assist in evaluating biomedical imagery, there’s a data problem to solve: many possible biomedical applications have access to only a few hundred labeled images. If machine learning researchers have only hundreds, and not tens of thousands, of biomedical images, can a useful, predictive tool still be built?

但是在计算机视觉可以广泛地帮助评估生物医学图像之前，要解决一个数据问题：许多可能的生物医学应用程序只能访问几百个带有标签的图像。如果机器学习研究人员只有数百个而不是数万个生物医学图像，是否仍可以构建有用的预测工具？

In this series of posts, we will empirically explore some of the options and tools that data scientists can use when working on extremely small biomedical imagery datasets. Our focus will be on classification tasks that do not require segmentation; for example, these datasets could be for identifying if there is a tumor in a brain scan, not where the tumor is in the brain scan. This series will explore questions such as:

在本系列文章中，我们将根据经验探索数据科学家在处理极小的生物医学图像数据集时可以使用的一些选项和工具。我们的重点将放在不需要细分的分类任务上；例如，这些数据集可用于识别脑部扫描中是否存在肿瘤，而不是脑部扫描中肿瘤所在的位置。本系列将探讨以下问题：

How applicable is transfer learning from existing, general-purpose models? Do ImageNet models (trained to distinguish between a thousand classes of things like cats and dogs) help us detect differences in cell-based photographs?
从现有的通用模型进行转移学习的适用性如何？ ImageNet模型(经过训练以区分猫和狗等一千种事物)是否有助于我们检测基于细胞的照片中的差异？
If not, can we build a general-purpose “CellNet” model that can be used successfully for transfer learning for cell-based biomedical images?
如果不是，我们是否可以建立一个通用的“ CellNet ”模型，该模型可以成功用于基于细胞的生物医学图像的转移学习？
What kind of data augmentation and pre-processing works for this biomedical domain?
对于该生物医学领域，什么样的数据增强和预处理有效？
What other approaches, such as dataset purification or using only high-confidence predictions for image triage, could we employ to make these models, built off ultra-small images, more usable in real-world settings?
我们还可以采用其他哪些方法(例如数据集净化或仅对图像分类使用高可信度预测)来构建这些基于超小图像的模型，使其在现实环境中更有用？

转移学习对超小型生物医学数据集有效吗？ (Does Transfer Learning Work for Ultra-small Biomedical Datasets?)

In this first post, we’re going to tackle the common problem of limited training data by examining how, when, and why ImageNet-based transfer learning can be used effectively (or not). Transfer learning refers to the idea that a large, pre-trained model can be reused on a new dataset, recycling the learned parameters and their weights for a new classification task. For example, you can download a model that was pre-trained on millions of images from ImageNet to predict common objects. Then you can replace its final layer to predict, for example, four different types of white blood cells, instead of birds or cars.

在第一篇文章中，我们将通过研究基于ImageNet的方式，时间和原因来解决培训数据有限的普遍问题。转移学习可以被有效地使用(或不被有效地使用)。转移学习是指可以将大型的，经过预先训练的模型重用于新的数据集，从而将学习到的参数及其权重回收用于新的分类任务。例如，您可以从ImageNet下载数以百万计的图像进行预训练的模型，以预测常见的对象。然后，您可以替换其最后一层来预测例如四种不同类型的白细胞，而不是鸟或汽车。

These pre-trained models have learned to recognize lower-level features like straight lines versus curves, which could help distinguish the outlines of a cat’s whiskers from a dog’s snout. With transfer learning, these simple features can then be recycled when trying to differentiate different blood cell types, obviating the need for thousands of blood cell images and countless hours of model retraining. That’s the theory at least.

这些经过预先训练的模型已经学会了识别低级特征，例如直线与曲线，这可以帮助区分猫须和狗的鼻子。通过转移学习，这些简单的功能可以在尝试区分不同的血细胞类型时进行回收，从而避免了成千上万的血细胞图像和数小时的模型再训练。至少这是理论。

ImageNet Large Scale Visual Recognition Challenge,’ 2015. ImageNet大规模视觉识别挑战赛 ”，2015年。

How well transfer learning works depends, in part, on the similarity between a dataset and ImageNet, above (assuming you’re using a model built off of ImageNet, like pre-trained vgg or ResNet). On one hand, transfer learning seems to work well for many biomedical applications. On the other hand, it often doesn’t. Recent work out of NeurIPS by Raghu and colleagues from Cornell and GoogleBrain hypothesized that rather than using transfer learning from an entire general-purpose, large-scale model built off ImageNet, data scientists may be better off recycling only the lower layers of these pre-trained models. The upper layers of the model can then be simplified and locally trained. By recycling lower-level features, one would expect to get the benefits of training a model to recognize shallow features like lines and curves, which may be especially important given the fine-level detail found in biomedical imagery. We’ll evaluate that approach with our datasets here.

迁移学习的工作原理部分取决于上面的数据集与ImageNet之间的相似性(假设您使用的是基于ImageNet构建的模型，例如预先训练的vgg或ResNet )。一方面，转移学习似乎在许多生物医学应用中效果很好。另一方面，通常不是。来自Cornell和GoogleBrain的Raghu及其同事在NeurIPS上所做的最新研究假设，与其使用从ImageNet构建的整个通用大型模型的转移学习，数据科学家可能最好只回收这些预训练有素的模型。然后可以简化模型的上层并进行本地培训。通过回收较低级别的特征，可以期望获得训练模型以识别诸如直线和曲线之类的浅层特征的好处，考虑到生物医学图像中的精细细节，这可能尤其重要。我们将在此处使用数据集评估该方法。

超小型生物医学数据集：好的，坏的和丑陋的 (Ultra-small Biomedical Datasets: the Good, the Bad, and the Ugly)

To investigate the feasibility of transfer learning for ultra-small biomedical image datasets, we set up experiments to classify open-source benchmark imagery, using pre-trained vgg16, ResNet18, and two custom-designed Convolutional Neural Network (CNN) architectures. Below are examples from our nine biomedical, mostly cell-based datasets we used to benchmark different models. We provide further details and results for each benchmark in the sections below.

为了研究超小型生物医学图像数据集转移学习的可行性，我们使用预训练的vgg16 ， ResNet18和两个定制设计的卷积神经网络(CNN)架构，建立了实验以对开源基准图像进行分类。以下是我们用来对不同模型进行基准测试的九个生物医学数据(主要是基于细胞的数据集)的示例。我们在以下各节中提供了每个基准的更多详细信息和结果。

Examples from nine biomedical imagery datasets we used as benchmarks in our experiments; dataset sizes and class imbalances are detailed below for each dataset individually. 来自我们在实验中用作基准的9个生物医学图像数据集中的示例；下面分别针对每个数据集详细介绍了数据集大小和类不平衡。

For all benchmarks, we did our best to ensure that our models learn what we want them to learn (a cell shape, for example) and not circumstantial artifacts (like a background color). Given the limited size of our datasets, we applied basic data augmentation during training: flipping, rotating, normalizing, and color jittering each image to try to help with generalization. Because classes were often imbalanced, we balanced classes during training through a WeightedRandomSampler passed to the DataLoader. In a future post, we will explore in detail how data pre-processing affects these types of models.

对于所有基准测试，我们尽力确保我们的模型学习我们希望他们学习的内容(例如，单元格形状)，而不是周围的伪影(例如背景色)。鉴于数据集的大小有限，我们在训练过程中应用了基本数据增强功能：翻转，旋转，归一化和抖动每个图像，以尝试帮助泛化。由于班级经常失衡，因此我们在训练期间通过传递给DataLoader的WeightedRandomSampler来平衡班级。在以后的文章中，我们将详细探讨数据预处理如何影响这些类型的模型。

Images were either cropped or resized, depending on which was feasibility and performance so that each image conformed to the 224x224 base of ImageNet. Notably, we fed these resized images into our custom CNNs, and it remains an open question whether other bespoke CNNs built on non-resized images could work better. (We will explore in later posts.) Our entire experimentation source code is available on git, while the images can be downloaded from their original sources linked in the sections below.

根据可行性和性能，对图像进行裁剪或调整大小，以使每个图像都符合ImageNet的224x224基数 。值得注意的是，我们将这些调整大小后的图像输入到我们的自定义CNN中，而在未调整大小的图像上构建其他定制的CNN是否可以更好地工作仍然是一个悬而未决的问题。 (我们将在以后的文章中进行探讨。)我们的整个实验源代码都可以在git上获得，而图像可以从下面各节中链接的原始资源中下载。

完全和部分迁移学习与Scratch的CNN (Full and Partial Transfer Learning versus CNNs from Scratch)

We sought to answer the following questions as we investigated how transfer learning could be used to build predictive models on these small but relatively homogenous datasets:

在研究如何使用转移学习在这些较小但相对同质的数据集上建立预测模型时，我们试图回答以下问题：

Is ImageNet-based transfer learning better for this type of data than a CNN built from scratch?
与从头开始构建的CNN相比，基于ImageNet的传输学习此类数据是否更好？
Is reusing only lower layers of a pre-trained model, as some researchers suggest for this domain, more effective?
正如一些研究人员建议的那样，仅重用经过预训练的模型的较低层会更有效吗？
How fragile are individual models built using subsets of training data during cross-validation? Should we employ a voting ensemble method due to our ultra-small datasets?
在交叉验证期间，使用训练数据的子集构建的各个模型有多脆弱？由于我们的数据集非常小，是否应该采用投票合奏方法？

To answer these questions, we used pre-trained vgg16 and ResNet18, and two CNNs built from scratch (without pre-training on ImageNet) to set up the following experimental models:

为了回答这些问题，我们使用了预训练的vgg16和ResNet18以及两个从头开始构建的CNN(在ImageNet上未进行预训练)来建立以下实验模型：

Models we used in our experiments. 我们在实验中使用的模型。

For each dataset, we kept a global holdout of 10% of the entire dataset to compare performance across the models above. The remaining data was used for 5-fold cross-validation, repeated four times, giving us 20 models for each dataset-model pair. We then calculated the average weighted accuracy across all 20 models on both the holdout test set and a local test set used in each cross-validation. When training the models, we used the same batch size (32), while we calculated a reasonable learning rate and epochs for each benchmark to allow all models to finish learning, but not take too long to run 20 trials for each of the six architectures per benchmark. Typically, this was around 20–30 epochs and a learning rate of 1e-3, unless otherwise noted below. We did not freeze the layers of the models.

对于每个数据集，我们保留了整个数据集的10％的全局保留量，以比较上述模型之间的性能。其余数据用于5倍交叉验证，重复四次，为每个数据集/模型对提供20个模型。然后，我们在每个交叉验证中使用的保留测试集和本地测试集上，计算了所有20个模型的平均加权准确性。在训练模型时，我们使用相同的批处理大小(32)，同时我们为每个基准计算了合理的学习率和时期，以允许所有模型完成学习，但是对于六个体系结构中的每一个进行20个试验都不会花费太长时间每个基准。通常情况下，除非下面另有说明，否则这大约是20-30个纪元，学习率是1e-3。我们没有冻结模型的各层。

We also used a voting ensemble model that combined the predictions across all twenty models on the holdout test sets. Notably, we didn’t expect these models to perform well out-of-the-box on our benchmarks; our goal was to make basic comparisons. For a data scientist trying to build an effective model for a specific dataset, they would devote significant time to optimizing such models. These experiments are meant to be used as a broad guideline for future experiments related to small biomedical datasets.

我们还使用了投票合奏模型，该模型结合了对保持测试集上所有二十个模型的预测。值得注意的是，我们并不期望这些模型在基准测试中即开即用。我们的目标是进行基本比较。对于试图为特定数据集建立有效模型的数据科学家而言，他们将花费大量时间来优化此类模型。这些实验旨在用作与小型生物医学数据集有关的未来实验的广泛指导。

细胞形状检测和转移学习 (Cell Shape Detection and Transfer Learning)

Several of our datasets involved classifying what cell type (or shape) a culture slide belonged to; these are common idioms and tasks in the biomedical domain, whether used for classifying proteins, or sub-cellular structures. Learning to identify these types of low-level textures and patterns is fundamental in the biomedical classification domain; however, it’s unclear how much overlap there is between these features, and what ImageNet-based models are prepared to transfer.

我们的一些数据集涉及对培养玻片所属的细胞类型(或形状)进行分类。这些都是生物医学领域中常见的习语和任务，无论是用于蛋白质分类还是亚细胞结构分类。学习识别这些类型的低级纹理和图案是生物医学分类领域的基础。但是，目前尚不清楚这些功能之间有多少重叠，以及准备移植哪些基于ImageNet的模型。

Therefore, we trialed our two ImageNet-based models (vgg16 and ResNet18), both in full and only using their lower layers, against two CNN models (a deeper versus a more shallow one) without transfer learning; a discussion of the results is presented for each dataset below. To reduce cognitive load, we’re only showing the results on the holdout test set for each benchmark, which was almost always in line with the observations for each model’s test set performance during its cross-validation trials. We also show the voting ensemble model performance on each holdout, where each of the 20 models trained through cross-validation vote on the classification.

因此，我们对两个基于ImageNet的模型( vgg16和ResNet18 )在不进行迁移学习的情况下，针对两个CNN模型(较深而不是较浅的模型)进行了完整测试，仅使用了它们的较低层。以下是每个数据集的结果讨论。为了减少认知负担，我们仅在每个基准的保持测试集上显示结果，该结果几乎始终与每个模型在交叉验证试验中对测试集性能的观察结果一致。我们还显示了每个保留项上的投票合奏模型性能，其中通过交叉验证对分类的20个模型进行了训练。

We first tested transfer learning on a sub-celluar protein classification challenge. From the perspective of a computer vision model, we can imagine this problem as trying to distinguish between different patterns of cell shapes:

我们首先在亚细胞蛋白质分类挑战中测试了转移学习。从计算机视觉模型的角度来看，我们可以将这个问题想象为试图区分不同形状的细胞形状：

Murphy Lab, with [35, 80, 13, 16, 12, 12, 60, 40, 17] images per class. 墨菲实验室的亚细胞蛋白质共聚焦图像，每类具有[35、80、13、16、12、12、60、40、17]图像。

Performance of various models across the sub-cellular protein challenge on the holdout dataset 保持数据集上跨亚细胞蛋白质挑战的各种模型的性能

In this dataset, we resized these 1024x1024 pixel images to be 224x224 (to avoid cropping out salient structures), applied the basic transformations mentioned earlier, and trained the five models above, using each of the 20 models generated during cross-validation to create a voting ensemble model. Notably, transfer learning from ImageNet seemed to help here, although there wasn’t a large difference between using the whole architecture, versus only the lower layers.

在此数据集中，我们将这些1024x1024像素图像的大小调整为224x224(以避免裁剪出明显的结构)，应用了前面提到的基本转换，并使用了交叉验证期间生成的20个模型中的每一个来训练上面的5个模型，以创建一个投票合奏模型。值得注意的是，尽管使用整个体系结构与仅使用较低层之间没有太大差异，但从ImageNet进行转移学习似乎在这里有所帮助。

The same setup was used for another dataset for classifying sub-cellular structures in 1024x1024 pixel images, with similar trends:

相同的设置用于另一个数据集，用于对1024x1024像素图像中的亚细胞结构进行分类，趋势相似：

CHO cell images from the Murphy Lab的 Murphy Lab, with [78, 70, 98, 34, 52] images per class. CHO细胞图像，每个类别有[78，70，98，34，52]个图像。

Performance of various models across the sub-cellular structure challenge on the holdout dataset 保持数据集上跨亚细胞结构挑战的各种模型的性能

However, in this experiment, all transfer learning models approach near-perfect performance, especially with voting, presumably because there are fewer classes, with more images per class available to train on. This problem seems trivial even for untrained humans, and machines also easily learn these differences, although the general caveat of witnessing such high performance on such a small dataset always leaves the nagging question about model generalization.

但是，在该实验中，所有迁移学习模型都接近完美的性能，尤其是在投票时，大概是因为班级较少，每个班级有更多图像可供训练。即使对于未经训练的人来说，这个问题似乎也是微不足道的，而且机器也很容易了解这些差异，尽管在如此小的数据集上见证如此高的性能的一般警告总是留下关于模型泛化的烦恼问题。

Moving on to another sub-cellular protein classification dataset, we trialed our models against similar protein staining as on the images above, this time on epithelial cells from a small, random subset of a COVID-19 dataset of 1024x1024 pixel images:

转到另一个亚细胞蛋白质分类数据集，我们对模型进行了与上图类似的蛋白质染色试验，这次是在来自1024x1024像素图像的COVID-19数据集的随机小子集的上皮细胞上进行的：

Different stains of epithelial cells from the RxRx19 Sars-Cov-2 dataset, with [455, 505, 495, 449, 487] randomly-sampled images for each class. 来自 RxRx19 Sars-Cov-2数据集的上皮细胞不同染色，每个类别随机具有[ 455，505，495，449，487 ]随机采样的图像。

We resized the 1024x1024 pixel images to 224x224, rather than cropping them, because the former yielded better results; we also trained for only 10 epochs with a learning rate of 1e-4. This dataset is also something that’s easy for untrained humans to classify correctly, therefore it’s not surprising that the models above performed so well, especially because they had hundreds of images to train each class on:

我们将1024x1024像素的图像调整为224x224，而不是裁剪它们，因为前者产生了更好的结果。我们也只训练了10个纪元，学习率为1e-4 。对于未经训练的人来说，该数据集也很容易对其进行正确分类，因此，上述模型的表现如此出色也就不足为奇了，特别是因为他们有数百张图像可以训练每个班级：

Performance of various models across the Covid19 protein stain challenge on the holdout dataset 保持数据集上Covid19蛋白染色挑战中各种模型的性能

All the models performed quite well, hinting that the combination of task and number of images per class has reached a point of saturation.

所有模型均表现良好，暗示任务和每类图像数量的组合已达到饱和点。

How about another cell shape challenge, with many classes but few images per class? We predict that the task of classifying different cell shapes we manually annotated from a Kaggle data science bowl will perform similar to our first benchmark, with some lift from transfer learning. Even though the shape labels in this dataset aren’t biologically relevant, there are other applications where data scientists try to build models to differentiate cell shape and/or contents. These images varied widely in size, between 1024x1024 to 256x256 pixels, so we chose to greyscale and resize them to 224x224, as resized images are more likely to end up having closer cell sizes (making it more interesting to try to classify them based on cell shape):

另一个单元格形状挑战很多，每个类很多但图像很少？我们预测，从Kaggle数据科学碗中手动注释的不同单元格形状的分类任务将执行与我们的第一个基准类似的任务，但迁移学习会有所帮助。即使此数据集中的形状标签与生物学无关，但在其他应用程序中，数据科学家试图建立模型来区分细胞的形状和/或内容。这些图像的大小变化很大，介于1024x1024到256x256像素之间，因此我们选择灰度并将其调整为224x224，因为调整大小后的图像更有可能最终具有更小的像元大小(使得尝试根据像元对它们进行分类更加有趣形状)：

Different cell shapes from a Kaggle 2018 data science bowl, with [16, 92, 75, 157, 31, 70, 62, 39] images per class. 来自 Kaggle 2018数据科学碗的不同像元形状，每个类别具有[ 16、92、75、157、31、70、62、39 ]张图像。

Performance of various models across the Kaggle shapes challenge on the holdout dataset 保持数据集上跨Kaggle形状挑战的各种模型的性能

Transfer learning appears to be beneficial on this dataset, along with surprisingly good performance, given the limited class sizes. Why? One of the differences between this dataset and others we’ve seen so far is that we manually labelled these images by assigning a shape that came from different instruments and datasets originally in the Kaggle bowl competition; our task here was to identify cell shapes, but these Kaggle images were originally meant for nucleus segmentation. Therefore, one possibility is that our models were picking up on background noise that happened to be accidentally correlated with the cell shape due to the original image collection. Another theory is that the differences between the classes here are more subtle than the previous two obvious datasets, and perhaps transfer learning is particularly suited to teasing them apart.

鉴于班级人数有限，转学对于这个数据集似乎是有益的，并且表现出惊人的良好表现。为什么？到目前为止，我们已经看到的这个数据集与其他数据集之间的差异之一是，我们通过分配一种形状来手动标记这些图像，这些形状来自于Kaggle碗比赛中的不同仪器和数据集。我们这里的任务是识别细胞形状，但是这些Kaggle图像最初是用于细胞核分割的。因此，一种可能性是我们的模型正在拾取由于原始图像收集而偶然与细胞形状相关的背景噪声。另一个理论是，这里的类之间的差异比前两个明显的数据集更细微，也许转移学习特别适合于将它们分开。

转移学习用于细胞培养中的细胞形状检测 (Transfer learning for cell shape detection in cell cultures)

Overall, we learned that for this type of cell-shape-based classification that transfer learning with either ImageNet-based model (vgg or ResNet) outperformed other CNN-based approaches. Part of the reason transfer learning works well for these problems may be that identifying different cell shapes on a slide of multiple cells is similar to identifying different textures. This is something that ImageNet-based models are thought to be good at, and perhaps we’re able to recycle their lower-level features in these experiments. We also observed the power of voting ensemble methods to out-pace the prediction accuracy of any individual model. In fact, a voting algorithm seemed necessary when using a model built with only lower layers of the original vgg or ResNet architecture, due to variance in individual model performance.

总体而言，我们了解到，对于这种基于单元格形状的分类，使用基于ImageNet的模型( vgg或ResNet )进行的学习转移均优于其他基于CNN的方法。对于这些问题，转移学习效果很好的部分原因可能是，在多个单元的幻灯片上识别不同的单元形状类似于识别不同的纹理。这是基于ImageNet的模型被认为是擅长的，也许我们可以在这些实验中回收它们的低级功能。我们还观察到投票集成方法的力量超过任何单个模型的预测准确性。实际上，由于各个模型性能的差异，使用仅由原始vgg或ResNet体系结构的较低层构建的模型时，似乎需要一种投票算法。

In our next blog post, we’ll continue this set of experiments on pre-segmented cells, as well as other types of biomedical datasets. We hypothesize that transfer learning from ImageNet may be less useful under these circumstances, given that these newer datasets might be more difficult to translate into texture challenges for our models. Stay tuned to see if these conclusions hold on other types of biomedical datasets, next!

在我们的下一篇博客文章中，我们将继续对预分割的细胞以及其他类型的生物医学数据集进行这组实验。我们假设在这种情况下，从ImageNet进行转移学习可能会没有太大用处，因为这些较新的数据集可能更难以转化为我们模型的纹理挑战。请继续关注，看看这些结论是否适用于其他类型的生物医学数据集！

Thanks to my colleagues Felipe Mejia, John Speed Meyers, and Vishal Sandesara

感谢我的同事Felipe Mejia，John Speed Meyers和Vishal Sandesara

Thank you to the Murphy Lab for making available many of the datasets used here. For further information, see their paper: X. Chen, M. Velliste, S. Weinstein, J.W. Jarvik and R.F. Murphy (2003). Location proteomics — Building subcellular location trees from high resolution 3D fluorescence microscope images of randomly-tagged proteins. Proc. SPIE 4962: 298–306.

感谢Murphy Lab提供了此处使用的许多数据集。有关更多信息，请参见他们的论文： X. Chen，M。Velliste，S。Weinstein，JW Jarvik和RF Murphy(2003)。 定位蛋白质组学—根据随机标记的蛋白质的高分辨率3D荧光显微镜图像构建亚细胞定位树。 进程 SPIE 4962：298-306。

翻译自: https://gab41.lab41.org/transfer-learning-for-classification-in-ultra-small-biomedical-datasets-2d332ae87bfb