图像整合到文件中_整合图像和表格数据以进行深度学习

图像整合到文件中

I recently participated in the SIIM-ISIC Melanoma Classification competition on Kaggle. In this competition, participants are asked to identify melanoma in images of skin lesions. Interestingly, they also provide metadata about the patient and the anatomic site in addition to the image. In essence, we have both image and structured or tabular data for each example. For the image, we can use a CNN-based model, and for the tabular data, we can use embeddings and fully connected layers as explored in my previous posts on UFC and League of Legends predictions. It is easy to build two separate models for each data modality. But what if we want to build a joint model that trains on both data modalities simultaneously? There are inspiring discussions in the competition forum including this thread. In this post, I will demonstrate how to integrate the two data modalities and train a joint deep learning model using fastai and the image_tabular library, which I created specifically for these tasks.

我最近参加了Kaggle的SIIM-ISIC黑色素瘤分类比赛。 在这场比赛中,要求参与者在皮肤病变图像中识别黑色素瘤。 有趣的是,除了图像之外,它们还提供有关患者和解剖部位的元数据。 本质上,每个示例都具有图像和结构化或表格数据。 对于图像,我们可以使用基于CNN的模型,对于表格数据,我们可以使用嵌入和完全连接的图层,如我之前关于UFC和英雄联盟的预测中所探讨的。 为每种数据模式构建两个单独的模型很容易。 但是,如果我们想建立一个同时对两种数据模式进行训练的联合模型,该怎么办? 包括该主题在内的竞赛论坛都进行了鼓舞人心的讨论。 在本文中,我将演示如何集成这两种数据模式,以及如何使用我专门为这些任务创建的fastai和image_tabular库训练联合深度学习模型。

SIIM-ISIC数据集 (The SIIM-ISIC Dataset)

The SIIM-ISIC Melanoma Classification dataset can be downloaded here. The training set consists of 32542 benign images and 584 malignant melanoma images. Please note that this dataset is extremely unbalanced. The picture below shows one example from each class. It seems that malignant lesions are larger and more diffused than benign ones.

SIIM-ISIC黑色素瘤分类数据集可在此处下载。 训练集包括32542个良性图像和584个恶性黑色素瘤图像。 请注意,该数据集非常不平衡。 下图显示了每个类的一个示例。 看来恶性病变比良性病变更大,且分布更广。

Benign versus malignant 良性与恶性

As mentioned above, there are metadata available in addition to the images as shown below:

如上所述,除了图像外,还有元数据可用,如下所示:

Metadata as a Pandas dataframe 元数据作为熊猫数据框

We can perform some basic analysis to investigate whether some of these features are associated with the target. Interestingly, males are more likely to have malignant melanoma than females, and age also seems to be a risk factor of having malignant melanoma as shown below. In addition, the frequency of malignancy melanoma differs between the locations of the imaged site with the head/neck showing the highest malignancy rate. Therefore, these features contain useful information, and combining them with the images could help our model make better predictions. This makes sense as doctors will probably not only examine images of skin lesions but also consider additional factors in order to make a diagnosis.

我们可以执行一些基本分析,以调查其中的某些功能是否与目标关联。 有趣的是,男性比女性更有可能患上恶性黑色素瘤,而且年龄似乎也是罹患恶性黑色素瘤的危险因素,如下所示。 另外,恶性黑色素瘤的频率在成像部位的位置之间有所不同,头部/颈部显示出最高恶性率。 因此,这些功能包含有用的信息,将它们与图像结合可以帮助我们的模型做出更好的预测。 这很有意义,因为医生可能不仅会检查皮肤病变的图像,还会考虑其他因素以进行诊断。

Metadata features are associated with the target 元数据功能与目标关联

该方法 (The Approach)

Our approach to integrating both image and tabular data is very similar to the one taken by the winners of the ISIC 2019 Skin Lesion Classification Challenge as described in their paper and shown in the picture below. Basically, we first load the image and tabular data for each sample, which are fed into a CNN model and a fully connected neural network, respectively. Subsequently, the outputs from the two networks will be concatenated and fed into an additional fully connected neural network to generate final predictions.

我们整合图像和表格数据的方法与ISIC 2019皮肤病变分类挑战赛获奖者所采用的方法非常相似,如论文中所述和下图所示。 基本上,我们首先为每个样本加载图像和表格数据,分别将其输入到CNN模型和完全连接的神经网络中。 随后,两个网络的输出将被合并并馈入另一个完全连接的神经网络以生成最终预测。

N. Gessert, M. Nielsen, and M. Shaikh et al. / MethodsX 7 (2020) 100864 N. Gessert,M。Nielsen和M. Shaikh等。 方法7(2020)100864

使用image_tabular库实现 (Implementation with the image_tabular library)

To implement the idea, we will be using Pytorch and fastai. More specifically, we will use fastai to load the image and tabular data and package them into fastai LabelLists.

为了实现这个想法,我们将使用Pytorch和fastai 。 更具体地说,我们将使用fastai加载图像和表格数据并将其打包到fastai LabelLists中。

# load image data using train_df and prepare fastai LabelLists
image_data = (ImageList.from_df(train_df, path=data_path, cols="image_name",
                               folder="train_128", suffix=".jpg")
              .split_by_idx(val_idx)
              .label_from_df(cols="target")
              .transform(tfms, size=size))


# add test data so that we can make predictions
test_image_data = ImageList.from_df(test_df, path=data_path, cols="image_name",
                                    folder="test_128", suffix=".jpg")
image_data.add_test(test_image_data)


tab_data = (TabularList.from_df(train_df, path=data_path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(val_idx)
                           .label_from_df(cols=dep_var))


# add test
tab_data.add_test(TabularList.from_df(test_df, cat_names=cat_names, cont_names=cont_names,
                                      processor = tab_data.train.x.processor))

Next, we will integrate the two data modalities using the image_tabular library, which can be installed by running:

接下来,我们将使用image_tabular库集成这两个数据模态,可以通过运行安装该库:

pip install image_tabular

We will use the get_imagetabdatasets function from image_tabular to integrate image and tabular LabelLists.

我们将使用get_imagetabdatasets功能从image_tabular整合图像和表格LabelLists。

integrate_train, integrate_valid, integrate_test = get_imagetabdatasets(image_data, tab_data)


# package train, valid, and test datasets into a fastai databunch
db = DataBunch.create(integrate_train, integrate_valid, integrate_test,
                      path=data_path, bs=bs)

The databunch contains both image and tabular data and is ready to be used for training and prediction.

数据束包含图像和表格数据,可以用于训练和预测。

Once the data is ready, we can then move on to build the model. First, we need to create a CNN model, resnet50 in this case, and a tabular model using fastai. We will treat sex and anatomic site as categorical features and represent them using embeddings in the tabular model.

数据准备好后,我们便可以继续构建模型。 首先,我们需要创建一个CNN模型,本例中的resnet50以及一个使用fastai的表格模型。 我们会将性别和解剖部位视为分类特征,并使用表格模型中的嵌入表示它们。

# cnn model for images, use Resnet50 as an example
cnn_arch = models.resnet50


# cnn_out_sz is the output size of the cnn model that will be concatenated with tabular model output
cnn_out_sz = 256


# use fastai functions to get a cnn model
image_data_db = image_data.databunch()
image_data_db.c = cnn_out_sz
cnn_learn = cnn_learner(image_data_db, cnn_arch, ps=0.2)
cnn_model = cnn_learn.model


# get embedding sizes of categorical data
emb_szs = tab_data.train.get_emb_szs()


# output size of the tabular model that will be concatenated with cnn model output
tab_out_sz = 8


# use fastai functions to get a tabular model
tabular_model = TabularModel(emb_szs, len(cont_names), out_sz=tab_out_sz, layers=[8], ps=0.2)

We are now ready to build a joint model, again using the image_tabular library. We can customize the fully connected layers by specifying the layers parameter.

现在,我们准备再次使用image_tabular库构建联合模型。 我们可以通过指定layers参数来自定义完全连接的图层。

# get an integrated model that combines the two components and concatenate their outputs
# which will pass through additional fully connected layers
integrate_model = CNNTabularModel(cnn_model,
                                  tabular_model,
                                  layers = [cnn_out_sz + tab_out_sz, 32],
                                  ps=0.2,
                                  out_sz=2).to(device)

Finally, we can pack everything into a fastai learner and train the joint model.

最后,我们可以将所有内容打包成fastai学习者并训练联合模型。

# package everything in a fastai learner, add auc roc score as a metric
learn = Learner(db, integrate_model, metrics=[accuracy, ROCAUC()], loss_func=loss_func)


# train
learn.fit_one_cycle(10, 1e-4)


# unfreeze all layer groups to train the entire model using differential learning rates
learn.unfreeze()
learn.fit_one_cycle(5, slice(1e-6, 1e-4))

The entire workflow is detailed in this Jupyter notebook.

此Jupyter 笔记本详细介绍了整个工作流程。

结果 (Results)

The model achieved a ROC AUC score of about 0.87 on the validation set after training for 15 epochs. I subsequently submitted the predictions made by the trained model on the test set to Kaggle and got a public score of 0.864. There is definitely much room for improvement.

在训练15个时间段后,该模型在验证集上获得了约0.87的ROC AUC评分。 随后,我将由训练有素的模型对测试集进行的预测提交给Kaggle,并获得0.864的公共评分。 肯定还有很多改进的空间。

Kaggle public score Kaggle公众得分

摘要 (Summary)

In this post, we used fastai and image_tabular to integrate image and tabular data and built a joint model trained on both data modalities simultaneously. As noted above, there are many opportunities for further improvement. For example, we can try more advanced CNN architectures such as ResNeXt. Another question would be how many neurons should we allocate for image and tabular data before concatenation, in other words, how should we decide the relative importance or weights of the two data modalities? I hope this could serve as a framework for further experimentation and improvement.

在这篇文章中,我们使用fastai和image_tabular来集成图像和表格数据,并同时建立了在两种数据模态上训练的联合模型。 如上所述,存在许多进一步改进的机会。 例如,我们可以尝试更高级的CNN架构,例如ResNeXt。 另一个问题是,在串联之前我们应该为图像和表格数据分配多少个神经元,换句话说,我们应该如何确定这两种数据方式的相对重要性或权重? 我希望这可以作为进一步试验和改进的框架。

源代码 (Source Code)

The source code of image_tabular and jupyter notebooks for the SIIM-ISIC Melanoma Classification competition can be found here.

SIIM-ISIC黑色素瘤分类比赛的image_tabular和jupyter笔记本的源代码可以在此处找到。

致谢 (Acknowledgments)

The image_tabular library relies on the fantastic fastai library and was inspired by the code of John F. Wu.

image_tabular库依赖于出色的fastai库,并受John F. Wu 代码的启发。

I am an immunologist with bioinformatics and programming skills. I am interested in data analysis, machine learning, and deep learning.

我是一位具有生物信息学和编程技能的免疫学家。 我对数据分析,机器学习和深度学习感兴趣。

Website: www.ytian.meBlog: https://medium.com/@yuan_tianLinkedIn: https://www.linkedin.com/in/ytianimmune/Twitter: https://twitter.com/_ytian_

网站: www.ytian.me博客: https : //medium.com/@yuan_tian LinkedIn: https : //www.linkedin.com/in/ytianimmune/ Twitter: https : //twitter.com/_ytian_

翻译自: https://towardsdatascience.com/integrating-image-and-tabular-data-for-deep-learning-9281397c7318

图像整合到文件中

你可能感兴趣的:(python,人工智能,机器学习,深度学习,java)