How is Artificial Intelligence used in the medical domain? One of the cliche answers to this type of question is Lung Cancer detection. But really, how many of you have ever seen a lung image data before? or even a simple Jupyter kernel going through the preprocessing step on this type of data? To be honest, it’s not an easy project that one can simply undertake despite its position as a classic example as a data science project. It’s not something like the Boston House pricing example we can easily find in Kaggle.
人工智能在医学领域如何使用? 这类问题的老套答案之一是肺癌检测。 但实际上,你们中有多少人以前见过肺图像数据? 甚至是简单的Jupyter内核都要对此类数据进行预处理? 老实说,尽管将其作为数据科学项目中的经典例子,但要完成这个任务并不容易。 这与我们可以在Kaggle轻松找到的Boston House定价示例不同。
But honestly, it’s not so hard as you think it is. With just some effort and time I can guarantee you that you can do it. You will get to learn more than just doing projects with tabular data. You will learn to process images, manage each mask and image files, how to mount image files, and many more!
但说实话,这并不像您想象的那么难。 只需花费一些时间和精力,我就能保证您可以做到。 您将获得更多的知识,而不仅仅是使用表格数据进行项目。 您将学习处理图像,管理每个蒙版和图像文件,如何挂载图像文件等等!
In this article, I would like to go through the procedures to start your very first Lung Cancer detection project. I started this project when I was a newbie to Python. I had a hard time going through other people’s Github and codes that were online. I hope that my explanation could help those who first start their research or project in Lung Cancer detection.
在本文中,我将介绍启动第一个肺癌检测项目的过程。 我是Python的新手时就开始了这个项目。 我很难浏览别人的Github和在线代码。 我希望我的解释可以帮助那些首先开始其肺癌检测研究或项目的人。
The whole procedure is divided into 3 steps: preprocessing of the data, training a segmentation model, training a classification model. Here, I will only talk about the downloading and preprocessing step of the data. You will need a working computer and storage of at least 130 GB memory(You don’t need to download the whole data if you just want to get a glimpse of it). In the later parts of my article, I will go through the model construction. Let’s begin!
整个过程分为3个步骤:数据预处理,训练分割模型,训练分类模型。 在这里,我仅讨论数据的下载和预处理步骤。 您将需要一台可以正常运行的计算机,并至少要存储130 GB的内存(如果您只是想一眼就不必下载全部数据)。 在本文的后面,我将介绍模型的构建。 让我们开始!
1.下载数据 (1. Download the Data)
Of course, you would need a lung image to start your cancer detection project. Well, you might be expecting a png, jpeg, or any other image format. But lung image is based on a CT scan. They take a different form which is a DICOM format(Digital Imaging and Communications in Medicine). It’s a widely used format in the medical domain.
当然,您将需要肺部图像来启动您的癌症检测项目。 好吧,您可能期望使用png,jpeg或任何其他图像格式。 但是肺部图像是基于CT扫描的。 它们采用DICOM格式的不同形式(医学数字成像和通信)。 它是医学领域广泛使用的格式。
We will use the LIDC-IDRI open-sourced dataset which contains the DICOM files for each patient. First, visit the website and click the search button.
我们将使用LIDC-IDRI开源数据集,其中包含每个患者的DICOM文件。 首先,访问网站,然后单击搜索按钮。
We would only need the CT images for our training. The whole data consists of 1010 patients and this would take up 125 GB of memory. Thus, if this is too heavy for your device, just select the number of patients you can afford and download them.
我们只需要CT图像进行训练。 整个数据包含1010位患者,这将占用125 GB的内存。 因此,如果这对于您的设备来说太重了,只需选择您可以负担的患者数量并下载即可。
check out the next steps to see where your data should be located after downloading.
请查看下一步,以查看下载后数据应位于的位置。
2.克隆预处理代码 (2. Clone the Preprocessing Code)
Go to my Github and clone the repository into the directory you are working on. Save the LIDC-IDRI dataset under the folder “LIDC-IDRI” in the cloned repository
转到我的Github并将存储库克隆到您正在处理的目录中。 将LIDC-IDRI数据集保存在克隆存储库中的文件夹“ LIDC-IDRI”下
git clone https://github.com/jaeho3690/LIDC-IDRI-Preprocessing.git
3.设置Pylidc配置 (3. Set up the Pylidc configuration)
Pylidc is a library used to easily query the LIDC-IDRI database. This library will help you to make a mask image for the lung nodule. On the website, you will find instructions regarding installation. Make sure to follow these instructions as the whole code depends on it.
Pylidc是用于轻松查询LIDC-IDRI数据库的库。 该库将帮助您为肺结节制作遮罩图像。 在网站上 ,您会找到有关安装的说明。 确保遵循这些说明,因为整个代码都取决于它。
4.脚本说明 (4. Script Explanation)
Most of the explanations for my code are on Github. However, I will elaborate on them here.
我的代码的大多数解释都在Github上。 但是,我将在这里详细说明。
config_file_create.py
config_file_create.py
This python script creates a configuration file ‘lung.conf’ which contains information regarding directory settings and some hyperparameter settings for the Pylidc library. A configuration file is to manage all the wordy directories and extra settings that you need to run the code. Making a separate configuration file helps to easily debug and change settings effectively.
该python脚本创建一个配置文件'lung.conf',其中包含有关目录设置和Pylidc库的某些超参数设置的信息。 配置文件用于管理运行代码所需的所有冗长目录和其他设置。 制作单独的配置文件有助于轻松调试和有效更改设置。
You can just use the given setting as it is but you can change as you wish. For the hyperparameter settings of Pylidc, you can get more information in the documentation.
您可以按原样使用给定的设置,但是可以根据需要进行更改。 对于Pylidc的超参数设置,您可以在文档中获取更多信息。
python config_file_create.py
2. prepare_dataset.py
2. prepare_dataset.py
Running this python script will first segment the lung regions from the DICOM dataset and save the segmented lung image and its corresponding mask image. Now, when I first started this project, I got confused with the segmentation of lung regions and the segmentation of lung nodules. Segmenting the lung region, as the words speak, is leaving only the lung regions from the DICOM data. This is done to reduce the search area for the model. You can use a specific segmentation model just for this but a simple K-Means clustering and morphological operation is enough(utils.py contains the algorithm needed). Segmenting a lung nodule is to find prospective lung cancer from the Lung image. You would need to train a segmentation model such as a U-Net(I will cover this in Part2 but you can find the repository in my Github. I still need some time to edit but it works fine on my computer). Make sure you distinguish the two!
运行此python脚本将首先从DICOM数据集中分割肺区域,并保存分割的肺图像及其对应的蒙版图像。 现在,当我刚开始这个项目时,我对肺区域的分割和肺结节的分割感到困惑。 像单词所说的那样,对肺区域进行分割仅保留了DICOM数据中的肺区域。 这样做是为了减少模型的搜索区域。 您可以为此使用特定的细分模型,但是简单的K-Means聚类和形态运算就足够了( utils.py包含所需的算法)。 分割肺结节是从肺部影像中发现预期的肺癌。 您将需要训练诸如U-Net之类的细分模型(我将在第2部分中对此进行介绍,但是您可以在我的Github中找到该存储库。我仍然需要一些时间进行编辑,但是在我的计算机上可以正常使用)。 确保您将两者区分开!
After segmenting the lung region, each lung image and its corresponding mask file is saved as .npy format. A “.npy” format is a numpy data type that is often used for saving matrix or N-dimensional arrays. Some patients in the LIDC-IDRI dataset have very small nodules or non-nodules. Thus, they do not contain masks. I consider these data as a “Clean” dataset(let me know if there is an official term) and will be used for validation purposes in the classification stage. Random slices of these Clean dataset will be saved under the Clean folder.
分割肺区域后,每个肺图像及其对应的遮罩文件均保存为.npy格式。 “ .npy”格式是一种numpy数据类型,通常用于保存矩阵或N维数组。 LIDC-IDRI数据集中的某些患者的结核或非结核非常小。 因此,它们不包含遮罩。 我认为这些数据是“干净的”数据集(请问是否有正式术语),并将在分类阶段用于验证目的。 这些Clean数据集的随机切片将保存在Clean文件夹下。
Not only does this script saves image files, but it also creates a meta.csv file that contains information regarding each nodule. It tells us the slice number, nodule number, malignancy of the nodule, and directory of both image and mask.
该脚本不仅保存图像文件,而且还创建一个meta.csv文件,其中包含有关每个结节的信息。 它告诉我们切片编号,结节编号,结节的恶性程度以及图像和蒙版的目录。
python prepare_dataset.py
3. notebook/make_label.ipynb
3. notebook / make_label.ipynb
The Jupyter script edits the meta.csv file created from the prepare_dataset.py. It creates extra-label needed to annotate and distinguish each nodule. Also, I carry out the train/validation/test split here. If the split is done during the model training like most other machine learning projects, its very likely that adjacent nodule slices will be included in all train/validation/test set. I consider this as a type of “cheating” as adjacent images are very similar to one another. Thus, the split should be done nodule-wise or patient-wise. We utilize this CSV file laterwards in model training.
Jupyter脚本编辑从prepare_dataset.py创建的meta.csv文件。 它创建了注释和区分每个结节所需的额外标签。 另外,我在这里进行训练/验证/测试拆分。 如果像大多数其他机器学习项目一样在模型训练过程中完成拆分,则相邻的结节切片很可能会包含在所有训练/验证/测试集中。 我认为这是一种“作弊”,因为相邻图像彼此非常相似。 因此,分割应在结节或患者方面进行。 我们稍后会在模型训练中使用此CSV文件。
结论 (Conclusion)
Overall I have explained most of the things that you would need to start your very first Lung cancer detection project. I plan to write the Segmentation and Classification tutorial laterwards after affining some codes in my repository. Hope you find this article useful. Thanks
总的来说,我已经解释了开始您的第一个肺癌检测项目所需的大多数事情。 我计划在存储库中添加一些代码后,稍后再编写细分和分类教程。 希望本文对您有所帮助。 谢谢
Github: https://github.com/jaeho3690/LIDC-IDRI-Preprocessing
GitHub: https : //github.com/jaeho3690/LIDC-IDRI-Preprocessing
翻译自: https://medium.com/analytics-vidhya/how-to-start-your-very-first-lung-cancer-detection-project-using-python-part-1-3ab490964aae