现在网上能够找到的关于肺结节的比赛有TianCHi/LUNA16/Kaggle这三家。
三个网站论坛上都有很多关于肺结节识别与检测的源码。作为一名新手,我准备先大致了解这三个网站提供的数据的异同。
天池医疗AI大赛:
大赛数据集提供数千份高危患者的低剂量肺部CT影像(mhd格式)数据,每个影像包含一系列胸腔的多个轴向切片。每个影像包含的切片数量会随着扫描机器、扫描层厚和患者的不同而有差异。原始图像为三维图像。每个图像包含一系列胸腔的多个轴向切片。这个三维图像由不同数量的二维图像组成。其二维图像数量可以基于不同因素变化,比如扫描机器、患者。Mhd文件具有包含关于患者ID的必要信息的头部,以及诸如切片厚度的扫描参数。
5-10mm | 10-30mm |
50% | 50% |
b)除了进行病理分析的结节外,其它结节都由三位医生进行标记确认。
2. 数据格式:
1)CT影像:mhd格式
2)结节标注信息:csv文件,标注了结节的位置和大小(mm)
seriesuid | coordX | coordY | coordZ | diameter_mm |
LKDS_00001 | -100.56 | 67.26 | -231.81 | 6.44 |
3. 层厚(mm)
所有CT影像的层厚小于2mm
不过现在天池这个比赛的数据集还没有公开,网上应该是下载不到的。
LUNA16:
For this challenge, we use the publicly available LIDC/IDRI database. This data uses the Creative Commons Attribution 3.0 Unported License. We excluded scans with a slice thickness greater than 2.5 mm. In total, 888 CT scans are included. The LIDC/IDRI database also contains annotations which were collected during a two-phase annotation process using 4 experienced radiologists. Each radiologist marked lesions they identified as non-nodule, nodule < 3 mm, and nodules >= 3 mm. See this publication for the details of the annotation process. The reference standard of our challenge consists of all nodules >= 3 mm accepted by at least 3 out of 4 radiologists. Annotations that are not included in the reference standard (non-nodules, nodules < 3 mm, and nodules annotated by only 1 or 2 radiologists) are referred as irrelevant findings. The list of irrelevant findings is provided inside the evaluation script package (annotations_excluded.csv).
本次比赛,我们使用公开可获得的LIDC/IDRI database(数据库)。
数据集里面切片厚度均小于2.5mm。总共有888张CT扫描件。
我们比赛的参考标准是四个放射科医生中有三个都接受的大于3mm的结节。
不包括在参考标准中的结节(非结节,小于3mm的结节,只被1-2名医生接受的结节)被视为无关发现。
无关发现的列表在 evaluation script package (annotations_excluded.csv)中。
Data is available on the download page. The data is structured as follows:
Additional data includes:
Note: The dataset is used for both training and testing dataset. To allow easier reproducibility, please use the given subsets for training the algorithm for 10-folds cross-validation.
Kaggle:
In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.
在本数据集中,你可以获得1000张来自高位患者的以DICOM为格式的低剂量CT图片。
每个图像包含一系列多个胸部的轴向切片。
CT图像之间2D切片图像的数量是不等的,这个数量随着扫描患者的机器的不同而变化。
The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.
DICOM文件有一个头部,包含患者ID,以及扫描参数比如切片厚度。
The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.
这个比赛的任务是建立一种自动化方法,以确定患者是否会在CT扫描的一年内被诊断出癌症。所有的标记都被病理诊断确认了。
The images in this dataset come from many sources and will vary in quality. For example, older scans were imaged with less sophisticated equipment. You should expect the stage 2 data to be, on the whole, more recent and higher quality than the stage 1 data (generally having thinner slice thickness). Ideally, your algorithm should perform well across a range of image quality.
数据集里的图片来源不同品质不同。例如,旧一点扫描件是由不成熟的设备成像的。stage2中的扫描件质量会比stage1中的高(通常体现在有更薄的切片)(注:stage1与stage2在kaggle比赛页面能够找到)。理想情况下,你的算法应该在这一系列(质量参差不齐)的图片中表现良好。
注意:
比赛事项的注意,简单看看就好。
Each patient id has an associated directory of DICOM files. The patient id is found in the DICOM header and is identical to the patient name. The exact number of images will differ from case to case, varying according in the number of slices. Images were compressed as .7z files due to the large size of the dataset.
The DICOM standard is complex and there are a number of different tools to work with DICOM files. You may find the following resources helpful for managing the competition data:
目前最大的收获,就是知道了TianChi和LUNA16都使用mhd数据,而kaggle使用的是DICOM数据,也就是说CT图像的格式不一样。