TianCHi/LUNA16/Kaggle lung cancer肺结节数据集特征比较






1. 数据量:1000例病人,全部都有结节。数据质量:
 5-10mm  10-30mm
 50%  50%


2. 数据格式:

 seriesuid  coordX  coordY  coordZ  diameter_mm
 LKDS_00001  -100.56  67.26  -231.81  6.44

3. 层厚(mm)



For this challenge, we use the publicly available LIDC/IDRI database. This data uses the Creative Commons Attribution 3.0 Unported License. We excluded scans with a slice thickness greater than 2.5 mm. In total, 888 CT scans are included. The LIDC/IDRI database also contains annotations which were collected during a two-phase annotation process using 4 experienced radiologists. Each radiologist marked lesions they identified as non-nodule, nodule < 3 mm, and nodules >= 3 mm. See this publication for the details of the annotation process. The reference standard of our challenge consists of all nodules >= 3 mm accepted by at least 3 out of 4 radiologists. Annotations that are not included in the reference standard (non-nodules, nodules < 3 mm, and nodules annotated by only 1 or 2 radiologists) are referred as irrelevant findings. The list of irrelevant findings is provided inside the evaluation script package (annotations_excluded.csv).

本次比赛,我们使用公开可获得的LIDC/IDRI database(数据库)。




无关发现的列表在 evaluation script package (annotations_excluded.csv)中。

Data is available on the download page. The data is structured as follows:

  • subset0.zip to subset9.zip: 10 zip files which contain all CT images
  • annotations.csv: csv file that contains the annotations used as reference standard for the 'nodule detection' track
  • sampleSubmission.csv: an example of a submission file in the correct format
  • candidates_V2.csv: csv file that contains the candidate locations for the ‘false positive reduction’ track

Additional data includes:

  • evaluation script: the evaluation script that is used in the LUNA16 framework
  • lung segmentation: a directory that contains the lung segmentation for CT images computed using automatic algorithms
  • additional_annotations.csv: csv file that contain additional nodule annotations from our observer study. The file will be available soon

Note: The dataset is used for both training and testing dataset. To allow easier reproducibility, please use the given subsets for training the algorithm for 10-folds cross-validation.


In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.




The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.


The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.


The images in this dataset come from many sources and will vary in quality. For example, older scans were imaged with less sophisticated equipment. You should expect the stage 2 data to be, on the whole, more recent and higher quality than the stage 1 data (generally having thinner slice thickness). Ideally, your algorithm should perform well across a range of image quality.



  • Use of external data is permitted in this competition, provided the data is freely available. If you are using a source of external data, you must post the source to the official external data forum thread no later than one week prior to the deadline of the first stage.
  • This is a two-stage competition. In order to appear on the final competition leaderboard and receive ranking points, your team must make a submission during both stages of the competition.
  • Due to the large file size, Kaggle is beta testing use of BitTorrent as an alternate means of download. The image archives are encrypted in order to prevent outside access. Please do not share the decryption password. The large stage1.7z archive hosted on BitTorrent is the same as the version available for direct download.



File Descriptions

Each patient id has an associated directory of DICOM files. The patient id is found in the DICOM header and is identical to the patient name. The exact number of images will differ from case to case, varying according in the number of slices. Images were compressed as .7z files due to the large size of the dataset.

  • stage1.7z - contains all images for the first stage of the competition, including both the training and test set. This is file is also hosted on BitTorrent.
  • stage1_labels.csv - contains the cancer ground truth for the stage 1 training set images
  • stage1_sample_submission.csv - shows the submission format for stage 1. You should also use this file to determine which patients belong to the leaderboard set of stage 1.
  • sample_images.7z - a smaller subset set of the full dataset, provided for people who wish to preview the images before downloading the large file.
  • data_password.txt - contains the decryption key for the image files

The DICOM standard is complex and there are a number of different tools to work with DICOM files. You may find the following resources helpful for managing the competition data:

  • The lite version of OsiriX is useful for viewing images on OSX.
  • pydicom: A package for working with images in python.
  • oro.dicom: A package for working with images in R.
  • Mango: A useful DICOM viewer for Windows users.

