ps之前已经稍微处理过相关的csv文件,但是没有记录,发现基本忘光了看来记录还是一件非常重要的事情。碰巧DSB2017grt团队的代码里用的csv比较奇葩,我就把天池的数据的csv改成他们使用的模样。加油。
1.他们的shorter.csv
000,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860
001,1.3.6.1.4.1.14519.5.2.1.6279.6001.100332161840553388986847034053
002,1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208
003,1.3.6.1.4.1.14519.5.2.1.6279.6001.100530488926682752765845212286
004,1.3.6.1.4.1.14519.5.2.1.6279.6001.100620385482151095585000946543
005,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016233746780170740405
006,1.3.6.1.4.1.14519.5.2.1.6279.6001.100684836163890911914061745866
007,1.3.6.1.4.1.14519.5.2.1.6279.6001.100953483028192176989979435275
008,1.3.6.1.4.1.14519.5.2.1.6279.6001.101228986346984399347858840086
009,1.3.6.1.4.1.14519.5.2.1.6279.6001.102133688497886810253331438797
010,1.3.6.1.4.1.14519.5.2.1.6279.6001.102681962408431413578140925249
上面的序号+名字
实现代码:
# coding=UTF-8
import pandas as pd
import os
#我把验证集也放入训练集里了分别命名为train_subset15-train_subset19
tianchi_raw='/media/pacs/0000E2850005C030/DcmData/xlc/tanchi/sharelink4184761691-814629355569975/天池大赛肺部结节智能诊断/train/'
subsetdirs = [os.path.join(tianchi_raw, f) for f in os.listdir(tianchi_raw) if
f.startswith('train_subset') and os.path.isdir(os.path.join(tianchi_raw, f))]#获取所有文件夹路径如到tianchi_raw+‘train_subset15’
namelist=[]
for i in range(len(subsetdirs)):
for filename in os.listdir(subsetdirs[i]):
if filename[-4:]=='.mhd':
namelist.append(filename[:-4])
save_name=pd.DataFrame({'name':namelist})#'name'必须有不然会报错。
save_name.to_csv('shorter.csv',header=False,index=True)
结果是如下:
0,LKDS-00001
1,LKDS-00003
2,LKDS-00004
3,LKDS-00005
4,LKDS-00007
5,LKDS-00011
6,LKDS-00013
7,LKDS-00015
8,LKDS-00016
9,LKDS-00019
10,LKDS-00020
其他方法代码:
# coding=UTF-8
import pandas as pd
import os
import glob
tianchi_raw='/media/pacs/0000E2850005C030/DcmData/xlc/tanchi/sharelink4184761691-814629355569975/天池大赛肺部结节智能诊断/train/'
df = pd.DataFrame(columns=['seriesuid'])
subsetdirs = [os.path.join(tianchi_raw, f) for f in os.listdir(tianchi_raw) if
f.startswith('train_subset') and os.path.isdir(os.path.join(tianchi_raw, f))]
ii=1
for i in range(len(subsetdirs)):
for filename in os.listdir(subsetdirs[i]):
if filename[-4:]=='.mhd':
data={'seriesuid':filename[:-4]}
index=pd.Index(data=[ii],name='id')#定义序号
dfn=pd.DataFrame(data,index=index)
df = pd.concat([df, dfn], ignore_index=True)#默认按行拼接,按列拼接加参数axis=1
ii=ii+1
df.to_csv('annotations2.csv',header=False,index=True)
上面这个代码是我的第一版现在想想好蠢啊,一个名字定义一个数据框架(也就是表),然后拼接起来,哈哈哈。
2.他们的lunaqualified.csv
5,-24.014,192.1,-391.08,8.1433
5,2.4415,172.46,-405.49,18.545
5,90.932,149.03,-426.54,18.209
5,89.541,196.41,-515.07,16.381
7,81.51,54.957,-150.35,10.362
10,105.06,19.825,-91.247,21.09
12,-124.83,127.25,-473.06,10.466
14,-106.9,21.923,-126.92,9.7453
16,2.2638,33.526,-170.64,7.1685
17,-70.551,66.359,-160.94,6.6422
用对应第一个文件的序号替代文件名字。(后面是xyzd)
这里解释一下,前面暗含的保留几位小数点以及过滤掉6mm以下的结节这两步我就不弄了。
实现代码
第一步:先将两个标注信息拼接起来(训练集和验证集)得到annotations_together.csv
# coding=UTF-8
import pandas as pd
import os
import glob
#我把train和val的annotations.csv重新命名并放在了一起。
csv_files = glob.glob('/media/pacs/0000E2850005C030/DcmData/xlc/tanchi/sharelink4184761691-814629355569975/天池大赛肺部结节智能诊断/csv/合并/*.csv')
df = pd.DataFrame(columns=['seriesuid', 'coordX', 'coordY', 'coordZ', 'diameter_mm'])
for csv in csv_files:
df = pd.merge(df,pd.read_csv(csv),how='outer')
df.to_csv('annotations_together.csv',header=True,index=False)
seriesuid,coordX,coordY,coordZ,diameter_mm
LKDS-00375,-122.003793556,128.088202005,384.529998779,7.77904231077
LKDS-00640,69.8244009958,103.039681448,251.599975586,23.8006292592
LKDS-00728,93.1056798986,163.855363176,225.5,11.0826543246
LKDS-00095,115.437994164,-153.882553652,-104.800001383,8.40507666939
LKDS-00807,52.6415211306,15.0564420021,69.5354003906,12.3348918533
...
LKDS-00161,-91.2077242944,-129.558625252,32.6999982595,13.9877103298
LKDS-00864,77.1092168414,4.14245411706,174.5,14.4626808554
LKDS-00570,-75.3992919922,238.30329895,194.724975586,11.2967527665
LKDS-00570,96.3452785326,217.390879755,269.724975586,4.60107130749
LKDS-00010,-111.182779948,217.531738281,-275.400024414,4.43397444007
用pd.concat也很很方便:
# coding=UTF-8
import pandas as pd
import os
import glob
#我把train和val的annotations.csv重新命名并放在了一起。
csv_files = glob.glob('/media/pacs/0000E2850005C030/DcmData/xlc/tanchi/sharelink4184761691-814629355569975/天池大赛肺部结节智能诊断/csv/合并/*.csv')
df = pd.DataFrame(columns=['seriesuid', 'coordX', 'coordY', 'coordZ', 'diameter_mm'])
for csv in csv_files:
df = pd.concat([df, pd.read_csv(csv)], axis=0)
df.to_csv('annotations_together.csv',header=True,index=False)
第二步:按照我们在1中得到的shorter.csv,把上一步得到的annotations_together.csv中'seriesuid'中名字替换成shorter.csv中序号。
# coding=UTF-8
import pandas as pd
import os
import glob
shh=pd.read_csv('shorter.csv')
att=pd.read_csv('annotations_together.csv')
print(att['seriesuid'][1243])
print(len(att))
# #直接赋值修改数据两种方法参考:https://blog.csdn.net/dark_tone/article/details/80179644
# df.at[0,'城市']='天津'
# #或者用.loc效果一样
# df.loc[0,'城市']='天津'
print(shh.loc[0][1])#由于我没有header了,得用[0][0]。loc改为at不行。
for i in range(len(att)):
for j in range(len(shh)):
if shh.loc[j][1]==att['seriesuid'][i]:
att['seriesuid'][i]=shh.loc[j][0]
break
att.to_csv('lunaqualified.csv',header=False,index=False)
221,-122.003793556,128.088202005,384.529998779,7.77904231077
374,69.8244009958,103.039681448,251.599975586,23.8006292592
428,93.1056798986,163.855363176,225.5,11.0826543246
58,115.437994164,-153.882553652,-104.800001383,8.40507666939
475,52.6415211306,15.0564420021,69.5354003906,12.3348918533
475,-44.7023808214,66.1236872439,100.535400391,8.28980791179
475,-108.547683716,-14.5947265625,116.535400391,7.23274492721
475,-119.902752776,6.93833234441,174.535400391,10.6162775597
86,-129.004266036,-145.044870477,1973.20001185,13.6076118378
86,-129.482627467,-145.365182977,1973.80001187,13.9016228368
完美搞定了嘻嘻,但是花了不少时间,果然好记性不如烂笔头啊哈哈哈。
3.给个筛选掉6mm一下代码思路:df[df['diameter_mm']>6]即可就这么简单。
添加:好像还挺简单的就把代码也贴下
# coding=UTF-8
import pandas as pd
import os
import glob
att=pd.read_csv('annotations_together.csv')
aa=att[att['diameter_mm']>=6]
aa.to_csv('sift.csv',header=True,index=False)
结果天池的结节数从1244减少到了843.(初赛的800个CT(训练加验证))。
ps:良心原创,对你有帮助别忘了点赞哦。
ps:添加关于前面的第二点中第二步的代码出现了细节错误,坑了我3个多小时。
shh.loc[j][0] 当j=0的时候,shh.loc[0][0]取到的是第二行的第一个数,第一行的数是取不到的。故第一点中的代码
save_name.to_csv('shorter.csv',header=False,index=True)改为header=True即可。