目录
2022数维杯国际数学建模C题
数据说明
Task1
数据分析
Task2
筛选出脑结构特征和行为结构特征
两表合并
随机森林分类
随机森林参数输出
Task3
读入数据并将ecog列删除
两表合并并编码
聚类成AD,CN,MCI三大类
从MCI中亚聚类成三个子类(SMC、EMCI、LMCI)
Task4
导入原始数据
对原始数据进行预处理
对剩余的特征进行可视化
分开CN、MCI、Dementia三个表格进行分析
Task5
解释一下:
以上的均是每个人的基本信息,并不能称作是我们的特征(除了年龄和DX DX是我们的标签)
参与者 PT
这些是可以当作我们的特征的,属于每个人的基本情况
分数在27-30分:正常
分数<27分:认知功能障碍
21-26,轻度
10-20,中度
0-9,重度
以上这些,个人认为不用纠结具体含义是什么,只需要知道它们是通过某些技术测出来的数据,就是我们要用来建立模型的关键特征
对附件数据的特征指标进行预处理,研究数据特征与阿尔茨海默病诊断的相关性。
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import missingno
df = pd.read_csv('../datasets/ADNIMERGE_New.csv')
df.head()
先展示下数据
我们可以发现:
当2005-9-12号患者第一次去检验时,VISCODE标注的是bl,M标注的是0。当2006-3-13号患者第二次去检验时,VISCODE标注的是m06,M标注的是6. 我们可以观察发现,这俩日期刚好相差6个月。经过多次验证,所以我们可以推断出。bl所对的日期是患者第一次参加检验的日期,这时M是0,。mxx所对应的日期是患者经过x个月后非首次参加检验的日期,这时M是x。说白了M那一列就是患者相差几个月检验的意思。VISCODE就是根据患者的情况选择不一样的代码程序去检验。
从表格暂时分析出的就这么多
通过上面的数据分析,数据处理的思维导图在脑海里已经出来了,下面我展示一下:
对总体数据的缺失值进行可视化
#全部数据总体缺失值可视化
missingno.matrix(df,figsize=(30,5))
查看一下原始的维度
df.shape
查询RID与PTID和SITE有无矛盾的行,如果与将其剔除
df['RID1'] = df['PTID'].apply(lambda x: int(x.split('_S_')[-1]))
df = df.reset_index(drop = True)
a=[]
for i in range(0,len(list(df['RID']))):
if df['RID'][i] != df['RID1'][i]:
print(df['RID'][i])
else:
a.append('%d无异常情况'%i)
len(a)
--> 16222
df['SITE1'] = df['PTID'].apply(lambda x: int(x.split('_S_')[0]))
b=[]
for i in range(0,len(list(df['SITE']))):
if df['SITE'][i] != df['SITE1'][i]:
print(df['SITE'][i])
else:
b.append('%d无异常情况'%i)
len(b)
--> 16222
查询VISCODE与M是否有矛盾,如果有,将其剔除
df.loc[df['VISCODE']=='bl','VISCODE'] = '0'
df['VISCODE'] = df['VISCODE'].apply(lambda x: int(x.split('m')[-1]) )
c=[]
for i in range(0,len(list(df['VISCODE']))):
if df['VISCODE'][i] != df['M'][i]:
print(df['VISCODE'][i])
else:
c.append('%d无异常情况'%i)
len(c)
--> 16222
根据上文的数据分析,才会出现以上两步的异常值查找。
根据RID,对缺失值进行填充
df = pd.read_csv('../datasets/ADNIMERGE_New.csv')
rid = list(set(list(df['RID'])))
#因为发现只有后缀为bl的才有此规律,因此我们只对bl列做处理
for i in rid:
df[df['RID']==i][['IMAGEUID_bl','Ventricles_bl','Hippocampus_bl','WholeBrain_bl','Entorhinal_bl',
'Fusiform_bl','MidTemp_bl','ICV_bl','ABETA_bl','TAU_bl','PTAU_bl','FDG_bl','mPACCdigit_bl','mPACCtrailsB_bl',
'TRABSCOR_bl','DIGITSCOR_bl','LDELTOTAL_BL','RAVLT_perc_forgetting_bl','RAVLT_forgetting_bl','RAVLT_learning_bl',
'RAVLT_immediate_bl','MMSE_bl','ADASQ4_bl','ADAS13_bl','ADAS11_bl','CDRSB_bl','EXAMDATE_bl','DX_bl','AGE',
'PTGENDER','PTEDUCAT','PTETHCAT','PTRACCAT','PTMARRY','APOE4']].fillna(method='ffill',inplace=True)
df[df['RID']==i][['IMAGEUID_bl','Ventricles_bl','Hippocampus_bl','WholeBrain_bl','Entorhinal_bl',
'Fusiform_bl','MidTemp_bl','ICV_bl','ABETA_bl','TAU_bl','PTAU_bl','FDG_bl','mPACCdigit_bl','mPACCtrailsB_bl',
'TRABSCOR_bl','DIGITSCOR_bl','LDELTOTAL_BL','RAVLT_perc_forgetting_bl','RAVLT_forgetting_bl','RAVLT_learning_bl',
'RAVLT_immediate_bl','MMSE_bl','ADASQ4_bl','ADAS13_bl','ADAS11_bl','CDRSB_bl','EXAMDATE_bl','DX_bl','AGE',
'PTGENDER','PTEDUCAT','PTETHCAT','PTRACCAT','PTMARRY','APOE4']].fillna(method='bfill',inplace=True)
df = df.drop('RID',axis=1)
将含有bl特征列和不含bl特征的数据分开处理
df = df[df['VISCODE']=='bl']
先抽出所有VISCODE=bl的行,为什么呢。因为相同RID所对应的bl后缀的特征值是一样的,所以为了减少训练成本,防止过拟合现象,我们就只提取VISCODE中的bl数据进行训练。因此我选择用bl行以及bl列构成一个新的表格。
c= []
for i in list(df.columns):
if i.split('_')[-1] == 'bl' or i.split('_')[-1] == 'BL' :
c.append(i)
df_or = df.drop(c,axis=1)
b = ['PTGENDER','PTEDUCAT','PTETHCAT','PTRACCAT','PTMARRY','APOE4','M','update_stamp','AGE']
for i in b:
c.append(i)
df_bl = df[c]
df_bl.shape,df_or.shape
-->((2425, 60), (2425, 59))
两个表格分别进行缺失值可视化
含bl:
不含bl:
对两个表格进行处理
先对特征为成分值的数据,缺失值用0填充
hanliang = ['FDG_bl','PIB_bl','AV45_bl','FBB_bl','ABETA_bl','TAU_bl','PTAU_bl','CDRSB_bl','ADAS11_bl','ADAS13_bl','ADASQ4_bl','MMSE_bl','LDELTOTAL_BL','DIGITSCOR_bl','TRABSCOR_bl','FAQ_bl','MOCA_bl','RAVLT_immediate_bl','RAVLT_learning_bl','RAVLT_forgetting_bl','RAVLT_perc_forgetting_bl']
df_bl[hanliang] = df_bl[hanliang].fillna(0)
df_or[hanliang] = df_or[hanliang].fillna(0)
因为成分值的数据,如果缺失,我个人认为就是没检测到,就是没有这个含量,所以用0填充
对DX及DX_bl列中为空值的行进行剔除
df_bl = df_bl.drop(df_bl[df_bl['DX_bl'].isna()].index,axis=0)
因为这是有监督的任务,标签如果是空的话那就是无监督任务了,任务无法进行,不符合题意。
对FSVERSION及FSVERSION_bl列进行编码
df_bl.loc[df_bl['FSVERSION_bl']=='Cross-Sectional FreeSurfer (FreeSurfer Version 4.3)','FLDSTRENG_bl'] = 1
df_bl.loc[df_bl['FSVERSION_bl']=='Cross-Sectional FreeSurfer (5.1)','FLDSTRENG_bl'] = 2
df_bl.loc[df_bl['FSVERSION_bl']=='Cross-Sectional FreeSurfer (6.0)','FLDSTRENG_bl'] = 3
df_or.loc[df_or['FSVERSION']=='Cross-Sectional FreeSurfer (FreeSurfer Version 4.3)','FLDSTRENG_bl'] = 1
df_or.loc[df_or['FSVERSION']=='Cross-Sectional FreeSurfer (5.1)','FLDSTRENG_bl'] = 2
df_or.loc[df_or['FSVERSION']=='Cross-Sectional FreeSurfer (6.0)','FLDSTRENG_bl'] = 3
为什么进行label encoder呢,因为机器不认识字符串,想利用这特征只能转换成数字,而且这特征是有层次关系的,因此不选用One-hot encoder
对IMAGEUID_bl进行分析处理
df1 = df[df['FLDSTRENG_bl']=='1.5 Tesla MRI']
#发现只有一条数据是大于20W的,因此我们将它剔除,并拿20W当作阈值
df1 = df1.drop(df1[df1['IMAGEUID_bl']> 200000].index,axis=0)
df_bl = df_bl.drop(df1[df1['IMAGEUID_bl']> 200000].index,axis=0)
df2 = df[df['FLDSTRENG_bl']=='3 Tesla MRI']
df3 = df_bl[df_bl['FLDSTRENG_bl']==3]
#取20W和80W当成阈值
df_bl.loc[(df_bl['IMAGEUID_bl']> 200000) & (df_bl['IMAGEUID_bl']< 800000) ,'FLDSTRENG_bl'] = 2
df_bl.loc[df_bl['IMAGEUID_bl']< 200000,'FLDSTRENG_bl'] = 1
df_bl.loc[df_bl['IMAGEUID_bl']> 800000,'FLDSTRENG_bl'] = 3
不含bl的代码不粘贴了,与上方一致的
对剩下FLDSTRENG_bl的缺失值行删除
#将剩下FLDSTRENG_bl的缺失值行删除,因为剩下的行中,FLDSTRENG_bl FSVERSION_bl IMAGEUID_bl同时缺失
df_bl = df_bl.drop(df_bl[df_bl['FLDSTRENG_bl'].isna()].index,axis=0)
对剩下的mPACCdigit_bl缺失值行删除
#将剩下的mPACCdigit_bl缺失值行删除,因为剩下的两行中,mPACCdigit_bl mPACCtrailsB_bl同时缺失
df_bl = df_bl.drop(df_bl[df_bl['mPACCdigit_bl'].isna()].index,axis=0)
对大脑检测出来的数据,还含有缺失值的,用0填充
#用0填充
a = ['Ventricles_bl','Hippocampus_bl','WholeBrain_bl','Entorhinal_bl','Fusiform_bl','MidTemp_bl','ICV_bl']
df_bl[a] = df_bl[a].fillna(0)
因为我个人认为,筛选到这一步,还有缺失值的,也只能是未检测到该成分才会缺失,因此我拿0填充
将bl表格分为含有ecog列和不含ecog列两个表格
ecog = ['EcogPtMem_bl','EcogPtLang_bl','EcogPtVisspat_bl','EcogPtPlan_bl','EcogPtOrgan_bl','EcogPtDivatt_bl','EcogPtTotal_bl',
'EcogSPMem_bl','EcogSPLang_bl','EcogSPVisspat_bl','EcogSPPlan_bl','EcogSPOrgan_bl','EcogSPDivatt_bl','EcogSPTotal_bl']
df_bl_notec = df_bl.drop(ecog,axis=1)
# 筛选出所有不含缺失值的行
df_bl_ec = df_bl[df_bl[ecog].isna().T.any() == False]
对APOE4和AGE列含有缺失值的行进行删除
df_bl_notec = df_bl_notec.drop(df_bl_notec[df_bl_notec['APOE4'].isna()].index,axis=0)
df_bl_ec = df_bl_ec.drop(df_bl_ec[df_bl_ec['APOE4'].isna()].index,axis=0)
df_bl_notec = df_bl_notec.drop(df_bl_notec[df_bl_notec['AGE'].isna()].index,axis=0)
df_bl_ec = df_bl_ec.drop(df_bl_ec[df_bl_ec['AGE'].isna()].index,axis=0)
处理到此,所有缺失值和异常值都处理了,得到了四份完美的数据。(一份bl有ecog,一份bl无ecog,一份无bl有ecog,一份无bl无ecog)
可视化处理完的结果:
有ecog的两表情况
不含ecog的两表情况
写出第一次处理完的表格
df_or_notec.to_csv('../mid_data/df_or_notec.csv',index=False)
df_or_ec.to_csv('../mid_data/df_or_ec.csv',index=False)
df_bl_notec.to_csv('../mid_data/df_bl_notec.csv',index=False)
df_bl_ec.to_csv('../mid_data/df_bl_ec.csv',index=False)
继续对那四个表格进行处理
DX_map = {
'CN': 1,
'SMC': 2,
'EMCI': 3,
'LMCI': 4,
'AD': 5,
}
F_map = {
'Cross-Sectional FreeSurfer (5.1)':2,
'Cross-Sectional FreeSurfer (FreeSurfer Version 4.3)':1,
'Cross-Sectional FreeSurfer (6.0)':3
}
df_bl_ec['DX_bl'] = df_bl_ec['DX_bl'].map(DX_map)
df_bl_ec['FSVERSION_bl'] = df_bl_ec['FSVERSION_bl'].map(F_map)
删除无关列
df_bl_ec = df_bl_ec.drop(['EXAMDATE_bl','Years_bl','Month_bl','M','update_stamp'],axis=1)
删除UNknown行
df_bl_ec = df_bl_ec.drop(df_bl_ec[df_bl_ec['PTETHCAT'] =='Unknown'].index,axis=0)
df_bl_ec = df_bl_ec.drop(df_bl_ec[df_bl_ec['PTRACCAT'] =='Unknown'].index,axis=0)
df_bl_ec = df_bl_ec.drop(df_bl_ec[df_bl_ec['PTMARRY'] =='Unknown'].index,axis=0)
对离散型特征进行One-hot encoder
one_hot_df = pd.get_dummies(df_bl_ec['PTETHCAT'])
df_bl_ec = pd.merge(df_bl_ec,one_hot_df,left_index=True,right_index=True)
df_bl_ec = df_bl_ec.drop('PTETHCAT',axis=1)
one_hot_df = pd.get_dummies(df_bl_ec['PTRACCAT'])
df_bl_ec = pd.merge(df_bl_ec,one_hot_df,left_index=True,right_index=True)
df_bl_ec = df_bl_ec.drop('PTRACCAT',axis=1)
one_hot_df = pd.get_dummies(df_bl_ec['PTMARRY'])
df_bl_ec = pd.merge(df_bl_ec,one_hot_df,left_index=True,right_index=True)
df_bl_ec = df_bl_ec.drop('PTMARRY',axis=1)
one_hot_df = pd.get_dummies(df_bl_ec['PTGENDER'])
df_bl_ec = pd.merge(df_bl_ec,one_hot_df,left_index=True,right_index=True)
df_bl_ec = df_bl_ec.drop('PTGENDER',axis=1)
对特殊特征进行预处理
df_bl_ec.loc[df_bl_ec['ABETA_bl']=='>1700','ABETA_bl'] = 1701
df_bl_ec.loc[df_bl_ec['PTAU_bl']=='<8','PTAU_bl'] = 7
df_bl_ec.loc[df_bl_ec['ABETA_bl']=='<200','ABETA_bl'] = 199
df_bl_ec.loc[df_bl_ec['PTAU_bl']=='>120','PTAU_bl'] = 121
df_bl_ec.loc[df_bl_ec['TAU_bl']=='>1300','TAU_bl'] = 1301
df_bl_ec.loc[df_bl_ec['TAU_bl']=='<80','TAU_bl'] = 79
df_bl_ec.loc[df_bl_ec['ABETA_bl'].astype(float)>1700,'ABETA_bl'] = 1701
df_bl_ec.loc[df_bl_ec['PTAU_bl'].astype(float)<8,'PTAU_bl'] = 7
df_bl_ec.loc[df_bl_ec['ABETA_bl'].astype(float)<200,'ABETA_bl'] = 199
df_bl_ec.loc[df_bl_ec['PTAU_bl'].astype(float)>120,'PTAU_bl'] = 121
df_bl_ec.loc[df_bl_ec['TAU_bl'].astype(float)>1300,'TAU_bl'] = 1301
df_bl_ec.loc[df_bl_ec['TAU_bl'].astype(float)<80,'TAU_bl'] = 79
到此,第二次处理完,输出数据
df_or_notec.to_csv('../mid_data/df_or_notec_encoding.csv',index=False)
df_or_ec.to_csv('../mid_data/df_or_ec_encoding.csv',index=False)
df_bl_notec.to_csv('../mid_data/df_bl_notec_encoding.csv',index=False)
df_bl_ec.to_csv('../mid_data/df_bl_ec_encoding.csv',index=False)
对经过两次处理的含有bl的表格进行相关性分析
corr()
分别绘制两个热力图,左图是有ecog特征的,右图是没有ecog特征的
# 绘制两个数据集的热力图
plt.style.use('seaborn-whitegrid') # 设置绘图风格
fig = plt.figure(figsize=(20,10))
# 绘制第一个热力图
plt.subplot(1,2,1)
mask = np.zeros_like(df_bl_ec.corr(),dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df_bl_ec.corr(),
vmin=-1,vmax=1,
square=True,
cmap=sns.color_palette("RdBu_r",100),
mask=mask,
linewidths=.5)
# 绘制第二个热力图
plt.subplot(1,2,2)
mask = np.zeros_like(df_bl_notec.corr(),dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df_bl_notec.corr(),
vmin=-1,vmax=1,
square=True,
cmap=sns.color_palette("RdBu_r",100),
mask=mask,
linewidths=.5)
随机森林特征重要性查看
第一题还没完,单独一个corr()并不能解决问题。我选择用随机森林去拟合数据,从而查看特征的重要性
#df_bl_ec
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import r2_score #用于模型拟合优度评估
label=df_bl_ec['DX_bl']
train=df_bl_ec.drop(['DX_bl'],axis=1)
#构造随机森林模型
model=RandomForestClassifier()
model.fit(train,label) #模型拟合
predictions= model.predict(train) #预测
print("train r2:%.3f"%r2_score(label,model.predict(train))) #评估
--> r2:1.00
可以发现拟合得非常完美,因此我们就不对随机森林调优了
直接查看特征重要性
feature_list = list(train.columns)
importances = list(model.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list,importances)] #将相关变量名称与重要性对应
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True) #排序
[print('Variable: {:12} Importance: {}'.format(*pair)) for pair in feature_importances] #输出特征影响程度详细数据
以上只是其中一个表格的重要性,另一个看代码吧。对俩表特征重要性进行可视化
含ecog的:
不含ecog的
利用所附的脑结构特征和认知行为特征,设计阿尔茨海默病的智能诊断。
df_bl_ec = pd.read_csv('../mid_data/df_bl_ec_encoding.csv')
df_bl_notec = pd.read_csv('../mid_data/df_bl_notec_encoding.csv')
#筛选出脑结构特征和行为结构特征
f_col = ['IMAGEUID_bl','Ventricles_bl','Hippocampus_bl','WholeBrain_bl','Entorhinal_bl','Fusiform_bl','MidTemp_bl','ICV_bl',
'RAVLT_immediate_bl','RAVLT_learning_bl','RAVLT_forgetting_bl','RAVLT_perc_forgetting_bl','LDELTOTAL_BL','DX_bl',
'MOCA_bl','mPACCdigit_bl','mPACCtrailsB_bl']
df_bl_ec = df_bl_ec[f_col]
df_bl_notec = df_bl_notec[f_col]
#因为脑结构特征和行为结构特征都没有ecog,所以俩表拼接
df_bl = df_bl_ec.append(df_bl_notec)
df_bl = df_bl.reset_index(drop=True)
df_bl.to_csv('../mid_data/df_bl_merge.csv')
因为本赛题是多分类,这里是五分类问题。由于本赛题的数据属于结构化数据,因此我们选择了树模型,不选择深度学习。评价指标这里我们用了,r2,recall,F1,论文里要对这三个指标进行介绍
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score #用于模型拟合优度评估
label=df_bl['DX_bl']
train=df_bl.drop(['DX_bl'],axis=1)
train_x,test_x,train_y,test_y=train_test_split(train,label,test_size=0.3,random_state=4396)
model=RandomForestClassifier()
model.fit(train_x,train_y)
predictions= model.predict(test_x)
print("train r2:%.3f"%r2_score(train_y,model.predict(train_x)))
print("test r2:%.3f"%r2_score(test_y,predictions))
print("train Recall:%.3f"%metrics.recall_score(train_y,model.predict(train_x),average='micro'))
print("test Recall:%.3f"%metrics.recall_score(test_y,predictions,average='micro'))
print("train F1:%.3f"%metrics.f1_score(train_y,model.predict(train_x),average='macro'))
print("test F1:%.3f"%metrics.f1_score(test_y,predictions,average='macro'))
model.estimators_
model.estimators_[0].random_state
--> 466110319
首先,将CN、MCI和AD聚类为三大类。然后,针对MCI中包含的三个子类(SMC、EMCI和LMCI),继续将聚类细化为三个子类。
df_bl_ec = pd.read_csv('../mid_data/df_bl_ec_encoding.csv')
df_bl_notec = pd.read_csv('../mid_data/df_bl_notec_encoding.csv')
#为了利用到更多的特征,在聚类前,将ecog的删除,然后两表合并以换取更多特征
df_bl_ec = df_bl_ec.drop(['EcogPtMem_bl',
'EcogPtLang_bl', 'EcogPtVisspat_bl', 'EcogPtPlan_bl', 'EcogPtOrgan_bl',
'EcogPtDivatt_bl', 'EcogPtTotal_bl', 'EcogSPMem_bl', 'EcogSPLang_bl',
'EcogSPVisspat_bl', 'EcogSPPlan_bl', 'EcogSPOrgan_bl',
'EcogSPDivatt_bl', 'EcogSPTotal_bl'],axis=1)
df = df_bl_ec.append(df_bl_notec)
df = df.reset_index(drop=True)
df.loc[df['DX_bl']==1,'DX_three'] = 1
df.loc[df['DX_bl']==2,'DX_three'] = 2
df.loc[df['DX_bl']==3,'DX_three'] = 2
df.loc[df['DX_bl']==4,'DX_three'] = 2
df.loc[df['DX_bl']==5,'DX_three'] = 3
df.loc[df['DX_bl']==2,'DX_son'] = 2
df.loc[df['DX_bl']==3,'DX_son'] = 3
df.loc[df['DX_bl']==4,'DX_son'] = 4
这里我们使用了KMeans聚类算法
首先寻找最优K值
from sklearn.cluster import KMeans
## 寻找最佳K值
x = df[df.columns.difference(['DX_bl','DX_three','DX_son'])]
inertia = []
for i in range(1,10):
km = KMeans(n_clusters=i,n_jobs=-1)
km.fit(x)
inertia.append(km.inertia_) #簇内的误差平方和
plt.figure(figsize=(12,6))
plt.plot(range(1,10),inertia)
plt.show()
##手肘原则
第一次聚类
我们选k = 2进行第一次聚类
km = KMeans(n_clusters=2,n_jobs=-1)
y_means = km.fit_predict(x)
df['Kmeans_three'] = y_means
df.loc[df['Kmeans_three']==1,'Kmeans_three'] = 2
df_1 = df[df['Kmeans_three']==0]
第二次寻找最优K值
第二次聚类
km = KMeans(n_clusters=2,n_jobs=-1)
y_means = km.fit_predict(x)
df_1['Kmeans_three'] = y_means
df_1.loc[df_1['Kmeans_three']==0,'Kmeans_three'] = 3
df['Kmeans_three'][df_1['Kmeans_three'].index] = df_1['Kmeans_three']
df['Kmeans_three'].value_counts()
这是聚成AD、CN、MCI三类的分布情况
将MCI抽出来,并寻找最佳K值
df_MCI = df[df['Kmeans_three']==2]
## 寻找最佳K值
x = df_MCI[df_MCI.columns.difference(['DX_bl','DX_three','DX_son'])]
inertia = []
for i in range(1,10):
km = KMeans(n_clusters=i,n_jobs=-1)
km.fit(x)
inertia.append(km.inertia_) #簇内的误差平方和
plt.figure(figsize=(12,6))
plt.plot(range(1,10),inertia)
plt.show()
聚类完的标签分布:
附件中的同一样本包含不同时间点的特征,请将其与时间点的关系进行分析,以揭示其中的模式,不同种类疾病随时间的演变。
df = pd.read_csv('../datasets/ADNIMERGE_New.csv')
df = df.drop(['COLPROT','ORIGPROT','PTID','SITE','VISCODE'],axis=1)
c= []
for i in list(df.columns):
if i.split('_')[-1] == 'bl' or i.split('_')[-1] == 'BL' :
c.append(i)
b = ['PTGENDER','PTEDUCAT','PTETHCAT','PTRACCAT','PTMARRY','APOE4','update_stamp','AGE']
for i in b:
c.append(i)
df_or = df.drop(c,axis=1)
# 对特征为成分值的数据,缺失值用0填充
hanliang = ['FDG','PIB','AV45','FBB','ABETA',
'TAU','PTAU','CDRSB','ADAS11','ADAS13','ADASQ4',
'MMSE','LDELTOTAL','DIGITSCOR','TRABSCOR','FAQ',
'MOCA','RAVLT_immediate','RAVLT_learning',
'RAVLT_forgetting','RAVLT_perc_forgetting']
df_or[hanliang] = df_or[hanliang].fillna(0)
df_or = df_or.drop(df_or[df_or['DX'].isna()].index,axis=0)
df_or.loc[df_or['FSVERSION']=='Cross-Sectional FreeSurfer (FreeSurfer Version 4.3)','FLDSTRENG_bl'] = 1
df_or.loc[df_or['FSVERSION']=='Cross-Sectional FreeSurfer (5.1)','FLDSTRENG_bl'] = 2
df_or.loc[df_or['FSVERSION']=='Cross-Sectional FreeSurfer (6.0)','FLDSTRENG_bl'] = 3
df1 = df[df['FLDSTRENG']=='1.5 Tesla MRI']
df2 = df[df['FLDSTRENG']=='3 Tesla MRI']
df_or = df_or.drop(df1[df1['IMAGEUID']> 200000].index,axis=0)
#取20W和80W当成阈值
df_or.loc[(df_or['IMAGEUID']> 200000) & (df_or['IMAGEUID']< 800000) ,
'FLDSTRENG'] = 2
df_or.loc[df_or['IMAGEUID']< 200000,'FLDSTRENG'] = 1
df_or.loc[df_or['IMAGEUID']> 800000,'FLDSTRENG'] = 3
df_or = df_or.drop(df_or[df_or['FLDSTRENG'].isna()].index,axis=0)
df_or = df_or.drop(df_or[df_or['mPACCdigit'].isna()].index,axis=0)
#用0填充
a = ['Ventricles','Hippocampus','WholeBrain','Entorhinal','Fusiform',
'MidTemp','ICV']
df_or[a] = df_or[a].fillna(0)
ecog = ['EcogPtMem','EcogPtLang','EcogPtVisspat','EcogPtPlan','EcogPtOrgan',
'EcogPtDivatt','EcogPtTotal',
'EcogSPMem','EcogSPLang','EcogSPVisspat','EcogSPPlan','EcogSPOrgan',
'EcogSPDivatt','EcogSPTotal']
df_or_notec = df_or.drop(ecog,axis=1)
df_or_notec.loc[df_or_notec['ABETA']=='>1700','ABETA'] = 1701
df_or_notec.loc[df_or_notec['ABETA']=='<200','ABETA'] = 199
df_or_notec.loc[df_or_notec['PTAU']=='<8','PTAU'] = 7
df_or_notec.loc[df_or_notec['PTAU']=='>120','PTAU'] = 121
df_or_notec.loc[df_or_notec['TAU']=='>1300','TAU'] = 1301
df_or_notec.loc[df_or_notec['TAU']=='<80','TAU'] = 79
df_or_notec.loc[df_or_notec['ABETA'].astype(float)>1700,'ABETA'] = 1701
df_or_notec.loc[df_or_notec['PTAU'].astype(float)<8,'PTAU'] = 7
df_or_notec.loc[df_or_notec['ABETA'].astype(float)<200,'ABETA'] = 199
df_or_notec.loc[df_or_notec['PTAU'].astype(float)>120,'PTAU'] = 121
df_or_notec.loc[df_or_notec['TAU'].astype(float)>1300,'TAU'] = 1301
df_or_notec.loc[df_or_notec['TAU'].astype(float)<80,'TAU'] = 79
df_or_notec = df_or_notec.drop(['Month','EXAMDATE'],axis=1)
df_or_notec = df_or_notec.reset_index(drop = True)
col= ['FDG', 'PIB', 'AV45', 'FBB', 'ABETA', 'TAU', 'PTAU', 'CDRSB',
'ADAS11', 'ADAS13', 'ADASQ4', 'MMSE', 'RAVLT_immediate',
'RAVLT_learning', 'RAVLT_forgetting', 'RAVLT_perc_forgetting',
'LDELTOTAL', 'DIGITSCOR', 'TRABSCOR', 'FAQ', 'MOCA', 'IMAGEUID', 'Ventricles', 'Hippocampus', 'WholeBrain',
'Entorhinal', 'Fusiform', 'MidTemp', 'ICV', 'mPACCdigit',
'mPACCtrailsB']
import math
def plot_distribution(dataset,cols=5,width=20,height=15,hspace=0.2,wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
fig.subplots_adjust(left=None,bottom=None,right=None,top=None,wspace=wspace,hspace=hspace)
rows = math.ceil(float(dataset.shape[1]) / cols)
#以上都是准备工作
for i,column in enumerate(dataset.columns):
ax = fig.add_subplot(rows,cols,i+1)
ax.set_title(column)
g = sns.distplot(dataset[column]) #绘制分布图
plt.xticks(rotation=25)
df_or_notec_draw = df_or_notec[col]
import warnings
warnings.filterwarnings('ignore')
plot_distribution(df_or_notec_draw,cols=3,width=20,height=20,hspace=1.3,wspace=0.75)
col1 = ['mPACCtrailsB','mPACCdigit','ICV','MidTemp','Fusiform',
'Entorhinal','WholeBrain','Hippocampus','Ventricles','IMAGEUID',
'TRABSCOR','RAVLT_forgetting','RAVLT_immediate','ADAS13','ADAS11']
def plot(dataset,cols=5,width=20,height=15,hspace=0.2,wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
fig.subplots_adjust(left=None,bottom=None,right=None,top=None,wspace=wspace,hspace=hspace)
rows = math.ceil(float(dataset.shape[1]) / cols)
#以上都是准备工作
for i,column in enumerate(dataset.columns):
ax = fig.add_subplot(rows,cols,i+1)
ax.set_title(column)
g = sns.lineplot(dataset[column].index,dataset[column])
plt.xticks(rotation=25)
df_CN = df_or_notec[df_or_notec['DX']=='CN']
df_MCI = df_or_notec[df_or_notec['DX']=='MCI']
df_DE = df_or_notec[df_or_notec['DX']=='Dementia']
CN
df_CN_draw = df_CN.groupby('M')[col1].mean()
plot(df_CN_draw,cols=3,width=20,height=20,hspace=1,wspace=0.4)
MCI
df_MCI_draw = df_MCI.groupby('M')[col1].mean()
plot(df_MCI_draw,cols=3,width=20,height=20,hspace=1,wspace=0.4)
Dementia
df_DE_draw = df_DE.groupby('M')[col1].mean()
plot(df_DE_draw,cols=3,width=20,height=20,hspace=1,wspace=0.4)
CN、SMC、EMCI、LMCI、AD五类的早期干预及诊断标准请查阅相关文献。
为什么要提取VISCODE中的bl数据,因为相同RID所对应的bl后缀的特征值是一样的,所以为了减少训练成本,防止过拟合现象,我们就只提取VISCODE中的bl数据进行训练。
这次分享主要目的是交流学习,这也仅代表我一人的解题思路,不一定正确,也请各位大佬多多指正!