这是我在kaggle做的第二个项目,之前做一个是Digit Recognition,感兴趣的同学可以看这里。本文涉及到的python环境和需要的库在上一篇文章中已经详细说过了,这里不再赘述。在Digit Recognition由于feature只有一种特征类型,就是int类型的灰度值,所以不需要什么feature engineering。而Titanic项目最关键的部分可能就在于feature engineering方面,它对于最后的精度有着直接的影响。由于不是计算机视觉方面的问题,可以预期神经网络的分类效果估计不是很理想,我们的思路是采用多种机器学习分类器来分别预测,最后将几个分类器的结果ensemble(我采用的是bagging的方法)。
首先有必要描述一下这个项目的数据结构,项目的地址在这里。train.csv中是带有label的数据,这里的label是"Survived"变量(0代表身亡,1代表幸存)。features包括变量:
1、pclassint类型,旅客的社会等级(分3个等级,1代表高级,2代表中级,3代表低级)
2、namestring类型,旅客的姓名
3、sexstring类型,分为“male”和“female”两类
4、ageint类型,旅客的年龄
5、sibspint类型,旅客海外的旁系亲友
6、parchint类型,旅客海外的直系亲友
7、ticketstring,int混合类型,旅客的船票序列号
8、farefloat类型,旅客的船票价格
9、cabinstring,int混合类型,旅客的房号
10、embarkedstring类型,旅客上船的港口(共三个值“C”,“Q”,“S”)
容易看出,由于feature的类型各不相同,经过feature engineering之后的features必定一些适合于成为标称的类型(pclass,sex等),而有些更适合于成为标量类型(age,fare等),所以features会变成标称类型和标量类型的混合类型。由于随机森林是处理标称类型和标量类型混合features表现最好的方法之一,我们考虑用随机森林作为我们的一个基准classifier。
另外我还构建了其余四个分类器,它们分别只看见一部分的features,这样有两个好处:第一、有一些分类器只在某种类型的features上表现良好,所以我们只feed它们那些features。例如GBDT(Gradient Boosting Decision Tree)适合用标量类型的features,因为它对于稀疏的特征分类效果不好,所以只给GBDT标量类型的features或者把标称类型的features转化为标量类型(进一步转化为二值类型)。第二、这样相当于每个分类器只“记住”了一部分的features这样带来的bias可以trade off掉分类器自身的high variance,我们在构建分类器时都尽量多选参数提高variance,最后通过投票的方式ensemble,这种类似bagging的方法把候选的分类器组合起来。
总的来说我构建了五个分类器,它们以及对应接收的训练features为:
1、Random Forest Classifier(随机森林分类器):接收二值和标量类型
2、Gradient Boosting Decision Tree Classifier(梯度提升树分类器):接收标量类型
3、Support Vector Machine Classifier(支持向量机分类器):接收二值类型
4、Adaptive Boosting Decision Tree Classifier(自适应提升树分类器):接收二值类型
5、Logistic Regression Classifier(逻辑回归分类器):接收二值和标量类型
最后通过五个分类器投票决定大多数的为输出。
#replace missing values with the median of the coressponding class
df.loc[(df.Fare.isnull())&(df.Pclass==1),'Fare']=np.median(df[df['Pclass']==1]['Fare'].dropna())
df.loc[(df.Fare.isnull())&(df.Pclass==2),'Fare']=np.median(df[df['Pclass']==2]['Fare'].dropna())
df.loc[(df.Fare.isnull())&(df.Pclass==3),'Fare']=np.median(df[df['Pclass']==3]['Fare'].dropna())
def setMissingAges(df):
age_df=df[['Age','Embarked','Fare','Parch','SibSp','Title_id','Pclass','Names','CabinLetter']]
knownAge=age_df[df.Age.notnull()]
unknownAge=age_df[df.Age.isnull()]
y=knownAge.values[:,0]
X=knownAge.values[:,1:]
rfr=RandomForestRegressor(n_estimators=2000,n_jobs=-1)
#train the regressor
rfr.fit(X,y)
predictedAges=rfr.predict(unknownAge.values[:,1:])
df['Age'][df.Age.isnull()]=predictedAges
return df
def processPclass(df,keep_binary=False,keep_scaled=False):
#fill in the missing value
df['Pclass'][df.Pclass.isnull()]=df['Pclass'].median()
#create binary features
if keep_binary:
df=pd.concat([df,pd.get_dummies(df['Pclass']).rename(columns=lambda x:'Pclass_'+str(x))],axis=1)
if keep_scaled:
scaler=preprocessing.StandardScaler()
df['Pclass_scaled']=scaler.fit_transform(df['Pclass'])
return df
标量类型转化为二值型主要的方法是通过binning(分箱)先化为标称类型,再通过设置哑变量的方法从标称类型化为二值类型。分箱的操作是按照频数相等的原则把标量分为N个不想交的左开右闭区间的并。例如把“age”分为4个标称类型“age_(0,21]”,“age_(21,28]”,“age_(28,38]”,“age_(38,80]”,使得每个区间内的频数相等,代码实现如下:
def processAge(df,keep_binary=False,keep_bins=False,keep_scaled=False):
df=setMissingAges(df)
if keep_bins:
# bin into quantiles and create binary features
df['Age_bin'] = pd.qcut(df['Age'], 4)
if keep_binary:
# have a feature for children
df['isChild'] = np.where(df.Age < 13, 1, 0)
df = pd.concat([df, pd.get_dummies(df['Age_bin']).rename(columns=lambda x: 'Age_' + str(x))], axis=1)
if keep_scaled:
scaler=preprocessing.StandardScaler()
df['Age_scaled']=scaler.fit_transform(df['Age'])
del df['Age_bin']
return df
def processEmbarked(df,keep_binary=False,keep_scaled=False):
#replace the missing values with most common port
df['Embarked'][df['Embarked'].isnull()]=df.Embarked.dropna().mode().values
#turn into number
df['Embarked']=pd.factorize(df['Embarked'])[0]
# Create binary features for each port
if keep_binary:
df = pd.concat([df, pd.get_dummies(df['Embarked']).rename(columns=lambda x: 'Embarked_' + str(x))], axis=1)
if keep_scaled:
scaler=preprocessing.StandardScaler()
df['Embarked_scaled']=scaler.fit_transform(df['Embarked'])
return df
1、有的人的“name”字段有带有括号给出了另一个名字,名字的多少可能也代表了身份地位的象征
2、人的“name”字段中包含有title(称谓)例如“Mr”,“Miss”,“Mrs”等,这些称谓直接反应了身份、学历、性别、年龄等信息。所以把这些字段通过正则表达式提取出来并做成标称feature,进一步做成二值版本和标量版本。
3、“Cabin”字段中的首字母代表了相对甲板的位置,这个feature可能会带来重要的信息,例如:距离甲板的远近影响着生还概率
4、“Cabin”字段中的数字代表房间号,这个feature可能会带来重要的信息,例如:某层的房间旁边是配电室,漏水导致漏电使得配电室周围的全部电击身亡。
5、“Ticket”字段的首字母为国家编号,国家的差距直接影响人的身份地位已经绅士程度。
6、“Ticket”字段的数字部分有很多是相同的,猜测为家庭票,直接反应了家庭成员的多少。这个feature可能已经被“parch”+“sibsp”捕捉到,但是此时应该加入,去除冗余feature是后面的步骤。
“name”字段添加隐藏特征的代码如下:
def processName(df,keep_binary=False,keep_bins=False,keep_scaled=False):
"""
Parameters:
keep_binary:include 'Title_Mr' 'Title_Mrs'...
keey_scaled&&keep_bins:include 'Names_scaled' 'Title_id_scaled'
Note: the string feature 'Name' can be deleted
"""
# how many different names do they have? this feature 'Names'
df['Names']=df['Name'].map(lambda x:len(re.split('\\(',x)))
#what is each person's title?
df['Title']=df['Name'].map(lambda x:re.compile(", (.*?)\.").findall(x)[0])
#group low-occuring,related titles together
df['Title'][df.Title.isin(['Mr','Don','Major','Capt','Jonkheer','Rev','Col','Sir','Dona'])] = 'Mr'
df['Title'][df.Title.isin(['Master'])] = 'Master'
df['Title'][df.Title.isin(['Countess','Mme','Mrs','Lady','the Countess'])] = 'Mrs'
df['Title'][df.Title.isin(['Mlle','Ms','Miss'])] = 'Miss'
df['Title'][(df.Title.isin(['Dr']))&(df['Sex']=='male')]='Mr'
df['Title'][(df.Title.isin(['Dr']))&(df['Sex']=='female')]='Mrs'
df['Title'][df.Title.isnull()][df['Sex']=='male']='Master'
df['Title'][df.Title.isnull()][df['Sex']=='female']='Miss'
#build binary features
if keep_binary:
df=pd.concat([df,pd.get_dummies(df['Title']).rename(columns=lambda x:'Title_'+str(x))],axis=1)
#process_scaled
if keep_scaled:
scaler=preprocessing.StandardScaler()
df['Names_scaled']=scaler.fit_transform(df['Names'])
if keep_bins:
df['Title_id']=pd.factorize(df['Title'])[0]+1
if keep_bins and keep_scaled:
scaler=preprocessing.StandardScaler()
df['Title_id_scaled']=scaler.fit_transform(df['Title_id'])
del df['Name']
return df
#Utility method
def getCabinLetter(cabin):
match = re.compile("([a-zA-Z]+)").search(cabin)
if match:
return match.group(0)
else:
return 'U'
#Utility method
def getCabinNumber(cabin):
match = re.compile("([0-9]+)").search(cabin)
if match:
return match.group(0)
else:
return 0
def processCabin(df,keep_binary=False,keep_scaled=False):
# Replace missing values with "U0"
df['Cabin'][df.Cabin.isnull()] = 'U0'
# create feature for the alphabetical part of the cabin number
df['CabinLetter'] = df['Cabin'].map( lambda x : getCabinLetter(x))
#change alphbet to number beacause we need tht important feature to regress the age
df['CabinLetter']=pd.factorize(df['CabinLetter'])[0]
# create binary features for each cabin letters
if keep_binary:
cletters = pd.get_dummies(df['CabinLetter']).rename(columns=lambda x: 'CabinLetter_' + str(x))
df = pd.concat([df, cletters], axis=1)
if keep_scaled:
# create feature for the numerical part of the cabin number
df['CabinNumber'] = df['Cabin'].map( lambda x : getCabinNumber(x)).astype(int) + 1
# scale the number to process as a continuous feature
scaler = preprocessing.StandardScaler()
df['CabinNumber_scaled'] = scaler.fit_transform(df['CabinNumber'])
df['CabinLetter_scaled'] = scaler.fit_transform(df['CabinLetter'])
del df['CabinNumber']
del df['CabinLetter']
return df
def getTicketPrefix(ticket):
match=re.compile("([a-zA-Z\.\/]+)").search(ticket)
if match:
return match.group(0)
else:
return 'U'
###Utility method: get the numerical component of 'Ticket'
def getTicketNumber(ticket):
match=re.compile("([0-9]+)").search(ticket)
if match:
return match.group(0)
else:
return '0'
###Generate features of 'Ticket'
def processTicket(df,keep_binary=False,keep_bins=False,keep_scaled=False):
df['TicketPrefix']=df['Ticket'].map(lambda x:getTicketPrefix(x.upper()))
df['TicketPrefix']=df['TicketPrefix'].map(lambda x:re.sub('[\.?\/?]','',x))
df['TicketPrefix']=df['TicketPrefix'].map(lambda x:re.sub('STON','SOTON',x))
df['TicketNumber']=df['Ticket'].map(lambda x:getTicketNumber(x))
df['TicketNumberStart']=df['TicketNumber'].map(lambda x:x[0]).astype(np.int)
if keep_binary:
numberstart = pd.get_dummies(df['TicketNumberStart']).rename(columns=lambda x: 'TicketNumberStart_' + str(x))
df = pd.concat([df, numberstart], axis=1)
if keep_bins:
#help the interactive feature process,lift by 1
df['TicketPrefix_id']=pd.factorize(df['TicketPrefix'])[0]+1
if keep_scaled:
scaler = preprocessing.StandardScaler()
df['TicketNumber_scaled'] = scaler.fit_transform(df['TicketNumber'])
df['TicketPrefix_id_scaled'] = scaler.fit_transform(df['TicketPrefix_id'])
del df['Ticket'],df['TicketNumber'],df['TicketPrefix'],df['TicketNumberStart'],df['TicketPrefix_id']
return df
if keep_interactive_auto:
numerics=df[['Names_scaled','SibSp_scaled','Parch_scaled','TicketPrefix_id_scaled','Fare_scaled','CabinNumber_scaled',
'Pclass_scaled','Title_id_scaled','TicketNumber_scaled','CabinLetter_scaled','Embarked_scaled','Age_scaled']]
#print "\nFeatures used for automated feature generation:\n", numerics.head(10)
new_fields_count=0
for i in range(0,numerics.columns.size-1):
for j in range(0,numerics.columns.size-1):
if i<=j:
name=str(numerics.columns.values[i])+'*'+str(numerics.columns.values[j])
df=pd.concat([df,pd.Series(numerics.iloc[:,i]*numerics.iloc[:,j],name=name)],axis=1)
new_fields_count+=1
if i < j:
name = str(numerics.columns.values[i]) + "+" + str(numerics.columns.values[j])
df = pd.concat([df, pd.Series(numerics.iloc[:,i] + numerics.iloc[:,j], name=name)], axis=1)
new_fields_count += 1
if not i == j:
name = str(numerics.columns.values[i]) + "/" + str(numerics.columns.values[j])
df = pd.concat([df, pd.Series(numerics.iloc[:,i] / numerics.iloc[:,j], name=name)], axis=1)
name = str(numerics.columns.values[i]) + "-" + str(numerics.columns.values[j])
df = pd.concat([df, pd.Series(numerics.iloc[:,i] - numerics.iloc[:,j], name=name)], axis=1)
new_fields_count += 2
df_corr=df.drop(['Survived','PassengerId'],axis=1).corr(method='spearman')
mask=np.ones(df_corr.columns.size)-np.eye(df_corr.columns.size)
df_corr=df_corr*mask
drops=[]
for col in df_corr.columns.values:
if np.in1d([col],drops):
continue
corr=df_corr.index[abs(df_corr[col])>0.9].values
drops=np.union1d(drops,corr)
#print "\nDropping",drops.shape[0],"highly correlated features"
df.drop(drops,axis=1,inplace=True)
至此特征处理工作告一段落,共计生成了不相关的大概200个features(包括二值类型和scaling后的标量)供不同的分类器选用。
print "\nRough fitting a RandomForest to determine feature importance...."
forest=RandomForestClassifier(oob_score=True,n_estimators=10000,n_jobs=-1)
forest.fit(X,y)
feature_importance=forest.feature_importances_
feature_importance=100.0*(feature_importance/feature_importance.max())
#print "Feature importances:\n", feature_importance
fi_threshold=30
important_idx=np.where(feature_importance>fi_threshold)[0]
important_features=features_list[important_idx]
#print "\n", important_features.shape[0], "Important features(>", fi_threshold, "percent of max importance)...\n",important_features
sorted_idx=np.argsort(feature_importance[important_idx])[::-1]
#plot feature importance
pos=np.arange(sorted_idx.shape[0])+0.5
plt.subplot(1,2,2)
plt.barh(pos,feature_importance[important_idx][sorted_idx[::-1]],align='center')
plt.yticks(pos,important_features[sorted_idx[::-1]])
plt.xlabel('Relative Importance')
plt.title('Feature Importance')
plt.draw()
plt.show()
sqrtfeat=int(np.sqrt(X.shape[1]))
params_test={"n_estimators":[10000],
"max_features":np.rint(np.linspace(sqrtfeat,sqrtfeat,3)).astype(int),
"min_samples_split":np.rint(np.linspace(X.shape[0]*0.01,X.shape[0]*0.2,30)).astype(int)}
print "Hyperparameter opimization using RandomizedSearchCV..."
rand_search=RandomizedSearchCV(forest,param_distributions=params_test,n_jobs=7,cv=4,n_iter=100)
rand_search.fit(X,y)
best_params=report(rand_search.grid_scores_)
params=best_params
test_ids,ret1,w1=rf.Titanic_rf()
test_ids,ret2,w2=gbdt.Titanic_gbdt()
test_ids,ret3,w3=svc.Titanic_svc()
test_ids,ret4,w4=adbst.Titanic_adbst()
test_ids,ret5,w5=lg.Titanic_lg()
ret1=np.where(ret1==1,1,-1)
ret2=np.where(ret2==1,1,-1)
ret3=np.where(ret3==1,1,-1)
ret4=np.where(ret4==1,1,-1)
ret5=np.where(ret5==1,1,-1)
votes=(w1+0.03)*ret1+w2*ret2+w3*ret3+w4*ret4+w5*ret5
votes=np.where(votes<=0,0,1)
submission=np.asarray(zip(test_ids,votes)).astype(int)
#ensure passenger IDs in ascending order
output=submission[submission[:,0].argsort()]
predict_file=open(path+"predict.csv",'wb')
file_object=csv.writer(predict_file)
file_object.writerow(["PassengerId","Survived"])
file_object.writerows(output)
predict_file.close()
print 'Done'
产生预测结果,提交到kaggle,正确率为80.04%,目前排名是350/1828,大概在top20%的位置。在自己的validation set上正确率可以达到84%左右,所以还是overfitting了,最后的bagging并没能把variance降下来,还需要磨练啊。另外隐藏feature的选取可能也不够好,请大家有好的想法不吝赐教。我写的这个项目的python代码在我的github上,以上。