作为Kaggle上最经典的入门案例之一,Titanic也成为了我练手且熟悉Kaggale的不二选择,配合唐宇迪的讲解视频以及Kaggle入门视频的指导,我在努力理解记忆之后,敲了一遍代码,并且在Kaggle提交获得了分数。(由于文件格式问题,这个后续还会更新的)。由于我采用的是,一遍看视频,一遍把关键点备注在Notebook里,然后在根据这些关键点,理解以及代码,最后独立敲一遍代码的方法,所以也想在csdn上和大家记录分享一下。
话不多说,让我们现在开始吧!
这个就不啰嗦了(其实是因为我是上周学的,这周忘得差不多了Ծ‸Ծ)。把数据集放在这里,大家有需要可以自取(是我自己直接从官网下载的,保证可靠!!)
链接:https://pan.baidu.com/s/1WN2TAvhVsOyIwDmqf1Th8Q
提取码:xstc
#读入数据,打印前5行,并生成描述性统计数据
import pandas as pd
titanic=pd.read_csv(r'E:\Google\下载\train.csv',engine='python')
print(titanic.head())
titanic.describe()
#填充age列的缺失值
titanic["Age"]=titanic["Age"].fillna(titanic["Age"].median())#这里的median一定要记得打括号()!!被这个东西折磨一下午了啊啊啊,否则下面用线性回归算法进行分类时,会一直报错TypeError: float() argument must be a string or a number, not 'method' 。这里也告诉了我,有的时候这个地方出错,不一定是这里有问题,有可能是前面出了问题。
titanic.head()
print(titanic["Sex"].unique())
#机器学习算法一般不能处理字符值,所以把string 转化为 int
#定位数据样本并赋值
titanic.loc[titanic["Sex"]=="male","Sex"]=0
titanic.loc[titanic["Sex"]=="female","Sex"]=0
#填充Embarked缺失值,并转化为int
titanic["Embarked"]=titanic["Embarked"].fillna('S')
titanic.loc[titanic['Embarked']=="S","Embarked"]=0
titanic.loc[titanic['Embarked']=="C","Embarked"]=1
titanic.loc[titanic['Embarked']=="Q","Embarked"]=2
print(titanic.head())
#导入线性回归算法和交叉验证
from sklearn.linear_model import LinearRegression
#from sklearn.cross_validation import KFold
from sklearn.model_selection import KFold
#选择特征以便之后输入机器算法
predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
#用alg=线性回归
alg=LinearRegression()
#.shape是[m,n]的二元组,.shape[0]得到样本数量,三层交叉验证,意思就是把m个样本平均分为3份
#kf=KFold[titanic.shape[0],n_folds=3,random_state=1]
#kf=KFold(n_splits=3)
kf = KFold(n_splits=3, shuffle=False, random_state=1)
#读入输入数据取出特征,拿出train的数据
#拿出label值,也就是survive这一列
#让线性回归应用在当前数据上
#对测试集进行预测
#最后把测试结果导入到predictions
predictions=[]
for train, test in kf.split(titanic):
#for train,test in kf.split(titanic["Survived"]):
train_predictors=(titanic[predictors].iloc[train,:])
train_target=titanic["Survived"].iloc[train]
alg.fit(train_predictors,train_target)
test_predictions = alg.predict(titanic[predictors].iloc[test, :])
predictions.append(test_predictions)
#The predictions are in three seperate numpy arrays.Concatenate them into one.
#把[0,1]的区间值,转化为类别值
#计算准确率
import numpy as np
predictions=np.concatenate(predictions,axis=0)
predictions[predictions> .5]=1
predictions[predictions<= .5]=0
accuracy = sum(predictions == titanic["Survived"]) / len(predictions)
print(accuracy)
#利用逻辑回归
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
alg=LogisticRegression(random_state=1)
scores=cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=3)
print(scores.mean())
#随机森林:随机有放回采样+随机特征选择、多颗决策树投票
#导入并指定参数,min_sample_split最小的样本个数,min_samples_leaf叶子节点上最小的样本数
#交叉验证
# from sklearn import cross_validation 这里发生了变化,cross_val_score()和KFold()应该都是sklearn.model_selection()里面的
# from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
alg=RandomForestClassifier(random_state=1,n_estimators=10,min_samples_split=2,min_samples_leaf=1)
kf=KFold(n_splits=3,shuffle=False,random_state=1)
scores=cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())
(这里跑出来结果只有0.67,并且还没写完,之后继续)