小编一般喜欢在在代码的刚开始便导入所有要使用的库,这样可以直观明了的将自己的思路展现出来,并且对代码的布局也更加美观。
下面是泰坦尼克号需要使用的库:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
其中关于matplotlib,numpy,pandas工具的使用可以参考小编写的例外一篇博客
机器学习入门基本使用工具(保姆式教学):matplotlib,numpy,pandas这一篇就够了
这里是泰坦尼克号的数据集链接:https://pan.baidu.com/s/1C8mRRSkSdBsVRZ5zv_zCbA?pwd=w45i
提取码:w45i
这里导入5个模型,用来比较各种模型之间的差距
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
#训练数据集
train_data = pd.read_csv(r"D:\data\python\taitanic\train.csv")
#测试数据集
test_data = pd.read_csv(r"D:\data\python\taitanic\test.csv")
>>>print('训练数据集:', train_data.shape, '测试数据集:', test_data.shape)
>训练数据集: (891, 12) 测试数据集: (418, 11)
>>>train_data.isnull().sum()
>PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
查看上船港口数
>>>plt.figure(figsize=(10,5),dpi=100)
>>>train_data['Embarked'].value_counts().plot(kind='bar')
>>>plt.savefig(r"D:\data\python\exercise\test2\1.png")
可以看到从S港口上船的人数最多,而数据只缺失两个,所以选取频率最高的填充到空白数据中
>>>train_data['Embarked'] = train_data['Embarked'].fillna('S')
>>>train_data.isnull().sum()
>PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
处理空白年龄(Age),使用平均值填充
>>>train_data['Age'] = train_data['Age'].fillna(int(train_data['Age'].mean()))
>>>train_data.isnull().sum()
>PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
至此除了Cabin(船舱号)其他缺失值已经补充完毕.Cabin这一列数据值缺失过多选择填充会导致得到的数据过于片面,因此,选择删去这一列
>>>train_data.drop(columns = 'Cabin', axis=1,inplace=True)
>>>train_data.isnull().sum()
>PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
dtype: int64
至此,训练集的缺失值已经处理完毕
>>>train_data.head()
>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
使用get_dummies进行one-hot编码,列名前缀是Pclass
>>>pclassdf1 = pd.DataFrame()
>>>pclassdf1 = pd.get_dummies(train_data['Pclass'] , prefix='Pclass' )
>>>train_data = pd.concat([train_data, pclassdf1], axis=1)
>>>train_data.drop('Pclass',axis=1, inplace=True)
>>>train_data.head()
PassengerId Survived Name Sex Age SibSp Parch Ticket Fare Embarked Pclass_1 Pclass_2 Pclass_3
0 1 0 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 0 0 1
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 1 0 0
2 3 1 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 0 0 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 1 0 0
4 5 0 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 0 0 1
>>>def gettitle(name):
>>> str1 = name.split(',')[1] #Mr. Owen Harris
>>> str2 = str1.split('.')[0]#Mr
>>> str3 = str2.strip()
>>> return str3
#存放提取后的特征
>>>titledf1 = pd.DataFrame()
#map函数:对Series每个数据应用自定义的函数计算
>>>titledf1['Title'] = train_data['Name'].map(gettitle)
#查看titledf的种类
>>>titledf1['Title'].value_counts()
>Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Major 2
Mlle 2
Capt 1
Lady 1
Jonkheer 1
Mme 1
Sir 1
the Countess 1
Ms 1
Don 1
Name: Title, dtype: int64
#姓名中头衔字符串与定义头衔类别的对应关系
>>>title_mapdict = {
>>> "Capt": "Officer",
>>> "Col": "Officer",
>>> "Major": "Officer",
>>> "Jonkheer": "Royalty",
>>> "Don": "Royalty",
>>> "Sir": "Royalty",
>>> "Dr": "Officer",
>>> "Rev": "Officer",
>>> "the Countess":"Royalty",
>>> "Dona": "Royalty",
>>> "Mme": "Mrs",
>>> "Mlle": "Miss",
>>> "Ms": "Mrs",
>>> "Mr": "Mr",
>>> "Mrs": "Mrs",
>>> "Miss": "Miss",
>>> "Master": "Master",
>>> "Lady": "Royalty"
>>> }
#map函数:对Series每个数据应用自定义的函数计算
>>>titledf1['Title'] = titledf1['Title'].map(title_mapdict)
#使用get_dummies进行one-hot编码
>>>titledf1 = pd.get_dummies(titledf1['Title'])
>>>train_data = pd.concat([train_data, titledf1], axis=1)
>>>train_data.drop('Name',axis=1, inplace=True)
>>>train_data.head()
>
PassengerId Survived Sex Age SibSp Parch Ticket Fare Embarked Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer Royalty
0 1 0 male 22.0 1 0 A/5 21171 7.2500 S 0 0 1 0 0 1 0 0 0
1 2 1 female 38.0 1 0 PC 17599 71.2833 C 1 0 0 0 0 0 1 0 0
2 3 1 female 26.0 0 0 STON/O2. 3101282 7.9250 S 0 0 1 0 1 0 0 0 0
3 4 1 female 35.0 1 0 113803 53.1000 S 1 0 0 0 0 0 1 0 0
4 5 0 male 35.0 0 0 373450 8.0500 S 0 0 1 0 0 1 0 0 0
#处理性别
>>>sex_mapdict = {'male': 1, 'female': 0}
#map函数:对Series每个数据应用自定义的函数计算
>>>train_data['Sex'] = train_data['Sex'].map(sex_mapdict)
>>>sexdf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Sex
>>>sexdf1 = pd.get_dummies( train_data['Sex'], prefix='Sex')
>>>train_data = pd.concat([train_data, sexdf1], axis=1)
>>>train_data.drop('Sex',axis=1, inplace=True)
>>>train_data.head()
>
PassengerId Survived Age SibSp Parch Ticket Fare Embarked Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer Royalty Sex_0 Sex_1
0 1 0 22.0 1 0 A/5 21171 7.2500 S 0 0 1 0 0 1 0 0 0 0 1
1 2 1 38.0 1 0 PC 17599 71.2833 C 1 0 0 0 0 0 1 0 0 1 0
2 3 1 26.0 0 0 STON/O2. 3101282 7.9250 S 0 0 1 0 1 0 0 0 0 1 0
3 4 1 35.0 1 0 113803 53.1000 S 1 0 0 0 0 0 1 0 0 1 0
4 5 0 35.0 0 0 373450 8.0500 S 0 0 1 0 0 1 0 0 0 0 1
>>>plt.figure(figsize=(18,4))
>>>train_data['Age']=train_data['Age'].astype(np.int)
>>>average_age=train_data[['Age','Survived']].groupby('Age',as_index=False).mean()
>>>sns.barplot(x='Age',y='Survived',data=average_age,palette='BuPu')
>>>plt.savefig(r"D:\data\python\exercise\test2\2.png")
>>>agedf1 = pd.DataFrame(train_data['Age'])
>>>agedf1['Age_baby'] = agedf1['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
>>>agedf1['Age_youth'] = agedf1['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
>>>agedf1['Age_middle'] = agedf1['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
>>>agedf1['Age_old'] = agedf1['Age'].map(lambda s: 1 if 60 <= s else 0)
>>>train_data = pd.concat([train_data,agedf1], axis=1)
>>>train_data.drop('Age',axis=1, inplace=True)
>>>train_data.head()
>
PassengerId Survived SibSp Parch Ticket Fare Embarked Pclass_1 Pclass_2 Pclass_3 ... Mr Mrs Officer Royalty Sex_0 Sex_1 Age_baby Age_youth Age_middle Age_old
0 1 0 1 0 A/5 21171 7.2500 S 0 0 1 ... 1 0 0 0 0 1 0 1 0 0
1 2 1 1 0 PC 17599 71.2833 C 1 0 0 ... 0 1 0 0 1 0 0 0 1 0
2 3 1 0 0 STON/O2. 3101282 7.9250 S 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
3 4 1 1 0 113803 53.1000 S 1 0 0 ... 0 1 0 0 1 0 0 0 1 0
4 5 0 0 0 373450 8.0500 S 0 0 1 ... 1 0 0 0 0 1 0 0 1 0
#存放家庭信息
>>>familydf1 = pd.DataFrame()
#家庭人数
>>>familydf1['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1
#家庭类别
#if 条件为真的时候返回if前面内容,否则返回0
>>>familydf1['Family_Single'] = familydf1['FamilySize'].map(lambda s: 1 if s == 1 else 0)
>>>familydf1['Family_Small'] = familydf1['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
>>>familydf1['Family_Large'] = familydf1['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
>>>familydf1.drop('FamilySize',axis=1, inplace=True)
>>>train_data = pd.concat([train_data, familydf1], axis=1)
>>>train_data.drop('Parch',axis=1, inplace=True)
>>>train_data.drop('SibSp',axis=1, inplace=True)
>>>train_data.head()
>
PassengerId Survived Ticket Fare Embarked Pclass_1 Pclass_2 Pclass_3 Master Miss ... Royalty Sex_0 Sex_1 Age_baby Age_youth Age_middle Age_old Family_Single Family_Small Family_Large
0 1 0 A/5 21171 7.2500 S 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 1 0
1 2 1 PC 17599 71.2833 C 1 0 0 0 0 ... 0 1 0 0 0 1 0 0 1 0
2 3 1 STON/O2. 3101282 7.9250 S 0 0 1 0 1 ... 0 1 0 0 1 0 0 1 0 0
3 4 1 113803 53.1000 S 1 0 0 0 0 ... 0 1 0 0 0 1 0 0 1 0
4 5 0 373450 8.0500 S 0 0 1 0 0 ... 0 0 1 0 0 1 0 1 0 0
船票在这里没什么用处,选择删除
train_data.drop('Ticket',axis=1, inplace=True)
>>>faredf1=pd.DataFrame(train_data['Fare'])
>>>faredf1['Fare_First']=faredf1['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
>>>faredf1['Fare_Second']=faredf1['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
>>>faredf1['Fare_Third']=faredf1['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
>>>train_data = pd.concat([train_data,faredf1], axis=1)
>>>train_data.drop('Fare',axis=1, inplace=True)
>>>train_data.head()
>
PassengerId Survived Embarked Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs ... Age_baby Age_youth Age_middle Age_old Family_Single Family_Small Family_Large Fare_First Fare_Second Fare_Third
0 1 0 S 0 0 1 0 0 1 0 ... 0 1 0 0 0 1 0 0 0 1
1 2 1 C 1 0 0 0 0 0 1 ... 0 0 1 0 0 1 0 1 0 0
2 3 1 S 0 0 1 0 1 0 0 ... 0 1 0 0 1 0 0 0 0 1
3 4 1 S 1 0 0 0 0 0 1 ... 0 0 1 0 0 1 0 1 0 0
4 5 0 S 0 0 1 0 0 1 0 ... 0 0 1 0 1 0 0 0 0 1
#处理上船港口
>>>embarkeddf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
>>>embarkeddf1 = pd.get_dummies(train_data['Embarked'], prefix='Embarked')
>>>train_data = pd.concat([train_data, embarkeddf1], axis=1)
>>>train_data.drop('Embarked',axis=1, inplace=True)
>>>train_data.head()
>
PassengerId Survived Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer ... Age_old Family_Single Family_Small Family_Large Fare_First Fare_Second Fare_Third Embarked_C Embarked_Q Embarked_S
0 1 0 0 0 1 0 0 1 0 0 ... 0 0 1 0 0 0 1 0 0 1
1 2 1 1 0 0 0 0 0 1 0 ... 0 0 1 0 1 0 0 1 0 0
2 3 1 0 0 1 0 1 0 0 0 ... 0 1 0 0 0 0 1 0 0 1
3 4 1 1 0 0 0 0 0 1 0 ... 0 0 1 0 1 0 0 0 0 1
4 5 0 0 0 1 0 0 1 0 0 ... 0 1 0 0 0 0 1 0 0 1
至此所有的训练集数据均处理完了
按照如上操作再处理测试集,为了方便预测结果需要将测试集的数据处理的与训练集相同
首先,查看缺失的数量
>>>test_data.isnull().sum()
>PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
#处理测试集数据
>>>test_data['Age'] = test_data['Age'].fillna(int(test_data['Age'].mean()))
>>>test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
>>>test_data.drop(columns = 'Cabin', axis=1,inplace=True)
>>>test_data.isnull().sum()
>PassengerId 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
dtype: int64
#处理年龄
>>>agedf2 = pd.DataFrame(test_data['Age'])
>>>agedf2['Age_baby'] = agedf2['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
>>>agedf2['Age_youth'] = agedf2['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
>>>agedf2['Age_middle'] = agedf2['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
>>>agedf2['Age_old'] = agedf2['Age'].map(lambda s: 1 if 60 <= s else 0)
>>>test_data = pd.concat([test_data, agedf2], axis=1)
>>>test_data.drop('Age',axis=1, inplace=True)
#处理性别
>>>test_data['Sex'] = test_data['Sex'].map(sex_mapdict)
>>>sexdf2 = pd.DataFrame()
>>>sexdf2 = pd.get_dummies( test_data['Sex'], prefix='Sex')
>>>test_data = pd.concat([test_data, sexdf2], axis=1)
>>>test_data.drop('Sex',axis=1, inplace=True)
#处理上船港口
>>>embarkeddf2 = pd.DataFrame()
>>>embarkeddf2 = pd.get_dummies(test_data['Embarked'], prefix='Embarked')
>>>test_data = pd.concat([test_data, embarkeddf2], axis=1)
>>>test_data.drop('Embarked',axis=1, inplace=True)
#处理用户阶级
>>>pclassdf2 = pd.DataFrame()
>>>pclassdf2 = pd.get_dummies( test_data['Pclass'] , prefix='Pclass' )
>>>test_data = pd.concat([test_data, pclassdf2], axis=1)
>>>test_data.drop('Pclass',axis=1, inplace=True)
#存放提取后的特征
>>>titledf2 = pd.DataFrame()
>>>titledf2['Title'] = test_data['Name'].map(gettitle)
>>>titledf2['Title'] = titledf2['Title'].map(title_mapdict)
>>>titledf2 = pd.get_dummies(titledf2['Title'])
>>>test_data = pd.concat([test_data, titledf2], axis=1)
>>>test_data.drop('Name',axis=1, inplace=True)
#存放家庭信息
>>>familydf2 = pd.DataFrame()
>>>familydf2['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1
>>>familydf2['Family_Single'] = familydf2['FamilySize'].map(lambda s: 1 if s == 1 else 0)
>>>familydf2['Family_Small'] = familydf2['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
>>>familydf2['Family_Large'] = familydf2['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
>>>familydf2.drop('FamilySize',axis=1, inplace=True)
>>>test_data = pd.concat([test_data, familydf2], axis=1)
>>>test_data.drop('Parch',axis=1, inplace=True)
>>>test_data.drop('SibSp',axis=1, inplace=True)
#处理船票
>>>faredf2=pd.DataFrame(test_data['Fare'])
>>>faredf2['Fare_First']=faredf2['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
>>>faredf2['Fare_Second']=faredf2['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
>>>faredf2['Fare_Third']=faredf2['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
>>>test_data = pd.concat([test_data, faredf2], axis=1)
>>>test_data.drop('Fare',axis=1, inplace=True)
>>>test_data.drop('Ticket',axis=1, inplace=True)
>>>test_data.head()
>
PassengerId Age_baby Age_youth Age_middle Age_old Sex_0 Sex_1 Embarked_C Embarked_Q Embarked_S ... Mr Mrs Officer Royalty Family_Single Family_Small Family_Large Fare_First Fare_Second Fare_Third
0 892 0 1 0 0 0 1 0 1 0 ... 1 0 0 0 1 0 0 0 0 1
1 893 0 0 1 0 1 0 0 0 1 ... 0 1 0 0 0 1 0 0 0 1
2 894 0 0 0 1 0 1 0 1 0 ... 1 0 0 0 1 0 0 0 0 1
3 895 0 1 0 0 0 1 0 0 1 ... 1 0 0 0 1 0 0 0 0 1
4 896 0 1 0 0 1 0 0 0 1 ... 0 1 0 0 0 1 0 0 1 0
>>>corrdf = train_data.corr()
> PassengerId Survived Pclass_1 Pclass_2 Pclass_3 Master Miss Mr Mrs Officer ... Age_old Family_Single Family_Small Family_Large Fare_First Fare_Second Fare_Third Embarked_C Embarked_Q Embarked_S
PassengerId 1.000000 -0.005007 0.034303 -0.000086 -0.029486 -0.026151 -0.067846 0.038850 0.010197 0.055299 ... 0.006611 0.057462 -0.028976 -0.057055 0.022603 -0.028772 -0.002661 -0.001205 -0.033606 0.022204
Survived -0.005007 1.000000 0.285904 0.093349 -0.322308 0.085221 0.332795 -0.549199 0.344935 -0.031316 ... -0.040857 -0.203367 0.279855 -0.125147 0.254274 0.066213 -0.270267 0.168240 0.003650 -0.149683
Pclass_1 0.034303 0.285904 1.000000 -0.288585 -0.626738 -0.084700 0.021958 -0.097288 0.091483 0.104919 ... 0.166443 -0.113364 0.168568 -0.092945 0.683722 -0.177457 -0.458268 0.296423 -0.155342 -0.161921
Pclass_2 -0.000086 0.093349 -0.288585 1.000000 -0.565210 0.009903 -0.027381 -0.088569 0.125093 0.084401 ... -0.022555 -0.039070 0.104546 -0.117721 -0.153508 0.404474 -0.259749 -0.125416 -0.127301 0.189980
Pclass_3 -0.029486 -0.322308 -0.626738 -0.565210 1.000000 0.064918 0.003366 0.155907 -0.180630 -0.159089 ... -0.125051 0.129472 -0.230325 0.175890 -0.464164 -0.176287 0.606245 -0.153329 0.237449 -0.015104
Master -0.026151 0.085221 -0.084700 0.009903 0.064918 1.000000 -0.110602 -0.254903 -0.088394 -0.031131 ... -0.037588 -0.267024 0.102668 0.324136 0.063844 0.099667 -0.144515 -0.035225 0.010478 0.024264
Miss -0.067846 0.332795 0.021958 -0.027381 0.003366 -0.110602 1.000000 -0.599803 -0.207996 -0.073253 ... -0.071973 -0.050402 -0.007684 0.111105 0.077733 -0.008436 -0.044849 0.037613 0.168720 -0.139126
Mr 0.038850 -0.549199 -0.097288 -0.088569 0.155907 -0.254903 -0.599803 1.000000 -0.479363 -0.168826 ... 0.066390 0.396920 -0.292792 -0.223221 -0.201253 -0.168402 0.316688 -0.072567 -0.078338 0.112870
Mrs 0.010197 0.344935 0.091483 0.125093 -0.180630 -0.088394 -0.207996 -0.479363 1.000000 -0.058544 ... -0.013465 -0.357826 0.365088 0.014670 0.121511 0.168896 -0.255565 0.066101 -0.091121 -0.000565
Officer 0.055299 -0.031316 0.104919 0.084401 -0.159089 -0.031131 -0.073253 -0.168826 -0.058544 1.000000 ... 0.069897 0.035074 -0.015279 -0.039269 0.056671 0.058263 -0.101410 -0.008034 0.012618 -0.000902
Royalty 0.031602 0.033391 0.132798 -0.038324 -0.083230 -0.016287 -0.038324 -0.088324 -0.030628 -0.010787 ... -0.013024 -0.000414 0.011568 -0.020544 0.055989 -0.018161 -0.061567 0.079020 -0.023105 -0.054685
Sex_0 -0.042939 0.543351 0.098013 0.064746 -0.137143 -0.159934 0.691548 -0.867334 0.552686 -0.089228 ... -0.072063 -0.303646 0.260747 0.102954 0.161102 0.116775 -0.230803 0.082853 0.074115 -0.119224
Sex_1 0.042939 -0.543351 -0.098013 -0.064746 0.137143 0.159934 -0.691548 0.867334 -0.552686 0.089228 ... 0.072063 0.303646 -0.260747 -0.102954 -0.161102 -0.116775 0.230803 -0.082853 -0.074115 0.119224
Age_baby -0.026833 0.122978 -0.128886 0.028373 0.087957 0.623234 0.214762 -0.340037 -0.114928 -0.044477 ... -0.053701 -0.349033 0.172907 0.352281 0.071520 0.142821 -0.189052 0.002974 -0.038734 0.021770
Age_youth -0.003044 -0.091170 -0.231081 -0.042761 0.233902 -0.249201 0.021126 0.188732 -0.083297 -0.111574 ... -0.235597 0.228904 -0.152384 -0.159111 -0.205044 -0.128593 0.305773 -0.023941 0.134290 -0.063535
Age_middle 0.024216 0.039188 0.290830 0.039091 -0.282395 -0.121518 -0.142926 -0.013824 0.177937 0.143963 ... -0.034638 -0.050408 0.079582 -0.049864 0.157938 0.065555 -0.212457 0.032550 -0.125563 0.050502
Age_old 0.006611 -0.040857 0.166443 -0.022555 -0.125051 -0.037588 -0.071973 0.066390 -0.013465 0.069897 ... 1.000000 0.045377 -0.035810 -0.021206 0.090135 -0.044792 -0.033278 0.001665 -0.005860 0.002229
Family_Single 0.057462 -0.203367 -0.113364 -0.039070 0.129472 -0.267024 -0.050402 0.396920 -0.357826 0.035074 ... 0.045377 1.000000 -0.859931 -0.336825 -0.339394 -0.240334 0.506354 -0.095298 0.086464 0.029074
Family_Small -0.028976 0.279855 0.168568 0.104546 -0.230325 0.102668 -0.007684 -0.292792 0.365088 -0.015279 ... -0.035810 -0.859931 1.000000 -0.190940 0.228243 0.242009 -0.411264 0.158586 -0.087093 -0.084120
Family_Large -0.057055 -0.125147 -0.092945 -0.117721 0.175890 0.324136 0.111105 -0.223221 0.014670 -0.039269 ... -0.021206 -0.336825 -0.190940 1.000000 0.231664 0.015760 -0.215131 -0.109274 -0.005620 0.099265
Fare_First 0.022603 0.254274 0.683722 -0.153508 -0.464164 0.063844 0.077733 -0.201253 0.121511 0.056671 ... 0.090135 -0.339394 0.228243 0.231664 1.000000 -0.408891 -0.497615 0.237676 -0.168737 -0.102027
Fare_Second -0.028772 0.066213 -0.177457 0.404474 -0.176287 0.099667 -0.008436 -0.168402 0.168896 0.058263 ... -0.044792 -0.240334 0.242009 0.015760 -0.408891 1.000000 -0.551912 -0.033551 -0.026076 0.045802
Fare_Third -0.002661 -0.270267 -0.458268 -0.259749 0.606245 -0.144515 -0.044849 0.316688 -0.255565 -0.101410 ... -0.033278 0.506354 -0.411264 -0.215131 -0.497615 -0.551912 1.000000 -0.166809 0.187891 0.027891
Embarked_C -0.001205 0.168240 0.296423 -0.125416 -0.153329 -0.035225 0.037613 -0.072567 0.066101 -0.008034 ... 0.001665 -0.095298 0.158586 -0.109274 0.237676 -0.033551 -0.166809 1.000000 -0.148258 -0.782742
Embarked_Q -0.033606 0.003650 -0.155342 -0.127301 0.237449 0.010478 0.168720 -0.078338 -0.091121 0.012618 ... -0.005860 0.086464 -0.087093 -0.005620 - 0.168737 -0.026076 0.187891 -0.148258 1.000000 -0.499421
Embarked_S 0.022204 -0.149683 -0.161921 0.189980 -0.015104 0.024264 -0.139126 0.112870 -0.000565 -0.000902 ... 0.002229 0.029074 -0.084120 0.099265 -0.102027 0.045802 0.027891 -0.782742 -0.499421 1.000000
>>>corrdf['Survived'].round(4).abs().sort_values(ascending=False)
>Survived 1.0000
Mr 0.5492
Sex_1 0.5434
Sex_0 0.5434
Mrs 0.3449
Miss 0.3328
Pclass_3 0.3223
Pclass_1 0.2859
Family_Small 0.2799
Fare_Third 0.2703
Fare_First 0.2543
Family_Single 0.2034
Embarked_C 0.1682
Embarked_S 0.1497
Family_Large 0.1251
Age_baby 0.1230
Pclass_2 0.0933
Age_youth 0.0912
Master 0.0852
Fare_Second 0.0662
Age_old 0.0409
Age_middle 0.0392
Royalty 0.0334
Officer 0.0313
PassengerId 0.0050
Embarked_Q 0.0037
Name: Survived, dtype: float64
>>>source_y = train_data.Survived
>>>source_x = train_data.drop(['Survived'],axis=1)
>>>train_x, test_x, train_y, test_y = train_test_split(source_x, source_y , train_size=0.8,test_size=0.2)
>>>model_lr = LogisticRegression()
>>>model_lr.fit(train_x, train_y)
>>>model_rfc = RandomForestClassifier()
>>>model_rfc.fit(train_x, train_y)
>>>model_svm = SVC()
>>>model_svm.fit(train_x, train_y)
>>>model_knn = KNeighborsClassifier()
>>>model_knn.fit(train_x, train_y)
>>>model_dtree = DecisionTreeClassifier()
>>>model_dtree.fit(train_x, train_y)
>>>accuracy_lr = model_lr.score(test_x,test_y)
>>>print("逻辑回归的测试结果:", accuracy_lr)
>逻辑回归的测试结果: 0.7988826815642458
>>>accuracy_rfc = model_rfc.score(test_x,test_y)
>>>print("随机森林的测试结果:", accuracy_rfc)
>随机森林的测试结果: 0.770949720670391
>>>accuracy_svm = model_svm.score(test_x,test_y)
>>>print("支持向量机的测试结果:", accuracy_svm)
>支持向量机的测试结果: 0.5698324022346368
>>>accuracy_knn = model_knn.score(test_x,test_y)
>>>print("K最近邻分类器的测试结果:", accuracy_knn)
>K最近邻分类器的测试结果: 0.553072625698324
>>>accuracy_dtree= model_dtree.score(test_x,test_y)
>>>print("决策树模型的测试结果:", accuracy_dtree)
>决策树模型的测试结果: 0.7541899441340782
>>>import seaborn as sns
>>>import matplotlib.pyplot as plt
>>>sns.set(rc={'figure.figsize':(15,6)})
>>>accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree]
>>>models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree']
>>>bar = sns.barplot(x=models, y=accuracys)
# 显示数值标签
>>>for x, y in enumerate(accuracys):
>>> plt.text(x, y, '%s'% round(y,3), ha='center')
>>>plt.xlabel("Model")
>>>plt.ylabel("Accuracy")
>>>plt.savefig(r"D:\data\python\exercise\test2\3.png")
>>>plt.show()
>>>pred_lr = model_lr.predict(pred_x)
>>>pred_lr = pred_lr.astype(int)
>>>passenger_id = test_data.iloc[:, 0]
#逻辑回归的预测结果
>>>preddf1 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_lr})
>>>preddf1.to_csv(r'D:\data\python\taitanic\titanic_pred_model_lr.csv', index=False)
>>>pred_rfc = model_rfc.predict(pred_x)
>>>pred_rfc = pred_rfc.astype(int)
#随机森林的预测结果
>>>preddf2 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_rfc})
>>>preddf2.to_csv(r'D:\data\python\taitanic\titanic_pred_model_rfc.csv', index=False)
>>>pred_svm = model_svm.predict(pred_x)
>>>pred_svm = pred_svm.astype(int)
#支持向量机的预测结果
>>>preddf3 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_svm})
>>>preddf3.to_csv(r'D:\data\python\taitanic\titanic_pred_model_svm.csv', index=False)
>>>pred_knn = model_knn.predict(pred_x)
>>>pred_knn = pred_knn.astype(int)
#K最近邻分类器的预测结果
>>>preddf4 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_knn})
>>>preddf4.to_csv(r'D:\data\python\taitanic\titanic_pred_model_knn.csv', index=False)
>>>pred_dtree= model_dtree.predict(pred_x)
>>>pred_dtree = pred_dtree.astype(int)
#决策树模型的预测结果
>>>preddf5 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_dtree})
>>>preddf5.to_csv(r'D:\data\python\taitanic\titanic_pred_model_dtree.csv', index=False)
建议使用jupyter notebook完成以下内容
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.svm import SVC # 支持向量机
from sklearn.neighbors import KNeighborsClassifier # K最近邻
from sklearn.tree import DecisionTreeClassifier # 决策树
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#训练数据集
train_data = pd.read_csv(r"D:\data\python\taitanic\train.csv")
#测试数据集
test_data = pd.read_csv(r"D:\data\python\taitanic\test.csv")
print('训练数据集:', train_data.shape, '测试数据集:', test_data.shape)
train_data.isnull().sum()
plt.figure(figsize=(10,5),dpi=100)
train_data['Embarked'].value_counts().plot(kind='bar')
plt.savefig(r"D:\data\python\exercise\test2\1.png")
#选取频率最高的填充到空白数据中
train_data['Embarked'] = train_data['Embarked'].fillna('S')
train_data.isnull().sum()
#处理空白年龄(Age),使用平均值填充
train_data['Age'] = train_data['Age'].fillna(int(train_data['Age'].mean()))
train_data.isnull().sum()
#至此除了Cabin(船舱号)其他缺失值已经补充完毕.Cabin这一列数据值缺失过多选择填充会导致得到的数据过于片面,因此,选择删去这一列
train_data.drop(columns = 'Cabin', axis=1,inplace=True)
train_data.head()
#处理用户阶级
pclassdf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码,列名前缀是Pclass
pclassdf1 = pd.get_dummies( train_data['Pclass'] , prefix='Pclass' )
train_data = pd.concat([train_data, pclassdf1], axis=1)
train_data.drop('Pclass',axis=1, inplace=True)
train_data.head()
def gettitle(name):
str1 = name.split(',')[1] #Mr. Owen Harris
str2 = str1.split('.')[0]#Mr
str3 = str2.strip()
return str3
#存放提取后的特征
titledf1 = pd.DataFrame()
#map函数:对Series每个数据应用自定义的函数计算
titledf1['Title'] = train_data['Name'].map(gettitle)
#查看titledf的种类
titledf1['Title'].value_counts()
#姓名中头衔字符串与定义头衔类别的对应关系
title_mapdict = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir": "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr": "Mr",
"Mrs": "Mrs",
"Miss": "Miss",
"Master": "Master",
"Lady": "Royalty"
}
#map函数:对Series每个数据应用自定义的函数计算
titledf1['Title'] = titledf1['Title'].map(title_mapdict)
#使用get_dummies进行one-hot编码
titledf1 = pd.get_dummies(titledf1['Title'])
train_data = pd.concat([train_data, titledf1], axis=1)
train_data.drop('Name',axis=1, inplace=True)
train_data.head()
#处理性别
sex_mapdict = {'male': 1, 'female': 0}
#map函数:对Series每个数据应用自定义的函数计算
train_data['Sex'] = train_data['Sex'].map(sex_mapdict)
sexdf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Sex
sexdf1 = pd.get_dummies( train_data['Sex'], prefix='Sex')
train_data = pd.concat([train_data, sexdf1], axis=1)
train_data.drop('Sex',axis=1, inplace=True)
train_data.head()
plt.figure(figsize=(18,4))
train_data['Age']=train_data['Age'].astype(np.int)
average_age=train_data[['Age','Survived']].groupby('Age',as_index=False).mean()
sns.barplot(x='Age',y='Survived',data=average_age,palette='BuPu')
plt.savefig(r"D:\data\python\exercise\test2\2.png")
agedf1 = pd.DataFrame(train_data['Age'])
agedf1['Age_baby'] = agedf1['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
agedf1['Age_youth'] = agedf1['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
agedf1['Age_middle'] = agedf1['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
agedf1['Age_old'] = agedf1['Age'].map(lambda s: 1 if 60 <= s else 0)
train_data = pd.concat([train_data,agedf1], axis=1)
train_data.drop('Age',axis=1, inplace=True)
train_data.head()
#存放家庭信息
familydf1 = pd.DataFrame()
#家庭人数
familydf1['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1
#家庭类别
#if 条件为真的时候返回if前面内容,否则返回0
familydf1['Family_Single'] = familydf1['FamilySize'].map(lambda s: 1 if s == 1 else 0)
familydf1['Family_Small'] = familydf1['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
familydf1['Family_Large'] = familydf1['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
familydf1.drop('FamilySize',axis=1, inplace=True)
train_data = pd.concat([train_data, familydf1], axis=1)
train_data.drop('Parch',axis=1, inplace=True)
train_data.drop('SibSp',axis=1, inplace=True)
train_data.head()
#船票在这里没什么用处,选择删除
train_data.drop('Ticket',axis=1, inplace=True)
faredf1=pd.DataFrame(train_data['Fare'])
faredf1['Fare_First']=faredf1['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
faredf1['Fare_Second']=faredf1['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
faredf1['Fare_Third']=faredf1['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
train_data = pd.concat([train_data,faredf1], axis=1)
train_data.drop('Fare',axis=1, inplace=True)
train_data.head()
#处理上船港口
embarkeddf1 = pd.DataFrame()
#使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
embarkeddf1 = pd.get_dummies(train_data['Embarked'], prefix='Embarked')
train_data = pd.concat([train_data, embarkeddf1], axis=1)
train_data.drop('Embarked',axis=1, inplace=True)
train_data.head()
test_data.isnull().sum()
#处理测试集数据
test_data['Age'] = test_data['Age'].fillna(int(test_data['Age'].mean()))
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
test_data.drop(columns = 'Cabin', axis=1,inplace=True)
test_data.isnull().sum()
agedf2 = pd.DataFrame(test_data['Age'])
agedf2['Age_baby'] = agedf2['Age'].map(lambda s: 1 if 0 <= s < 15 else 0)
agedf2['Age_youth'] = agedf2['Age'].map(lambda s: 1 if 15 <= s < 35 else 0)
agedf2['Age_middle'] = agedf2['Age'].map(lambda s: 1 if 35 <= s <=60 else 0)
agedf2['Age_old'] = agedf2['Age'].map(lambda s: 1 if 60 <= s else 0)
test_data = pd.concat([test_data, agedf2], axis=1)
test_data.drop('Age',axis=1, inplace=True)
#处理性别
test_data['Sex'] = test_data['Sex'].map(sex_mapdict)
sexdf2 = pd.DataFrame()
sexdf2 = pd.get_dummies( test_data['Sex'], prefix='Sex')
test_data = pd.concat([test_data, sexdf2], axis=1)
test_data.drop('Sex',axis=1, inplace=True)
#处理上船港口
embarkeddf2 = pd.DataFrame()
embarkeddf2 = pd.get_dummies(test_data['Embarked'], prefix='Embarked')
test_data = pd.concat([test_data, embarkeddf2], axis=1)
test_data.drop('Embarked',axis=1, inplace=True)
#处理用户阶级
pclassdf2 = pd.DataFrame()
pclassdf2 = pd.get_dummies( test_data['Pclass'] , prefix='Pclass' )
test_data = pd.concat([test_data, pclassdf2], axis=1)
test_data.drop('Pclass',axis=1, inplace=True)
#存放提取后的特征
titledf2 = pd.DataFrame()
titledf2['Title'] = test_data['Name'].map(gettitle)
titledf2['Title'] = titledf2['Title'].map(title_mapdict)
titledf2 = pd.get_dummies(titledf2['Title'])
test_data = pd.concat([test_data, titledf2], axis=1)
test_data.drop('Name',axis=1, inplace=True)
#存放家庭信息
familydf2 = pd.DataFrame()
familydf2['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1
familydf2['Family_Single'] = familydf2['FamilySize'].map(lambda s: 1 if s == 1 else 0)
familydf2['Family_Small'] = familydf2['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
familydf2['Family_Large'] = familydf2['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
familydf2.drop('FamilySize',axis=1, inplace=True)
test_data = pd.concat([test_data, familydf2], axis=1)
test_data.drop('Parch',axis=1, inplace=True)
test_data.drop('SibSp',axis=1, inplace=True)
#处理船票
faredf2=pd.DataFrame(test_data['Fare'])
faredf2['Fare_First']=faredf2['Fare'].map(lambda f: 1 if f >= 30 and f <= 870 else 0)
faredf2['Fare_Second']=faredf2['Fare'].map(lambda f: 1 if f >= 12 and f < 30 else 0)
faredf2['Fare_Third']=faredf2['Fare'].map(lambda f: 1 if f >= 3 and f < 12 else 0)
test_data = pd.concat([test_data, faredf2], axis=1)
test_data.drop('Fare',axis=1, inplace=True)
test_data.drop('Ticket',axis=1, inplace=True)
test_data.head()
corrdf = train_data.corr()
corrdf
corrdf['Survived'].round(4).abs().sort_values(ascending=False)
source_y = train_data.Survived
source_x = train_data.drop(['Survived'],axis=1)
train_x, test_x, train_y, test_y = train_test_split(source_x,
source_y , train_size=0.8,test_size=0.2)
pred_x=test_data
model_lr = LogisticRegression()
model_lr.fit(train_x, train_y)
pred_lr = model_lr.predict(pred_x)
pred_lr = pred_lr.astype(int)
accuracy_lr = model_lr.score(test_x,test_y)
print("逻辑回归的测试结果:", accuracy_lr)
model_rfc = RandomForestClassifier()
model_rfc.fit(train_x, train_y)
pred_rfc = model_rfc.predict(pred_x)
pred_rfc = pred_rfc.astype(int)
accuracy_rfc = model_rfc.score(test_x,test_y)
print("随机森林的预试结果:", accuracy_rfc)
model_svm = SVC()
model_svm.fit(train_x, train_y)
pred_svm = model_svm.predict(pred_x)
pred_svm = pred_svm.astype(int)
accuracy_svm = model_svm.score(test_x,test_y)
print("支持向量机的测试结果:", accuracy_svm)
model_knn = KNeighborsClassifier()
model_knn.fit(train_x, train_y)
pred_knn = model_knn.predict(pred_x)
pred_knn = pred_knn.astype(int)
accuracy_knn = model_knn.score(test_x,test_y)
print("K最近邻分类器的测试结果:", accuracy_knn)
model_dtree = DecisionTreeClassifier()
model_dtree.fit(train_x, train_y)
pred_dtree= model_dtree.predict(pred_x)
pred_dtree = pred_dtree.astype(int)
accuracy_dtree= model_dtree.score(test_x,test_y)
print("决策树模型的测试结果:", accuracy_dtree)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(15,6)}) # 设置画布大小
accuracys = [accuracy_lr, accuracy_rfc, accuracy_svm, accuracy_knn, accuracy_dtree]
models = ['Logistic', 'RF', 'SVM', 'KNN', 'Dtree']
bar = sns.barplot(x=models, y=accuracys)
# 显示数值标签
for x, y in enumerate(accuracys):
plt.text(x, y, '%s'% round(y,3), ha='center')
plt.xlabel("Model")
plt.ylabel("Accuracy")
plt.savefig(r"D:\data\python\exercise\test2\3.png")
plt.show()
#数据框:乘客id,预测生存情况的值
passenger_id = test_data.iloc[:, 0]
#逻辑回归的预测结果
preddf1 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_lr})
preddf1.to_csv(r'D:\data\python\taitanic\titanic_pred_model_lr.csv', index=False)
#随机森林的预测结果
preddf2 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_rfc})
preddf2.to_csv(r'D:\data\python\taitanic\titanic_pred_model_rfc.csv', index=False)
#支持向量机的预测结果
preddf3 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_svm})
preddf3.to_csv(r'D:\data\python\taitanic\titanic_pred_model_svm.csv', index=False)
#K最近邻分类器的预测结果
preddf4 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_knn})
preddf4.to_csv(r'D:\data\python\taitanic\titanic_pred_model_knn.csv', index=False)
#决策树模型的预测结果
preddf5 = pd.DataFrame({'PassengerId': passenger_id,'Survived': pred_dtree})
preddf5.to_csv(r'D:\data\python\taitanic\titanic_pred_model_dtree.csv', index=False)
以上就是泰坦尼克号的生存预测分析。
如果你觉得这篇文章对你有用,建议点赞收藏。
欢迎各位读者指正错误,请在评论区留言。或者发表自己的看法,小编不胜感激。