分析目的:
通过分析训练数据集分析出什么类型的人更可能生存,并预测出数据集中乘客的生还概率
#字段说明
# PassengerId: 乘客编号
# Survived:是否生存,也是我们分析目标。1表示生存,0表示死亡
# Pclass:船舱等级 分1、2、3等级,1等级最高
# Name:乘客姓名
# Sex:性别
# Age:年龄
# SibSp:该乘客一起旅行的兄弟姐妹和配偶的数量(同代直系亲属人数)
# Parch:该和乘客一起旅行的父母和孩子的数量(不同代直系亲属人数)
# Ticket:船票号
# Fare:船票价格
# Cabin:船舱号
# Embarked:登船港口
# S=英国南安普顿Southampton(起航点)
# C=法国 瑟堡市Cherbourg(途经点)
# Q=爱尔兰 昆士敦Queenstown(途经点)
1.获取数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#获取数据
titanic_train=pd.read_csv("./titanic_train.csv")
titanic_test=pd.read_csv("./titanic_test.csv")
2.数据探索
titanic_train.head()#前5个样本
|
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
titanic_test.head()
|
|
PassengerId |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
892 |
3 |
Kelly, Mr. James |
male |
34.5 |
0 |
0 |
330911 |
7.8292 |
NaN |
Q |
1 |
893 |
3 |
Wilkes, Mrs. James (Ellen Needs) |
female |
47.0 |
1 |
0 |
363272 |
7.0000 |
NaN |
S |
2 |
894 |
2 |
Myles, Mr. Thomas Francis |
male |
62.0 |
0 |
0 |
240276 |
9.6875 |
NaN |
Q |
3 |
895 |
3 |
Wirz, Mr. Albert |
male |
27.0 |
0 |
0 |
315154 |
8.6625 |
NaN |
S |
4 |
896 |
3 |
Hirvonen, Mrs. Alexander (Helga E Lindqvist) |
female |
22.0 |
1 |
1 |
3101298 |
12.2875 |
NaN |
S |
titanic_train.info()
titanic_test.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
titanic_train.describe().T
|
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
PassengerId |
891.0 |
446.000000 |
257.353842 |
1.00 |
223.5000 |
446.0000 |
668.5 |
891.0000 |
Survived |
891.0 |
0.383838 |
0.486592 |
0.00 |
0.0000 |
0.0000 |
1.0 |
1.0000 |
Pclass |
891.0 |
2.308642 |
0.836071 |
1.00 |
2.0000 |
3.0000 |
3.0 |
3.0000 |
Age |
714.0 |
29.699118 |
14.526497 |
0.42 |
20.1250 |
28.0000 |
38.0 |
80.0000 |
SibSp |
891.0 |
0.523008 |
1.102743 |
0.00 |
0.0000 |
0.0000 |
1.0 |
8.0000 |
Parch |
891.0 |
0.381594 |
0.806057 |
0.00 |
0.0000 |
0.0000 |
0.0 |
6.0000 |
Fare |
891.0 |
32.204208 |
49.693429 |
0.00 |
7.9104 |
14.4542 |
31.0 |
512.3292 |
titanic_train.describe(include=['O']).T
|
|
count |
unique |
top |
freq |
Name |
891 |
891 |
Ilett, Miss. Bertha |
1 |
Sex |
891 |
2 |
male |
577 |
Ticket |
891 |
681 |
347082 |
7 |
Cabin |
204 |
147 |
B96 B98 |
4 |
Embarked |
889 |
3 |
S |
644 |
t_t_pclass=titanic_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False)
t_t_pclass.mean().sort_values(by='Survived', ascending=False)
|
|
Pclass |
Survived |
0 |
1 |
0.629630 |
1 |
2 |
0.472826 |
2 |
3 |
0.242363 |
titanic_train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
|
|
Sex |
Survived |
0 |
female |
0.742038 |
1 |
male |
0.188908 |
import seaborn as sns
import matplotlib.pyplot as plt
age_map = sns.FacetGrid(titanic_train, col='Survived')
age_map.map(plt.hist, 'Age', bins=20)
grid = sns.FacetGrid(titanic_train, col='Survived', row='Pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()
3.特征工程
combine = [titanic_train, titanic_test]
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
titanic_train.head()
|
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
0 |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
1 |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
1 |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
1 |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
0 |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
for dataset in combine:
dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
titanic_train.drop(['PassengerId','Cabin', 'Ticket'], axis=1, inplace = True)
titanic_train.info()
titanic_test.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null int32
4 Age 891 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Fare 891 non-null float64
8 Embarked 891 non-null object
dtypes: float64(2), int32(1), int64(4), object(2)
memory usage: 59.3+ KB
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null int32
4 Age 418 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 418 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int32(1), int64(4), object(4)
memory usage: 34.4+ KB
for dataset in combine:
dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1
dataset['IsAlone'] = 1
dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0
dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
label = LabelEncoder()
for dataset in combine:
dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
titanic_train.head()
|
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
FamilySize |
IsAlone |
FareBin |
AgeBin |
Sex_Code |
Embarked_Code |
AgeBin_Code |
FareBin_Code |
0 |
0 |
3 |
Braund, Mr. Owen Harris |
0 |
22.0 |
1 |
0 |
7.2500 |
S |
2 |
0 |
(-0.001, 7.91] |
(16.0, 32.0] |
0 |
2 |
1 |
0 |
1 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
1 |
38.0 |
1 |
0 |
71.2833 |
C |
2 |
0 |
(31.0, 512.329] |
(32.0, 48.0] |
1 |
0 |
2 |
3 |
2 |
1 |
3 |
Heikkinen, Miss. Laina |
1 |
26.0 |
0 |
0 |
7.9250 |
S |
1 |
1 |
(7.91, 14.454] |
(16.0, 32.0] |
1 |
2 |
1 |
1 |
3 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
1 |
35.0 |
1 |
0 |
53.1000 |
S |
2 |
0 |
(31.0, 512.329] |
(32.0, 48.0] |
1 |
2 |
2 |
3 |
4 |
0 |
3 |
Allen, Mr. William Henry |
0 |
35.0 |
0 |
0 |
8.0500 |
S |
1 |
1 |
(7.91, 14.454] |
(32.0, 48.0] |
0 |
2 |
2 |
1 |
4.模型开发
Target = ['Survived']
titanic_train_x_bin = ['Sex','Pclass', 'Embarked_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
titanic_train_xy_bin = Target + titanic_train_x_bin
print('Bin X Y: ', titanic_train_xy_bin, '\n')
#Model Algorithms
from sklearn import linear_model
#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
train1_x_bin,test1_x_bin,train1_y_bin,test1_y_bin =model_selection.train_test_split(titanic_train[titanic_train_x_bin],titanic_train[Target],test_size=0.3,random_state = 0)
print("Data1 Shape: {}".format(titanic_train.shape))
print("Train1 Shape: {}".format(train1_x_bin.shape))
print("Test1 Shape: {}".format(test1_x_bin.shape))
train1_x_bin.head()
Bin X Y: ['Survived', 'Sex', 'Pclass', 'Embarked_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
Data1 Shape: (891, 17)
Train1 Shape: (623, 6)
Test1 Shape: (268, 6)
|
|
Sex |
Pclass |
Embarked_Code |
FamilySize |
AgeBin_Code |
FareBin_Code |
857 |
0 |
1 |
2 |
1 |
3 |
2 |
52 |
1 |
1 |
0 |
2 |
3 |
3 |
386 |
0 |
3 |
2 |
8 |
0 |
3 |
124 |
0 |
1 |
2 |
2 |
3 |
3 |
578 |
1 |
3 |
0 |
2 |
1 |
2 |
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(train1_x_bin, train1_y_bin)
Y_pred = logreg.predict(test1_x_bin)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
print(logreg.intercept_) #截距
coef=pd.DataFrame(logreg.coef_).T #参数
columns=pd.DataFrame(train1_x_bin.columns,columns=['A'])
result = pd.concat([columns,coef], axis=1)
result = result.rename(columns={'A': 'Attribute', 0: 'Coefficients'})
result
[1.82313942]
|
|
Attribute |
Coefficients |
0 |
Sex |
2.605530 |
1 |
Pclass |
-0.859693 |
2 |
Embarked_Code |
-0.254773 |
3 |
FamilySize |
-0.290905 |
4 |
AgeBin_Code |
-0.568369 |
5 |
FareBin_Code |
0.219886 |
5.模型评估
acc_train = round(logreg.score(train1_x_bin, train1_y_bin) * 100, 2)
acc_train
78.81
# View summary of common classification metrics
print("------------------------Metrices---------------------")
print(metrics.classification_report(y_true = test1_y_bin,y_pred = Y_pred))
------------------------Metrices---------------------
precision recall f1-score support
0 0.85 0.84 0.84 168
1 0.74 0.75 0.74 100
accuracy 0.81 268
macro avg 0.79 0.79 0.79 268
weighted avg 0.81 0.81 0.81 268
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
probs = logreg.predict_proba(test1_x_bin)
preds = probs[:,1]
#---find the FPR, TPR, and threshold---
fpr, tpr, threshold = roc_curve(test1_y_bin, preds)
#---find the area under the curve---
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate (TPR)')
plt.xlabel('False Positive Rate (FPR)')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc = 'lower right')
plt.show()
通过分析得出以下结论:
性别方面:女性的生还率较高,达到74.2%;
年龄方面:4岁以下的婴儿存活率较高,15-25岁的人大部分没有生还。
船层方面:第一船层的乘客大部分生还,而第三船层的乘客大部分没有生还。
模型的准确率为78.49%,模型的AUC等于0.86,这个模型效果良好,能够很好预测
乘客的生存概率。