泰坦尼克号代码分析

读入训练数据

import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
test.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
g_sub = pd.read_csv('gender_submission.csv')
g_sub.info()

特征含义:

PassengerId:乘客编号(无意义)
Survived:是否存活(1-存活,0-未存活),目标特征
Pclass:船舱等级(1、2、3等舱)
SibSp:堂兄弟姐妹个数
Parch:直系亲属个数
Embarked:登船港口

看一看数据的基本信息

RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
# Column Non-Null Count Dtype
— ------ -------------- -----
0 PassengerId 418 non-null int64
1 Survived 418 non-null int64
dtypes: int64(2)
memory usage: 6.7 KB

g_sub.head()
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
union = train.append(test, ignore_index=True)
union.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

我们发现:

乘客总数(记录数):1309
特征总数:11
缺失:年龄、船舱编号、登船港口

union.describe()

看一看数据的基本统计值

PassengerId Survived Pclass Age SibSp Parch Fare
count 1309.000000 891.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean 655.000000 0.383838 2.294882 29.881138 0.498854 0.385027 33.295479
std 378.020061 0.486592 0.837836 14.413493 1.041658 0.865560 51.758668
min 1.000000 0.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 328.000000 0.000000 2.000000 21.000000 0.000000 0.000000 7.895800
50% 655.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 982.000000 1.000000 3.000000 39.000000 1.000000 0.000000 31.275000
max 1309.000000 1.000000 3.000000 80.000000 8.000000 9.000000 512.329200

我们可以看出:

平均年龄29.8岁,说明乘客青壮年居多
存活率38.4%
2、3等舱乘客比一等舱要多很多

union['Age'].fillna(union['Age'].median(), inplace=True)
union['Fare'].fillna(union['Fare'].mean(), inplace=True)
union = union.drop(['Name', 'Cabin'], axis=1)
union['Embarked'].value_counts()

对于登船港口的处理,我们采取最频繁的值填充

S    914
C    270
Q    123
Name: Embarked, dtype: int64

我们用最多的“s”来填充缺少的登船港口

union['Embarked'].fillna('S', inplace=True)
union.info()

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Sex          1309 non-null   object 
 4   Age          1309 non-null   float64
 5   SibSp        1309 non-null   int64  
 6   Parch        1309 non-null   int64  
 7   Ticket       1309 non-null   object 
 8   Fare         1309 non-null   float64
 9   Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(3)
memory usage: 102.4+ KB
union.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 1 0.0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 2 1.0 1 female 38.0 1 0 PC 17599 71.2833 C
2 3 1.0 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 4 1.0 1 female 35.0 1 0 113803 53.1000 S
4 5 0.0 3 male 35.0 0 0 373450 8.0500 S

经过处理之后,发现已经没有缺失值,简单的数据预处理告一段落

train = union.loc[0:890]
test = union.loc[891:]

把合并的数据集重新拆开

x_train = train.drop(['Survived','Ticket'], axis=1)
x_test = test.drop(['Survived','Ticket'], axis=1)
y_train = train['Survived']
y_test = test['Survived']

将其分为训练集的特征值,测试集的特征值,训练集的目标值和测试集的目标值

dict = DictVectorizer(sparse=False)
X_train = dict.fit_transform(x_train.to_dict(orient='record'))
X_test = dict.transform(x_test.to_dict(orient='record'))
dict.get_feature_names()

调用 fit transform(X)
与tranform(X)

['Age',
 'Embarked=C',
 'Embarked=Q',
 'Embarked=S',
 'Fare',
 'Parch',
 'PassengerId',
 'Pclass',
 'Sex=female',
 'Sex=male',
 'SibSp']
X_train
array([[22.,  0.,  0., ...,  0.,  1.,  1.],
       [38.,  1.,  0., ...,  1.,  0.,  1.],
       [26.,  0.,  0., ...,  1.,  0.,  0.],
       ...,
       [28.,  0.,  0., ...,  1.,  0.,  1.],
       [26.,  1.,  0., ...,  0.,  1.,  0.],
       [32.,  0.,  1., ...,  0.,  1.,  0.]])
from sklearn.tree import DecisionTreeClassifier

导入库sklearn.tree.DecisionTreeClassifier()

dec = DecisionTreeClassifier(max_depth=5)
dec.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5)
y_predict = dec.predict(X_test).astype(np.int64)
y_predict
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0],
      dtype=int64)
id = test['PassengerId']
sub = {
     'PassengerId': id, 'Survived': y_predict}
submission = pd.DataFrame(sub)
submission.to_csv("sub.csv", index=False)
submission
PassengerId Survived
891 892 0
892 893 0
893 894 0
894 895 0
895 896 1
... ... ...
1304 1305 0
1305 1306 1
1306 1307 0
1307 1308 0
1308 1309 0

418 rows × 2 columns


你可能感兴趣的:(笔记)