读入训练数据
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
test.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
g_sub = pd.read_csv('gender_submission.csv')
g_sub.info()
特征含义:
PassengerId:乘客编号(无意义)
Survived:是否存活(1-存活,0-未存活),目标特征
Pclass:船舱等级(1、2、3等舱)
SibSp:堂兄弟姐妹个数
Parch:直系亲属个数
Embarked:登船港口
看一看数据的基本信息
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
# Column Non-Null Count Dtype
— ------ -------------- -----
0 PassengerId 418 non-null int64
1 Survived 418 non-null int64
dtypes: int64(2)
memory usage: 6.7 KB
g_sub.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 1 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
union = train.append(test, ignore_index=True)
union.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1046 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1308 non-null float64
10 Cabin 295 non-null object
11 Embarked 1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
我们发现:
乘客总数(记录数):1309
特征总数:11
缺失:年龄、船舱编号、登船港口
union.describe()
看一看数据的基本统计值
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 891.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 |
mean | 655.000000 | 0.383838 | 2.294882 | 29.881138 | 0.498854 | 0.385027 | 33.295479 |
std | 378.020061 | 0.486592 | 0.837836 | 14.413493 | 1.041658 | 0.865560 | 51.758668 |
min | 1.000000 | 0.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
25% | 328.000000 | 0.000000 | 2.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
50% | 655.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 982.000000 | 1.000000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 |
max | 1309.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 |
我们可以看出:
平均年龄29.8岁,说明乘客青壮年居多
存活率38.4%
2、3等舱乘客比一等舱要多很多
union['Age'].fillna(union['Age'].median(), inplace=True)
union['Fare'].fillna(union['Fare'].mean(), inplace=True)
union = union.drop(['Name', 'Cabin'], axis=1)
union['Embarked'].value_counts()
对于登船港口的处理,我们采取最频繁的值填充
S 914
C 270
Q 123
Name: Embarked, dtype: int64
我们用最多的“s”来填充缺少的登船港口
union['Embarked'].fillna('S', inplace=True)
union.info()
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Sex 1309 non-null object
4 Age 1309 non-null float64
5 SibSp 1309 non-null int64
6 Parch 1309 non-null int64
7 Ticket 1309 non-null object
8 Fare 1309 non-null float64
9 Embarked 1309 non-null object
dtypes: float64(3), int64(4), object(3)
memory usage: 102.4+ KB
union.head()
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
1 | 2 | 1.0 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
2 | 3 | 1.0 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
3 | 4 | 1.0 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
4 | 5 | 0.0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
经过处理之后,发现已经没有缺失值,简单的数据预处理告一段落
train = union.loc[0:890]
test = union.loc[891:]
把合并的数据集重新拆开
x_train = train.drop(['Survived','Ticket'], axis=1)
x_test = test.drop(['Survived','Ticket'], axis=1)
y_train = train['Survived']
y_test = test['Survived']
将其分为训练集的特征值,测试集的特征值,训练集的目标值和测试集的目标值
dict = DictVectorizer(sparse=False)
X_train = dict.fit_transform(x_train.to_dict(orient='record'))
X_test = dict.transform(x_test.to_dict(orient='record'))
dict.get_feature_names()
调用 fit transform(X)
与tranform(X)
['Age',
'Embarked=C',
'Embarked=Q',
'Embarked=S',
'Fare',
'Parch',
'PassengerId',
'Pclass',
'Sex=female',
'Sex=male',
'SibSp']
X_train
array([[22., 0., 0., ..., 0., 1., 1.],
[38., 1., 0., ..., 1., 0., 1.],
[26., 0., 0., ..., 1., 0., 0.],
...,
[28., 0., 0., ..., 1., 0., 1.],
[26., 1., 0., ..., 0., 1., 0.],
[32., 0., 1., ..., 0., 1., 0.]])
from sklearn.tree import DecisionTreeClassifier
导入库sklearn.tree.DecisionTreeClassifier()
dec = DecisionTreeClassifier(max_depth=5)
dec.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5)
y_predict = dec.predict(X_test).astype(np.int64)
y_predict
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0],
dtype=int64)
id = test['PassengerId']
sub = {
'PassengerId': id, 'Survived': y_predict}
submission = pd.DataFrame(sub)
submission.to_csv("sub.csv", index=False)
submission
PassengerId | Survived | |
---|---|---|
891 | 892 | 0 |
892 | 893 | 0 |
893 | 894 | 0 |
894 | 895 | 0 |
895 | 896 | 1 |
... | ... | ... |
1304 | 1305 | 0 |
1305 | 1306 | 1 |
1306 | 1307 | 0 |
1307 | 1308 | 0 |
1308 | 1309 | 0 |
418 rows × 2 columns