Kaggle竞赛:泰坦尼克号灾难数据分析简单案例

Kaggle竞赛:泰坦尼克号灾难数据分

https://www.kaggle.com/c/titanic

  • 目标确定:根据已有数据预测未知旅客生死
  • 数据准备
    • 数据获取,载入训练集csv、测试集csv
    • 数据清洗,补齐或抛弃缺失值,数据类型变换(字符串转数字)
    • 数据重构,根据需要重新构造数据(重组数据,构建新特征)
  • 数据分析
    • 描述性分析,画图,直观分析
    • 探索性分析,机器学习模型
  • 成果输出:csv文件上传得到正确率和排名

载入库

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

数据获取

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head() # 显示头几行数据
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
test.head()# 显示头几行数据
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

数据概览

train.shape, test.shape # 查看数据的行数,列数
((891, 12), (418, 11))
train.info() # 查看具体信息字段

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
test.info()

RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

train.csv 具体数据格式

  • PassengerId 乘客ID
  • Survived 是否幸存。0遇难,1幸存
  • Pclass 船舱等级,1Upper,2Middle,3Lower
  • Name 姓名,object——————————
  • Sex 性别,object—————————
  • Age 年龄 缺失177——m————————
  • SibSp 兄弟姐妹及配偶个数
  • Parch 父母或子女个数
  • Ticket 乘客的船票号,object————————
  • Fare 乘客的船票价
  • Cabin 乘客所在舱位,object,缺失687———————
  • Embarked 乘客登船口岸,object,缺失3————————
train.head() # head()方法查看头部几行信息,如果打train则返回所有数据列表
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S



数据清洗

缺失过多或无关值抛弃

# .loc 通过自定义索引获取数据 , 其中 .loc[:,:]中括号里面逗号前面的表示行,逗号后面的表示列
train2 = train.loc[:,['PassengerId','Survived','Pclass','Sex','Age','SibSp','Parch','Fare']]
test2 = test.loc[:, ['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare']]
train2.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare
0 1 0 3 male 22.0 1 0 7.2500
1 2 1 1 female 38.0 1 0 71.2833
2 3 1 3 female 26.0 0 0 7.9250
3 4 1 1 female 35.0 1 0 53.1000
4 5 0 3 male 35.0 0 0 8.0500
test2.head()
PassengerId Pclass Sex Age SibSp Parch Fare
0 892 3 male 34.5 0 0 7.8292
1 893 3 female 47.0 1 0 7.0000
2 894 2 male 62.0 0 0 9.6875
3 895 3 male 27.0 0 0 8.6625
4 896 3 female 22.0 1 1 12.2875
train2.info(), test2.info()
test2.info()

RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           417 non-null float64
dtypes: float64(2), int64(4), object(1)
memory usage: 22.9+ KB


填充年龄空值

age = train2['Age'].median() # 年龄中位数
age
28.0
train2['Age'].isnull() # 空值转bool值
0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26      True
27     False
28      True
29      True
       ...  
861    False
862    False
863     True
864    False
865    False
866    False
867    False
868     True
869    False
870    False
871    False
872    False
873    False
874    False
875    False
876    False
877    False
878     True
879    False
880    False
881    False
882    False
883    False
884    False
885    False
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool
train2.loc[train2['Age'].isnull(), 'Age'] = age # 为train2年龄为空值的填充年龄中位数
train2.info()
test2.loc[test2['Age'].isnull(), 'Age'] = age # 为test2中年龄为空值的数据填充年龄中位数
test2.info()

RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           417 non-null float64
dtypes: float64(2), int64(4), object(1)
memory usage: 22.9+ KB


填充船票价格空值

#取众数填充船票价格 Fare

Fare = test2['Fare'].mode()
Fare

test2.loc[test['Fare'].isnull(),'Fare'] = Fare[0]

train2.info(),test2.info()
train2.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare
0 1 0 3 male 22.0 1 0 7.2500
1 2 1 1 female 38.0 1 0 71.2833
2 3 1 3 female 26.0 0 0 7.9250
3 4 1 1 female 35.0 1 0 53.1000
4 5 0 3 male 35.0 0 0 8.0500


数据类型转换

train2.dtypes,test2.dtypes # 列数据类型
(PassengerId      int64
 Survived         int64
 Pclass           int64
 Sex             object
 Age            float64
 SibSp            int64
 Parch            int64
 Fare           float64
 dtype: object, PassengerId      int64
 Pclass           int64
 Sex             object
 Age            float64
 SibSp            int64
 Parch            int64
 Fare           float64
 dtype: object)

性别转换成整型数据

train2['Sex'] = train2['Sex'].map({'female':0, 'male':1}).astype(int)
test2['Sex'] = test2['Sex'].map({'female': 0, 'male': 1}).astype(int)
train2.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare
0 1 0 3 1 22.0 1 0 7.2500
1 2 1 1 0 38.0 1 0 71.2833
2 3 1 3 0 26.0 0 0 7.9250
3 4 1 1 0 35.0 1 0 53.1000
4 5 0 3 1 35.0 0 0 8.0500


数据重构

将SibSp、Parch特征构建两个新特征

  • 家庭人口总数 familysize
  • 是否单身 isalone
train2.loc[:,'SibSp'] #兄妹个数
train2.loc[:,'Parch'] #父母子女个数

train2['familysize'] = train2.loc[:,'SibSp'] + train2.loc[:,'Parch'] + 1
test2['familysize'] = test2.loc[:,'SibSp'] + test2.loc[:,'Parch'] + 1
train2.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare familysize
0 1 0 3 1 22.0 1 0 7.2500 2
1 2 1 1 0 38.0 1 0 71.2833 2
2 3 1 3 0 26.0 0 0 7.9250 1
3 4 1 1 0 35.0 1 0 53.1000 2
4 5 0 3 1 35.0 0 0 8.0500 1
train2['isalone'] = 0
train2.loc[train2['familysize'] == 1,'isalone'] = 1
train2.head()
PassengerId Survived Pclass Sex Age SibSp Parch Fare familysize isalone
0 1 0 3 1 22.0 1 0 7.2500 2 0
1 2 1 1 0 38.0 1 0 71.2833 2 0
2 3 1 3 0 26.0 0 0 7.9250 1 1
3 4 1 1 0 35.0 1 0 53.1000 2 0
4 5 0 3 1 35.0 0 0 8.0500 1 1


数据重构后的最终数据

train3 = train2.loc[:,['PassengerId','Survived','Pclass','Sex','Age','Fare','familysize','isalone']]
train3.head()
test3 = test2.loc[:,['PassengerId','Pclass','Sex','Age','Fare','familysize','isalone']]
test3.head()
PassengerId Pclass Sex Age Fare familysize isalone
0 892 3 1 34.5 7.8292 1 NaN
1 893 3 0 47.0 7.0000 2 NaN
2 894 2 1 62.0 9.6875 1 NaN
3 895 3 1 27.0 8.6625 1 NaN
4 896 3 0 22.0 12.2875 3 NaN




数据分析

描述性分析

#单身存活率
d = train3[['isalone', 'Survived']].groupby(['isalone']).mean()
d
# d.loc[0,'Survived']
Survived
isalone
0 0.505650
1 0.303538
#单身与否死亡率

plt.bar(
    [0,1],
    [1-d.loc[0,'Survived'],1-d.loc[1,'Survived']],
    0.5,
    color='r',
    alpha=0.5,
)

plt.xticks([0,1],['notalone','alone'])

plt.show()

Kaggle竞赛:泰坦尼克号灾难数据分析简单案例_第1张图片

#男性女性存活率
n = train3[['Sex', 'Survived']].groupby(['Sex']).mean()
n
Survived
Sex
0 0.742038
1 0.188908
# 不同性别死亡率条形图

plt.bar(
    [0,1],
    [1-n.loc[0,'Survived'],1-n.loc[1,'Survived']],
    0.5,
    color='g',
    alpha=0.7
)

plt.xticks([0,1],['female','male'])

plt.show()

Kaggle竞赛:泰坦尼克号灾难数据分析简单案例_第2张图片

#仓位存活率
c = train3[['Pclass', 'Survived']].groupby(['Pclass']).mean()
c
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
#三等仓位死亡率条形图

plt.bar(
    [0,1,2],
    [1-c.loc[1,'Survived'],1-c.loc[2,'Survived'],1-c.loc[3,'Survived']],
    0.5,
    color='b',
    alpha=0.7
)

plt.xticks([0,1,2],[1,2,3])

plt.show()

Kaggle竞赛:泰坦尼克号灾难数据分析简单案例_第3张图片

#年龄存活率
age = train3[['Age', 'Survived']].groupby(['Age']).mean()
age
Survived
Age
0.42 1.000000
0.67 1.000000
0.75 1.000000
0.83 1.000000
0.92 1.000000
1.00 0.714286
2.00 0.300000
3.00 0.833333
4.00 0.700000
5.00 1.000000
6.00 0.666667
7.00 0.333333
8.00 0.500000
9.00 0.250000
10.00 0.000000
11.00 0.250000
12.00 1.000000
13.00 1.000000
14.00 0.500000
14.50 0.000000
15.00 0.800000
16.00 0.352941
17.00 0.461538
18.00 0.346154
19.00 0.360000
20.00 0.200000
20.50 0.000000
21.00 0.208333
22.00 0.407407
23.00 0.333333
44.00 0.333333
45.00 0.416667
45.50 0.000000
46.00 0.000000
47.00 0.111111
48.00 0.666667
49.00 0.666667
50.00 0.500000
51.00 0.285714
52.00 0.500000
53.00 1.000000
54.00 0.375000
55.00 0.500000
55.50 0.000000
56.00 0.500000
57.00 0.000000
58.00 0.600000
59.00 0.000000
60.00 0.500000
61.00 0.000000
62.00 0.500000
63.00 1.000000
64.00 0.000000
65.00 0.000000
66.00 0.000000
70.00 0.000000
70.50 0.000000
71.00 0.000000
74.00 0.000000
80.00 1.000000

88 rows × 1 columns

#不同年龄存活率

plt.figure(2, figsize=(20,5))
plt.bar(
    age.index,
    age.values,
    0.5,
    color='r',
    alpha=0.7
)
# plt.axis([0,80,0,20])
plt.xticks(age.index,rotation=90)

plt.show()

Kaggle竞赛:泰坦尼克号灾难数据分析简单案例_第4张图片

#票价存活率
fare = train3[['Fare', 'Survived']].groupby(['Fare']).mean()
fare
Survived
Fare
0.0000 0.066667
4.0125 0.000000
5.0000 0.000000
6.2375 0.000000
6.4375 0.000000
6.4500 0.000000
6.4958 0.000000
6.7500 0.000000
6.8583 0.000000
6.9500 0.000000
6.9750 0.500000
7.0458 0.000000
7.0500 0.000000
7.0542 0.000000
7.1250 0.000000
7.1417 1.000000
7.2250 0.250000
7.2292 0.266667
7.2500 0.076923
7.3125 0.000000
7.4958 0.333333
7.5208 0.000000
7.5500 0.250000
7.6292 0.000000
7.6500 0.250000
7.7250 0.000000
7.7292 0.000000
7.7333 0.500000
7.7375 0.500000
7.7417 0.000000
80.0000 1.000000
81.8583 1.000000
82.1708 0.500000
83.1583 1.000000
83.4750 0.500000
86.5000 1.000000
89.1042 1.000000
90.0000 0.750000
91.0792 1.000000
93.5000 1.000000
106.4250 0.500000
108.9000 0.500000
110.8833 0.750000
113.2750 0.666667
120.0000 1.000000
133.6500 1.000000
134.5000 1.000000
135.6333 0.666667
146.5208 1.000000
151.5500 0.500000
153.4625 0.666667
164.8667 1.000000
211.3375 1.000000
211.5000 0.000000
221.7792 0.000000
227.5250 0.750000
247.5208 0.500000
262.3750 1.000000
263.0000 0.500000
512.3292 1.000000

248 rows × 1 columns

plt.figure(2, figsize=(20,5))
plt.bar(
    fare.index,
    fare.values,
    0.5,
    color='r',
    alpha=0.7
)
# plt.axis([0,80,0,20])
plt.xticks(fare.index,rotation=90)

plt.show()

Kaggle竞赛:泰坦尼克号灾难数据分析简单案例_第5张图片



得出结论

# 单身死亡率70%
jieguo = pd.DataFrame(np.arange(0,418),index=test3.loc[:,'PassengerId'])
jieguo.loc[:,0] = 1
jieguo.head()
0
PassengerId
892 1
893 1
894 1
895 1
896 1
jieguo.loc[test3[test3.loc[:,'isalone'] == 1].loc[:,'PassengerId'].values] = 0 #单身死
jieguo.head()
0
PassengerId
892 1
893 1
894 1
895 1
896 1



输出结论

jieguo.to_csv('isalone.csv')
#判断:男性全死,女性全活,三等仓全死
new3 = pd.DataFrame(np.arange(0,418),index=test3.loc[:,'PassengerId'].values)
new3[0] = 0 #默认全死
new3.head()
0
892 0
893 0
894 0
895 0
896 0
new3.loc[test3[test3.loc[:,'Sex'] == 0].loc[:,'PassengerId'].values] = 1 #女性活
new3.head()
0
892 0
893 1
894 0
895 0
896 1
new3.loc[test2[test2.loc[:,'Pclass'] == 3].loc[:,'PassengerId'].values] = 0 #三等仓死
new3.head()
0
892 0
893 0
894 0
895 0
896 0
#写入csv上传
new3.to_csv('cangwei-xingbie.csv')#判断:男性全死,女性全活,三等仓全死


机器学习建模

train3.head()
PassengerId Survived Pclass Sex Age Fare familysize isalone
0 1 0 3 1 22.0 7.2500 2 0
1 2 1 1 0 38.0 71.2833 2 0
2 3 1 3 0 26.0 7.9250 1 1
3 4 1 1 0 35.0 53.1000 2 0
4 5 0 3 1 35.0 8.0500 1 1
from sklearn import neighbors,datasets
x = train3.loc[:,['Pclass','Sex','familysize']]
y = train3.loc[:,'Survived'] #生死

clf = neighbors.KNeighborsClassifier(n_neighbors = 20)
clf.fit(x,y) #knn训练
clf
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=20, p=2,
           weights='uniform')
#knn预测
z = clf.predict(test3.loc[:,['Pclass','Sex','familysize']])
z
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0], dtype=int64)
# 构造表
s = np.arange(892, 1310)
s
results = pd.DataFrame(z, index=s)
results.head()
0
892 0
893 0
894 0
895 0
896 1
# 写入csv上传
results.to_csv('Titanic_knn.csv')

你可能感兴趣的:(数据分析)