titanic的一个解法

本文参考了Titanic Data Science Solutions和第一次Kaggle项目——泰坦尼克号的一些代码思路。

步骤

  • 问题提出
  • 数据的导入与整理
  • 特征工程
    • Age的处理
    • Fare的处理
    • Embarked的处理,'S'最常见
    • sex的0-1
    • 登船港口Embarked进行one-hot编码,同时把原本的变量删除
    • 客舱等级Pclass进行one-hot编码,同时把原本的变量删除
    • 名字
      • 整理名字的含义
      • 对名字进行one-hot编码,同时把原本的变量删除
  • 所在家庭大小(在船上的)
  • 除去空值过多的Cabin,信息杂乱的Ticket,PassengerId也是不需要的。
  • 分离出train、test
  • 机器学习
    • 计算分数
    • 结果
  • 一些发现

问题提出

什么样的人在泰坦尼克号中更容易存活?

数据的导入与整理

本文要用到的数据在Kaggle,然后利用pandas进行数据的导入。

import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

缺失值的存在极大的影响了数据的使用,所以我们先利用.info()查看数据的缺失值。

train_df.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

train数据集中,AgeCabinEmbarked需要进行数据的补全。

test_df.info()

RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

test数据集中,AgeFareCabin需要进行数据的补全。

合并数据集,同时对两个数据集进行清洗

combine = pd.concat([train_df, test_df], axis = 0)

特征工程

import numpy as np 
#消除随机
np.random.seed()

Age的处理

随机生产正态分布数列,均值:np.mean() 方差np.std()

Age_null = combine[combine['Age'].isna()]
Age_null['Age'] = np.random.normal(np.mean(combine['Age']), np.std(combine['Age'])\
        , (Age_null.shape[0], 1))
#Age_null['Age'] = Age_null['Age'].apply(round) 
Age_notnull = combine[combine['Age'].notna()]
combine = pd.concat([Age_null, Age_notnull], axis = 0)

Fare的处理

Fare_null = combine[combine['Fare'].isna()]
Fare_null['Fare'] = np.random.normal(np.mean(combine['Fare']), np.std(combine['Fare']), \
         (Fare_null.shape[0], 1))
#Fare_null['Fare'] = Fare_null['Fare'].apply(round)
Fare_notnull = combine[combine['Fare'].notna()]
combine = pd.concat([Fare_null, Fare_notnull], axis = 0)

Embarked的处理,'S’最常见

combine['Embarked'] = combine['Embarked'].fillna('S')

sex的0-1

sex = {'male': 1, 'female': 0}
combine['Sex'] = combine['Sex'].map(sex)

登船港口Embarked进行one-hot编码,同时把原本的变量删除

data_Embark = pd.get_dummies(combine['Embarked'], prefix = 'Embarked')
combine = pd.concat([data_Embark, combine], axis = 1)
combine = combine.drop('Embarked', axis = 1)

客舱等级Pclass进行one-hot编码,同时把原本的变量删除

data_Pclass = pd.get_dummies(combine['Pclass'], prefix = 'Pclass')
combine = pd.concat([data_Pclass, combine], axis = 1)
combine = combine.drop('Pclass', axis = 1)

名字

整理名字的含义

combine['NameTitle'] = combine.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
combine['NameTitle'] = combine['NameTitle'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
combine['NameTitle'] = combine['NameTitle'].replace(['Mlle', 'Ms'], 'Miss')
combine['NameTitle'] = combine['NameTitle'].replace('Mme', 'Mrs')
combine = combine.drop('Name', axis = 1)

对名字进行one-hot编码,同时把原本的变量删除

data_NameTitle = pd.get_dummies(combine['NameTitle'], prefix = 'NameTitle')
combine = pd.concat([data_NameTitle, combine], axis = 1)
combine = combine.drop('NameTitle', axis = 1)

所在家庭大小(在船上的)

combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1 
combine = combine.drop(['SibSp', 'Parch'], axis = 1)

除去空值过多的Cabin,信息杂乱的Ticket,PassengerId也是不需要的。

combine = combine.drop(['Cabin', 'PassengerId', 'Ticket'], axis = 1)

分离出train、test

train = combine[combine['Survived'].notna()]
test = combine[combine['Survived'].isna()].drop('Survived', axis=1)

X_train = train.drop('Survived', axis = 1)
Y_train = train['Survived']
X_test = test

机器学习

from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

svc = SVC()
svc.fit(X_train, Y_train)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)

knn = KNeighborsClassifier(n_neighbors = 33)
knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)


linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

计算分数

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})

结果

Model Score
0 Support Vector Machines 88.55
1 KNN 72.62
2 Random Forest 99.10
3 Naive Bayes 79.91
4 Perceptron 58.59
5 Stochastic Gradient Decent 73.51
6 Linear SVC 82.04
7 Decision Tree 99.10

一些发现

  1. 在填补缺失值的过程中,对AgeFare的取整并不能来ACC分数的提升
  2. 可以对AgeFare进行one-hot编码,但是跟上一个发现一样,ACC分数降低。

你可能感兴趣的:(sklearn,kaggle)