我是初学者哈,有问题欢迎大家指出。一起加油,共同进步!
关于数据以及代码:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
读取数据
train_df = pd.read_csv('data/泰坦尼克号生存率/train.csv')
test_df = pd.read_csv('data/泰坦尼克号生存率/test.csv')
combine = [train_df, test_df]
#特征属性值以及前五个数据样本
print(train_df.columns.values)
train_df.head()
# 查看数据集的缺失情况
train_df.info()
print('_'*50)
test_df.info()
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
得到结论:
数据缺失情况:
对于训练数据:cabin信息缺失很多,age部分缺失,再是embarked少量缺失
对于测试数据:cabin>age
数据类型:
7+5
6+5
缺失数据处理方法
先看缺失值最少的embarked:
# 因为只缺少两个值,因而大部分方法都可以使用,从简,直接插入出现频率最高的值
freq_port = train_df.Embarked.dropna().mode()[0]# 得到出现频率最高的特征值
freq_port
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)#当该特征值为空值时,插入出现频率最高的值
train_df.info()
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
#根据Embarked进行分类,并计算出其与是否生存的关系,或者说是每个港口的存活率。
根据输出值,可以得出Embarked已经完全填补,而且c港口的生存概率最高
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
年龄采用均值插补法
age_mean=dataset['Age'].mean()
age_mean
for dataset in combine:
dataset['Age'] = dataset['Age'].fillna(age_mean)
train_df.info()
train_df[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Survived', ascending=False)
cabin可以直接丢弃
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape
test_df = test_df.drop(['Ticket','Cabin'], axis=1)
train_df = train_df.drop(['Ticket','Cabin'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape
#train_df.head()
对于