Titanic(泰坦尼克号生存预测)---(1)

我是初学者哈,有问题欢迎大家指出。一起加油,共同进步!
关于数据以及代码:

# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

读取数据

train_df = pd.read_csv('data/泰坦尼克号生存率/train.csv')
test_df = pd.read_csv('data/泰坦尼克号生存率/test.csv')
combine = [train_df, test_df]
#特征属性值以及前五个数据样本
print(train_df.columns.values)
train_df.head()
# 查看数据集的缺失情况
train_df.info()
print('_'*50)
test_df.info()
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)

得到结论:
数据缺失情况:

对于训练数据:cabin信息缺失很多,age部分缺失,再是embarked少量缺失
对于测试数据:cabin>age

数据类型:
7+5
6+5

对缺失数据进行处理

缺失数据处理方法
先看缺失值最少的embarked:

# 因为只缺少两个值,因而大部分方法都可以使用,从简,直接插入出现频率最高的值
freq_port = train_df.Embarked.dropna().mode()[0]# 得到出现频率最高的特征值
freq_port
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)#当该特征值为空值时,插入出现频率最高的值
    
train_df.info()    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
#根据Embarked进行分类,并计算出其与是否生存的关系,或者说是每个港口的存活率。
根据输出值,可以得出Embarked已经完全填补,而且c港口的生存概率最高
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB

	Embarked 	Survived
0 	C 	0.553571
1 	Q 	0.389610
2 	S 	0.339009

年龄采用均值插补法

age_mean=dataset['Age'].mean()
age_mean

for dataset in combine:
    dataset['Age'] = dataset['Age'].fillna(age_mean)
    
train_df.info()    
train_df[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Survived', ascending=False)

cabin可以直接丢弃

  • 缺失数据过大
  • 该特征值与存活率相关不大
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)

combine = [train_df, test_df]
train_df.shape, test_df.shape

test_df = test_df.drop(['Ticket','Cabin'], axis=1)
train_df = train_df.drop(['Ticket','Cabin'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape
#train_df.head()

将数据规格化

对于

你可能感兴趣的:(人工智能)