Kaggle 入门——titanic(SVM)

some of this article gets from 官网优秀指导

The framework of machine learning:

1、 define the problem
2、 get the data
3、 prepare the data
4、 perform exploratory analysis
5、 choose the appropriate model
6、 validate the data model
7、 optimize the parameter

data preparing

the method could be MATLAB or PYTHON to crush the data.
here i choose python to handle this.

data analysis`s tips:

1、 correct. There are some special columu of data such as age which couldn`t be 180. So check the data.
2、 complete. Some algorithm couldn`t get the NULL. So check it again.
3、 create. ceate some new columu or paremeter which is more meaningful.
4、 convering.

font converting is the most important thing in this period.
pandas 读取数据后得到的是DataFrame类型

数据处理部分

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
path_train = 'D:\\AboutWork\\机器学习\\Example1_titanic\\train.csv'
path_test = 'D:\\AboutWork\\机器学习\\Example1_titanic\\test.csv'
raw_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)
data_cleaner = [raw_data, test_data]
print(raw_data.sample(10))
print(raw_data.isnull().sum())
     PassengerId  Survived  Pclass  \
817          818         0       2   
54            55         0       1   
566          567         0       3   
769          770         0       3   
186          187         1       3   
208          209         1       3   
559          560         1       3   
340          341         1       2   
358          359         1       3   
483          484         1       3   

                                                Name     Sex   Age  SibSp  \
817                               Mallet, Mr. Albert    male  31.0      1   
54                    Ostby, Mr. Engelhart Cornelius    male  65.0      0   
566                             Stoytcheff, Mr. Ilia    male  19.0      0   
769                 Gronnestad, Mr. Daniel Danielsen    male  32.0      0   
186  O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)  female   NaN      1   
208                        Carr, Miss. Helen "Ellen"  female  16.0      0   
559     de Messemaeker, Mrs. Guillaume Joseph (Emma)  female  36.0      1   
340                   Navratil, Master. Edmond Roger    male   2.0      1   
358                             McGovern, Miss. Mary  female   NaN      0   
483                           Turkula, Mrs. (Hedwig)  female  63.0      0   

     Parch           Ticket     Fare Cabin Embarked  
817      1  S.C./PARIS 2079  37.0042   NaN        C  
54       1           113509  61.9792   B30        C  
566      0           349205   7.8958   NaN        S  
769      0             8471   8.3625   NaN        S  
186      0           370365  15.5000   NaN        Q  
208      0           367231   7.7500   NaN        Q  
559      0           345572  17.4000   NaN        S  
340      1           230080  26.0000    F2        S  
358      0           330931   7.8792   NaN        Q  
483      0             4134   9.5875   NaN        S  
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
raw_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)
data_cleaner = [raw_data, test_data]
for data in data_cleaner:
    data['Age'].fillna(data['Age'].median(), inplace=True)
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
    data.drop(['Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)  # 删除列
print(raw_data.sample(10))
print(raw_data.isnull().sum())
     Survived  Pclass                                              Name  \
620         0       3                               Yasbeck, Mr. Antoni   
176         0       3                     Lefebre, Master. Henry Forbes   
568         0       3                               Doharr, Mr. Tannous   
863         0       3                 Sage, Miss. Dorothy Edith "Dolly"   
742         1       1             Ryerson, Miss. Susan Parker "Suzette"   
748         0       1                         Marvin, Mr. Daniel Warner   
279         1       3                  Abbott, Mrs. Stanton (Rosa Hunt)   
206         0       3                        Backstrom, Mr. Karl Alfred   
871         1       1  Beckwith, Mrs. Richard Leonard (Sallie Monypeny)   
317         0       2                              Moraweck, Dr. Ernest   

        Sex   Age  SibSp  Parch      Fare Embarked  
620    male  27.0      1      0   14.4542        C  
176    male  28.0      3      1   25.4667        S  
568    male  28.0      0      0    7.2292        C  
863  female  28.0      8      2   69.5500        S  
742  female  21.0      2      2  262.3750        C  
748    male  19.0      1      0   53.1000        S  
279  female  35.0      1      1   20.2500        S  
206    male  32.0      1      0   15.8500        S  
871  female  47.0      1      1   52.5542        S  
317    male  54.0      0      0   14.0000        S  
Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
for data in data_cleaner:
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
    data['IsAlone'] = 1
    data['IsAlone'].loc[data['FamilySize'] > 1] = 0  # DataFrame的loc函数需要会
    data['FareBin'] = pd.qcut(data['Fare'], 4)  # qcut将数据区分成几个区间,具体参照方式由系统自动确定
    data['AgeBin'] = pd.cut(data['Age'].astype(float), 5)  # cut是用数据里面的最大值减去最小值除以n作为区间间距分类。
print(raw_data.info())
print(raw_data.isnull().sum())

pandas.qcut() 将数据区分成几个区间,具体参照方式由系统自动确定
pandas.cut() 也是将数据分成几个区间,方式和区间长度由最大值减去最小值除以n决定


RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Survived    891 non-null    int64   
 1   Pclass      891 non-null    int64   
 2   Name        891 non-null    object  
 3   Sex         891 non-null    object  
 4   Age         891 non-null    float64 
 5   SibSp       891 non-null    int64   
 6   Parch       891 non-null    int64   
 7   Fare        891 non-null    float64 
 8   Embarked    891 non-null    object  
 9   FamilySize  891 non-null    int64   
 10  IsAlone     891 non-null    int64   
 11  FareBin     891 non-null    category
 12  AgeBin      891 non-null    category
dtypes: category(2), float64(2), int64(6), object(3)
memory usage: 78.9+ KB
None
Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Embarked      0
FamilySize    0
IsAlone       0
FareBin       0
AgeBin        0
dtype: int64
label = LabelEncoder()
for data in data_cleaner:  # 分类为1与0属性
    data['Sex_Code'] = label.fit_transform(data['Sex'])
    data['Embarked_Code'] = label.fit_transform(data['Embarked'])
    data['AgeBin_Code'] = label.fit_transform(data['AgeBin'])
    # data['FareBin_Code'] = label.fit_transform(data['FareBin'])
raw_data_x = ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
raw_data_calc = ['Sex_Code', 'Pclass', 'Embarked_Code', 'SibSp', 'Parch', 'Age', 'Fare']
raw_data_xy = ['Survived'] + raw_data_x
print(raw_data_xy)
['Survived', 'Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
raw_data['FareBin_Code'] = label.fit_transform(raw_data['FareBin']) # 注意没有将test数据进行转化。
raw_data_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
raw_data_bin = ['Survived'] + raw_data_x_bin
raw_data_dummy = pd.get_dummies(raw_data[raw_data_x])
raw_data_x_dummy = raw_data_dummy.columns.tolist()
raw_data_xy_dummy = ['Survived'] + raw_data_x_dummy
print('Dummy X Y: ', raw_data_xy_dummy, '\n')
Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'] 
raw_data.describe(include = 'all')
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone FareBin AgeBin Sex_Code Embarked_Code AgeBin_Code FareBin_Code
count 891.000000 891.000000 891 891 891.000000 891.000000 891.000000 891.000000 891 891.000000 891.000000 891 891 891.000000 891.000000 891.000000 891.000000
unique NaN NaN 891 2 NaN NaN NaN NaN 3 NaN NaN 4 5 NaN NaN NaN NaN
top NaN NaN Lundahl, Mr. Johan Svensson male NaN NaN NaN NaN S NaN NaN (7.91, 14.454] (16.336, 32.252] NaN NaN NaN NaN
freq NaN NaN 1 577 NaN NaN NaN NaN 646 NaN NaN 224 523 NaN NaN NaN NaN
mean 0.383838 2.308642 NaN NaN 29.361582 0.523008 0.381594 32.204208 NaN 1.904602 0.602694 NaN NaN 0.647587 1.536476 1.290685 1.497194
std 0.486592 0.836071 NaN NaN 13.019697 1.102743 0.806057 49.693429 NaN 1.613459 0.489615 NaN NaN 0.477990 0.791503 0.812620 1.118156
min 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 0.000000 NaN 1.000000 0.000000 NaN NaN 0.000000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 NaN NaN 22.000000 0.000000 0.000000 7.910400 NaN 1.000000 0.000000 NaN NaN 0.000000 1.000000 1.000000 0.500000
50% 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 14.454200 NaN 1.000000 1.000000 NaN NaN 1.000000 2.000000 1.000000 1.000000
75% 1.000000 3.000000 NaN NaN 35.000000 1.000000 0.000000 31.000000 NaN 2.000000 1.000000 NaN NaN 1.000000 2.000000 2.000000 2.000000
max 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 512.329200 NaN 11.000000 1.000000 NaN NaN 1.000000 2.000000 4.000000 3.000000
train_x, test_x, train_y, test_y = model_selection.train_test_split(raw_data[raw_data_calc], raw_data['Survived'], random_state=0)
train_x_bin, text_x_bin, train_y_bin, test_y_bin = model_selection.train_test_split(raw_data[raw_data_x_bin], raw_data['Survived'], random_state=0)
train_x_dummy, test_x_dummy, train_y_dummy, test_y_dummy = model_selection.train_test_split(raw_data_dummy[raw_data_x_dummy], raw_data['Survived'], random_state = 0)
train_x_bin.head()
Sex_Code Pclass Embarked_Code FamilySize AgeBin_Code FareBin_Code
105 1 3 2 1 1 0
68 0 3 2 7 1 1
253 1 3 2 2 1 2
320 1 3 2 1 1 0
706 0 2 2 1 2 1

数据图像处理部分

你可能感兴趣的:(Kaggle 入门——titanic(SVM))