Titantic项目学习练习

之前对ML和DL学习已经有一段时间了,后面准备用对Kaggle上的学习和实践进行一些记录,方便技术的积累。同时也用它记录些AI相关的文章

Titanic项目地址:

https://www.kaggle.com/c/titanic/kernels?sortBy=voteCount&group=everyone&pageSize=20&language=Python&competitionId=3136

其中学习了其中的两个项目:

1.https://www.kaggle.com/startupsci/titanic-data-science-solutions

第一个项目讲得比较细,讲了解决此问题的思路,然后对其思路进行一步一步验证,建设特征工程,特征工程中的每一个特征的建设都会进行可视化的观察,最后用了多种机器学习方法进行模型训练,预测,并进行了对比

所以适合对机器学习实践不太了解的同学学习

下面是我对此文章学习的经验总结:

# 总结:

# 1.将数据分为数值型和字符型(分类型),然后分别看他们的分布和survived的关系,describe()看下总体分布

# 2.还有他们联合起来看和survived的关系

# 查看分布后,会有些假设,比如本利假设survived和性别,仓位等级,年龄等比较相关

# 3.对fea的drop,比如说空缺比较多的feature,和survived关系不大的fea,比如姓名,passid等

# 4.对空缺fea进行填充,数值型建议使用中位数进行填充,分类型使用种类最多的进行填充

# 5.新增fea,这个就需要看经验了,比如说本例中增加了familySize,IsAlone等

# 6.将分类型fea映射为数值型

# 技巧

# 对数值型进行统计,#应该是对每列进行排序后,计算最小,1/4位,2/4等的数值

# train_df.describe()

# # 对str型的数据进行统计:count,unique,top,freq

# train_df.describe(include=['O'])

# sns.FacetGrid

# 使用柱状图对数值型进行分析,使用散点图(sns.pointplot)对分类型进行分析 [14]


2.https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

第二个项目基本以代码说话,特征工程和第一个项目类似,所以再对第一个工程熟悉后,看第二个工程会比较容易,同时也能够对学习方法进行学习总结

总体来说对特征工程建设这方面,要对可视化分析比较熟悉

网上代码有些错误,修正后特征工程代码如下:

# Load in our libraries

import pandas as pd

import numpy as np

import random as rnd

import re

# visualization

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

# machine learning

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC, LinearSVC

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import Perceptron

from sklearn.linear_model import SGDClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,

                          GradientBoostingClassifier, ExtraTreesClassifier)

from sklearn.svm import SVC

from sklearn.cross_validation import KFold

train_df = pd.read_csv('./data/train.csv')

test_df = pd.read_csv('./data/test.csv')

combine = [train_df, test_df]

train_df['Name_length'] = train_df['Name'].apply(len)

test_df['Name_length'] = test_df['Name'].apply(len)

# Feature that tells whether a passenger had a cabin on the Titanic

train_df['Has_Cabin'] = train_df["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

test_df['Has_Cabin'] = test_df["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# train_df.info()

# test_df.info()

# drop掉无用列

train_df = train_df.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1)

test_df = test_df.drop(['Cabin', 'PassengerId'], axis=1)

# print train_df.describe()

# print train_df.describe()

# print train_df.info()

# print train_df

def get_title(name):

    title_search = re.search(' ([A-Za-z]+)\.', name)

    # If the title exists, extract and return it.

    if title_search:

        return title_search.group(1)

    return ""

full_df = []

for dataset in combine :

    # 处理object类型

    dataset = dataset.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1)


    #Title

    dataset['Title'] = dataset['Name'].apply(get_title)

    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')

    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

    # Mapping titles

    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

    dataset['Title'] = dataset['Title'].map(title_mapping)

    dataset['Title'] = dataset['Title'].fillna(0)

    dataset = dataset.drop('Name',axis=1)



#    print dataset['Title'].groupby('Title')

    dataset['Sex'] = dataset['Sex'].map({'male':0, 'female':1}).astype(int)

    dataset['Embarked'] = dataset['Embarked'].fillna('S').map({'S':0, 'C':1, 'Q':2}).astype(int)


    #处理数值类型

    dataset['Pclass'] = dataset['Pclass'].fillna(dataset['Pclass'].mean())

    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

    # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc

    dataset['IsAlone'] = 0

    dataset.loc[dataset['FamilySize']==1,'IsAlone'] = 1

    dataset = dataset.drop(['SibSp'], axis=1)



    #对Fare列进行处理,通过describe 看 25%,50%, 75%的数据:7.9104,14.4542,31

    #sns.distplot(train_df['Fare'], bins=10, rug=True);

    dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].dropna().mean())

    dataset.loc[dataset['Fare']<7.9104, 'Fare'] = 0

    dataset.loc[(dataset['Fare']>=7.9104) & (dataset['Fare']<14.4542), 'Fare'] = 1

    dataset.loc[(dataset['Fare']>=14.4542) & (dataset['Fare']<31), 'Fare'] = 2

    dataset.loc[dataset['Fare']>=31, 'Fare'] = 3

    dataset['Fare'] = dataset['Fare'].astype(int)



    #age处理

    dataset['Age'] = dataset['Age'].fillna(dataset['Age'].dropna().median())

#    sns.distplot(dataset['Age'], bins=10, rug=True);

    # Mapping Age

    dataset.loc[ dataset['Age'] <= 16, 'Age']       = 0

    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1

    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2

    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3

    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

    dataset['Age'] = dataset['Age'].astype(int)


    full_df.append(dataset)

#    print dataset.head(3)

#    print 'end ******'

train,test = full_df

# Some useful parameters which will come in handy later on

ntrain = train.shape[0]

ntest = test.shape[0]

SEED = 0 # for reproducibility

NFOLDS = 5 # set folds for out-of-fold prediction

kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

rf_params = {

    'n_jobs': -1,

    'n_estimators': 500,

    'warm_start': True,

    #'max_features': 0.2,

    'max_depth': 6,

    'min_samples_leaf': 2,

    'max_features' : 'sqrt',

    'verbose': 0

}

# Extra Trees Parameters

et_params = {

    'n_jobs': -1,

    'n_estimators':500,

    #'max_features': 0.5,

    'max_depth': 8,

    'min_samples_leaf': 2,

    'verbose': 0

}

# AdaBoost parameters

ada_params = {

    'n_estimators': 500,

    'learning_rate' : 0.75

}

# Gradient Boosting parameters

gb_params = {

    'n_estimators': 500,

    #'max_features': 0.2,

    'max_depth': 5,

    'min_samples_leaf': 2,

    'verbose': 0

}

# Support Vector Classifier parameters

svc_params = {

    'kernel' : 'linear',

    'C' : 0.025

}

# Class to extend the Sklearn classifier

class SklearnHelper(object):

    def __init__(self, clf, seed=0, params=None):

        params['random_state'] = seed

        self.clf = clf(**params)

    def train(self, x_train, y_train):

        self.clf.fit(x_train, y_train)

    def predict(self, x):

        return self.clf.predict(x)


    def fit(self,x,y):

        return self.clf.fit(x,y)


    def feature_importances(self,x,y):

        print(self.clf.fit(x,y).feature_importances_)


# Class to extend XGboost classifer

# Put in our parameters for said classifiers

# Random Forest parameters

def get_oof(clf, x_train, y_train, x_test):

    oof_train = np.zeros((ntrain,))

    oof_test = np.zeros((ntest,))

    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf):

        x_tr = x_train[train_index]

        y_tr = y_train[train_index]

        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)

        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)

    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

print '*'*50

print rf_params

a={'a':'test'}

# Create 5 objects that represent our 4 models

rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)

et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)

ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)

gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)

svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models

y_train = train['Survived'].ravel()

train = train.drop(['Survived'], axis=1)

x_train = train.values # Creates an array of the train data

x_test = test.values # Creats an array of the test data

# Create our OOF train and test predictions. These base results will be used as new features

et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees

rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest

ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost

gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost

svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")

rf_feature = rf.feature_importances(x_train,y_train)

et_feature = et.feature_importances(x_train, y_train)

ada_feature = ada.feature_importances(x_train, y_train)

gb_feature = gb.feature_importances(x_train,y_train)

你可能感兴趣的:(Titantic项目学习练习)