kaggle系列(一、Titanic入门比赛)

Table of Contents

  • 1  背景介绍
  • 2  数据导入与分析
    • 2.1  导入有用的包
    • 2.2  导入数据
    • 2.3  去除离群点
    • 2.4  连接训练数据和测试数据
    • 2.5  查看缺失值
  • 3  特征分析与数据前处理
    • 3.1  数值变量
      • 3.1.1  Explore SibSp feature vs Survived
      • 3.1.2  Explore Parch feature vs Survived
      • 3.1.3  Explore Age distibution
        • 3.1.3.1  Age缺失值填补
      • 3.1.4  Explore Fare distribution
    • 3.2  分类变量
      • 3.2.1  sex
      • 3.2.2  Explore Pclass vs Survived
      • 3.2.3  Explore Pclass vs Survived by Sex
      • 3.2.4  Explore Embarked vs Survived
      • 3.2.5  Explore Pclass vs Embarked
  • 4  特征工程
    • 4.1  name
    • 4.2  SibSp,Parch
    • 4.3  Embarked
    • 4.4  Cabin
    • 4.5  Ticket
    • 4.6  Pclass
    • 4.7  Age
    • 4.8  Fare
    • 4.9  PassengerId
  • 5  基本模型求解
    • 5.1  简单模型
      • 5.1.1  KNN
      • 5.1.2  Logistic regression
      • 5.1.3  Naive Bayes
      • 5.1.4  SVC
    • 5.2  单一集成模型
      • 5.2.1  Random Forest
      • 5.2.2  ExtraTrees
      • 5.2.3  Gradient boosting
      • 5.2.4  xgboost
      • 5.2.5  Plot learning curves
      • 5.2.6  Feature importance of tree based classifiers
    • 5.3  多种模型集成方法
      • 5.3.1  四个集成模型投票法
      • 5.3.2  stacking法
  • 6  总结
  • 7  参考资料

背景介绍

发生在1912年的泰坦尼克事件,导致船上2224名游客阵亡1502(我们的男主角也牺牲了),我们掌握船上乘客的一些数据以及一部分乘客是否获救的信息。我们希望能通过探索这些数据,发现一些不为人知的秘密,顺便预测下另外一部分乘客是否能够获救!

数据导入与分析

导入有用的包

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from collections import Counter

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,\
      ExtraTreesClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

导入数据

In [2]:
train = pd.read_csv("C:/Code/Kaggle/Titanic/train.csv")
test = pd.read_csv("C:/Code/Kaggle/Titanic/test.csv")
IDtest = test["PassengerId"]

去除离群点

In [3]:
def detect_outliers(df,n,features):
    outlier_indices = []
    for col in features:
        Q1 = np.percentile(df[col],25)
        Q3 = np.percentile(df[col],75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col)

    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v>n)
    return multiple_outliers
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])
In [4]:
train.loc[Outliers_to_drop]
Out[4]:
  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.00 C23 C25 C27 S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.00 C23 C25 C27 S
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S
341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.00 C23 C25 C27 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S
In [5]:
train = train.drop(Outliers_to_drop,axis=0).reset_index(drop=True)

连接训练数据和测试数据

In [6]:
train_len = len(train)
dataset = pd.concat([train,test], axis=0).reset_index(drop=True)
dataset.tail()
Out[6]:
  Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
1294 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236
1295 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758
1296 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262
1297 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309
1298 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668

查看缺失值

In [7]:
#dataset = dataset.fillna(np.nan)
dataset.isnull().sum()
Out[7]:
Age             256
Cabin          1007
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64
In [8]:
train.info()
train.isnull().sum()

RangeIndex: 881 entries, 0 to 880
Data columns (total 12 columns):
PassengerId    881 non-null int64
Survived       881 non-null int64
Pclass         881 non-null int64
Name           881 non-null object
Sex            881 non-null object
Age            711 non-null float64
SibSp          881 non-null int64
Parch          881 non-null int64
Ticket         881 non-null object
Fare           881 non-null float64
Cabin          201 non-null object
Embarked       879 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 82.7+ KB
Out[8]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            170
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          680
Embarked         2
dtype: int64
In [9]:
train.describe()
Out[9]:
  PassengerId Survived Pclass Age SibSp Parch Fare
count 881.000000 881.000000 881.000000 711.000000 881.000000 881.000000 881.000000
mean 446.713961 0.385925 2.307605 29.731603 0.455165 0.363224 31.121566
std 256.617021 0.487090 0.835055 14.547835 0.871571 0.791839 47.996249
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 226.000000 0.000000 2.000000 20.250000 0.000000 0.000000 7.895800
50% 448.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.000000 1.000000 3.000000 38.000000 1.000000 0.000000 30.500000
max 891.000000 1.000000 3.000000 80.000000 5.000000 6.000000 512.329200

特征分析与数据前处理

数值变量

In [10]:
g = sns.heatmap(train[["Survived","SibSp","Parch","Age","Fare"]].corr(), annot=True, fmt=".2f", cmap = "coolwarm")

Explore SibSp feature vs Survived

In [11]:
g = sns.factorplot(x="SibSp",y="Survived",data=train,kind="bar")
g = g.set_ylabels("survival probability")

Explore Parch feature vs Survived

In [12]:
g  = sns.factorplot(x="Parch",y="Survived",data=train,kind="bar")
g = g.set_ylabels("survival probability")

Explore Age distibution

In [13]:
g = sns.kdeplot(train["Age"][(train["Survived"] == 0) & (train["Age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(train["Age"][(train["Survived"] == 1) & (train["Age"].notnull())], ax =g, color="Blue", shade= True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])

Age缺失值填补

我们发现数据集中总共有256个缺失值,并且年龄对存活率有着不小的影响(小孩更容易存活),所以我们需要保留这个特征,并且将缺失值进行填充

(As we see, Age column contains 256 missing values in the whole dataset. Since there is subpopulations that have more chance to survive (children for example), it is preferable to keep the age feature and to impute the missing values. To adress this problem, i looked at the most correlated features with Age (Sex, Parch , Pclass and SibSP).)

填充缺失值的三种方法

Completing a numerical continuous feature

Now we should start estimating and completing features with missing or null values. We will first do this for the Age feature.

We can consider three methods to complete a numerical continuous feature.

A simple way is to generate random numbers between mean and standard deviation.

More accurate way of guessing missing values is to use other correlated features. In our case we note correlation among Age, Gender, and Pclass. Guess Age values using median values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...

Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

Method 1 and 3 will introduce random noise into our models. The results from multiple executions might vary. We will prefer method 2.

In [14]:
# Explore Age vs Sex, Parch , Pclass and SibSP
g = sns.factorplot(y="Age",x="Sex",data=dataset,kind="box")
g = sns.factorplot(y="Age",x="Sex",hue="Pclass", data=dataset,kind="box")
g = sns.factorplot(y="Age",x="Parch", data=dataset,kind="box")
g = sns.factorplot(y="Age",x="SibSp", data=dataset,kind="box")
In [15]:
# convert Sex into categorical value 0 for male and 1 for female
dataset["Sex"] = dataset["Sex"].map({"male": 0, "female":1})
g = sns.heatmap(dataset[["Age","Sex","SibSp","Parch","Pclass"]].corr(),cmap="coolwarm",annot=True)
In [16]:
# Filling missing value of Age 

## Fill Age with the median age of similar rows according to Pclass, Parch and SibSp
# Index of NaN age rows
index_NaN_age = list(dataset["Age"][dataset["Age"].isnull()].index)

for i in index_NaN_age :
    age_med = dataset["Age"].median()
    age_pred = dataset["Age"][((dataset['SibSp'] == dataset.iloc[i]["SibSp"]) & (dataset['Parch'] == dataset.iloc[i]["Parch"]) & (dataset['Pclass'] == dataset.iloc[i]["Pclass"]))].median()
    if not np.isnan(age_pred) :
        dataset['Age'].iloc[i] = age_pred
    else :
        dataset['Age'].iloc[i] = age_med
dataset.tail()
Out[16]:
  Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
1294 25.0 NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 0 0 NaN A.5. 3236
1295 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 1 0 NaN PC 17758
1296 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 0 0 NaN SOTON/O.Q. 3101262
1297 25.0 NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 0 0 NaN 359309
1298 16.0 NaN C 22.3583 Peter, Master. Michael J 1 1309 3 0 1 NaN 2668
In [17]:
g = sns.factorplot(x="Survived", y = "Age",data = train, kind="box")
g = sns.factorplot(x="Survived", y = "Age",data = train, kind="violin")

Explore Fare distribution

In [18]:
#Fill Fare missing values with the median value
dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())

g = sns.distplot(dataset["Fare"], color="m")
g = g.legend(loc="best")
In [19]:
# Apply log to Fare to reduce skewness distribution
dataset["Fare"] = dataset["Fare"].map(lambda i: np.log(i) if i > 0 else 0)
g = sns.distplot(dataset["Fare"], color="b")
g = g.legend(loc="best")

分类变量

sex

In [20]:
g = sns.factorplot(x="Sex",y="Survived",data=train,kind="bar")
g = g.set_ylabels("Survival Probability")
In [21]:
train[["Sex","Survived"]].groupby('Sex').mean()
Out[21]:
  Survived
Sex  
female 0.747573
male 0.190559

Explore Pclass vs Survived

In [22]:
g = sns.factorplot(x="Pclass",y="Survived",data=train,kind="bar", size = 6 , 
palette = "muted")
g = g.set_ylabels("survival probability")

Explore Pclass vs Survived by Sex

In [23]:
g = sns.factorplot(x="Pclass", y="Survived", hue="Sex", data=train,
                  size=6, kind="bar", palette="muted")
g = g.set_ylabels("survival probability")

Explore Embarked vs Survived

In [24]:
#Fill Embarked nan values of dataset set with 'S' most frequent value
dataset["Embarked"] = dataset["Embarked"].fillna("S")

g = sns.factorplot(x="Embarked", y="Survived",  data=train,
                   size=6, kind="bar", palette="muted")
g = g.set_ylabels("survival probability")

Explore Pclass vs Embarked

In [25]:
# Explore Pclass vs Embarked 
g = sns.factorplot("Pclass", col="Embarked",  data=train,
                   size=6, kind="count", palette="muted")
g = g.set_ylabels("Count")

特征工程

name

In [26]:
dataset["Name"].head()
Out[26]:
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
In [27]:
# Get Title from Name
dataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]]
dataset["Title"] = pd.Series(dataset_title)
dataset["Title"].head()
Out[27]:
0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object
In [28]:
g = sns.countplot(x="Title",data=dataset)
g = plt.setp(g.get_xticklabels(), rotation=45)
In [29]:
# Convert to categorical values Title 
dataset["Title"] = dataset["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})
dataset["Title"] = dataset["Title"].astype(int)
In [30]:
g = sns.countplot(dataset["Title"])
g = g.set_xticklabels(["Master","Miss/Ms/Mme/Mlle/Mrs","Mr","Rare"])
In [31]:
g = sns.factorplot(x="Title",y="Survived",data=dataset[:train_len],kind="bar")
g = g.set_xticklabels(["Master","Miss-Mrs","Mr","Rare"])
g = g.set_ylabels("survival probability")
In [32]:
# Drop Name variable
dataset.drop(labels = ["Name"], axis = 1, inplace = True)   #inplace为True时返回None,为默认False时返回dataset
In [33]:
# convert to indicator values Title and Embarked 
Title_dummies = pd.get_dummies(dataset['Title'],prefix='Title')
dataset = dataset.join(Title_dummies).drop(['Title'],axis=1)
#dataset = pd.get_dummies(dataset, columns = ["Title"])
dataset.drop(['Title_3'],axis=1,inplace=True) #这里去掉存活率最低的一列(冗余特征)

SibSp,Parch

In [34]:
# Create a family size descriptor from SibSp and Parch
dataset["Fsize"] = dataset["SibSp"] + dataset["Parch"] + 1
g = sns.factorplot(x="Fsize",y="Survived",data = dataset)
g = g.set_ylabels("Survival Probability")
In [35]:
# Create new feature of family size
dataset['Single'] = dataset['Fsize'].map(lambda s: 1 if s == 1 else 0)
dataset['SmallF'] = dataset['Fsize'].map(lambda s: 1 if  s == 2  else 0)
dataset['MedF'] = dataset['Fsize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
dataset['LargeF'] = dataset['Fsize'].map(lambda s: 1 if s >= 5 else 0)
In [36]:
dataset.drop(['Fsize','SibSp','Parch'],axis=1,inplace=True)
In [37]:
dataset.columns
Out[37]:
Index([u'Age', u'Cabin', u'Embarked', u'Fare', u'PassengerId', u'Pclass',
       u'Sex', u'Survived', u'Ticket', u'Title_0', u'Title_1', u'Title_2',
       u'Single', u'SmallF', u'MedF', u'LargeF'],
      dtype='object')

Embarked

In [38]:
dataset[:train_len][['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Embarked', ascending=True)
Out[38]:
  Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.341195
In [39]:
#dataset = pd.get_dummies(dataset, columns = ["Embarked"], prefix="Em")
Embarked_dummies = pd.get_dummies(dataset['Embarked'], prefix='Em')
dataset = dataset.join(Embarked_dummies).drop(['Embarked'],axis=1)
dataset.drop(['Em_S'],axis=1,inplace=True)
In [40]:
dataset.columns
Out[40]:
Index([u'Age', u'Cabin', u'Fare', u'PassengerId', u'Pclass', u'Sex',
       u'Survived', u'Ticket', u'Title_0', u'Title_1', u'Title_2', u'Single',
       u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q'],
      dtype='object')

Cabin

In [41]:
dataset["Cabin"].head()
Out[41]:
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
In [42]:
dataset["Cabin"].describe()
Out[42]:
count                 292
unique                186
top       B57 B59 B63 B66
freq                    5
Name: Cabin, dtype: object
In [43]:
dataset["Cabin"].isnull().sum()
Out[43]:
1007
In [44]:
dataset["Cabin"][dataset["Cabin"].notnull()].head()
Out[44]:
1      C85
3     C123
6      E46
10      G6
11    C103
Name: Cabin, dtype: object
In [45]:
# Replace the Cabin number by the type of cabin 'X' if not
dataset["Cabin"] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in dataset['Cabin'] ])
In [46]:
g = sns.countplot(dataset["Cabin"],order=['A','B','C','D','E','F','G','T','X'])
In [47]:
g = sns.factorplot(y="Survived",x="Cabin",data=dataset[:train_len],kind="bar",order=['A','B','C','D','E','F','G','T','X'])
g = g.set_ylabels("Survival Probability")
In [48]:
dataset = pd.get_dummies(dataset, columns = ["Cabin"],prefix="Cabin")
dataset.drop(['Cabin_T'],axis=1,inplace=True)

Ticket

In [49]:
dataset["Ticket"].head()
Out[49]:
0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object
In [50]:
## Treat Ticket by extracting the ticket prefix. When there is no prefix it returns X. 

Ticket = []
for i in list(dataset.Ticket):
    if not i.isdigit() :
        Ticket.append(i.replace(".","").replace("/","").strip().split(' ')[0]) #Take prefix
    else:
        Ticket.append("X")
        
dataset["Ticket"] = Ticket
dataset["Ticket"].head()
Out[50]:
0        A5
1        PC
2    STONO2
3         X
4         X
Name: Ticket, dtype: object
In [51]:
dataset[:train_len][['Ticket', 'Survived']].groupby(['Ticket'], as_index=False).mean().sort_values(by='Ticket', ascending=True)
Out[51]:
  Ticket Survived
0 A4 0.000000
1 A5 0.095238
2 AS 0.000000
3 C 0.400000
4 CA 0.411765
5 CASOTON 0.000000
6 FC 0.000000
7 FCC 0.800000
8 Fa 0.000000
9 LINE 0.250000
10 PC 0.650000
11 PP 0.666667
12 PPP 0.500000
13 SC 1.000000
14 SCA4 0.000000
15 SCAH 0.666667
16 SCOW 0.000000
17 SCPARIS 0.428571
18 SCParis 0.500000
19 SOC 0.166667
20 SOP 0.000000
21 SOPP 0.000000
22 SOTONO2 0.000000
23 SOTONOQ 0.133333
24 SP 0.000000
25 STONO 0.416667
26 STONO2 0.500000
27 SWPP 1.000000
28 WC 0.100000
29 WEP 0.333333
30 X 0.382979
In [52]:
dataset = pd.get_dummies(dataset, columns = ["Ticket"], prefix="T")
dataset.drop(['T_A4'],axis=1,inplace=True)

Pclass

In [53]:
dataset[:train_len][['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Pclass', ascending=True)
Out[53]:
  Pclass Survived
0 1 0.629108
1 2 0.472826
2 3 0.245868
In [54]:
# Create categorical values for Pclass
dataset["Pclass"] = dataset["Pclass"].astype("category")
dataset = pd.get_dummies(dataset, columns = ["Pclass"],prefix="Pc")
dataset.drop(['Pc_3'],axis=1,inplace=True)

Age

In [55]:
dataset['Age']=dataset['Age'].astype(int)
dataset['AgeBand'] = pd.cut(dataset['Age'], 5)
dataset[:train_len][['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
Out[55]:
  AgeBand Survived
0 (-0.08, 16.0] 0.532110
1 (16.0, 32.0] 0.340336
2 (32.0, 48.0] 0.412037
3 (48.0, 64.0] 0.434783
4 (64.0, 80.0] 0.090909
In [56]:
dataset.tail()
Out[56]:
  Age Fare PassengerId Sex Survived Title_0 Title_1 Title_2 Single SmallF ... T_STONO T_STONO2 T_STONOQ T_SWPP T_WC T_WEP T_X Pc_1 Pc_2 AgeBand
1294 25 2.085672 1305 0 NaN 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 0 (16.0, 32.0]
1295 39 4.690430 1306 1 NaN 0 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 (32.0, 48.0]
1296 38 1.981001 1307 0 NaN 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 0 (32.0, 48.0]
1297 25 2.085672 1308 0 NaN 0 0 1 1 0 ... 0 0 0 0 0 0 1 0 0 (16.0, 32.0]
1298 16 3.107198 1309 0 NaN 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 (-0.08, 16.0]

5 rows × 61 columns

In [57]:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 

Age_dummies = pd.get_dummies(dataset['Age'], prefix='Age')
dataset=dataset.join(Age_dummies).drop(['Age','AgeBand'],axis=1)
dataset.drop(['Age_4'],axis=1,inplace=True)

Fare

In [58]:
dataset['FareBand'] = pd.cut(dataset['Fare'], 4)
dataset[:train_len][['FareBand','Survived']].groupby(['FareBand'],as_index=False).mean().sort_values(by='FareBand', ascending=True)
Out[58]:
  FareBand Survived
0 (-0.00624, 1.56] 0.062500
1 (1.56, 3.119] 0.288719
2 (3.119, 4.679] 0.517007
3 (4.679, 6.239] 0.750000
In [59]:
dataset.columns
Out[59]:
Index([u'Fare', u'PassengerId', u'Sex', u'Survived', u'Title_0', u'Title_1',
       u'Title_2', u'Single', u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q',
       u'Cabin_A', u'Cabin_B', u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F',
       u'Cabin_G', u'Cabin_X', u'T_A', u'T_A5', u'T_AQ3', u'T_AQ4', u'T_AS',
       u'T_C', u'T_CA', u'T_CASOTON', u'T_FC', u'T_FCC', u'T_Fa', u'T_LINE',
       u'T_LP', u'T_PC', u'T_PP', u'T_PPP', u'T_SC', u'T_SCA3', u'T_SCA4',
       u'T_SCAH', u'T_SCOW', u'T_SCPARIS', u'T_SCParis', u'T_SOC', u'T_SOP',
       u'T_SOPP', u'T_SOTONO2', u'T_SOTONOQ', u'T_SP', u'T_STONO', u'T_STONO2',
       u'T_STONOQ', u'T_SWPP', u'T_WC', u'T_WEP', u'T_X', u'Pc_1', u'Pc_2',
       u'Age_0', u'Age_1', u'Age_2', u'Age_3', u'FareBand'],
      dtype='object')
In [60]:
dataset.loc[ dataset['Fare'] <= 1.56, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 1.56) & (dataset['Fare'] <= 3.119), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 3.119) & (dataset['Fare'] <= 4.679), 'Fare']   = 2
dataset.loc[ dataset['Fare'] > 4.679, 'Fare'] = 3
Fare_dummies = pd.get_dummies(dataset['Fare'], prefix='Fare')
dataset = dataset.join(Fare_dummies).drop(['Fare','FareBand'],axis=1)
In [61]:
dataset.columns
Out[61]:
Index([u'PassengerId', u'Sex', u'Survived', u'Title_0', u'Title_1', u'Title_2',
       u'Single', u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q', u'Cabin_A',
       u'Cabin_B', u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F', u'Cabin_G',
       u'Cabin_X', u'T_A', u'T_A5', u'T_AQ3', u'T_AQ4', u'T_AS', u'T_C',
       u'T_CA', u'T_CASOTON', u'T_FC', u'T_FCC', u'T_Fa', u'T_LINE', u'T_LP',
       u'T_PC', u'T_PP', u'T_PPP', u'T_SC', u'T_SCA3', u'T_SCA4', u'T_SCAH',
       u'T_SCOW', u'T_SCPARIS', u'T_SCParis', u'T_SOC', u'T_SOP', u'T_SOPP',
       u'T_SOTONO2', u'T_SOTONOQ', u'T_SP', u'T_STONO', u'T_STONO2',
       u'T_STONOQ', u'T_SWPP', u'T_WC', u'T_WEP', u'T_X', u'Pc_1', u'Pc_2',
       u'Age_0', u'Age_1', u'Age_2', u'Age_3', u'Fare_0.0', u'Fare_1.0',
       u'Fare_2.0', u'Fare_3.0'],
      dtype='object')
In [62]:
dataset.drop(['Fare_0.0'],axis=1,inplace=True)

PassengerId

In [63]:
# Drop useless variables 
dataset.drop(labels = ["PassengerId"], axis = 1, inplace = True)
In [64]:
dataset.head()
Out[64]:
  Sex Survived Title_0 Title_1 Title_2 Single SmallF MedF LargeF Em_C ... T_X Pc_1 Pc_2 Age_0 Age_1 Age_2 Age_3 Fare_1.0 Fare_2.0 Fare_3.0
0 0 0.0 0 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
1 1 1.0 0 1 0 0 1 0 0 1 ... 0 1 0 0 0 1 0 0 1 0
2 1 1.0 0 1 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
3 1 1.0 0 1 0 0 1 0 0 0 ... 1 1 0 0 0 1 0 0 1 0
4 0 0.0 0 0 1 1 0 0 0 0 ... 1 0 0 0 0 1 0 1 0 0

5 rows × 64 columns

In [65]:
dataset.columns
Out[65]:
Index([u'Sex', u'Survived', u'Title_0', u'Title_1', u'Title_2', u'Single',
       u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q', u'Cabin_A', u'Cabin_B',
       u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F', u'Cabin_G', u'Cabin_X',
       u'T_A', u'T_A5', u'T_AQ3', u'T_AQ4', u'T_AS', u'T_C', u'T_CA',
       u'T_CASOTON', u'T_FC', u'T_FCC', u'T_Fa', u'T_LINE', u'T_LP', u'T_PC',
       u'T_PP', u'T_PPP', u'T_SC', u'T_SCA3', u'T_SCA4', u'T_SCAH', u'T_SCOW',
       u'T_SCPARIS', u'T_SCParis', u'T_SOC', u'T_SOP', u'T_SOPP', u'T_SOTONO2',
       u'T_SOTONOQ', u'T_SP', u'T_STONO', u'T_STONO2', u'T_STONOQ', u'T_SWPP',
       u'T_WC', u'T_WEP', u'T_X', u'Pc_1', u'Pc_2', u'Age_0', u'Age_1',
       u'Age_2', u'Age_3', u'Fare_1.0', u'Fare_2.0', u'Fare_3.0'],
      dtype='object')

基本模型求解

In [66]:
## Separate train dataset and test dataset

train = dataset[:train_len]
test = dataset[train_len:]
test.drop(labels=["Survived"],axis = 1,inplace=True)
In [67]:
## Separate train features and label 

train["Survived"] = train["Survived"].astype(int)

Y_train = train["Survived"]

X_train = train.drop(labels = ["Survived"],axis = 1)

简单模型

1.KNN

2.Logistic regression

3.Naive_bayes

4.SVC

In [68]:
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

KNN

In [69]:
k_range=list([16,18])
knn_param_grid={'n_neighbors' : k_range}
gridKNN = GridSearchCV(KNeighborsClassifier(),param_grid = knn_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
gridKNN.fit(X_train,Y_train)
print(gridKNN.best_estimator_)
print(gridKNN.best_score_)
Fitting 10 folds for each of 2 candidates, totalling 20 fits
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=16, p=2,
           weights='uniform')
0.821793416572
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    4.0s finished

Logistic regression

In [70]:
LR_param_grid={'penalty' : ['l1', 'l2'], 'C' : [0.001,0.01,0.1,1,10,100]}
gridLR = GridSearchCV(LogisticRegression(),param_grid = LR_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
gridLR.fit(X_train,Y_train)
print(gridLR.best_estimator_)
print(gridLR.best_score_)
Fitting 10 folds for each of 12 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.6s
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.822928490352
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    7.6s finished

Naive Bayes

In [71]:
from sklearn.naive_bayes import GaussianNB 
GaussianNB=GaussianNB()
GaussianNB.fit(X_train, Y_train)
NB_score=cross_val_score(GaussianNB,X_train,Y_train, cv = kfold,scoring = "accuracy").mean()
print(NB_score)
#不知怎么回事,计算有问题
0.431307456588

SVC

In [72]:
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel=['rbf','linear']
SVC_param_grid={'kernel':kernel,'C':C,'gamma':gamma}
gridSVC = GridSearchCV(SVC(),param_grid = SVC_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)
gridSVC.fit(X_train,Y_train)
print(gridSVC.best_estimator_)
print(gridSVC.best_score_)
Fitting 10 folds for each of 240 candidates, totalling 2400 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done 615 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done 1315 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done 2215 tasks      | elapsed:   31.1s
[Parallel(n_jobs=-1)]: Done 2385 out of 2400 | elapsed:   33.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 2400 out of 2400 | elapsed:   33.2s finished
SVC(C=0.6, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.3, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
0.83314415437

经过简单的调参处理,支持向量机的交叉验证预测效果最好!最终调参后的本地正确率为0.83314415437!

In [73]:
test_Survived = pd.Series(gridSVC.best_estimator_.predict(test), name="Survived")

results_SVC = pd.concat([IDtest,test_Survived],axis=1)

results_SVC.to_csv("SVC_predict.csv",index=False)

单一集成模型

1 Random Forest

2 Extra Trees

3 Gradient Boosting

4 xgboost

Random Forest

In [74]:
# RFC Parameters tunning 
RFC = RandomForestClassifier()


## Search grid for optimal parameters
rf_param_grid = {"n_estimators" :[300, 500],
                 "max_depth": [8, 15],
              "min_samples_split": [2, 5, 10],
              "min_samples_leaf": [1, 2, 5],
              "max_features": ['log2', 'sqrt']}


gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)

gsRFC.fit(X_train,Y_train)

RFC_best = gsRFC.best_estimator_

# Best score
gsRFC.best_score_ ,RFC_best
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   14.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   59.0s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  2.1min finished
Out[74]:
(0.83200908059023837,
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=8, max_features='log2', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False))
In [93]:
test_Survived = pd.Series(RFC_best.predict(test), name="Survived")

results_RFC_best = pd.concat([IDtest,test_Survived],axis=1)

results_RFC_best.to_csv("RFC_best.csv",index=False)

ExtraTrees

In [75]:
#ExtraTrees 
ExtC = ExtraTreesClassifier()


## Search grid for optimal parameters
ex_param_grid = {"max_depth": [8, 15],
              "max_features": ['log2', 'sqrt'],
              "min_samples_split": [2,5, 10],
              "min_samples_leaf": [1, 2, 5],
              "n_estimators" :[300, 500]}


gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)

gsExtC.fit(X_train,Y_train)

ExtC_best = gsExtC.best_estimator_

# Best score
gsExtC.best_score_, ExtC_best
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   54.9s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  1.7min finished
Out[75]:
(0.82973893303064694,
 ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=8, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,
            oob_score=False, random_state=None, verbose=0, warm_start=False))

Gradient boosting

In [76]:
# Gradient boosting tunning

GBC = GradientBoostingClassifier()
gb_param_grid = {
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [3, 5, 10],
              'min_samples_leaf': [50,100,150],
             'max_features' :['sqrt','log2']
              }

gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid,cv=5, scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(X_train,Y_train)

GBC_best = gsGBC.best_estimator_

# Best score
gsGBC.best_score_,GBC_best
Fitting 5 folds for each of 54 candidates, totalling 270 fits
[Parallel(n_jobs=4)]: Done  52 tasks      | elapsed:    3.1s
[Parallel(n_jobs=4)]: Done 270 out of 270 | elapsed:    7.2s finished
Out[76]:
(0.8161180476730987,
 GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=10,
               max_features='sqrt', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=50, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100,
               presort='auto', random_state=None, subsample=1.0, verbose=0,
               warm_start=False))

xgboost

In [77]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

## Search grid for optimal parameters
xgb_param_grid = {"learning_rate": [0.01,0.5,1.0],
                  "n_estimators" : [300,500],
              "gamma": [0.1, 0.5,1.0],
              "max_depth": [3, 5, 10],
              "min_child_weight": [1, 3],
              "subsample" : [0.8,1.0],
                 "colsample_bytree" : [0.8,1.0]}


gridxgb = GridSearchCV(XGBClassifier(),param_grid = xgb_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)

gridxgb.fit(X_train,Y_train)

gridxgb_best = gridxgb.best_estimator_

# Best score
gridxgb.best_score_
Fitting 5 folds for each of 432 candidates, totalling 2160 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   35.8s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:  6.7min finished
Out[77]:
0.82973893303064694
In [78]:
print(gridxgb_best)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=1.0, learning_rate=0.01,
       max_delta_step=0, max_depth=3, min_child_weight=3, missing=None,
       n_estimators=300, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1.0)
In [92]:
test_Survived = pd.Series(gridxgb_best.predict(test), name="Survived")

results_gridxgb = pd.concat([IDtest,test_Survived],axis=1)

results_gridxgb.to_csv("gridxgb.csv",index=False)

Plot learning curves

In [79]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generate a simple plot of the test and training learning curve"""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gridxgb.best_estimator_,"XGBoost learning curves",X_train,Y_train,cv=kfold)

Feature importance of tree based classifiers

In [80]:
nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))

names_classifiers = [("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best),("XGBoost", gridxgb_best)]

nclassifier = 0
for row in range(nrows):
    for col in range(ncols):
        name = names_classifiers[nclassifier][0]
        classifier = names_classifiers[nclassifier][1]
        indices = np.argsort(classifier.feature_importances_)[::-1][:40]
        g = sns.barplot(y=X_train.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
        g.set_xlabel("Relative importance",fontsize=12)
        g.set_ylabel("Features",fontsize=12)
        g.tick_params(labelsize=9)
        g.set_title(name + " feature importance")
        nclassifier += 1
In [81]:
test_Survived_RFC = pd.Series(RFC_best.predict(test), name="RFC")
test_Survived_ExtC = pd.Series(ExtC_best.predict(test), name="ExtC")
test_Survived_GBC = pd.Series(GBC_best.predict(test), name="GBC")
test_Survived_xgb = pd.Series(gridxgb_best.predict(test), name="xgb")

# Concatenate all classifier results
ensemble_results = pd.concat([test_Survived_RFC,test_Survived_ExtC,test_Survived_GBC, test_Survived_xgb],axis=1)


g= sns.heatmap(ensemble_results.corr(),annot=True)

多种模型集成方法

四个集成模型投票法

In [82]:
votingC = VotingClassifier(estimators=[('rfc', RFC_best), ('extc', ExtC_best),
 ('gbc',GBC_best), ('xgb', gridxgb_best)], voting='soft', n_jobs=-1)

votingC = votingC.fit(X_train, Y_train)
In [83]:
test_Survived = pd.Series(votingC.predict(test), name="Survived")

results_votingC = pd.concat([IDtest,test_Survived],axis=1)

results_votingC.to_csv("ensemble_python_voting.csv",index=False)

stacking法

In [90]:
#第一层
class Ensemble_stacking1(object):
    def __init__(self, n_folds, base_models):
        self.n_folds = n_folds
        self.base_models = base_models
    def get_data_to2(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)
        folds = StratifiedKFold(n_splits=self.n_folds, shuffle=True, random_state=2016).split(X,y)
        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))
        for i, clf in enumerate(self.base_models):
            S_test_i = np.zeros((T.shape[0], self.n_folds))
            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]
                # y_holdout = y[test_idx]
                clf.fit(X_train, y_train)
                y_pred = clf.predict(X_holdout)[:]
                S_train[test_idx, i] = y_pred
                S_test_i[:, j] = clf.predict(T)[:]
            S_test[:, i] = S_test_i.mean(1)
        return S_train, S_test

#第二层
xgb2_param_grid = {"learning_rate": [0.01,0.5],
                  "n_estimators" : [300,500],
              "gamma": [0.1, 0.5,1.0],
              "max_depth": [3, 5, 10],
              "min_child_weight": [1, 3 , 5, 7],
              "subsample" : [0.8,1.0],
                 "colsample_bytree" : [0.6,0.8]}


gridxgb2 = GridSearchCV(XGBClassifier(),param_grid = xgb2_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)


S_train, S_test = Ensemble_stacking1(5, [RFC_best,  ExtC_best, GBC_best, gridxgb]).get_data_to2(X_train, Y_train, test)
gridxgb2.fit(S_train,Y_train)

gridxgb2_best = gridxgb2.best_estimator_
print(gridxgb2.best_score_)
Fitting 5 folds for each of 576 candidates, totalling 2880 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   13.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   19.8s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   28.4s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   39.7s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:   54.5s
0.829738933031
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed:  1.1min finished
In [91]:
test_Survived = pd.Series(gridxgb2.predict(S_test), name="Survived")

results_stacking = pd.concat([IDtest,test_Survived],axis=1)

results_stacking.to_csv("ensemble_python_stacking.csv",index=False)

总结

  • 学完基本的机器学习理论知识后,在kaggle上找了Titanic比赛练练手,看了不少大神在kernel上了分享,学到了不少知识。本着尽快熟悉数据处理操作和各种机器学习算法应用的原则,自己重新实现了一遍比赛的基本流程,并且加了一些自己的想法进去。许多细节处理的特别粗糙,特征工程处理和模型调参由于时间的原因还不够完善,也是接下来要去认真学习的地方。刚接触Python没多长时间,代码写的也比较乱,还要多学习学习别人的代码风格,尝试总结出一套规范的流程来。
  • 最后将结果提交到kaggle上,分数没有达到0.8,这说明模型有点过拟合了。一方面特征工程做的还不够好,另一方面模型调参比较粗糙。如果再花时间进行模型调参,应该可以获得不错的成绩。本想着进一步优化模型,但是本次练习的目的也主要是熟悉和入门kaggle比赛,更多细节还需要学习优胜者们分享的解决方案,所以还是等以后有时间再来深究吧

参考资料

https://www.kaggle.com/helgejo/an-interactive-data-science-tutorial/notebook

https://www.kaggle.com/omarelgabry/a-journey-through-titanic

https://www.kaggle.com/startupsci/titanic-data-science-solutions

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

https://www.kaggle.com/ash316/eda-to-prediction-dietanic

https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

http://prozhuchen.com/2016/12/28/CCF%E5%A4%A7%E8%B5%9B%E6%90%9C%E7%8B%97%E7%94%A8%E6%88%B7%E7%94%BB%E5%83%8F%E6%80%BB%E7%BB%93/

  • 记录科研点滴,丰富研究生生活!

你可能感兴趣的:(机器学习)