Titanic: Machine Learning from Disaster——总结

分析数据

# -*- coding: utf-8 -*-
# 首先打开训练数据源文件,将枚举型数据转换为数值型数据(有些特征可以直接忽略)
import pandas
titanic_train = pandas.read_csv('train.csv')
titanic_train.loc[titanic_train["Sex"] == "male", "Sex"] = 0 
titanic_train.loc[titanic_train["Sex"] == "female", "Sex"] = 1
titanic_train["Embarked"] = titanic_train["Embarked"].fillna("S")
titanic_train.loc[titanic_train["Embarked"] == "S", "Embarked"] = 0
titanic_train.loc[titanic_train["Embarked"] == "C", "Embarked"] = 1
titanic_train.loc[titanic_train["Embarked"] == "Q", "Embarked"] = 2
# describe() return the summary statistics of the train data
statistics = titanic_train.describe()
print statistics
''' PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 '''

仔细观察数据,会发现:

  • “Age”这个属性的值是714小于所有其他的属性,表示”Age”这个属性有缺失值,因此需要做数据填充。

数据预处理

1)直接删除元组 也就是将存在遗漏信息属性值的对象(元组、记录)删除,从而得到一个完备的信息表。
2)特殊值填充(treating missing attribute valuesas special values)
将缺值作为一种特殊的属性值来处理,它不同于其他的任何属性值。如所有的缺值都用“unknown”填充,这样将可能导致严重的数据偏离,一般不推荐使用.
3)平均值填充(mean/mode completer)
将信息表中的属性分为数值属性和非数值属性来分别进行处理.如果缺值是数值型的,就根据该属性在其他所有对象取值的平均值来填充该缺失的属性值;如果空值是非数值型的,就根据统计学中的众数原理,用该属性在其他所有对象的取值次数最多的值(即出现频率最高的值)来补齐该缺失的属性值。
4)使用最有可能的值来填充缺值
可以用回归、基于推导的使用贝叶斯形式化方法的工具或判定树归纳确定,这些方法直接处理的是模型参数的估计而不是空缺值预测本身。与前面的方法相比,它使用现存数据的多数信息来推测空缺值。
5)保留缺失数据不予处理 不对缺失数据做任何处理,直接在含大量缺失数据的数据集上进行数据挖掘。

机器学习中如何处理缺失数据? 目前有三类处理方法:

1) 用平均值、中值、分位数、众数、随机值等替代。效果一般,因为等于人为增加了噪声。
2) 用其他变量做预测模型来算出缺失变量。效果比方法1略好。有一个根本缺陷,如果其他变量和缺失变量无关,则预测的结果无意义。如果预测结果相当准确,则又说明这个变量是没必要加入建模的。一般情况下,介于两者之间。
3)最精确的做法,把变量映射到高维空间。比如性别,有男、女、缺失三种情况,则映射成3个变量:是否男、是否女、是否缺失。连续型变量也可以这样处理。比如Google、百度的CTR预估模型,预处理时会把所有变量都这样处理,达到几亿维。这样做的好处是完整保留了原始数据的全部信息、不用考虑缺失值、不用考虑线性不可分之类的问题。缺点是计算量大大提升。而且只有在样本量非常大的时候效果才好,否则会因为过于稀疏,效果很差。

在进行数据分析的时候,什么情况下需要对数据进行标准化处理?
主要看模型是否具有伸缩不变性:

1)有些模型在各个维度进行不均匀伸缩后,最优解与原来不等价,例如SVM。对于这样的模型,除非本来各维数据的分布范围就比较接近,否则必须进行标准化,以免模型参数被分布范围较大或较小的数据dominate。
2)有些模型在各个维度进行不均匀伸缩后,最优解与原来等价,例如logistic regression。对于这样的模型,是否标准化理论上不会改变最优解。但是,由于实际求解往往使用迭代算法,如果目标函数的形状太“扁”,迭代算法可能收敛得很慢甚至不收敛。所以对于具有伸缩不变性的模型,最好也进行数据标准化。

机器学习数据归一化的的方法有哪些?适合于什么样的数据?
常见的有这两种:
1)最值归一化。比如把最大值归一化成1,最小值归一化成-1;或把最大值归一化成1,最小值归一化成0。适用于本来就分布在有限范围内的数据。
2)均值方差归一化,一般是把均值归一化成0,方差归一化成1。适用于分布没有明显边界的情况,受outlier影响也较小。

为什么 feature scaling 会使 gradient descent
的收敛更好?
如果不归一化,各维特征的跨度差距很大,目标函数就会是“扁”的,如果归一化了,那么目标函数就“圆”了。

数据特征的归一化,是对整个矩阵还是对每一维特征?
整体做归一化相当于各向同性的放缩,做了也没有用。
各维分别做归一化会丢失各维方差这一信息,但各维之间的相关系数可以保留。 如果本来各维的量纲是相同的,最好不要做归一化,以尽可能多地保留信息。
如果本来各维的量纲是不同的,就需要先对各维分别归一化。

  • 因此我们需要对数据做一些处理:
    • ”Age”属性的缺失值用”Age”的均值进行填充。
    • “Age”进行归一化(这一步前面没有提到)。
    • 创建四个新的特征:”FamilySize”,”NameLength”,”Title”,”FamilyId”。
# "Age"属性的缺失值用"Age"的均值进行填充。
# "Age"进行归一化(这一步前面没有提到)。
# 创建特征:"FamilySize","NameLength"
from sklearn import preprocessing
def normalize_series(series):
    # Create x, where x the series column's values as floats
    x = series.values.astype(float)
    # Create a minimum and maximum processor object
    min_max_scaler = preprocessing.MinMaxScaler()
    # Create an object to transform the data to fit minmax processor
    x_scaled = min_max_scaler.fit_transform(x)
    # Run the normalizer on the dataframe
    series_normalized = pandas.DataFrame(x_scaled)
    return series_normalized

# Fill the missing value of "Age" 
titanic_train["Age"] = titanic_train["Age"].fillna(titanic_train["Age"].median())
# normalize the "Age"
titanic_train["Age"] = normalize_series(titanic_train["Age"])
# Generating a familysize column
titanic_train["FamilySize"] = titanic_train["SibSp"] + titanic_train["Parch"]
# The .apply method generates a new series "NameLength"
titanic_train["NameLength"] = titanic_train["Name"].apply(lambda x: len(x))
# 创建特征"Title"
import re
# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
# Get all the titles and print how often each one occurs.
titles = titanic_train["Name"].apply(get_title)
# print pandas.value_counts(titles) #print and observe the title rule
# Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v
# Add in the title column.
titanic_train["Title"] = titles
创建特征"FamilyId"
import operator
# A dictionary mapping family name to id
family_id_mapping = {}
# A function to get the id given a row
def get_family_id(row):
    # Find the last name by splitting on a comma
    last_name = row["Name"].split(",")[0]
    # Create the family id
    family_id = "{0}{1}".format(last_name, row["FamilySize"])
    # Look up the id in the mapping
    if family_id not in family_id_mapping:
        if len(family_id_mapping) == 0:
            current_id = 1
        else:
            # Get the maximum id from the mapping and add one to it if we don't have an id
            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
        family_id_mapping[family_id] = current_id
    return family_id_mapping[family_id]

# Get the family ids with the apply method
family_ids = titanic_train.apply(get_family_id, axis=1)
# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
family_ids[titanic_train["FamilySize"] < 3] = -1
titanic_train["FamilyId"] = family_ids
# normalize the "NameLength" and "FamilyId"
titanic_train["NameLength"] = normalize_series(titanic_train["NameLength"])
titanic_train["FamilyId"] = normalize_series(titanic_train["FamilyId"])
print titanic_train.describe()
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642    0.363679    0.523008   
std     257.353842    0.486592    0.836071    0.163605    1.102743   
min       1.000000    0.000000    1.000000    0.000000    0.000000   
25% 223.500000 0.000000 2.000000 0.271174 0.000000 
50% 446.000000 0.000000 3.000000 0.346569 0.000000 
75% 668.500000 1.000000 3.000000 0.434531 1.000000 
max     891.000000    1.000000    3.000000    1.000000    8.000000   

            Parch        Fare  FamilySize  NameLength    FamilyId  
count  891.000000  891.000000  891.000000  891.000000  891.000000  
mean     0.381594   32.204208    0.904602    0.213789    0.024026  
std      0.806057   49.693429    1.613459    0.132594    0.110231  
min      0.000000    0.000000    0.000000    0.000000    0.000000  
25% 0.000000 7.910400 0.000000 0.114286 0.000000 
50% 0.000000 14.454200 0.000000 0.185714 0.000000 
75% 0.000000 31.000000 1.000000 0.257143 0.000000 
max      6.000000  512.329200   10.000000    1.000000    1.000000  

特征选择

现在我们得到了很多特征,但是有些特征可能与预测无关,或者是冗余的,现在先做一下特征相关性分析:

# select features import matplotlib.pyplot as plt import numpy as np from sklearn.feature_selection import SelectKBest, f_classif predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "NameLength", "Title", "FamilyId"] # Perform feature selection selector = SelectKBest(f_classif, k=5) selector.fit(titanic_train[predictors], titanic_train["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores scores = -np.log10(selector.pvalues_) # Plot the scores. plt.bar(range(len(predictors)), scores) plt.xticks(range(len(predictors)), predictors, rotation='vertical') plt.show() # Pick only the five best features. predictors = ["Pclass", "Sex", "Fare", "NameLength", "Title"]

预测

1)Linear regression

  • Making Predictions
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation from sklearn.cross_validation import KFold # Initialize our algorithm class alg = LinearRegression() # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(titanic_train.shape[0], n_folds=3, random_state=1) predictions = [] for train, test in kf: # The predictors we're using the train the algorithm. Note how we only take the rows in the train folds. train_predictors = (titanic_train[predictors].iloc[train,:]) # The target we're using to train the algorithm. train_target = titanic_train["Survived"].iloc[train] # Training the algorithm using the predictors and target. alg.fit(train_predictors, train_target) # We can now make predictions on the test fold test_predictions = alg.predict(titanic_train[predictors].iloc[test,:]) predictions.append(test_predictions) 
  • Evaluating Error
import numpy as np
# The predictions are in three separate numpy arrays. Concatenate them into one. 
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = .0
for i in range(len(predictions)):
    if predictions[i] == titanic_train["Survived"][i]:
        accuracy += 1.
accuracy /= len(predictions)
print accuracy
  • output: 0.784511784512

2)Logistic Regression

from sklearn import cross_validation

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
  • output: 0.776655443322

3)Random forest

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
  • output: 0.81593714927

4)SVM

# SVM 支持向量机
from sklearn import cross_validation
from sklearn import svm
X = titanic_train[predictors]
y = titanic_train["Survived"]
clf = svm.SVC()
clf.fit(X, y)  
scores = cross_validation.cross_val_score(clf, titanic_train[predictors], titanic_train["Survived"], cv=3)
#Take the mean of the scores (because we have one for each fold)
print(scores.mean())
  • output: 0.777777777778

目前选择了基本的几个分类算法,性能相差不大,其实在做的过程中,发现用所有特征做出来的精度更高,因此,不要盲目构造特征,此处只是做一个示范。下面采用boosting提升算法,生成多个分类器,并将这些分类器进行线性组合,提高分类的性能。

5)Boosting

from sklearn.ensemble import GradientBoostingClassifier

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3),predictors],
    [LogisticRegression(random_state=1),predictors]
]

# Initialize the cross validation folds
kf = KFold(titanic_train.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    train_target = titanic_train["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic_train[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic_train[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic_train["Survived"]]) / len(predictions)
print(accuracy)
  • output: 0.817059483726

上面是结合了GradientBoostingClassifier和LogisticRegression,二者的权重都是1/2.综上,这个算法的性能最好。

提交结果

1)对测试数据做相同的预处理:

# do the same with the test data
import pandas
titanic_test = pandas.read_csv('test.csv')
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median()) #train data has no missing data
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

# Fill the missing value of "Age" 
titanic_test["Age"] = titanic_test["Age"].fillna(titanic_train["Age"].median())
# normalize the "Age"
titanic_test["Age"] = normalize_series(titanic_test["Age"])
# Generating a familysize column
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]
# The .apply method generates a new series "NameLength"
titanic_test["NameLength"] = titanic_test["Name"].apply(lambda x: len(x))
# First, we'll add titles to the test set.
titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
    titles[titles == k] = v
titanic_test["Title"] = titles
# Now we can add family ids.
# We'll use the same ids that we did earlier.
family_ids = titanic_test.apply(get_family_id, axis=1)
family_ids[titanic_test["FamilySize"] < 3] = -1
titanic_test["FamilyId"] = family_ids
# normalize the "NameLength" and "FamilyId"
titanic_test["NameLength"] = normalize_series(titanic_test["NameLength"])
titanic_test["FamilyId"] = normalize_series(titanic_test["FamilyId"])

2)预测并保存成.csv文件提交。

algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), predictors]
]

full_predictions = []
for alg, predictors in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(titanic_train[predictors], titanic_train["Survived"])
    # Predict using the test dataset. We have to convert all the columns to floats to avoid an error.
    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions[predictions <= .5] = 0
predictions[predictions > .5] = 1
predictions = predictions.astype(int)
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })
submission.to_csv("kaggle.csv", index=False)


至此,提交到kaggle,一次实验完成。

你可能感兴趣的:(Titanic: Machine Learning from Disaster——总结)