从泰坦尼克来做数据分类预测

泰坦尼克空难简介:1912年4月15日,载着1316号乘客和891名船员的豪华巨轮“泰坦尼克号”与冰山相撞而沉没,这场海难被认为是20世纪人间十大灾难之一。1985年,“泰坦尼克号”的沉船遗骸在北大西洋两英里半的海底被发现。美国探险家洛维特(比尔·帕克斯顿 饰演)亲自潜入海底,在船舱的墙壁上看见了一幅画,洛维持的发现立刻引起了一位老妇人(格劳瑞亚·斯图尔特 饰演)的注意。已经是101岁高龄的露丝称她就是画中的少女。在潜水舱里,露丝开始叙述当年在船上发生的故事。年轻的贵族少女露丝(凯特·温丝莱特 饰演)与穷画家杰克(莱昂纳多·迪卡普里奥 饰演)不顾世俗的偏见坠入爱河,然而就在1912年4月14日,一个风平浪静的夜晚,泰坦尼克号撞上了冰山,“永不沉没的”泰坦尼克号面临沉船的命运,罗丝和杰克刚萌芽的爱情也将经历生死的考验,最终不得不永世相隔。老态龙钟的罗丝讲完这段哀恸天地的爱情之后,把那串价值连城的项链“海洋之心”沉入海底,让它陪着杰克和这段爱情长眠海底。

这是一部我看过很多次的电影,虽然时间很长,但的确是非常耐看。Kaggle上有相关的数据,请见网址 https://www.kaggle.com/c/titanic,他提供的训练数据主要有以下特征,乘客ID'PassengerId',是否获救 'Survived',乘客分类 'Pclass',姓名 'Name',性别 'Sex',年龄 'Age', 有多少兄弟姐妹/配偶同船,'SibSp', 有多少父母/子女同船,'Parch', 票号,'Ticket', 票价,'Fare', 客舱号,'Cabin', 'Embarked'出发港,根据这些训练数据训练模型,来判断测试数据中的乘客是否获救了,测试数据和训练数据相比就是只少了是否获救 'Survived'这一列。

解决问题的思路:先处理训练数据,如处理缺失数据,对乘客分类,性别,发出港口做LableEncoder,然后选择合适的分类模型做训练,再根据训练的模型对测试数据做获救预测,然后提交预测结果获得预测得分。

下面贴部分代码及运行结果截图

def get_process_train_data():
    train_data,target_data = get_raw_data()
    #用来对此列缺失数据做填充
    embarkeds = train_data.groupby('Embarked')['Embarked'].count()
    logging.debug(embarkeds)
    # 对Pclass,Sex做处理
    train_data_value = train_data.values
    pclass_le = LabelEncoder()
    train_data_value[:, 0] = pclass_le.fit_transform(train_data_value[:, 0])
    sex_le = LabelEncoder()
    train_data_value[:, 1] = sex_le.fit_transform(train_data_value[:, 1])
    #Embarked用缺失数据,需要先填充缺失数据再做LabelEncoder
    embarked_le = LabelEncoder()
    train_data_value[:, 6] = embarked_le.fit_transform(train_data_value[:, 6])
    #根据Info信息,age有缺失值,使用平均值代替
    imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
    imr = imr.fit(train_data_value)
    train_data_value = imr.transform(train_data_value)

    target_data_value = target_data.values.astype(np.int32)
    #result_le = LabelEncoder()
    #data_result_value = result_le.fit_transform(data_result_value)

    logging.debug(train_data_value)
    logging.debug(target_data_value)
    return train_data_value,target_data_value

使用各种分类方法对训练数据来训练,选择最好的模型来训练数据

我使用的分类算法如下:

def do_category():
    train_data_value,train_target_value = train_data.get_process_train_data()
    X_train, X_test, y_train, y_test = train_test_split(train_data_value, train_target_value, test_size=0.40, random_state=1)

    #使用决策树来分类
    tree = DecisionTreeClassifier(criterion='gini', max_depth=5,splitter='random')
    tree = tree.fit(X_train, y_train)
    y_train_pred = tree.predict(X_train)
    y_test_pred = tree.predict(X_test)
    tree_train = accuracy_score(y_train, y_train_pred)
    tree_test = accuracy_score(y_test, y_test_pred)
    print('Decision tree train/test accuracies %.4f/%.4f' % (tree_train, tree_test))

    #使用随机森林
    forest = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0)
    random_forest = forest.fit(X_train, y_train)
    y_train_pred = random_forest.predict(X_train)
    y_test_pred = random_forest.predict(X_test)
    tree_train = accuracy_score(y_train, y_train_pred)
    tree_test = accuracy_score(y_test, y_test_pred)
    print('Random Forest train/test accuracies %.4f/%.4f' % (tree_train, tree_test))

    #更多树 (Extra Trees)
    extra = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0)
    extra_tree = extra.fit(X_train, y_train)
    y_train_pred = extra_tree.predict(X_train)
    y_test_pred = extra_tree.predict(X_test)
    tree_train = accuracy_score(y_train, y_train_pred)
    tree_test = accuracy_score(y_test, y_test_pred)
    print('Extra Trees train/test accuracies %.4f/%.4f' % (tree_train, tree_test))

    #Logistic回归
    logreg = linear_model.LogisticRegression(C=1e5)
    logreg.fit(X_train, y_train)
    y_train_pred = logreg.predict(X_train)
    y_test_pred = logreg.predict(X_test)
    tree_train = accuracy_score(y_train, y_train_pred)
    tree_test = accuracy_score(y_test, y_test_pred)
    print('Logistic Regression train/test accuracies %.4f/%.4f' % (tree_train, tree_test))

    #贝叶斯
    bayes = GaussianNB()
    bayes.fit(X_train, y_train)
    y_train_pred = bayes.predict(X_train)
    y_test_pred = bayes.predict(X_test)
    tree_train = accuracy_score(y_train, y_train_pred)
    tree_test = accuracy_score(y_test, y_test_pred)
    print('Gaussian Naive Bayes train/test accuracies %.4f/%.4f' % (tree_train, tree_test))


    #集合方法
    clf1 = LogisticRegression(random_state=1)
    clf2 = RandomForestClassifier(random_state=1)
    clf3 = GaussianNB()
    voting_class = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
                            voting='soft',
                            weights=[1, 1, 1])
    vote = voting_class.fit(X_train, y_train)
    y_train_pred = vote.predict(X_train)
    y_test_pred = vote.predict(X_test)
    vote_train = accuracy_score(y_train, y_train_pred)
    vote_test = accuracy_score(y_test, y_test_pred)
    print('Ensemble Classifier train/test accuracies %.4f/%.4f' % (vote_train, vote_test))

def do_category_kfold():
    tree = DecisionTreeClassifier(criterion='entropy', max_depth=None)
    data_target_value,data_result_value = train_data.get_process_train_data()
    #print(data_result_value.dtype)
    kfold = StratifiedKFold(y=data_result_value,n_folds=10,random_state=1)
    scores = []
    for k, (train, test) in enumerate(kfold):
        tree.fit(data_target_value[train], data_result_value[train])
        score = tree.score(data_target_value[test], data_result_value[test])
        scores.append(score)
        print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,np.bincount(data_result_value[train]), score))
    print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))


对效果最好的分类算法使用GridSearch做参数调优:

def get_best_parameter():
    data_target_value,data_result_value = train_data.get_process_train_data()
    tree = DecisionTreeClassifier(criterion='entropy', max_depth=None)
    param_grid = [{'criterion': ['gini','entropy'],
                    'splitter': ['best','random'],
                   'max_depth': [1,5,10,None]
                   }]
    gs = GridSearchCV(estimator=tree,
        param_grid=param_grid,
        scoring='accuracy')
    gs = gs.fit(data_target_value, data_result_value)
    print(gs.best_score_)
    print(gs.best_params_)

输出的结果值为:

0.820426487093
{'splitter': 'random', 'max_depth': 5, 'criterion': 'gini'}
最后做预测,输出kaggle要求的csv文件:

def predict():
    raw_test_data = test_data.get_raw_test_data()
    test_data_value = test_data.get_process_test_data()
    #在上面几种验证的模型中,决策树得分最好,所以用决策树来预测
    #根据调参选择的参数学习
    train_data_value,target_data_value = train_data.get_process_train_data()
    tree = DecisionTreeClassifier(criterion='gini', max_depth=5,splitter='random')
    tree = tree.fit(train_data_value, target_data_value)
    #根据拟合的模型预测
    test_target_value = tree.predict(test_data_value)
    print(test_target_value)

    result_df = pd.DataFrame(columns=['PassengerId', 'Survived'])
    result_df['PassengerId'] = raw_test_data['PassengerId']
    result_df['Survived'] = test_target_value
    result_df.to_csv('gendermodel.csv', index=False, header=['PassengerId', 'Survived'])


上传生成的文件到kaggle,得到预测准确率

从泰坦尼克来做数据分类预测_第1张图片


另外对获救的年龄段做了一个分析,结果见截图。

从泰坦尼克来做数据分类预测_第2张图片

说明了什么,大家自己理解吧。

你可能感兴趣的:(从泰坦尼克来做数据分类预测)