泰坦尼克空难简介:1912年4月15日,载着1316号乘客和891名船员的豪华巨轮“泰坦尼克号”与冰山相撞而沉没,这场海难被认为是20世纪人间十大灾难之一。1985年,“泰坦尼克号”的沉船遗骸在北大西洋两英里半的海底被发现。美国探险家洛维特(比尔·帕克斯顿 饰演)亲自潜入海底,在船舱的墙壁上看见了一幅画,洛维持的发现立刻引起了一位老妇人(格劳瑞亚·斯图尔特 饰演)的注意。已经是101岁高龄的露丝称她就是画中的少女。在潜水舱里,露丝开始叙述当年在船上发生的故事。年轻的贵族少女露丝(凯特·温丝莱特 饰演)与穷画家杰克(莱昂纳多·迪卡普里奥 饰演)不顾世俗的偏见坠入爱河,然而就在1912年4月14日,一个风平浪静的夜晚,泰坦尼克号撞上了冰山,“永不沉没的”泰坦尼克号面临沉船的命运,罗丝和杰克刚萌芽的爱情也将经历生死的考验,最终不得不永世相隔。老态龙钟的罗丝讲完这段哀恸天地的爱情之后,把那串价值连城的项链“海洋之心”沉入海底,让它陪着杰克和这段爱情长眠海底。
这是一部我看过很多次的电影,虽然时间很长,但的确是非常耐看。Kaggle上有相关的数据,请见网址 https://www.kaggle.com/c/titanic,他提供的训练数据主要有以下特征,乘客ID'PassengerId',是否获救 'Survived',乘客分类 'Pclass',姓名 'Name',性别 'Sex',年龄 'Age', 有多少兄弟姐妹/配偶同船,'SibSp', 有多少父母/子女同船,'Parch', 票号,'Ticket', 票价,'Fare', 客舱号,'Cabin', 'Embarked'出发港,根据这些训练数据训练模型,来判断测试数据中的乘客是否获救了,测试数据和训练数据相比就是只少了是否获救 'Survived'这一列。
解决问题的思路:先处理训练数据,如处理缺失数据,对乘客分类,性别,发出港口做LableEncoder,然后选择合适的分类模型做训练,再根据训练的模型对测试数据做获救预测,然后提交预测结果获得预测得分。
下面贴部分代码及运行结果截图
def get_process_train_data(): train_data,target_data = get_raw_data() #用来对此列缺失数据做填充 embarkeds = train_data.groupby('Embarked')['Embarked'].count() logging.debug(embarkeds) # 对Pclass,Sex做处理 train_data_value = train_data.values pclass_le = LabelEncoder() train_data_value[:, 0] = pclass_le.fit_transform(train_data_value[:, 0]) sex_le = LabelEncoder() train_data_value[:, 1] = sex_le.fit_transform(train_data_value[:, 1]) #Embarked用缺失数据,需要先填充缺失数据再做LabelEncoder embarked_le = LabelEncoder() train_data_value[:, 6] = embarked_le.fit_transform(train_data_value[:, 6]) #根据Info信息,age有缺失值,使用平均值代替 imr = Imputer(missing_values='NaN', strategy='mean', axis=0) imr = imr.fit(train_data_value) train_data_value = imr.transform(train_data_value) target_data_value = target_data.values.astype(np.int32) #result_le = LabelEncoder() #data_result_value = result_le.fit_transform(data_result_value) logging.debug(train_data_value) logging.debug(target_data_value) return train_data_value,target_data_value
我使用的分类算法如下:
def do_category(): train_data_value,train_target_value = train_data.get_process_train_data() X_train, X_test, y_train, y_test = train_test_split(train_data_value, train_target_value, test_size=0.40, random_state=1) #使用决策树来分类 tree = DecisionTreeClassifier(criterion='gini', max_depth=5,splitter='random') tree = tree.fit(X_train, y_train) y_train_pred = tree.predict(X_train) y_test_pred = tree.predict(X_test) tree_train = accuracy_score(y_train, y_train_pred) tree_test = accuracy_score(y_test, y_test_pred) print('Decision tree train/test accuracies %.4f/%.4f' % (tree_train, tree_test)) #使用随机森林 forest = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) random_forest = forest.fit(X_train, y_train) y_train_pred = random_forest.predict(X_train) y_test_pred = random_forest.predict(X_test) tree_train = accuracy_score(y_train, y_train_pred) tree_test = accuracy_score(y_test, y_test_pred) print('Random Forest train/test accuracies %.4f/%.4f' % (tree_train, tree_test)) #更多树 (Extra Trees) extra = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) extra_tree = extra.fit(X_train, y_train) y_train_pred = extra_tree.predict(X_train) y_test_pred = extra_tree.predict(X_test) tree_train = accuracy_score(y_train, y_train_pred) tree_test = accuracy_score(y_test, y_test_pred) print('Extra Trees train/test accuracies %.4f/%.4f' % (tree_train, tree_test)) #Logistic回归 logreg = linear_model.LogisticRegression(C=1e5) logreg.fit(X_train, y_train) y_train_pred = logreg.predict(X_train) y_test_pred = logreg.predict(X_test) tree_train = accuracy_score(y_train, y_train_pred) tree_test = accuracy_score(y_test, y_test_pred) print('Logistic Regression train/test accuracies %.4f/%.4f' % (tree_train, tree_test)) #贝叶斯 bayes = GaussianNB() bayes.fit(X_train, y_train) y_train_pred = bayes.predict(X_train) y_test_pred = bayes.predict(X_test) tree_train = accuracy_score(y_train, y_train_pred) tree_test = accuracy_score(y_test, y_test_pred) print('Gaussian Naive Bayes train/test accuracies %.4f/%.4f' % (tree_train, tree_test)) #集合方法 clf1 = LogisticRegression(random_state=1) clf2 = RandomForestClassifier(random_state=1) clf3 = GaussianNB() voting_class = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[1, 1, 1]) vote = voting_class.fit(X_train, y_train) y_train_pred = vote.predict(X_train) y_test_pred = vote.predict(X_test) vote_train = accuracy_score(y_train, y_train_pred) vote_test = accuracy_score(y_test, y_test_pred) print('Ensemble Classifier train/test accuracies %.4f/%.4f' % (vote_train, vote_test)) def do_category_kfold(): tree = DecisionTreeClassifier(criterion='entropy', max_depth=None) data_target_value,data_result_value = train_data.get_process_train_data() #print(data_result_value.dtype) kfold = StratifiedKFold(y=data_result_value,n_folds=10,random_state=1) scores = [] for k, (train, test) in enumerate(kfold): tree.fit(data_target_value[train], data_result_value[train]) score = tree.score(data_target_value[test], data_result_value[test]) scores.append(score) print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,np.bincount(data_result_value[train]), score)) print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
def get_best_parameter(): data_target_value,data_result_value = train_data.get_process_train_data() tree = DecisionTreeClassifier(criterion='entropy', max_depth=None) param_grid = [{'criterion': ['gini','entropy'], 'splitter': ['best','random'], 'max_depth': [1,5,10,None] }] gs = GridSearchCV(estimator=tree, param_grid=param_grid, scoring='accuracy') gs = gs.fit(data_target_value, data_result_value) print(gs.best_score_) print(gs.best_params_)
0.820426487093
{'splitter': 'random', 'max_depth': 5, 'criterion': 'gini'}
最后做预测,输出kaggle要求的csv文件:
def predict(): raw_test_data = test_data.get_raw_test_data() test_data_value = test_data.get_process_test_data() #在上面几种验证的模型中,决策树得分最好,所以用决策树来预测 #根据调参选择的参数学习 train_data_value,target_data_value = train_data.get_process_train_data() tree = DecisionTreeClassifier(criterion='gini', max_depth=5,splitter='random') tree = tree.fit(train_data_value, target_data_value) #根据拟合的模型预测 test_target_value = tree.predict(test_data_value) print(test_target_value) result_df = pd.DataFrame(columns=['PassengerId', 'Survived']) result_df['PassengerId'] = raw_test_data['PassengerId'] result_df['Survived'] = test_target_value result_df.to_csv('gendermodel.csv', index=False, header=['PassengerId', 'Survived'])
另外对获救的年龄段做了一个分析,结果见截图。
说明了什么,大家自己理解吧。