因本人刚开始写博客,学识经验有限,如有不正之处望读者指正,不胜感激;也望借此平台留下学习笔记以温故而知新。
在引导项目中,你研究了泰坦尼克号存活数据并能够对乘客存活情况作出预测。在该项目中,你手动构建了一个决策树,该决策树在每个阶段都会选择一个与存活情况最相关的特征。幸运的是,这正是决策树的运行原理!在此实验室中,我们将通过在 sklearn 中实现决策树使这一流程速度显著加快。
我们首先将加载数据集并显示某些行。
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
# Pretty display for notebooks
%matplotlib inline
# Set a random seed
import random
random.seed(42)
# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)
# Print the first few entries of the RMS Titanic data
display(full_data.head())
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
下面是每位乘客具备的各种特征:
NaN
)NaN
)因为我们对每位乘客或船员的存活情况感兴趣,因此我们可以从此数据集中删除 Survived 特征,并将其存储在单独的变量 outcome
中。我们将使用这些结果作为预测目标。
运行以下代码单元格,以从数据集中删除特征 Survived 并将其存储到 outcome
中。
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
# Show the new dataset with 'Survived' removed
display(features_raw.head())
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
相同的泰坦尼克号样本数据现在显示 DataFrame 中删除了 Survived 特征。注意 data
(乘客数据)和 outcomes
(存活结果)现在是成对的。意味着对于任何乘客 data.loc[i]
,都具有存活结果 outcomes[i]
。
现在我们对数据进行预处理。首先,我们将对特征进行独热编码。
features = pd.get_dummies(features_raw)
现在用 0 填充任何空白处。
features = features.fillna(0.0)
display(features.head())
PassengerId | Pclass | Age | SibSp | Parch | Fare | Name_Abbing, Mr. Anthony | Name_Abbott, Mr. Rossmore Edward | Name_Abbott, Mrs. Stanton (Rosa Hunt) | Name_Abelson, Mr. Samuel | ... | Cabin_F G73 | Cabin_F2 | Cabin_F33 | Cabin_F38 | Cabin_F4 | Cabin_G6 | Cabin_T | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 5 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 1730 columns
现在我们已经准备好在 sklearn 中训练模型了。首先,将数据拆分为训练集和测试集。然后用训练集训练模型。
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier
# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
现在看看模型的效果。我们计算下训练集和测试集的准确率。
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
The training accuracy is 1.0
The test accuracy is 0.810055865922
结果是训练准确率很高,但是测试准确率较低。我们可能有点过拟合了。
现在该你来发挥作用了!训练新的模型,并尝试指定一些参数来改善测试准确率,例如:
max_depth
min_samples_leaf
min_samples_split
你可以根据直觉、采用试错法,甚至可以使用网格搜索法!
挑战: 尝试在测试集中获得 85% 的准确率,如果需要提示,可以查看接下来的解决方案 notebook。
# Training the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10)
model.fit(X_train, y_train)
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
The training accuracy is 0.870786516854
The test accuracy is 0.854748603352