本文的数据集和上一篇一样,是美国个人收入信息。在上一篇末尾提到了随机森林算法,这一篇就介绍随机森林。
前面提到决策树容易过拟合,解决过拟合最有效的一个方法就是随机森林。随机森林是一种集成模型(Ensemble Models),集成模型一般是结合多个模型然后创建了一个精度更高的模型。那么随机森林自然就是结合多个不同的树,然后构造一个决策树森林,平衡每棵树的结果产生一个最终结果。
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
clf2 = DecisionTreeClassifier(random_state=1, max_depth=6)
clf2.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
predictions = clf2.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
''' 0.784853009097 0.771031199892 '''
当我们有多个分类器的时候,我们可以将其预测结果做成矩阵形式,如下:
然后采用多数表决的方式生成最终的预测结果(当然其实有很多的生成最终结果的方式),此时分类器的个数要大于2,并且最好是奇数,因为是偶数的话,还要写一个规则来打破平局。
由于上面我们只生成了两个决策树,因此我们采用另一种方法来生成最终结果:我们不用predict函数,而用predict_proba 函数来生成每个样本属于0和1的概率,像下面这样:
predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf2.predict_proba(test[columns])[:,1]
combined = (predictions + predictions2) / 2
rounded = numpy.round(combined)
print(roc_auc_score(rounded, test["high_income"]))
''' 0.789959895266 '''
The more”diverse“, or dissimilar, the models used to construct an ensemble, the stronger the combined predictions will be (assuming that all models have about the same accuracy).
随机森林是决策树的集成,如果不加修改,那么这些树的都相同,性能就不会得到提升。而随机森林(random forest)就中的random就是暗示树的随机性。随机森林做了两个修改bagging 和andom feature subsets.随机森林是一个包含多个决策树的分类器,并且其输出的类别是所有决策树输出的类别的众数而定。
tree_count = 10
# Each "bag" will have 60% of the number of original rows.
bag_proportion = .6
predictions = []
for i in range(tree_count):
# We select 60% of the rows from train, sampling with replacement.
# We set a random state to ensure we'll be able to replicate our results.
# We set it to i instead of a fixed value so we don't get the same sample every loop.
# That would make all of our trees the same.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75)
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(rounded, test["high_income"]))
''' 0.785415640465 '''
# Create the dataset that we used 2 missions ago. data = pandas.DataFrame([ [0,4,20,0], [0,4,60,2], [0,5,40,1], [1,4,25,1], [1,5,35,2], [1,5,55,1] ]) data.columns = ["high_income", "employment", "age", "marital_status"] # Set a random seed to make results reproducible. numpy.random.seed(1) # The dictionary to store our tree. tree = {} nodes = [] def find_best_column(data, target_name, columns): information_gains = [] # Select two columns randomly. cols = numpy.random.choice(columns, 2) for col in cols: information_gain = calc_information_gain(data, col, "high_income") information_gains.append(information_gain) highest_gain_index = information_gains.index(max(information_gains)) # Get the highest gain by indexing cols. highest_gain = cols[highest_gain_index] return highest_gain # The function to construct an id3 decision tree. def id3(data, target, columns, tree): unique_targets = pandas.unique(data[target]) nodes.append(len(nodes) + 1) tree["number"] = nodes[-1] if len(unique_targets) == 1: if 0 in unique_targets: tree["label"] = 0 elif 1 in unique_targets: tree["label"] = 1 return best_column = find_best_column(data, target, columns) column_median = data[best_column].median() tree["column"] = best_column tree["median"] = column_median left_split = data[data[best_column] <= column_median] right_split = data[data[best_column] > column_median] split_dict = [["left", left_split], ["right", right_split]] for name, split in split_dict: tree[name] = {} id3(split, target, columns, tree[name]) id3(data, "high_income", ["employment", "age", "marital_status"], tree) print(tree) ''' {'median': 37.5, 'number': 1, 'left': {'median': 4.0, 'number': 2, 'left': {'median': 22.5, 'number': 3, 'left': {'label': 0, 'number': 4}, 'right': {'label': 1, 'number': 5}, 'column': 'age'}, 'right': {'label': 1, 'number': 6}, 'column': 'employment'}, 'right': {'median': 55.0, 'number': 7, 'left': {'median': 47.5, 'number': 8, 'left': {'label': 0, 'number': 9}, 'right': {'label': 1, 'number': 10}, 'column': 'age'}, 'right': {'label': 0, 'number': 11}, 'column': 'age'}, 'column': 'age'} '''
# We'll build 10 trees
tree_count = 10
# Each "bag" will have 70% of the number of original rows.
bag_proportion = .7
predictions = []
for i in range(tree_count):
# We select 80% of the rows from train, sampling with replacement.
# We set a random state to ensure we'll be able to replicate our results.
# We set it to i instead of a fixed value so we don't get the same sample every time.
bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
# Fit a decision tree model to the "bag".
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75, splitter="random", max_features="auto")
clf.fit(bag[columns], bag["high_income"])
# Using the model, make predictions on the test data.
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
print(roc_auc_score(rounded, test["high_income"]))
''' 0.789767997764 '''
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
''' 0.791634978035 '''
与决策树相同随机森林也可以通过调整参数得到更高的精度,随机森林的参数如下,前四个参数和决策树的参数相同,第五个参数
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
''' 0.793788646293 '''
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(train[columns])
print(roc_auc_score(predictions, train["high_income"]))
predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["high_income"]))
''' 0.794137608735 0.793788646293 '''
随机森林的优点有:
随机森林和神经网络以及gradient boosted trees是表现最好的算法之三。
随机森林的缺点:
所以结合了优点以及缺点,通常在精度是至关重要且不要求解释决策的情况下使用随机森林是很好的。当就要求效率(时间复杂度)且需要解释时使用单个的决策树会比较合适。