Kaggle教程 机器学习入门学习笔记

机器学习入门学习笔记

[跳转]《Kaggle教程 机器学习入门》系列课程目录

>> 决策树

  • 简介:是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树
    Kaggle教程 机器学习入门学习笔记_第1张图片

>> 举个例子,我们来看看澳大利亚墨尔本的房价数据。

import numpy as np
import pandas as pd

# 文件路径
melbourne_file_path = 'data/melb_data.csv'

# 读取并保存数据到DataFrame类型变量melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path, index_col="Date") 
print(melbourne_data.shape)

# 打印数据概览
melbourne_data.describe()
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 13580.000000 1.358000e+04 13580.000000 13580.000000 13580.000000 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.075684e+06 10.137776 3105.301915 2.914728 1.534242 1.610075 558.416127 151.967650 1964.684217 -37.809203 144.995216 7454.417378
std 0.955748 6.393107e+05 5.868725 90.676964 0.965921 0.691712 0.962634 3990.669241 541.014538 37.273762 0.079260 0.103916 4378.581772
min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1196.000000 -38.182550 144.431810 249.000000
25% 2.000000 6.500000e+05 6.100000 3044.000000 2.000000 1.000000 1.000000 177.000000 93.000000 1940.000000 -37.856822 144.929600 4380.000000
50% 3.000000 9.030000e+05 9.200000 3084.000000 3.000000 1.000000 2.000000 440.000000 126.000000 1970.000000 -37.802355 145.000100 6555.000000
75% 3.000000 1.330000e+06 13.000000 3148.000000 3.000000 2.000000 2.000000 651.000000 174.000000 1999.000000 -37.756400 145.058305 10331.000000
max 10.000000 9.000000e+06 48.100000 3977.000000 20.000000 8.000000 10.000000 433014.000000 44515.000000 2018.000000 -37.408530 145.526350 21650.000000
# 查看数据前5行
melbourne_data.head()
# Suburb	郊区,城市 Address	联系地址 Rooms	房间数 
# Type	类型 Price	价格   Method 	方法 
# SellerG	卖方,售货员 Distance 	距离 
# Postcode	邮政编码 Bedroom2	卧房 Bathroom 	浴室
# Car	汽车 Landsize 	 土地大小 BuildingArea	建筑面积
# YearBuilt 	制造年份 CouncilArea	会议面积 Lattitude	纬度
# Longtitude	经度 Regionname	地区名字  Propertycount 财产
Suburb Address Rooms Type Price Method SellerG Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
Date
3/12/2016 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 2.5 3067.0 2.0 1.0 1.0 202.0 NaN NaN Yarra -37.7996 144.9984 Northern Metropolitan 4019.0
4/02/2016 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 2.5 3067.0 2.0 1.0 0.0 156.0 79.0 1900.0 Yarra -37.8079 144.9934 Northern Metropolitan 4019.0
4/03/2017 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 2.5 3067.0 3.0 2.0 0.0 134.0 150.0 1900.0 Yarra -37.8093 144.9944 Northern Metropolitan 4019.0
4/03/2017 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 2.5 3067.0 3.0 2.0 1.0 94.0 NaN NaN Yarra -37.7969 144.9969 Northern Metropolitan 4019.0
4/06/2016 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 2.5 3067.0 3.0 1.0 2.0 120.0 142.0 2014.0 Yarra -37.8072 144.9941 Northern Metropolitan 4019.0

1、选择建模数据

# dropna 删除有缺失的数据 (na可以看作是"not available")
melbourne_data = melbourne_data.dropna(axis=0)

2、选择预测目标 选择特征值

# 将房价保存到y变量的代码
y = melbourne_data.Price

# 房间  浴室 土地 经度 伟度
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

X.describe()
Rooms Bathroom Landsize Lattitude Longtitude
count 13580.000000 13580.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.534242 558.416127 -37.809203 144.995216
std 0.955748 0.691712 3990.669241 0.079260 0.103916
min 1.000000 0.000000 0.000000 -38.182550 144.431810
25% 2.000000 1.000000 177.000000 -37.856822 144.929600
50% 3.000000 1.000000 440.000000 -37.802355 145.000100
75% 3.000000 2.000000 651.000000 -37.756400 145.058305
max 10.000000 8.000000 433014.000000 -37.408530 145.526350
X.head()
Rooms Bathroom Landsize Lattitude Longtitude
Date
3/12/2016 2 1.0 202.0 -37.7996 144.9984
4/02/2016 2 1.0 156.0 -37.8079 144.9934
4/03/2017 3 2.0 134.0 -37.8093 144.9944
4/03/2017 3 2.0 94.0 -37.7969 144.9969
4/06/2016 4 1.0 120.0 -37.8072 144.9941

3、构建模型

  • 使用scikit-learn库创建你的第一个模型
from sklearn.tree import DecisionTreeRegressor # 决策树

# 定义模型. 指定一个参数random_state确保每次运行结果一致
melbourne_model = DecisionTreeRegressor(random_state=1)

# 拟合模型
melbourne_model.fit(X, y)

'''
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')
'''

4、预测

print("Making predictions for the following 5 houses:")
#print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
print(y.head())

'''
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]
Date
3/12/2016    1480000.0
4/02/2016    1035000.0
4/03/2017    1465000.0
4/03/2017     850000.0
4/06/2016    1600000.0
Name: Price, dtype: float64
'''

5、模型验证 MAE

  • 从一个称为平均绝对误差(Mean Absolute Error,也称为MAE)的度量标准开始。
  • MAE 越小越好
from sklearn.metrics import mean_absolute_error # 导入MAE模块

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

'''
MAE: 62509.0528227786
'''

6、数据分割为训练和验证数据 train_test_split

from sklearn.model_selection import train_test_split

# 将数据分割为训练和验证数据,都有特征和预测目标值
# 分割基于随机数生成器。为random_state参数提供一个数值可以保证每次得到相同的分割
# 执行下面代码
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# 定义模型
melbourne_model = DecisionTreeRegressor()
# 拟合模型
melbourne_model.fit(train_X, train_y)

# 根据验证数据获得预测价格
val_predictions = melbourne_model.predict(val_X)
print("MAE:", mean_absolute_error(val_y, val_predictions))

'''
MAE: 247974.12793323514
'''

7. 尝试不同的模型

欠拟合与过拟合(overfitting underfitting)

  • 过拟合: 树节点太多 模型与训练数据几乎完全匹配
  • 欠拟合: 树节点太少 模型不能捕捉到数据中的重要特征和模式
    Kaggle教程 机器学习入门学习笔记_第2张图片
from sklearn.metrics import mean_absolute_error # 平均绝对误差模块
from sklearn.tree import DecisionTreeRegressor # 决策树

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# 比较不同max_leaf_nodes值的MAE
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

'''
Max leaf nodes: 5  		 Mean Absolute Error:  354662
Max leaf nodes: 50  		 Mean Absolute Error:  266447
Max leaf nodes: 500  		 Mean Absolute Error:  231301
Max leaf nodes: 5000  		 Mean Absolute Error:  249163
'''

>>随机森林

  • 决策树给你留下一个难题。一颗较深、叶子多的树将会过拟合,因为每一个预测都来自叶子上仅有的几个历史训练数据。一颗较浅、叶子少的树将会欠拟合,因为它不能在原始数据中捕捉到那么多的差异
  • 随机森林使用了许多树,它通过对每棵成分树的预测进行平均来进行预测。它通常比单个决策树具有更好的预测精度,
    Kaggle教程 机器学习入门学习笔记_第3张图片
from sklearn.ensemble import RandomForestRegressor # 隨機森林模块
from sklearn.metrics import mean_absolute_error # MAE 平均绝对误差

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

'''
MAE: 191525.59192369733
'''

>>结论

  • 可能还有进一步改进的空间,但是这比最佳决策树250,000的误差有很大的改进。
  • 你可以修改一些参数来提升随机森林的性能,就像我们改变单个决策树的最大深度一样。
  • 但是,随机森林模型的最佳特性之一是,即使没有这种调优,它们通常也可以正常工作。

你可能感兴趣的:(Kaggle教程 机器学习入门学习笔记)