09-29 Intro to Machine Learning

第一个机器学习模型

使用Pandas 读取数据

机器学习的第一步是处理数据,对于很多表格数据(如.csv文件),我们使用pandas库来处理。

加载数据并查看数据总体情况:

import pandas as pd

iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
home_data.describe()

选择所需数据

使用home_data.colums 查看表格所有列的名称。
其中有一列为SalePrice,是我们需要的房价。我们把它赋值给y:y= home_data.SalePrice
接着,我们要获取一些和房价有关的特征,赋值给X。

feature_names = ['LotArea','YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr',
                 'TotRmsAbvGrd']

# Select data corresponding to features in feature_names
X = home_data[feature_names]

确认一下我们获取的数据有没有异常值:

print(X.describe())

# print the top few lines
print(X.head())

创建模型(并训练)

这里比较简单,直接用一个sklearn的决策树模型:

from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit the model
iowa_model.fit(X, y)

用模型预测

predictions = iowa_model.predict(X)
print(predictions)

模型验证 Model Validation

划分训练集和验证集

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

这次只在训练集上fit:

iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X,train_y)

# Check your answer
step_2.check()

并在验证集上预测:
val_predictions = iowa_model.predict(val_X)

计算实际值和预测值之间误差

from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

# uncomment following line to see the validation_mae
print(val_mae)

你可能感兴趣的:(09-29 Intro to Machine Learning)