训练样本 | 直径(英寸) | 价格(美元) |
1 | 6 | 7 |
2 | 8 | 9 |
3 | 10 | 13 |
4 | 14 | 17.5 |
5 | 18 | 18 |
from sklearn.linear_model import LinearRegression
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(X, y)
print('Predict the price by linear regression: $%.2f' % model.predict([12])[0])
|
训练样本 | 直径(英寸) | 价格(美元) | 预测值(美元) |
1 | 8 | 11 | 9.7759 |
2 | 9 | 8.5 | 10.7522 |
3 | 11 | 15 | 12.7048 |
4 | 16 | 18 | 17.5863 |
5 | 12 | 11 | 13.6811 |
首先,我们需要计算样本总体平方和,y¯是价格y的均值,yi的训练集的第i个价格样本,n是样本数量。 |
|
我们计算残差平方和 |
|
最后计算R方: |
from sklearn.linear_model import LinearRegression
x = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
model = LinearRegression()
model.fit(x, y)
print model.score(x_test, y_test)
|
训练样本 | 直径(英寸) | 辅料种类 | 价格(美元) |
1 | 6 | 2 | 7 |
2 | 8 | 1 | 9 |
3 | 10 | 0 | 13 |
4 | 14 | 2 | 17.5 |
5 | 18 | 0 | 18 |
测试样本 | 直径(英寸) | 辅料种类 | 价格(美元) |
1 | 8 | 2 | 11 |
2 | 9 | 0 | 8.5 |
3 | 11 | 2 | 15 |
4 | 16 | 2 | 18 |
5 | 12 | 0 | 11 |
from sklearn.linear_model import LinearRegression
X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(X, y)
X_test = [[8, 2], [9, 0], [11, 2], [16, 2], [12, 0]]
y_test = [[11], [8.5], [15], [18], [11]]
predictions = model.predict(X_test)
for i, prediction in enumerate(predictions):
print('Predicted: %s, Target: %s' % (prediction, y_test[i]))
print('R-squared: %.2f' % model.score(X_test, y_test))
|
输出结果是:
Predicted: [ 10.0625], Target: [11]
Predicted: [ 10.28125], Target: [8.5]
Predicted: [ 13.09375], Target: [15]
Predicted: [ 18.14583333], Target: [18]
Predicted: [ 13.3125], Target: [11]
R-squared: 0.77
|
训练样本 | 直径(英寸) | 价格(美元) |
1 | 6 | 7 |
2 | 8 | 9 |
3 | 10 | 13 |
4 | 14 | 17.5 |
5 | 18 | 18 |
测试样本 | 直径(英寸) | 价格(美元) |
1 | 6 | 8 |
2 | 8 | 12 |
3 | 11 | 15 |
4 | 16 | 18 |
from sklearn.preprocessing import PolynomialFeatures
X_train = [[6], [8], [10], [14], [18]]
quadratic_featurizer = PolynomialFeatures(2) # 最多到二次方
X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
print (X_train_quadratic)
|
输出为:
[[ 1. 6. 36.]
[ 1. 8. 64.]
[ 1. 10. 100.]
[ 1. 14. 196.]
[ 1. 18. 324.]]
可以看到,就是对输入做了扩展, 从0次方最多到二次方
|
# coding=utf-8
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
def run_plt():
plt.figure()
plt.title('Price-Size')
plt.xlabel('Size')
plt.ylabel('Price')
plt.axis([0, 25, 0, 25])
plt.grid(True)
return plt
plt = run_plt()
X_train = [[6], [8], [10], [14], [18]]
y_train = [[7], [9], [13], [17.5], [18]]
X_test = [[6], [8], [11], [16]]
y_test = [[8], [12], [15], [18]]
# 返回一个数组,范围是 0 ~ 25, 共100个点, 这样可以画出预测出来的函数对应的线
xx = np.linspace(0, 25, 100)
xx_input = xx.reshape(xx.shape[0], 1) # 转化成一个跟 X_train 内容一致的1*100的矩阵
print (xx_input)
plt.plot(X_train, y_train, 'b.') # 画原始的训练集的点
plt.plot(X_test, X_test, 'r.') # 画原始的测试集的点
# 计算用一元一次线性回归出来的结果,然后画一条直线
regressor = LinearRegression()
regressor.fit(X_train, y_train)
yy = regressor.predict(xx_input)
plt.plot(xx, yy, 'c-') # 画一元一次回归对应的结果
print('Linear Regression r-squared', regressor.score(X_test, y_test))
# # quadratic 二次的
quadratic_featurizer = PolynomialFeatures(2)
X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
X_test_quadratic = quadratic_featurizer.transform(X_test)
print(X_train_quadratic) # 转换过的输入
regressor_quadratic = LinearRegression()
regressor_quadratic.fit(X_train_quadratic, y_train)
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
# 画二元一次回归对应的结果,而这个二元实际上通过一元输入转换过来的
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), 'm-')
print('Polynomial Regression r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
plt.show()
|
[[ 0. ]
[ 0.25252525]
……………….
[ 24.74747475]
[ 25. ]]
('Linear Regression r-squared', 0.80972679770766498)
[[ 1. 6. 36.]
[ 1. 8. 64.]
[ 1. 10. 100.]
[ 1. 14. 196.]
[ 1. 18. 324.]]
('Polynomial Regression r-squared', 0.86754436563451076)
注:
二次拟合的 R 值要高于 一次拟合。
但是,用上一个例子给出的测试集,也就是
X=[8,9,11,16,12]
Y=[11,8.5,15.18.11]
来计算 R 值,可以得到,其实一次拟合的 R 值要高于二次拟合。
也就是说,R 值跟训练集和测试集同时相关,而且不能简单的说高次拟合就一定比低次拟合效果好。
后面我们会论述一个问题:为什么只用一个测试集评估一个模型的效果是不准确的,如何通过将测试集数据分块的方法来测试,让模型的测试效果更可靠。
|
quadratic_featurizer = PolynomialFeatures(2)
X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
X_test_quadratic = quadratic_featurizer.transform(X_test)
regressor_quadratic = LinearRegression()
regressor_quadratic.fit(X_train_quadratic, y_train)
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), 'm-')
cubic_featurizer = PolynomialFeatures(3)
X_train_cubic = cubic_featurizer.fit_transform(X_train)
X_test_cubic = cubic_featurizer.transform(X_test)
regressor_cubic = LinearRegression()
regressor_cubic.fit(X_train_cubic, y_train)
xx_cubic = cubic_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_cubic.predict(xx_cubic))
print(X_train_cubic)
print('2 Polynomial r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
print('3 Polynomial r-squared', regressor_cubic.score(X_test_cubic, y_test))
plt.show()
|
二次回归 r-squared 0.867544458591
三次回归 r-squared 0.835692454062
|
quadratic_featurizer = PolynomialFeatures(2)
X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
X_test_quadratic = quadratic_featurizer.transform(X_test)
regressor_quadratic = LinearRegression()
regressor_quadratic.fit(X_train_quadratic, y_train)
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), 'm-')
seventh_featurizer = PolynomialFeatures(7)
X_train_seventh = seventh_featurizer.fit_transform(X_train)
X_test_seventh = seventh_featurizer.transform(X_test)
regressor_seventh = LinearRegression()
regressor_seventh.fit(X_train_seventh, y_train)
xx_seventh = seventh_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_seventh.predict(xx_seventh))
print('2 Polynomial r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
print('7 Polynomial r-squared', regressor_seventh.score(X_test_seventh, y_test))
plt.show()
|
二次回归 r-squared 0.867544458591
七次回归 r-squared 0.487942421984
|
对于
写成矩阵形式: Y = X β
其中, Y 是训练集的响应变量列向量, β 是模型参数列向量。 X 称为设计矩阵,是 m×n 维训练集的解释变量矩阵。 m 是训练集样本数量, n 是解释变量个数。
在这里,我们的n为2, 即 我们的学习算法评估三个参数的值:两个相关因子和一个截距。
对于 Y = X β,矩阵没有除法运算(详见线性代数相关内容),所以用矩阵的转置运算和逆运算来实现:
|