“最小二乘法”一句话解释:一种数学优化方法,通过最小化误差的平方和来寻找合适的数据拟合函数。
线性模型的最小二乘可以有很多方法来实现,比如直接使用矩阵运算求解析解,sklearn包(参考:用scikit-learn和pandas学习线性回归、用scikit-learn求解多元线性回归问题),或scipy里的leastsq function(参考:How to use leastsq function from scipy.optimize )。
本文使用scipy的leastsq函数实现,代码如下。
from scipy.optimize import leastsq
import numpy as np
def main():
# data provided
x = np.array([[1, 50, 5, 200], [1, 50, 5, 400], [1, 50, 5, 600], [1, 50, 5, 800], [1, 50, 5, 1000],
[1, 50, 10, 200], [1, 50, 10, 400], [1, 50, 10, 600], [1, 50, 10, 800], [1, 50, 10, 1000],
[1, 60, 5, 200], [1, 60, 5, 400], [1, 60, 5, 600], [1, 60, 5, 800], [1, 60, 5, 1000],
[1, 60, 10, 200], [1, 60, 10, 400], [1, 60, 10, 600], [1, 60, 10, 800], [1, 60, 10, 1000],
[1, 70, 5, 200], [1, 70, 5, 400], [1, 70, 5, 600], [1, 70, 5, 800], [1, 70, 5, 1000],
[1, 70, 10, 200], [1, 70, 10, 400]])
y = np.array([7.434, 3.011, 1.437, 0.6728, 0.00036,
5.518, 2.556, 1.341, 0.6824, 0.0001,
18.22, 7.344, 4.066, 1.799, 1.218,
16.11, 9.448, 4.752, 2.245, 1.539,
18.14, 12.88, 7.29, 3.449, 2.533,
15.76, 16.24])
# here, create lambda functions for Line fit
# tpl is a tuple that contains the parameters of the fit
funcLine=lambda tpl,x: np.dot(x, tpl)
# func is going to be a placeholder for funcLine,funcQuad or whatever
# function we would like to fit
func = funcLine
# ErrorFunc is the diference between the func and the y "experimental" data
ErrorFunc = lambda tpl, x, y: func(tpl, x)-y
#tplInitial contains the "first guess" of the parameters
tplInitial=[1.0, 1.0, 1.0, 1.0]
# leastsq finds the set of parameters in the tuple tpl that minimizes
# ErrorFunc=yfit-yExperimental
tplFinal, success = leastsq(ErrorFunc, tplInitial, args=(x, y))
print('linear fit', tplFinal)
print(funcLine(tplFinal, x))
if __name__ == "__main__":
main()
# tplFinal值
[-8.43371266 0.3787503 0.11744081 -0.01485372]
# y预测值
[ 8.12026253 5.1495184 2.17877428 -0.79196984 -3.76271396
8.70746659 5.73672247 2.76597835 -0.20476577 -3.17550989
11.90776557 8.93702145 5.96627733 2.99553321 0.02478909
12.49496964 9.52422552 6.5534814 3.58273728 0.61199315
15.69526862 12.7245245 9.75378038 6.78303626 3.81229214
16.28247269 13.31172857]
a) 从结果可以看出使用线性模型拟合的效果并不是特别好,可进一步尝试使用二次曲线等较复杂模型。
b) 拟合直线应首先自己观察一下给定数据x、y之间是否有什么关系。比如上述所给数据明显是一个基于控制变量的对照组实验,先观察一下其自变量(特征)与因变量(目标)之间的关系,你会明显发现自变量x3(200, 400, 600...)
与y值成负相关。这样至少心里有个底儿。
c) 感觉使用scipy的leastsq函数来做并不是那么方便,下次可尝试使用sklearn包。