In this homework, you will investigate multivariate linear regression using Gradient Descent and Stochastic Gradient Descent. You will also examine the relationship between the cost function, the convergence of gradient descent, overfitting problem, and the learning rate.
Download the file “dataForTraining.txt” in the attached files called “Homework 2”. This is a training dataset of apartment prices in Haizhu District, Guangzhou, Guangdong, China, where there are 50 training instances, one line per one instance, formatted in three columns separated with each other by a whitespace. The data in the first and the second columns are sizes of the apartments in square meters and the distances to the Double-Duck-Mountain Vocational Technical College in kilo-meters, respectively, while the data in the third are the corresponding prices in billion RMB. Please build a multivariate linear regression model with the training instances by script in any programming languages to predict the prices of the apartments. For evaluation purpose, please also download the file “dataForTesting.txt” (the same format as that in the file of training data) in the same folder.
请在文件夹“作业2”中下载文件名为“dataForTraining.txt”的文件。该文件包含广东省广州市海珠区的房价信息,里面包含50个训练样本数据。文件有三列,第一列对应房的面积(单位:平方米),第二列对应房子距离双鸭山职业技术学院的距离(单位:千米),第三列对应房子的销售价格(单位:万元)。每一行对应一个训练样本。请使用提供的50个训练样本来训练多变量回归模型以便进行房价预测,请用(随机)梯度下降法的多变量线性回归模型进行建模。为了评估训练效果,请文件夹中下载测试数据集“dataForTesting.txt” (该测试文件里的数据跟训练样本具有相同的格式,即第一列对应房子面积,第二列对应距离,第三列对应房子总价)。
How many parameters do you use to tune this linear regression model? Please use Gradient Descent to obtain the optimal parameters. Before you train the model, please set the number of iterations to be 1500000, the learning rate to 0.00015, the initial values of all the parameters to 0.0. During training, at every 100000 iterations, i.e., 100000 , 200000,…, 1500000, report the current training error and the testing error in a figure (you can draw it by hands or by any software). What can you find in the plots? Please analyze the plots.
Exercise 1: 你需要用多少个参数来训练该线性回归模型?请使用梯度下降方法训练。训练时,请把迭代次数设成1500000,学习率设成0.00015,参数都设成0.0。在训练的过程中,每迭代100000步,计算训练样本对应的误差,和使用当前的参数得到的测试样本对应的误差。请画图显示迭代到达100000步、200000步、… … 1500000时对应的训练样本的误差和测试样本对应的误差(图可以手画,或者用工具画图)。从画出的图中,你发现什么?请简单分析。
迭代次数 | 训练集误差 | 测试机误差 |
100000 | 3350.439763530956 | 1244.6358567507064 |
200000 | 752.9741700253077 | 1194.5523365857432 |
300000 | 290.80670041460706 | 1258.7206255364906 |
400000 | 208.57316842851702 | 1300.964391608669 |
500000 | 193.94134289436053 | 1321.4839068994477 |
600000 | 191.33789983055019 | 1330.6198746479934 |
700000 | 190.87466878868773 | 1334.559078113643 |
800000 | 190.79224601511947 | 1336.2359151769367 |
900000 | 190.77758051780003 | 1336.9459412688884 |
1000000 | 190.7749710835127 | 1337.245924411146 |
1100000 | 190.77450678645073 | 1337.372548293513 |
1200000 | 190.77442417400073 | 1337.4259757329712 |
1300000 | 190.77440947475455 | 1337.4485150846626 |
1400000 | 190.77440685931563 | 1337.458023064512 |
1500000 | 190.7744063939496 | 1337.4620337845404 |
import numpy as np
import matplotlib.pyplot as plt
def loss(train_data, weight, train_real):
result = np.matmul(train_data, weight.T)
loss = result - train_real
losses = pow(loss, 2)
losses = losses.T
return losses.sum()
def gradient(train_data, weight, train_real):
result = np.matmul(train_data, weight.T)
loss = train_real - result
# print (train_real)
# print (result)
# print (loss)
x1 = train_data.T[0]
x1 = x1.T
gradient1 = loss * x1
x2 = train_data.T[1]
x2 = x2.T
gradient2 = loss * x2
x3 = train_data.T[2]
x3 = x3.T
gradient3 = loss * x3
return gradient1.T.sum() / 50, gradient2.T.sum() / 50, gradient3.T.sum() / 50
filename = "./dataForTraining.txt"
filename1 = "./dataForTesting.txt"
# input the train data
read_file = open(filename)
lines = read_file.readlines()
list_all = []
list_raw = []
list_real = []
for line in lines:
list1 = line.split()
for one in list1:
list_raw = []
for one in list_all:
train_data = np.array(list_all)
train_real = np.array(list_real)
train_real = train_real.T
# input the test data
read_test = open(filename1)
test_lines = read_test.readlines()
list_all_test = []
list_raw_test = []
list_real_test = []
for test_line in test_lines:
another_list = test_line.split()
for one in another_list:
list_raw_test = []
for one in list_all_test:
train_test_data = np.array(list_all_test)
train_test_real = np.array(list_real_test)
train_test_real = train_test_real.T
# set the parameter
weight = np.array([0.0, 0.0, 0.0])
lr = 0.00015
x = []
y = []
z = []
for num in range(1, 1500001):
losses = loss(train_data, weight, train_real)
real_losses = loss(train_test_data, weight, train_test_real)
gra1, gra2, gra3 = gradient(train_data, weight, train_real)
gra = np.array([gra1, gra2, gra3])
weight = weight + (gra * lr)
# print (losses, real_losses)
if num % 100000 == 0:
print (losses, real_losses)
plt.plot(x, y)
plt.plot(x, z)
Now, you change the learning rate to a number of different values, for instance, to 0.0002 (you may also change the number of iterations as well) and then train the model again. What can you find? Please conclude your findings.
Exercise 2: 现在,你改变学习率,比如把学习率改成0.0002(此时,你可以保持相同的迭代次数也可以改变迭代次数),然后训练该回归模型。你有什么发现?请简单分析。
迭代次数 | 训练集误差 | 测试集误差 |
100000 | 12763.688614677096 | 1996.5022175657937 |
200000 | 9092.749685611183 | 1666.1554987348031 |
300000 | 6493.622143912071 | 1454.12970883651 |
400000 | 4653.367073328151 | 1322.4111037375815 |
500000 | 3350.414959405551 | 1244.634551279207 |
600000 | 2427.88838234995 | 1202.595415819055 |
700000 | 1774.7137366511224 | 1183.7935711665805 |
800000 | 1312.247799739428 | 1179.7061231510397 |
900000 | 984.8089681623431 | 1184.5742349763213 |
1000000 | 752.973107686592 | 1194.5524000926828 |
1100000 | 588.8268274679675 | 1207.1130324317212 |
1200000 | 472.6066652621286 | 1220.6307298687423 |
1300000 | 390.3195364574001 | 1234.0928269411381 |
1400000 | 332.05794536688643 | 1246.89858605016 |
1500000 | 290.80710765138787 | 1258.7204926660925 |
import numpy as np
import matplotlib.pyplot as plt
def loss(train_data, weight, train_real):
result = np.matmul(train_data, weight.T)
loss = result - train_real
losses = loss ** 2
losses = losses.T
return losses.sum()
def gradient(train_data, weight, train_real):
result = np.matmul(train_data, weight.T)
loss = train_real - result
# print (train_real)
# print (result)
# print (loss)
x1 = train_data.T[0]
x1 = x1.T
gradient1 = loss * x1
x2 = train_data.T[1]
x2 = x2.T
gradient2 = loss * x2
x3 = train_data.T[2]
x3 = x3.T
gradient3 = loss * x3
return gradient1.T.sum() / 50, gradient2.T.sum() / 50, gradient3.T.sum() / 50
filename = "./dataForTraining.txt"
filename1 = "./dataForTesting.txt"
# input the train data
read_file = open(filename)
lines = read_file.readlines()
list_all = []
list_raw = []
list_real = []
for line in lines:
list1 = line.split()
for one in list1:
list_raw = []
for one in list_all:
train_data = np.array(list_all)
train_real = np.array(list_real)
train_real = train_real.T
# input the test data
read_test = open(filename1)
test_lines = read_test.readlines()
list_all_test = []
list_raw_test = []
list_real_test = []
for test_line in test_lines:
another_list = test_line.split()
for one in another_list:
list_raw_test = []
for one in list_all_test:
train_test_data = np.array(list_all_test)
train_test_real = np.array(list_real_test)
train_test_real = train_test_real.T
# set the parameter
weight = np.array([0.0, 0.0, 0.0])
lr = 0.00003
x = []
y = []
z = []
for num in range(1, 1500001):
losses = loss(train_data, weight, train_real)
real_losses = loss(train_test_data, weight, train_test_real)
gra1, gra2, gra3 = gradient(train_data, weight, train_real)
gra = np.array([gra1, gra2, gra3])
weight = weight + (gra * lr)
# print (losses, real_losses)
if num % 100000 == 0:
print (losses, real_losses)
plt.plot(x, y)
plt.plot(x, z)