线性回归目标函数:y=3x+2
生成测试数据程序
genData.py
#!/usr/bin/python
fori in range(0,100):
label= 3*i+20
feature= "1:"+str(i)
print label,feature
生成数据如下
201:0
231:1
261:2
291:3
321:4
351:5
381:6
411:7
441:8
471:9
501:10
531:11
561:12
591:13
621:14
651:15
681:16
711:17
741:18
771:19
801:20
831:21
861:22
891:23
921:24
951:25
981:26
1011:27
1041:28
1071:29
1101:30
1131:31
1161:32
1191:33
1221:34
1251:35
1281:36
1311:37
1341:38
1371:39
1401:40
1431:41
1461:42
1491:43
1521:44
1551:45
1581:46
1611:47
1641:48
1671:49
1701:50
1731:51
1761:52
1791:53
1821:54
1851:55
1881:56
1911:57
1941:58
1971:59
2001:60
2031:61
2061:62
2091:63
2121:64
2151:65
2181:66
2211:67
2241:68
2271:69
2301:70
2331:71
2361:72
2391:73
2421:74
2451:75
2481:76
2511:77
2541:78
2571:79
2601:80
2631:81
2661:82
2691:83
2721:84
2751:85
2781:86
2811:87
2841:88
2871:89
2901:90
2931:91
2961:92
2991:93
3021:94
3051:95
3081:96
3111:97
3141:98
3171:99
训练程序,以均方差做为损失函数,采用批量梯度下降的方法搜索最优解。
trainLinear.py
#!/usr/bin/python
importsys
importos
instance={}
all_feature={}
index=0
max_index=0
#readdata
forline in sys.stdin:
cols= line.strip().split(" ");
label= cols[0]
forfeatures in cols[1:]:
feature_index= features.split(":")[0]
feature_value= features.split(":")[1]
#printfeature_index
#printfeature_value
if(max_index<feature_index):
max_index=feature_index
feature_index=str(index)+","+feature_index
all_feature[feature_index]=float(feature_value)
instance[index]=label
index=index+1
#initparam
print"feature count : ",max_index
weight={}
fori in range(0,int(max_index)+1):
weight[i]=100
iter_maxno=10000
foriter_no in xrange(0,iter_maxno):
print"Iter "+str(iter_no)+":"
#printguess
loss=0
guess_label={}
form in range(0,index):
guess_label[m]= 0
forn in range(0,int(max_index)+1):
#printn
ifn==0:
guess_label[m]= weight[0]
else:
guess_label[m]= guess_label[m]+float(all_feature[str(m)+","+str(n)])*weight[n]
#print"label : "+instance[m]+" , guess :"+str(guess_label[m])
loss= loss +(guess_label[m]-float(instance[m]))*(guess_label[m]-float(instance[m]))
print"weight",
forn in range(0,int(max_index)+1):
printweight[n],
print"loss :",loss
#updateparam
forn in range(0,int(max_index)+1):
new_weight= 0
ifn==0:
form in range(0,index):
new_weight=new_weight+float(instance[m])-guess_label[m]+weight[n]
new_weight= float(new_weight)/float(index)
else:
new_weight_div=0
form in range(0,index):
new_weight_div=new_weight_div+float(all_feature[str(m)+","+str(n)])
form in range(0,index):
if(float(all_feature[str(m)+","+str(n)])!=0):
new_weight=new_weight+(float(instance[m])-guess_label[m]+weight[n]*float(all_feature[str(m)+","+str(n)]))
new_weight= new_weight/new_weight_div
weight[n]=new_weight
迭代过程,可以看到,随着迭代的进行,loss函数逐渐减少。weight和常量逐步趋向设定值。
Iter0:
weight100 100 loss : 3166909150.0
Iter1:
weight-4781.5 1.4 loss : 2382336561.0
Iter2:
weight99.2 99.03 loss : 3103887657.91
Iter3:
weight-4733.485 1.416 loss : 2334928063.44
Iter4:
weight98.408 98.0697 loss : 3042120293.52
Iter5:
weight-4685.95015 1.43184 loss : 2288462994.97
迭代4000轮之后,已经非常接近
Iter4431:
weight19.9999989688 2.99999999966 loss : 1.09879476907e-10
Iter4432:
weight20.000000017 3.00000002062 loss : 1.43159254874e-10
Iter4433:
weight19.9999989791 2.99999999966 loss : 1.07692932005e-10
Iter4434:
weight20.0000000168 3.00000002042 loss : 1.40310420629e-10
Iter4435:
weight19.9999989893 2.99999999966 loss : 1.05549867471e-10
Iter4436:
weight20.0000000167 3.00000002021 loss : 1.37518185703e-10
Iter4437:
weight19.9999989994 2.99999999967 loss : 1.03449354712e-10
Iter4438:
weight20.0000000165 3.00000002001 loss : 1.34781495804e-10
Iter4439:
weight19.9999990095 2.99999999967 loss : 1.01390698953e-10
Iter4440:
weight20.0000000163 3.00000001981 loss : 1.32099425001e-10
Iter4441:
weight19.9999990194 2.99999999967 loss : 9.93730623639e-11
Iter5586:
weight20.0000000001 3.00000000006 loss : 1.31534809522e-15