线性回归

线性回归目标函数:y=3x+2


生成测试数据程序

genData.py

#!/usr/bin/python


fori in range(0,100):

label= 3*i+20

feature= "1:"+str(i)

print label,feature


生成数据如下

201:0

231:1

261:2

291:3

321:4

351:5

381:6

411:7

441:8

471:9

501:10

531:11

561:12

591:13

621:14

651:15

681:16

711:17

741:18

771:19

801:20

831:21

861:22

891:23

921:24

951:25

981:26

1011:27

1041:28

1071:29

1101:30

1131:31

1161:32

1191:33

1221:34

1251:35

1281:36

1311:37

1341:38

1371:39

1401:40

1431:41

1461:42

1491:43

1521:44

1551:45

1581:46

1611:47

1641:48

1671:49

1701:50

1731:51

1761:52

1791:53

1821:54

1851:55

1881:56

1911:57

1941:58

1971:59

2001:60

2031:61

2061:62

2091:63

2121:64

2151:65

2181:66

2211:67

2241:68

2271:69

2301:70

2331:71

2361:72

2391:73

2421:74

2451:75

2481:76

2511:77

2541:78

2571:79

2601:80

2631:81

2661:82

2691:83

2721:84

2751:85

2781:86

2811:87

2841:88

2871:89

2901:90

2931:91

2961:92

2991:93

3021:94

3051:95

3081:96

3111:97

3141:98

3171:99


训练程序,以均方差做为损失函数,采用批量梯度下降的方法搜索最优解。

trainLinear.py



#!/usr/bin/python

importsys

importos

instance={}

all_feature={}

index=0

max_index=0

#readdata

forline in sys.stdin:

cols= line.strip().split(" ");

label= cols[0]

forfeatures in cols[1:]:

feature_index= features.split(":")[0]

feature_value= features.split(":")[1]

#printfeature_index

#printfeature_value

if(max_index<feature_index):

max_index=feature_index

feature_index=str(index)+","+feature_index

all_feature[feature_index]=float(feature_value)

instance[index]=label

index=index+1


#initparam

print"feature count : ",max_index

weight={}

fori in range(0,int(max_index)+1):

weight[i]=100


iter_maxno=10000

foriter_no in xrange(0,iter_maxno):

print"Iter "+str(iter_no)+":"

#printguess

loss=0

guess_label={}

form in range(0,index):

guess_label[m]= 0

forn in range(0,int(max_index)+1):

#printn

ifn==0:

guess_label[m]= weight[0]

else:

guess_label[m]= guess_label[m]+float(all_feature[str(m)+","+str(n)])*weight[n]

#print"label : "+instance[m]+" , guess :"+str(guess_label[m])

loss= loss +(guess_label[m]-float(instance[m]))*(guess_label[m]-float(instance[m]))

print"weight",

forn in range(0,int(max_index)+1):

printweight[n],

print"loss :",loss


#updateparam

forn in range(0,int(max_index)+1):

new_weight= 0

ifn==0:

form in range(0,index):

new_weight=new_weight+float(instance[m])-guess_label[m]+weight[n]

new_weight= float(new_weight)/float(index)

else:

new_weight_div=0

form in range(0,index):

new_weight_div=new_weight_div+float(all_feature[str(m)+","+str(n)])

form in range(0,index):

if(float(all_feature[str(m)+","+str(n)])!=0):

new_weight=new_weight+(float(instance[m])-guess_label[m]+weight[n]*float(all_feature[str(m)+","+str(n)]))

new_weight= new_weight/new_weight_div

weight[n]=new_weight


迭代过程,可以看到,随着迭代的进行,loss函数逐渐减少。weight和常量逐步趋向设定值。

Iter0:

weight100 100 loss : 3166909150.0

Iter1:

weight-4781.5 1.4 loss : 2382336561.0

Iter2:

weight99.2 99.03 loss : 3103887657.91

Iter3:

weight-4733.485 1.416 loss : 2334928063.44

Iter4:

weight98.408 98.0697 loss : 3042120293.52

Iter5:

weight-4685.95015 1.43184 loss : 2288462994.97


迭代4000轮之后,已经非常接近

Iter4431:

weight19.9999989688 2.99999999966 loss : 1.09879476907e-10

Iter4432:

weight20.000000017 3.00000002062 loss : 1.43159254874e-10

Iter4433:

weight19.9999989791 2.99999999966 loss : 1.07692932005e-10

Iter4434:

weight20.0000000168 3.00000002042 loss : 1.40310420629e-10

Iter4435:

weight19.9999989893 2.99999999966 loss : 1.05549867471e-10

Iter4436:

weight20.0000000167 3.00000002021 loss : 1.37518185703e-10

Iter4437:

weight19.9999989994 2.99999999967 loss : 1.03449354712e-10

Iter4438:

weight20.0000000165 3.00000002001 loss : 1.34781495804e-10

Iter4439:

weight19.9999990095 2.99999999967 loss : 1.01390698953e-10

Iter4440:

weight20.0000000163 3.00000001981 loss : 1.32099425001e-10

Iter4441:

weight19.9999990194 2.99999999967 loss : 9.93730623639e-11


Iter5586:

weight20.0000000001 3.00000000006 loss : 1.31534809522e-15

你可能感兴趣的:(线性回归)