https://aistudio.baidu.com/aistudio/education/group/info/1978
https://aistudio.baidu.com/aistudio/projectdetail/1774774
二元分类是机器学习中最基础的问题之一,在这份教学中,你将学会如何实作一个线性二元分类器,来根据人们的个人资料,判断其年收入是否高于 50,000 美元。我们将以两种方法: logistic regression 与 generative model,来达成以上目的,你可以尝试了解、分析两者的设计理念及差别。
实现二分类任务:
这个资料集是由UCI Machine Learning Repository 的Census-Income (KDD) Data Set 经过一些处理而得来。为了方便训练,我们移除了一些不必要的资讯,并且稍微平衡了正负两种标记的比例。事实上在训练过程中,只有 X_train、Y_train 和 X_test 这三个经过处理的档案会被使用到,train.csv 和 test.csv 这两个原始资料档则可以提供你一些额外的资讯。
特征格式
项目数据保存在:work/data/ 目录下。
无
# 下面该你动手啦!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data1 = pd.read_csv('work/data/train.csv',header=None,encoding='big5')
print(data1.shape)
data1.head()
(54257, 42)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,3,4,6,17,18,19,30,36,38,39,40) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | id | age | class of worker | detailed industry recode | detailed occupation recode | education | wage per hour | enroll in edu inst last wk | marital stat | major industry code | ... | country of birth father | country of birth mother | country of birth self | citizenship | own business or self employed | fill inc questionnaire for veteran's admin | veterans benefits | weeks worked in year | year | y |
1 | 0 | 33 | Private | 34 | 26 | Masters degree(MA MS MEng MEd MSW MBA) | 0 | Not in universe | Married-civilian spouse present | Finance insurance and real estate | ... | China | China | Taiwan | Foreign born- Not a citizen of U S | 2 | Not in universe | 2 | 52 | 95 | 50000+. |
2 | 1 | 63 | Private | 7 | 22 | Some college but no degree | 0 | Not in universe | Never married | Manufacturing-durable goods | ... | ? | ? | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 95 | - 50000. |
3 | 2 | 71 | Not in universe | 0 | 0 | 7th and 8th grade | 0 | Not in universe | Married-civilian spouse present | Not in universe or children | ... | Germany | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
4 | 3 | 43 | Local government | 43 | 10 | Bachelors degree(BA AB BS) | 0 | Not in universe | Married-civilian spouse present | Education | ... | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 95 | - 50000. |
5 rows × 42 columns
data2 = pd.read_csv('work/data/X_train',header=None,encoding='big5')
data2.head()
data2.to_numpy()
print(data2[0:10])
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,502,503,504,508) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
0 1 2 3 4 \
0 id age Private Self-employed-incorporated State government
1 0 33 1 0 0
2 1 63 1 0 0
3 2 71 0 0 0
4 3 43 0 0 0
5 4 57 0 0 0
6 5 42 1 0 0
7 6 16 0 0 0
8 7 16 1 0 0
9 8 20 1 0 0
5 6 7 \
0 Self-employed-not incorporated Not in universe Without pay
1 0 0 0
2 0 0 0
3 0 1 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 1 0
8 0 0 0
9 0 0 0
8 9 ... 501 502 503 504 \
0 Federal government Never worked ... 1 Not in universe Yes No
1 0 0 ... 0 1 0 0
2 0 0 ... 0 1 0 0
3 0 0 ... 0 1 0 0
4 0 0 ... 0 1 0 0
5 0 0 ... 0 1 0 0
6 0 0 ... 0 1 0 0
7 0 0 ... 0 1 0 0
8 0 0 ... 0 1 0 0
9 0 0 ... 0 1 0 0
505 506 507 508 509 510
0 2 0 1 weeks worked in year 94 95
1 1 0 0 52 0 1
2 1 0 0 52 0 1
3 1 0 0 0 0 1
4 1 0 0 52 0 1
5 1 0 0 52 0 1
6 1 0 0 52 1 0
7 1 0 0 3 1 0
8 1 0 0 0 0 1
9 1 0 0 24 0 1
[10 rows x 511 columns]
# 获取训练集X,Y
with open('work/data/X_train') as f:
next(f)
X_train = np.array([line.strip('\n').split(',')[1:]for line in f],dtype=float)
with open('work/data/Y_train') as f:
next(f)
Y_train = np.array([line.strip('\n').split(',')[1]for line in f],dtype=float)
处理数据
print(X_train.shape)
print(Y_train.shape)
print(Y_train)
(54256, 510)
(54256,)
[1. 0. 0. ... 0. 0. 0.]
合并成一个矩阵方便后面进行随机梯度下降处理
Y_train = Y_train.reshape(-1,1) #转化为54256*1
print(Y_train.shape)
data = np.concatenate((X_train,Y_train),axis=1) # 合并矩阵data,方便进行随机处理
print(data.shape)
(54256, 1)
(54256, 511)
参考@变化hyq的博客,首先是对数据进行随机化处理,之后就是在训练集和测试集关于离散化的问题分别求其各自的离散值(就是训练集的mean为训练集的,测试集的mean是测试集的)
#按行随机打乱训练数据集
def shuffle(x):
return np.random.permutation(x)
shuffle_data = shuffle(data) # 先打乱数据集
print(shuffle_data)
[[37. 1. 0. ... 1. 0. 0.]
[58. 1. 0. ... 1. 0. 0.]
[26. 1. 0. ... 0. 1. 0.]
...
[47. 0. 0. ... 1. 0. 0.]
[19. 1. 0. ... 0. 1. 0.]
[20. 1. 0. ... 0. 1. 0.]]
划分训练集和数据集
t_len = int(len(data)*0.9) # 按9:1划分数据集为训练集和验证集
train_set = shuffle_data[:t_len,:] # 行为数据总量,列为参数数量
vali_set = shuffle_data[t_len:,:]
print(train_set.shape)
print(vali_set.shape)
(48830, 511)
(5426, 511)
标准化训练集和测试集
# Normalizing
train_set_mean = np.mean(train_set[:,0:-1],axis=0)
print(train_set_mean.shape)
train_set_std = np.std(train_set[:,0:-1],axis=0).reshape(1,-1)
print(train_set_std.shape)
vali_set_mean = np.mean(vali_set[:,0:-1],axis=0)
vali_set_std = np.std(vali_set[:,0:-1],axis=0).reshape(1,-1)
train_set[:,0:-1] = (train_set[:,0:-1]-train_set_mean)/(train_set_std+0.00000001)#避免方差为0
vali_set[:,0:-1] = (vali_set[:,0:-1]-vali_set_mean)/(vali_set_std+0.00000001)#避免方差为0
(510,)
(1, 510)
#sigmod函数
def sigmoid(z):
return 1/(1.0+np.exp(-np.clip(z,-10,10))) # 防止分母中z有绝对值过大的数,返回值为0,将数组z中所有数限制在-10到10范围内
# ComputCost
def ComputeCost(X,y):#y真实值与X预测值计算交叉熵损失
loss = -np.sum(y*np.log(X)+(1-y)*np.log(1-X))
return loss
# 计算准确率
def ComputeAcc(y_p,y_t):# y_p:预测值,y_t:真实值
acc = 1 - np.mean(np.abs(y_p - y_t))
return acc
# 采用最小批量梯度下降方法
# print(train_set.shape)
dim = 511
batch = 1000 # 一次送去训练1000条数据
batch_times = int(t_len/batch) # 送去的次数
lr = 0.007 # 学习率
iteration = 501 # 迭代次数
adagrad = np.zeros([dim,1]) # 生成数据是零的数组,采用adagrad方法控制梯度下降
eps = 0.0000000001 # eps 项是避免 adagrad 的分母为 0 而加的极小数值。
w = np.zeros([dim,1])
开始进行梯度下降,经测试,在每次迭代都打乱数据集并不怎么会提升准确率,而且会花费大量时间
# 开始训练
train_set = shuffle(train_set) # 打乱训练集
train_set_x = train_set[:,0:-1]
train_set_x = np.concatenate((train_set_x,np.ones([t_len,1])),axis=1)
train_set_y = train_set[:,-1]
vali_set = shuffle(vali_set) # 打乱测试集
vali_set_x = vali_set[:,0:-1]
vali_set_x = np.concatenate((vali_set_x,np.ones([len(data)-t_len,1])),axis=1)
vali_set_y = vali_set[:,-1]
vali_set_y = vali_set_y.reshape(-1,1)
for t in range(iteration):
y = np.zeros([batch,1]) # 用于存储训练时的预测值
y_hat = np.zeros([batch,1]) # 用于存储小批次的标签值
for b in range(batch_times):
x = train_set_x[batch*b:batch*(b+1),:]
x = x.reshape(batch,-1)
y_hat = train_set_y[batch*b:batch*(b+1)]
y_hat = y_hat.reshape(batch,1) # 实际值
y = sigmoid(np.dot(x,w)) # 预测值
err = y - y_hat
gradient = np.dot(x.transpose(),err)
adagrad += gradient**2
w = w -lr * gradient/np.sqrt(adagrad + eps) # adagrad,梯度下降公式
if(t%100 == 0):
y_predict = sigmoid(np.dot(vali_set_x,w))
loss_test = ComputeCost(y_predict,vali_set_y)/(len(data)-t_len)
acc_test = ComputeAcc(np.round(y_predict),vali_set_y) # 计算测试集准确率,np.round()为四舍五入函数
loss = ComputeCost(y,y_hat)/t_len # 计算训练集交叉熵
acc = ComputeAcc(np.round(y),y_hat) # 计算训练集准确率
print(str(t) + " 训练集交叉熵:"+str(loss))
print(str(t) + " 训练集准确率:"+str(acc))
print(str(t) + " 测试集交叉熵:"+str(loss_test))
print(str(t) + " 测试集准确率:"+str(acc_test))
print(w[0:10])
0 训练集交叉熵:0.010911902493790744
0 训练集准确率:0.759
0 测试集交叉熵:0.5247057318994045
0 测试集准确率:0.7736822705492075
200 训练集交叉熵:0.006509445262417623
200 训练集准确率:0.873
200 测试集交叉熵:0.3202077619856097
200 测试集准确率:0.8761518614080354
400 训练集交叉熵:0.006031300922069907
400 训练集准确率:0.879
400 测试集交叉熵:0.2975839189616665
400 测试集准确率:0.879653520088463
500 训练集交叉熵:0.005918204175830794
500 训练集准确率:0.882
500 测试集交叉熵:0.29214399402380054
500 测试集准确率:0.88020641356432
[[ 0.37061675]
[ 0.02173027]
[ 0.13248705]
[-0.07897371]
[ 0.02843867]
[-0.02160288]
[-0.01915566]
[ 0.01999223]
[ 0.01932086]
[-0.07371981]]
# 开始预测
with open('work/data/X_test') as f:
next(f)
X_test = np.array([line.strip('\n').split(',')[1:]for line in f],dtype=float)
X_test = (X_test-train_set_mean)/(train_set_std+0.0000001) # 离散化处理
X_test = np.concatenate((X_test,np.ones([len(X_test),1])),axis=1) # 加上常数项bias
print(X_test.shape)
predict = np.round(sigmoid(np.dot(X_test,w)))
(27622, 511)
#保存预测值
import csv
with open('sumbit_logistic.csv', mode='w', newline='') as submit_file:
csv_writer = csv.writer(submit_file)
header = ['id', 'lable']
print(header)
csv_writer.writerow(header)
for i in range(240):
row = ['id_' + str(i), int(predict[i][0])]
csv_writer.writerow(row)
if(i<10):
print(row)
['id', 'lable']
['id_0', 0]
['id_1', 0]
['id_2', 0]
['id_3', 0]
['id_4', 0]
['id_5', 1]
['id_6', 0]
['id_7', 1]
['id_8', 0]
['id_9', 0]
import numpy as np
import pandas as pd
with open('work/data/X_train') as f:
next(f)
X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open('work/data/Y_train') as f:
next(f)
Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
Y_train = Y_train.reshape(-1,1)
data = np.concatenate((X_train,Y_train),axis=1)
print(data.shape)
(54256, 511)
#打乱数据
def shuffle(x):
return np.random.permutation(x)
shuffle_data = shuffle(data) # 先打乱数据集
print(shuffle_data)
[[43. 1. 0. ... 1. 0. 0.]
[68. 0. 0. ... 0. 1. 0.]
[67. 0. 0. ... 0. 1. 0.]
...
[21. 0. 0. ... 0. 1. 0.]
[ 3. 0. 0. ... 1. 0. 0.]
[16. 0. 0. ... 1. 0. 0.]]
# 划分训练集和验证集
train_len = int(len(shuffle_data)*0.7) # 按7:3划分训练集和验证集
train_data = shuffle_data[:train_len,:]
vail_data = shuffle_data[train_len:,:]
print(train_data.shape)
print(vail_data.shape)
(37979, 511)
(16277, 511)
#分别离散化训练集和验证集\
# 注意最后一列是实际值y,离散化时不该离散y
train_mean = np.mean(train_data[:,0:-1],axis=0)
train_std = np.std(train_data[:,0:-1],axis=0).reshape(1,-1)
train_data[:,0:-1] = (train_data[:,0:-1] - train_mean)/(train_std+0.00000001)
vail_mean = np.mean(vail_data[:,0:-1],axis=0)
vail_std = np.std(vail_data[:,0:-1],axis=0).reshape(1,-1)
vail_data[:,0:-1] = (vail_data[:,0:-1]-vail_mean)/(vail_std + 0.000000001)
#训练集处理
e0 = (train_data[:,-1]==0) #设置筛选条件,train_set最后一列真实标签为0的数据
e1 = (train_data[:,-1]==1) #设置筛选条件,train_set最后一列真实标签为1的数据
train_0 = train_data[e0]
train_0 = train_0[:,0:-1]#舍去最后的标签值
train_1 = train_data[e1]
train_1 = train_1[:,0:-1]
#验证集处理
e0 = (vail_data[:,-1]==0) #设置筛选条件,train_set最后一列真实标签为0的数据
e1 = (vail_data[:,-1]==1) #设置筛选条件,train_set最后一列真实标签为1的数据
vail_0 = vail_data[e0]
vail_0 = vail_0[:,0:-1]#舍去最后的标签值
vail_1 = vail_data[e1]
vail_1 = vail_1[:,0:-1]
train_mean_0=np.mean(train_0,axis=0)#μ1
train_mean_1 = np.mean(train_1,axis=0)#μ2
#验证集求μ
vail_mean_0=np.mean(vail_0,axis=0)#μ1
vail_mean_1 = np.mean(vail_1,axis=0)#μ2
cov =1/(N−1) * (X - μ).T· (X - μ) (N表示样本数,T表示转置,后面的·表示矩阵相乘运算)
#训练集
#计算Σ
#首先分别计算Σ1和Σ2
train_cov0 = (train_0-train_mean_0).T.dot((train_0-train_mean_0))/len(train_0)
train_cov1 = (train_1-train_mean_1).T.dot((train_1-train_mean_1))/len(train_1)
#求共享协方差
train_cov = (len(train_0)*train_cov0 + len(train_1)*train_cov1)/len(train_data)
#验证集
#计算Σ
#首先分别计算Σ1和Σ2
vail_cov0 = (vail_0-vail_mean_0).T.dot((vail_0-vail_mean_0))/len(vail_0)
vail_cov1 = (vail_1-vail_mean_1).T.dot((vail_1-vail_mean_1))/len(vail_1)
#求共享协方差
vail_cov = (len(vail_0)*vail_cov0 + len(vail_1)*vail_cov1)/len(vail_data)
#训练集
u, s, v = np.linalg.svd(train_cov, full_matrices=False)
train_inv_cov = np.matmul(v.T * 1 / s, u.T) #求Σ逆
#验证集
u, s, v = np.linalg.svd(vail_cov, full_matrices=False)
vail_inv_cov = np.matmul(v.T * 1 / s, u.T) #求Σ逆
#训练集求w和b
train_w = np.dot(train_mean_0-train_mean_1,train_inv_cov).reshape(-1,1)
train_b = train_mean_0.T.dot(train_inv_cov).dot(train_mean_0)*(-0.5)+0.5*np.dot(train_mean_1.T,train_inv_cov).dot(train_mean_1)+np.log(len(train_0)/len(train_1))
#sigmod函数
def sigmoid(z):
return 1/(1.0+np.exp(-np.clip(z,-10,10))) # 防止分母中z有绝对值过大的数,返回值为0,将数组z中所有数限制在-10到10范围内
# ComputCost
def ComputeCost(X,y):#y真实值与X预测值计算交叉熵损失
loss = -np.sum(y*np.log(X)+(1-y)*np.log(1-X))
return loss
# 计算准确率
def ComputeAcc(y_p,y_t):# y_p:预测值,y_t:真实值
acc = 1 - np.mean(np.abs(y_p - y_t))
return acc
#训练集求准确率
# 计算训练集上的准确率
train_x = train_data[:,0:-1]
train_x = train_x.reshape(len(train_data),-1)
train_y_label = train_data[:,-1] #获取标签值
train_y_label = train_y_label.reshape(-1,1)
f = np.matmul(train_x, train_w) + train_b
train_y_pred = 1- np.round(sigmoid(f)) #预测
train_acc = ComputeAcc(train_y_pred,train_y_label)
print("训练准确率: "+ str(train_acc) )
训练准确率: 0.8702440822559836