李宏毅机器学习特训营之作业二年收入判断

作业2-年收入判断

课程链接

https://aistudio.baidu.com/aistudio/education/group/info/1978

项目链接

https://aistudio.baidu.com/aistudio/projectdetail/1774774

项目描述

二元分类是机器学习中最基础的问题之一,在这份教学中,你将学会如何实作一个线性二元分类器,来根据人们的个人资料,判断其年收入是否高于 50,000 美元。我们将以两种方法: logistic regression 与 generative model,来达成以上目的,你可以尝试了解、分析两者的设计理念及差别。
实现二分类任务:

  • 个人收入是否超过50000元?

数据集介绍

这个资料集是由UCI Machine Learning Repository 的Census-Income (KDD) Data Set 经过一些处理而得来。为了方便训练,我们移除了一些不必要的资讯,并且稍微平衡了正负两种标记的比例。事实上在训练过程中,只有 X_train、Y_train 和 X_test 这三个经过处理的档案会被使用到,train.csv 和 test.csv 这两个原始资料档则可以提供你一些额外的资讯。

  • 已经去除不必要的属性。
  • 已经平衡正标和负标数据之间的比例。

特征格式

  1. train.csv,test_no_label.csv。
  • 基于文本的原始数据
  • 去掉不必要的属性,平衡正负比例。
  1. X_train, Y_train, X_test(测试)
  • train.csv中的离散特征=>在X_train中onehot编码(学历、状态…)
  • train.csv中的连续特征 => 在X_train中保持不变(年龄、资本损失…)。
  • X_train, X_test : 每一行包含一个510-dim的特征,代表一个样本。
  • Y_train: label = 0 表示 “<=50K” 、 label = 1 表示 " >50K " 。

项目要求

  1. 请动手编写 gradient descent 实现 logistic regression
  2. 请动手实现概率生成模型。
  3. 单个代码块运行时长应低于五分钟。
  4. 禁止使用任何开源的代码(例如,你在GitHub上找到的决策树的实现)。

数据准备

项目数据保存在:work/data/ 目录下。

环境配置/安装


# 下面该你动手啦!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data1 = pd.read_csv('work/data/train.csv',header=None,encoding='big5')
print(data1.shape)
data1.head()
(54257, 42)


/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,3,4,6,17,18,19,30,36,38,39,40) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
0 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40 41
0 id age class of worker detailed industry recode detailed occupation recode education wage per hour enroll in edu inst last wk marital stat major industry code ... country of birth father country of birth mother country of birth self citizenship own business or self employed fill inc questionnaire for veteran's admin veterans benefits weeks worked in year year y
1 0 33 Private 34 26 Masters degree(MA MS MEng MEd MSW MBA) 0 Not in universe Married-civilian spouse present Finance insurance and real estate ... China China Taiwan Foreign born- Not a citizen of U S 2 Not in universe 2 52 95 50000+.
2 1 63 Private 7 22 Some college but no degree 0 Not in universe Never married Manufacturing-durable goods ... ? ? United-States Native- Born in the United States 0 Not in universe 2 52 95 - 50000.
3 2 71 Not in universe 0 0 7th and 8th grade 0 Not in universe Married-civilian spouse present Not in universe or children ... Germany United-States United-States Native- Born in the United States 0 Not in universe 2 0 95 - 50000.
4 3 43 Local government 43 10 Bachelors degree(BA AB BS) 0 Not in universe Married-civilian spouse present Education ... United-States United-States United-States Native- Born in the United States 0 Not in universe 2 52 95 - 50000.

5 rows × 42 columns

data2 = pd.read_csv('work/data/X_train',header=None,encoding='big5')
data2.head()
data2.to_numpy()
print(data2[0:10])
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,502,503,504,508) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)


  0    1         2                            3                  4    \
0  id  age   Private   Self-employed-incorporated   State government   
1   0   33         1                            0                  0   
2   1   63         1                            0                  0   
3   2   71         0                            0                  0   
4   3   43         0                            0                  0   
5   4   57         0                            0                  0   
6   5   42         1                            0                  0   
7   6   16         0                            0                  0   
8   7   16         1                            0                  0   
9   8   20         1                            0                  0   

                               5                 6             7    \
0   Self-employed-not incorporated   Not in universe   Without pay   
1                                0                 0             0   
2                                0                 0             0   
3                                0                 1             0   
4                                0                 0             0   
5                                0                 0             0   
6                                0                 0             0   
7                                0                 1             0   
8                                0                 0             0   
9                                0                 0             0   

                   8              9    ... 501               502   503  504  \
0   Federal government   Never worked  ...   1   Not in universe   Yes   No   
1                    0              0  ...   0                 1     0    0   
2                    0              0  ...   0                 1     0    0   
3                    0              0  ...   0                 1     0    0   
4                    0              0  ...   0                 1     0    0   
5                    0              0  ...   0                 1     0    0   
6                    0              0  ...   0                 1     0    0   
7                    0              0  ...   0                 1     0    0   
8                    0              0  ...   0                 1     0    0   
9                    0              0  ...   0                 1     0    0   

   505  506  507                   508  509  510  
0    2    0    1  weeks worked in year   94   95  
1    1    0    0                    52    0    1  
2    1    0    0                    52    0    1  
3    1    0    0                     0    0    1  
4    1    0    0                    52    0    1  
5    1    0    0                    52    0    1  
6    1    0    0                    52    1    0  
7    1    0    0                     3    1    0  
8    1    0    0                     0    0    1  
9    1    0    0                    24    0    1  

[10 rows x 511 columns]
# 获取训练集X,Y
with open('work/data/X_train') as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:]for line in f],dtype=float)
with open('work/data/Y_train') as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1]for line in f],dtype=float)

处理数据

print(X_train.shape)
print(Y_train.shape)
print(Y_train)
(54256, 510)
(54256,)
[1. 0. 0. ... 0. 0. 0.]

合并成一个矩阵方便后面进行随机梯度下降处理

Y_train = Y_train.reshape(-1,1) #转化为54256*1
print(Y_train.shape)
data = np.concatenate((X_train,Y_train),axis=1) # 合并矩阵data,方便进行随机处理
print(data.shape)
(54256, 1)
(54256, 511)

参考@变化hyq的博客,首先是对数据进行随机化处理,之后就是在训练集和测试集关于离散化的问题分别求其各自的离散值(就是训练集的mean为训练集的,测试集的mean是测试集的)

#按行随机打乱训练数据集
def shuffle(x):
    return np.random.permutation(x)
shuffle_data = shuffle(data) # 先打乱数据集
print(shuffle_data)
[[37.  1.  0. ...  1.  0.  0.]
 [58.  1.  0. ...  1.  0.  0.]
 [26.  1.  0. ...  0.  1.  0.]
 ...
 [47.  0.  0. ...  1.  0.  0.]
 [19.  1.  0. ...  0.  1.  0.]
 [20.  1.  0. ...  0.  1.  0.]]

划分训练集和数据集

t_len = int(len(data)*0.9) # 按9:1划分数据集为训练集和验证集

train_set = shuffle_data[:t_len,:] # 行为数据总量,列为参数数量
vali_set = shuffle_data[t_len:,:]
print(train_set.shape)
print(vali_set.shape)
(48830, 511)
(5426, 511)

标准化训练集和测试集

# Normalizing
train_set_mean = np.mean(train_set[:,0:-1],axis=0)
print(train_set_mean.shape)
train_set_std = np.std(train_set[:,0:-1],axis=0).reshape(1,-1)
print(train_set_std.shape)
vali_set_mean = np.mean(vali_set[:,0:-1],axis=0)
vali_set_std = np.std(vali_set[:,0:-1],axis=0).reshape(1,-1)

train_set[:,0:-1] = (train_set[:,0:-1]-train_set_mean)/(train_set_std+0.00000001)#避免方差为0
vali_set[:,0:-1] = (vali_set[:,0:-1]-vali_set_mean)/(vali_set_std+0.00000001)#避免方差为0
(510,)
(1, 510)
#sigmod函数
def sigmoid(z):
    return 1/(1.0+np.exp(-np.clip(z,-10,10))) # 防止分母中z有绝对值过大的数,返回值为0,将数组z中所有数限制在-10到10范围内
# ComputCost
def ComputeCost(X,y):#y真实值与X预测值计算交叉熵损失
    loss = -np.sum(y*np.log(X)+(1-y)*np.log(1-X))
    return loss
# 计算准确率
def ComputeAcc(y_p,y_t):# y_p:预测值,y_t:真实值
    acc = 1 - np.mean(np.abs(y_p - y_t))
    return acc
# 采用最小批量梯度下降方法
# print(train_set.shape)
dim = 511 
batch = 1000 # 一次送去训练1000条数据
batch_times = int(t_len/batch) # 送去的次数
lr = 0.007 # 学习率
iteration = 501 # 迭代次数
adagrad = np.zeros([dim,1]) # 生成数据是零的数组,采用adagrad方法控制梯度下降
eps = 0.0000000001 # eps 项是避免 adagrad 的分母为 0 而加的极小数值。
w = np.zeros([dim,1])

开始进行梯度下降,经测试,在每次迭代都打乱数据集并不怎么会提升准确率,而且会花费大量时间

# 开始训练
train_set = shuffle(train_set) # 打乱训练集
train_set_x = train_set[:,0:-1]
train_set_x = np.concatenate((train_set_x,np.ones([t_len,1])),axis=1)
train_set_y = train_set[:,-1]
    
vali_set = shuffle(vali_set) # 打乱测试集
vali_set_x = vali_set[:,0:-1]
vali_set_x = np.concatenate((vali_set_x,np.ones([len(data)-t_len,1])),axis=1)
vali_set_y = vali_set[:,-1]
vali_set_y = vali_set_y.reshape(-1,1)
for t in range(iteration):
    y = np.zeros([batch,1]) # 用于存储训练时的预测值
    y_hat = np.zeros([batch,1]) # 用于存储小批次的标签值
    for b in range(batch_times):
        x = train_set_x[batch*b:batch*(b+1),:]
        x = x.reshape(batch,-1)
        y_hat = train_set_y[batch*b:batch*(b+1)]
        y_hat = y_hat.reshape(batch,1) # 实际值
        y = sigmoid(np.dot(x,w)) # 预测值
        err = y - y_hat
        gradient = np.dot(x.transpose(),err)
        adagrad += gradient**2
        w = w -lr * gradient/np.sqrt(adagrad + eps) # adagrad,梯度下降公式
    
    if(t%100 == 0):
        y_predict = sigmoid(np.dot(vali_set_x,w))
        loss_test = ComputeCost(y_predict,vali_set_y)/(len(data)-t_len)
        acc_test = ComputeAcc(np.round(y_predict),vali_set_y) # 计算测试集准确率,np.round()为四舍五入函数
        loss = ComputeCost(y,y_hat)/t_len # 计算训练集交叉熵
        acc = ComputeAcc(np.round(y),y_hat) # 计算训练集准确率

        print(str(t) + " 训练集交叉熵:"+str(loss))
        print(str(t) + " 训练集准确率:"+str(acc))
        print(str(t) + " 测试集交叉熵:"+str(loss_test))
        print(str(t) + " 测试集准确率:"+str(acc_test))

print(w[0:10])
0 训练集交叉熵:0.010911902493790744
0 训练集准确率:0.759
0 测试集交叉熵:0.5247057318994045
0 测试集准确率:0.7736822705492075
200 训练集交叉熵:0.006509445262417623
200 训练集准确率:0.873
200 测试集交叉熵:0.3202077619856097
200 测试集准确率:0.8761518614080354
400 训练集交叉熵:0.006031300922069907
400 训练集准确率:0.879
400 测试集交叉熵:0.2975839189616665
400 测试集准确率:0.879653520088463
500 训练集交叉熵:0.005918204175830794
500 训练集准确率:0.882
500 测试集交叉熵:0.29214399402380054
500 测试集准确率:0.88020641356432
[[ 0.37061675]
 [ 0.02173027]
 [ 0.13248705]
 [-0.07897371]
 [ 0.02843867]
 [-0.02160288]
 [-0.01915566]
 [ 0.01999223]
 [ 0.01932086]
 [-0.07371981]]
# 开始预测
with open('work/data/X_test') as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:]for line in f],dtype=float)
X_test = (X_test-train_set_mean)/(train_set_std+0.0000001) # 离散化处理
X_test = np.concatenate((X_test,np.ones([len(X_test),1])),axis=1) # 加上常数项bias
print(X_test.shape)
predict = np.round(sigmoid(np.dot(X_test,w)))
(27622, 511)
#保存预测值
import csv
with open('sumbit_logistic.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'lable']
    print(header)
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), int(predict[i][0])]
        csv_writer.writerow(row)
        if(i<10):
            print(row)
['id', 'lable']
['id_0', 0]
['id_1', 0]
['id_2', 0]
['id_3', 0]
['id_4', 0]
['id_5', 1]
['id_6', 0]
['id_7', 1]
['id_8', 0]
['id_9', 0]

Generative Mode

import numpy as np
import pandas as pd
with open('work/data/X_train') as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open('work/data/Y_train') as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)

Y_train = Y_train.reshape(-1,1)
data = np.concatenate((X_train,Y_train),axis=1)
print(data.shape)
(54256, 511)
#打乱数据
def shuffle(x):
    return np.random.permutation(x)
shuffle_data = shuffle(data) # 先打乱数据集
print(shuffle_data)
[[43.  1.  0. ...  1.  0.  0.]
 [68.  0.  0. ...  0.  1.  0.]
 [67.  0.  0. ...  0.  1.  0.]
 ...
 [21.  0.  0. ...  0.  1.  0.]
 [ 3.  0.  0. ...  1.  0.  0.]
 [16.  0.  0. ...  1.  0.  0.]]
# 划分训练集和验证集
train_len = int(len(shuffle_data)*0.7) # 按7:3划分训练集和验证集
train_data = shuffle_data[:train_len,:]
vail_data = shuffle_data[train_len:,:]
print(train_data.shape)
print(vail_data.shape)
(37979, 511)
(16277, 511)
#分别离散化训练集和验证集\
# 注意最后一列是实际值y,离散化时不该离散y
train_mean = np.mean(train_data[:,0:-1],axis=0)
train_std = np.std(train_data[:,0:-1],axis=0).reshape(1,-1)
train_data[:,0:-1] = (train_data[:,0:-1] - train_mean)/(train_std+0.00000001)
vail_mean = np.mean(vail_data[:,0:-1],axis=0)
vail_std = np.std(vail_data[:,0:-1],axis=0).reshape(1,-1)
vail_data[:,0:-1] = (vail_data[:,0:-1]-vail_mean)/(vail_std + 0.000000001)

李宏毅机器学习特训营之作业二年收入判断_第1张图片
求u1,u2,Σ

#训练集处理
e0 = (train_data[:,-1]==0) #设置筛选条件,train_set最后一列真实标签为0的数据
e1 = (train_data[:,-1]==1) #设置筛选条件,train_set最后一列真实标签为1的数据
train_0 = train_data[e0]
train_0 = train_0[:,0:-1]#舍去最后的标签值
train_1 = train_data[e1]
train_1 = train_1[:,0:-1]
#验证集处理
e0 = (vail_data[:,-1]==0) #设置筛选条件,train_set最后一列真实标签为0的数据
e1 = (vail_data[:,-1]==1) #设置筛选条件,train_set最后一列真实标签为1的数据
vail_0 = vail_data[e0]
vail_0 = vail_0[:,0:-1]#舍去最后的标签值
vail_1 = vail_data[e1]
vail_1 = vail_1[:,0:-1]

李宏毅机器学习特训营之作业二年收入判断_第2张图片

计算μ1和μ2

train_mean_0=np.mean(train_0,axis=0)#μ1
train_mean_1 = np.mean(train_1,axis=0)#μ2
#验证集求μ
vail_mean_0=np.mean(vail_0,axis=0)#μ1
vail_mean_1 = np.mean(vail_1,axis=0)#μ2

cov =1/(N−1) * (X - μ).T· (X - μ) (N表示样本数,T表示转置,后面的·表示矩阵相乘运算)

#训练集
#计算Σ
#首先分别计算Σ1和Σ2
train_cov0 = (train_0-train_mean_0).T.dot((train_0-train_mean_0))/len(train_0)
train_cov1 = (train_1-train_mean_1).T.dot((train_1-train_mean_1))/len(train_1)
#求共享协方差
train_cov = (len(train_0)*train_cov0 + len(train_1)*train_cov1)/len(train_data)
#验证集
#计算Σ
#首先分别计算Σ1和Σ2
vail_cov0 = (vail_0-vail_mean_0).T.dot((vail_0-vail_mean_0))/len(vail_0)
vail_cov1 = (vail_1-vail_mean_1).T.dot((vail_1-vail_mean_1))/len(vail_1)

#求共享协方差
vail_cov = (len(vail_0)*vail_cov0 + len(vail_1)*vail_cov1)/len(vail_data)

#训练集
u, s, v = np.linalg.svd(train_cov, full_matrices=False)
train_inv_cov = np.matmul(v.T * 1 / s, u.T) #求Σ逆
#验证集
u, s, v = np.linalg.svd(vail_cov, full_matrices=False)
vail_inv_cov = np.matmul(v.T * 1 / s, u.T) #求Σ逆

李宏毅机器学习特训营之作业二年收入判断_第3张图片
计算W和b

#训练集求w和b
train_w = np.dot(train_mean_0-train_mean_1,train_inv_cov).reshape(-1,1)
train_b = train_mean_0.T.dot(train_inv_cov).dot(train_mean_0)*(-0.5)+0.5*np.dot(train_mean_1.T,train_inv_cov).dot(train_mean_1)+np.log(len(train_0)/len(train_1))
#sigmod函数
def sigmoid(z):
    return 1/(1.0+np.exp(-np.clip(z,-10,10))) # 防止分母中z有绝对值过大的数,返回值为0,将数组z中所有数限制在-10到10范围内
# ComputCost
def ComputeCost(X,y):#y真实值与X预测值计算交叉熵损失
    loss = -np.sum(y*np.log(X)+(1-y)*np.log(1-X))
    return loss
# 计算准确率
def ComputeAcc(y_p,y_t):# y_p:预测值,y_t:真实值
    acc = 1 - np.mean(np.abs(y_p - y_t))
    return acc
#训练集求准确率

# 计算训练集上的准确率
train_x = train_data[:,0:-1]
train_x = train_x.reshape(len(train_data),-1)
train_y_label = train_data[:,-1] #获取标签值
train_y_label = train_y_label.reshape(-1,1)

f = np.matmul(train_x, train_w) + train_b
train_y_pred = 1- np.round(sigmoid(f)) #预测
train_acc = ComputeAcc(train_y_pred,train_y_label)
print("训练准确率: "+ str(train_acc) )
训练准确率: 0.8702440822559836

你可能感兴趣的:(人工智能,机器学习)