《统计学习方法》第二章:感知机 ——python实现

参考链接

感知机理论推导:https://blog.csdn.net/ACM_hades/article/details/89496175
数据链接:https://github.com/WenDesi/lihang_book_algorithm/blob/master/data

代码

  • 数据集:我们选择MNIST数据集进行实验,它包含各种手写数字(0-9)图片,图片大小28*28。MNIST数据集本身有10个类别,为了将其变成二分类问题我们进行如下处理:label等于0的继续等于0,label大于0改为1。这样就将十分类的数据改为二分类的数据。
  • 特征选择:可选择的特征有很多,包括:
    • 自己提取特征
    • 将整个图片作为特征向量
    • HOG特征
  • 我们选择HOG特征(324)和将整个图片作为特征(784=28×28)。

代码

import pandas as pd
import numpy as np
import random
import time
from sklearn.model_selection  import train_test_split
from sklearn.metrics import accuracy_score

# 利用opencv获取图像hog特征
def get_hog_features(trainset):
    features = []
    hog = cv2.HOGDescriptor('../hog.xml')
    for img in trainset:
        img = np.reshape(img,(28,28))
        cv_img = img.astype(np.uint8)
        hog_feature = hog.compute(cv_img)
        features.append(hog_feature)
    features = np.array(features)
    features = np.reshape(features,(-1,324))
    return features

#感知机模型
class Perceptron(object):
    def __init__(self):
        self.learning_step = 0.00001
        self.max_iteration = 5000

    def model_function(self, x):
        wx = x.dot(self.w)
        return np.sign(wx)

    def train(self, features, labels):
        self.w = np.zeros(len(features[0]) + 1,dtype=np.float32)#将b并入到w中
        correct_count = 0
        time = 0
        while time < self.max_iteration:
            index = random.randint(0, len(labels) - 1)#随机选择一个样本进行梯度下降
            x = features[index]
            x=np.append(x,1.0)#参数b的系数
            y=labels[index]
            pred=self.model_function(x)

            if y * pred > 0:#样本分类正确
                correct_count += 1
                if correct_count > self.max_iteration:
                    break
                continue
            #更新
            self.w+=self.learning_step * y * x

    def predict(self,features):
        labels = []
        for feature in features:
            x = np.append(feature, 1.0)
            labels.append(self.model_function(x))
        return labels


if __name__ == '__main__':

    print ('Start read data')
    S = time.time()
    raw_data = pd.read_csv('../data/train_binary.csv')#读取数据
    data = raw_data.values#获取数据
    print("data shape:",data.shape)
    imgs = data[0:, 1:]
    labels = data[:, 0]
    
	#imgs = get_hog_features(imgs)  # 图片HOG特征(使用HOG特征就打开它)
    print("imgs shape:", imgs.shape)
    print("labels shape:", labels.shape)

    # 选取 2/3 数据作为训练集, 1/3 数据作为测试集
    train_features, test_features, train_labels, test_labels = train_test_split(
        imgs, labels, test_size=0.33, random_state=23323)
    train_labels=2*train_labels-1#将0/1转变为-1/+1
    test_labels=2*test_labels-1
    print("train data count :%d"%len(train_labels))
    print("test data count :%d"%len(test_labels))
    print ('read data cost ', time.time() - S, ' second')

    print ('Start training')
    S = time.time()
    p = Perceptron()
    p.train(train_features, train_labels)
    print( 'training cost ', time.time() - S, ' second')

    print('Start predicting')
    S = time.time()
    test_predict = p.predict(test_features)
    print('predicting cost ', time.time() - S, ' second')

    score = accuracy_score(test_labels, test_predict)
    print( "The accruacy socre is ", score)

输出:
		图片HOG特征:
				Start read data
				data shape: (42000, 785)
				imgs shape: (42000, 324)
				labels shape: (42000,)
				train data count :28140
				test data count :13860
				read data cost  5.35866117477417  second
				Start training
				training cost  0.07878541946411133  second
				Start predicting
				predicting cost  0.12164664268493652  second
				The accruacy socre is  0.9935786435786436	
	源图片特征:
			Start read data
			data shape: (42000, 785)
			imgs shape: (42000, 784)
			labels shape: (42000,)
			train data count :28140
			test data count :13860
			read data cost  3.7569241523742676  second
			Start training
			training cost  0.08876228332519531  second
			Start predicting
			predicting cost  0.12666058540344238  second
			The accruacy socre is  0.9242424242424242

API 说明

  • accuracy_score:https://blog.csdn.net/u011630575/article/details/79645814
  • train_test_split:https://blog.csdn.net/u011089523/article/details/72810720

你可能感兴趣的:(NLP代码)