机器学习基石第一次作业

coursera林轩田的《机器学习基石》很有意思,我把一些编程作业总结在这里,参考了mac Jiang的答案:https://blog.csdn.net/a1015553840/article/details/51085129:

作业1

15-17是naive pla(perceptron learning algorithm), 算法如下:

  1. 初始化w
    repeat {
    1.寻找w(t)的下一个错误分类点(x,y)(即sign(w(t)’*x)!=y);
    2.纠正错误:w(t+1) = w(t) + y * x;
    } until(每个样本都无错)
  2. 返回w
image.png
def naive_PLA():
    updates = 0
    w = np.zeros(5)
    while True:
        halt = True
        with open('hw1_15_train.dat') as csvfile:
            reader = csv.reader(csvfile, delimiter='\t')
            for line in reader:
                x = line[0].split()
                x = np.asarray(x)
                x = x.astype(np.float)
                x = np.insert(x, 0, 1)
                y = np.array(line[1], dtype='int')
                if sign(w.dot(x)) != y:
                    updates += 1
                    w += y * x
                    halt = False
        if halt:
            break
    return updates

最终计算结果是:45次

image.png

为了方便处理,定义了DataSet类和PLA类

class DataSet:
    def __init__(self, filename):
        self.input = []
        self.output = []
        self.load_data(filename)

    def load_data(self, filename):
        with open(filename) as csvfile:
            reader = csv.reader(csvfile, delimiter='\t')
            for line in reader:
                x = line[0].split()
                x.insert(0, '1')
                y = line[1]
                self.input.append(x)
                self.output.append(y)


class PLA:
    def __init__(self, train_name='hw1_15_train.dat', test_name=None):
        self.train_set = DataSet(train_name)
        if test_name:
            self.test_set = DataSet(test_name)

    def random_cycle_pla(self, times=2000, eta=1, print_out=False):
        total_updates = 0
        data_set = list(zip(self.train_set.input, self.train_set.output))
        for idx, _ in enumerate(range(times)):
            shuffle(data_set)
            current_updates = self.naive_pla(data_set, eta, print_out)
            total_updates += current_updates
        return total_updates / times

    def naive_pla(self, data_set=None, eta=1, print_out=False):
        """naive perceptron learning algorithm"""
        current_updates = 0
        w = np.zeros(5)
        if not data_set:
            data_set = list(zip(self.train_set.input, self.train_set.output))
        while True:
            halt = True
            for item in data_set:
                x = np.array(item[0], dtype=float)
                y = np.array(item[1], dtype=int)
                if sign(w.dot(x)) != y:
                    current_updates += 1
                    w += eta * y * x
                    halt = False
            if halt:
                break
        if print_out:
            print(f'第{idx}次终止次数:{current_updates}')
        return current_updates

200次平均,次数为38.145次

image.png

对w做更新时乘以eta=0.5即可:w += eta * y * x
所得的结果是:200次平均,次数为40.245

18-20题是非线性可分的问题,用pocket PLA算法,算法如下:

  1. 初始化w,pocket_w
    {

    1.寻找分类错误点(x,y)

    2.修正错误:w(t+1) = w(t) + y*x

    3.如果w(t+1)对训练样本的错误率比pocket_w更小,则用w(t+1)替代pocket_w

} until(达到足够的迭代次数)

  1. 返回pocket_w

该算法每次更新后都需要计算w的所有样本的错误率,因此计算量比naive pla大,好处是对线性不可分问题也有解

image.png

只要添加计算错误率的函数errors_count, 以及计算pocket_w的函数pocket_algorithm,最后计算出错误率即可。

    def errors_count(self, w, data_set):
        """"统计errors发生次数"""
        count = 0
        for x, y in data_set:
            x = np.array(x, dtype=float)
            y = np.array(y, dtype=int)
            if sign(w.dot(x)) != y:
                count += 1
        return count

    def pocket_algorithm(self, update_times=50, pocket=True):
        """pocket=True: 返回pocketWeight
        否则返回w"""
        data_set = list(zip(self.train_set.input, self.train_set.output))
        updates = 0
        w = np.zeros(5)
        pocket_weight = np.zeros(45)
        min_errors = sys.maxsize
        halt = False
        while not halt:
            shuffle(data_set)
            for item in data_set:
                x = np.array(item[0], dtype=float)
                y = np.array(item[1], dtype=int)
                if sign(w.dot(x)) != y:
                    w = w + y * x  # w每次都更新
                    updates += 1
                    # print(f'updates: {updates}')
                    errors_count = self.errors_count(w, data_set)
                    if errors_count < min_errors:
                        min_errors = errors_count
                        pocket_weight = w  # pocket_weight只有遇到更好的w才更新
                if updates >= update_times or min_errors == 0:
                    halt = True
                    break
        return pocket_weight if pocket else w

    def cal_test_error_rate(self, update_times=50, times=2000, pocket=True):
        train_set = list(zip(self.train_set.input, self.train_set.output))
        test_set = list(zip(self.test_set.input, self.test_set.output))
        train_avg_rate, test_avg_rate = 0, 0
        for _ in range(times):
            w = self.pocket_algorithm(update_times=update_times, pocket=pocket)
            train_error_counts = self.errors_count(w, train_set)
            test_error_counts = self.errors_count(w, test_set)
            train_avg_rate += train_error_counts / len(train_set)
            test_avg_rate += test_error_counts / len(test_set)
        return train_avg_rate / times, test_avg_rate / times

2000次的平均值是:0.13131999999999994


image.png

pocket_algorithm返回w,而非最有权重w_pocket,200次的平均错误率是:0.3798

image.png

将pocket algorithm的更新次数从50改为100即可,错误率有略微的下降,计算200次的平均值为:0.11630000000000004

全部作业链接见:https://www.jianshu.com/p/c8d06e7cb3c4

你可能感兴趣的:(机器学习基石第一次作业)