神经网络学习“你拍我猜” —— 你拍照，AI猜

在这个项目中，你将学习利用神经网络来分类照片中是狗狗，是猫猫，还是人。

本项目使用了一个经过预处理后较小的数据集，数据集中仅含有图像的特征结果。对于如何获取图像的特征，这里附上了open cv中对于图像特征的说明。
http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_features_meaning/py_features_meaning.html

在该 notebook 中，我们基于以下三个特征来了解图像是狗狗，猫猫还是人的概率：

Feature1
Feature2
Feature3
Feature4

‘class’是0，代表是人；1代表是猫猫；2代表是狗狗；

每一行代表一个图像；

加载数据

为了加载数据并很好地进行格式化，我们将使用两个非常有用的包，即 Pandas 和 Numpy。你可以在这里阅读文档：

https://pandas.pydata.org/pandas-docs/stable/
https://docs.scipy.org/

%matplotlib inline
# Importing pandas and numpy
import pandas as pd
import numpy as np
from IPython.display import display
 # present all plots in the notebook

# Reading the csv file into a pandas DataFrame
dataset = pd.read_csv('data.csv')

#random all the rows in dataset
dataset = dataset.sample(frac=1)

# print data shortcut
dataset[:10]

	feature1	feature2	feature3	feature4	class
59	5.2	1080.0	3.9	14.0	1
146	6.3	1000.0	5.0	19.0	2
1	4.9	1200.0	1.4	2.0	0
118	7.7	1040.0	6.9	23.0	2
124	6.7	1320.0	5.7	21.0	2
96	5.7	1160.0	4.2	13.0	1
42	4.4	1280.0	1.3	2.0	0
95	5.7	1200.0	4.2	12.0	1
145	6.7	1200.0	5.2	23.0	2
40	5.0	1400.0	1.3	3.0	0

数据分析 - 绘制数据，可视化的数据分析

首先让我们对数据进行绘图，看看他们互相之间的关系是什么。首先来看试一下feature1和feature2

# Importing matplotlib
import matplotlib.pyplot as plt

# Function to help us plot
def plot_points(dataset):
    X = np.array(dataset[["feature1","feature2"]])
    y = np.array(dataset["class"])
    
    people = X[np.argwhere(y==0)]
    cat = X[np.argwhere(y==1)]
    dog = X[np.argwhere(y==2)]
    
    plt.scatter([s[0][0] for s in people], [s[0][1] for s in people], s = 25, color = 'red', edgecolor = 'k')
    plt.scatter([s[0][0] for s in cat], [s[0][1] for s in cat], s = 25, color = 'cyan', edgecolor = 'k')
    plt.scatter([s[0][0] for s in dog], [s[0][1] for s in dog], s = 25, color = 'yellow', edgecolor = 'k')
    
    plt.xlabel('Feature_1')
    plt.ylabel('Feature_2')
    
# Plotting the points
plot_points(dataset)
plt.show()

图上红色是人，青色是小猫，黄色是小狗。粗略来说，这两个feature并没有很好地分离图像小狗，小猫和人。也许将另两个features考虑进来会有帮助？接下来我们将绘制一组图，用seaborn的pairplot函数来试试吧！
https://seaborn.pydata.org/generated/seaborn.pairplot.html

# plotting high-dimensional
import seaborn as sns

sns.pairplot(dataset, hue='class', vars=["feature1","feature2","feature3","feature4"])

图上class=0，代表是人；1代表是猫猫；2代表是狗狗；

任务1: 将训练集拆分成自变量data及应变量标签label的组合

数据集中['feature1','feature2','feature3','feature4']是自变量data；

['class']则是应变量标签label；

可参考使用pandas中的iloc，loc用法。

https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.iloc.html
https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.loc.html

# separate dataset into data - feature table and label table
data = dataset.iloc[:,:4]
label = dataset.loc[:,'class']

display(data[:10])
display(label[:10])

	feature1	feature2	feature3	feature4
59	5.2	1080.0	3.9	14.0
146	6.3	1000.0	5.0	19.0
1	4.9	1200.0	1.4	2.0
118	7.7	1040.0	6.9	23.0
124	6.7	1320.0	5.7	21.0
96	5.7	1160.0	4.2	13.0
42	4.4	1280.0	1.3	2.0
95	5.7	1200.0	4.2	12.0
145	6.7	1200.0	5.2	23.0
40	5.0	1400.0	1.3	3.0

59 1
146 2
1 0
118 2
124 2
96 1
42 0
95 1
145 2
40 0
Name: class, dtype: int64

任务2: 将分类进行 One-hot 编码

为了实现softmax的概率分布，我们将使用Pandas 中的 get_dummies 函数来对label进行One-hot编码。

问题1: one-hot编码的作用是什么呢？

回答：计算机只能读懂数字，而不能理解具体的分类类型，onehot编码就是对于那些离散的特征转换成数字编码后，依然能保留它们相互之间离散的特性，比如鸭子海狸海象如果编码编成0 1 2，虽然这三个动物没有任何关系，但是0 1 2却有数学上的大小和顺序，这个关系显然在原特征中是不存在的，而onehot编码后，三个编码从一维数轴上的0 1 2变成了三维相互正交的坐标轴上的100 010 001，彼此无任何顺序和大小关系。

# TODO:  Make dummy variables for labels
dummy_label = pd.get_dummies(label)

# Print the first 10 rows of our data
dummy_label[:10]

	0	1	2
59	0	1	0
146	0	0	1
1	1	0	0
118	0	0	1
124	0	0	1
96	0	1	0
42	1	0	0
95	0	1	0
145	0	0	1
40	1	0	0

任务3: 数据标准化

由于神经网络是计算权重，因此我们需要对数据进行标准化的预处理。我们注意到feature2和feature4的范围比feature1和feature3要大很多，这意味着我们的数据存在偏差，使得神经网络很难处理。让我们将两个特征缩小，使用(x-min)/(max-min))来将特征归到(0, 1)。

# TODO: Scale the columns

data['feature2'] = (data['feature2']-data['feature2'].min())/(data['feature2'].max()-data['feature2'].min())
data['feature4'] = (data['feature4']-data['feature4'].min())/(data['feature4'].max()-data['feature4'].min())

# Printing the first 10 rows of our procesed data
data[:10]

	feature1	feature2	feature3	feature4
59	5.2	0.291667	3.9	0.541667
146	6.3	0.208333	5.0	0.750000
1	4.9	0.416667	1.4	0.041667
118	7.7	0.250000	6.9	0.916667
124	6.7	0.541667	5.7	0.833333
96	5.7	0.375000	4.2	0.500000
42	4.4	0.500000	1.3	0.041667
95	5.7	0.416667	4.2	0.458333
145	6.7	0.416667	5.2	0.916667
40	5.0	0.625000	1.3	0.083333

任务4: 将数据分成训练集和测试集

为了测试我们的算法，我们将数据分为训练集和测试集。测试集的大小将占总数据的 10％。

你可以使用numpy.random.choice或者sklearn.model_selection.train_test_split函数。

https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

问题2: 拆分测试集的目的是什么？还有其他的拆分方式吗？

你的回答：训练集用来训练模型，数据集用来测试训练模型的准确性，测试集必须与训练模型无相关性，否则就不能客观评价模型的准确率。其他方法比如还可以继续将训练集再分成k份，每次留出一个子集作为验证集，剩余k-1个子集作为训练集，重复训练k次，平均k次结果，即k折交叉验证。

# TODO: split train and test dataset
from sklearn.model_selection import train_test_split
train_data, test_data, train_label, test_label= train_test_split(data, dummy_label, test_size=0.10, random_state=22)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:10])
print(test_data[:10])
print(train_label[:10])
print(test_label[:10])

Number of training samples is 135
Number of testing samples is 15
feature1 feature2 feature3 feature4
116 6.5 0.416667 5.5 0.708333
81 5.5 0.166667 3.7 0.375000
6 4.6 0.583333 1.4 0.083333
38 4.4 0.416667 1.3 0.041667
39 5.1 0.583333 1.5 0.041667
20 5.4 0.583333 1.7 0.041667
18 5.7 0.750000 1.7 0.083333
113 5.7 0.208333 5.0 0.791667
26 5.0 0.583333 1.6 0.125000
140 6.7 0.458333 5.6 0.958333
feature1 feature2 feature3 feature4
25 5.0 0.416667 1.6 0.041667
57 4.9 0.166667 3.3 0.375000
87 6.3 0.125000 4.4 0.500000
61 5.9 0.416667 4.2 0.583333
0 5.1 0.625000 1.4 0.041667
141 6.9 0.458333 5.1 0.916667
133 6.3 0.333333 5.1 0.583333
75 6.6 0.416667 4.4 0.541667
71 6.1 0.333333 4.0 0.500000
14 5.8 0.833333 1.2 0.041667
0 1 2
116 0 0 1
81 0 1 0
6 1 0 0
38 1 0 0
39 1 0 0
20 1 0 0
18 1 0 0
113 0 0 1
26 1 0 0
140 0 0 1
0 1 2
25 1 0 0
57 0 1 0
87 0 1 0
61 0 1 0
0 1 0 0
141 0 0 1
133 0 0 1
75 0 1 0
71 0 1 0
14 1 0 0

任务5: 训练多分类的神经网络

下列函数会训练二层神经网络。首先，我们将写一些 helper 函数。

Softmax 激活函数

p指代x的特征数量；

softmax函数常用于多分类目标的模型，他会把所有的output对sum(output)进行均一化，用于减少模型预测偏差。https://zh.wikipedia.org/wiki/Softmax%E5%87%BD%E6%95%B0

sigmoid函数常用于二分类目标的模型，他会将离散数值转换为概率数值。https://zh.wikipedia.org/wiki/S%E5%87%BD%E6%95%B0

误差函数：交叉熵

m 为分类的类别数。

# TODO: Activation (softmax) function
def softmax(x):
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)

def loss_CE(x,y,y_hat):
    return -np.sum(y* np.log(y_hat))

反向误差传递函数

现在轮到你来练习，编写误差项。记住这是由方程
给出的。

建议：此处可以使用numpy.reshape()或者numpy.newaxis()来实现；

# TODO: Write the error term formula
def error_term_formula(x, y, y_hat):

    
    return -np.dot(x.reshape(-1,1), (y-y_hat).reshape(1,-1))

# Training function
def train_nn(features, targets, epochs, learnrate):
    
    # Use to same seed to make debugging easier
    np.random.seed(42)

    n_records, n_features = features.shape
    last_loss = None

    # Initialize weights
    weights = np.zeros([features.shape[1],targets.shape[1]])

    for e in range(epochs):
        del_w = np.zeros(weights.shape)
        loss = []
        for x, y in zip(features.values, targets.values):
            # Loop through all records, x is the input, y is the target

            # Activation of the output unit
            #   Notice we multiply the inputs and the weights here 
            #   rather than storing h as a separate variable 
            output = softmax(np.dot(x, weights))
            
            # The error, the target minus the network output
            error = loss_CE(x, y, output)
            loss.append(error)
            # The error term           
            error_term = error_term_formula(x, y, output)
            #print(weights.shape)
            del_w += error_term
            
        # Update the weights here. The learning rate times the 
        # change in weights, divided by the number of records to average
        weights -= learnrate * del_w / n_records

        # Printing out the mean error on the training set
        if e % (epochs / 10) == 0:
            
            #out = softmax(np.dot(x, weights))
            loss = np.mean(np.array(loss))
            print("Epoch:", e)
            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss
            loss = []
            print("=========")
    print("Finished training!")
    return weights

任务6: 训练你的神经网络

设置你的超参数，训练你的神经网络

问题3: learnrate的设置有什么技巧？

回答：从一个较小值开始一点点试，当发生loss increase时，说明步长过大了，无法收敛，找到一个既能收敛有不至于收敛太慢的值

# TODO: SET Neural Network hyperparameters
epochs = 1000
learnrate = 0.18
weights = train_nn(train_data, train_label, epochs, learnrate)

Epoch: 0
Train loss: 1.09861228867
=========
Epoch: 100
Train loss: 0.550774008836
=========
Epoch: 200
Train loss: 0.446943578961
=========
Epoch: 300
Train loss: 0.281037086602
=========
Epoch: 400
Train loss: 0.193038619336
=========
Epoch: 500
Train loss: 0.182039814143
=========
Epoch: 600
Train loss: 0.174945169191
=========
Epoch: 700
Train loss: 0.169294951329
=========
Epoch: 800
Train loss: 0.164635233793
=========
Epoch: 900
Train loss: 0.160701520788
=========
Finished training!

任务7:计算测试 (Test) 数据的精确度

现在你的结果是One-Hot编号后的，想想如何获取的精度上的比较？

# TODO: Calculate accuracy on test data
tes_out = softmax(np.dot(test_data, weights))

predictions = pd.get_dummies(np.argmax((tes_out),axis=1))

accuracy = np.equal(test_label,predictions).mean().min()

print("Prediction accuracy: {:.3f}".format(accuracy))

Prediction accuracy: 0.933

任务8:用你的神经网络来预测图像是什么

在“images/”路径下有两张图片，我们已经使用通过图像提取特征的方式，分别得到了他们的4个feature值，存储在“validations.csv”中。

下面就由你来试试，看看你的神经网络能不能准确的预测他们吧！

# TODO: Open the 'validations.csv' file and predict the label. 
# Remember, 0 = people, 1 = cat, 2 = dog
valid=pd.read_csv('./images/validations.csv')

valid['feature2'] = (valid['feature2']-dataset['feature2'].min())/(dataset['feature2'].max()-dataset['feature2'].min())
valid['feature4'] = (valid['feature4']-dataset['feature4'].min())/(dataset['feature4'].max()-dataset['feature4'].min())
print(valid)

valid_out= softmax(np.dot(valid, weights))

valid_predictions = pd.get_dummies(np.argmax((valid_out),axis=1))

print (valid_out)
print (valid_predictions)

feature1 feature2 feature3 feature4
0 6.2 0.583333 5.4 0.916667
1 5.9 0.416667 5.1 0.708333
[[ 3.65878501e-06 1.73262510e-02 6.54050560e-01]
[ 6.00144459e-06 2.25746156e-02 3.06038913e-01]]

	2
0	1
1	1

第一个是2（狗）第二个是2（狗）

任务9:（选做）神经网络分类算法的拓展应用

经过上面的神经网络训练，我们已经得到一个可以猜对三个对象的网络了！

如果想让你的神经网络判断更多的对象，我们就需要提供更多有标签的数据供他学习。

同时，我们也要教会我们的神经网络什么是特征（这个部分，我们已经帮你做好了:)）。当我们把神经网络变得更深的时候，多层的神经网络就可以用来提取图像中的特征了！在正式的课程中，我们就会接触到深层网络的实现。

在这里，我们先借一个已经训练好能够识别1000个物体的网络来完成“你拍，我猜”的神奇功能吧。你可以随便上传一张照片到“images”的文件夹下，我们的神经网络就可以根据已经学习好的权重来猜你拍的照片是什么哦！快来试试吧！

上传的方法点击左上方的Jupyter图标，回到上级目录，进入‘/images’文件夹，并upload你所要分类的图片；

from ResNet_CAM import *
import glob

lists = glob.glob('images/*.jpg')

# TODO: Upload your image or pick up any image in the folder 'images/xx.png'
for img_path in lists:
    fig, (ax1, ax2) = plt.subplots(1,2)
    CAM = plot_CAM(img_path,ax1,ax2,fig)
    plt.show()

传了一张柯基，一张贵宾犬，都识别了，好神奇！

神经网络学习_Udacity