Bartender_Jill

[CUDA手搓]从零开始用C++ CUDA搭建一个卷积神经网络(LeNet)，了解神经网络各个层背后算法原理

文章目录

前言
一、所需环境
二、实现思路
- 2.1. 定义了LeNet网络模型结构，并训练了20次
- 2.2 以txt格式导出训练结果(模型的各个层权重偏置等参数)
- 2.3 (可选)以pth格式导出训练结果，以方便后期调试
- 2.4 C++ CUDA要做的事
三、C++ CUDA具体实现
- 3.1 新建.cu文件并填好框架
- 3.2 C++实现各网络层
- - 3.0 CUDA 编程核心思路
  - 3.1 卷积层Conv1
  - 3.2 激活函数ReLu1
  - 3.2 池化层MaxPool1
  - 3.3 卷积层Conv2
  - 3.4 激活函数ReLu2
  - 3.5 池化层MaxPool2
  - 3.6 全连接层fc1
  - 3.7 激活函数ReLu3
  - 3.8 全连接层fc2
  - 3.9 激活函数ReLu4
  - 3.10 全连接层fc3
  - 3.11 输出结果
  - 3.12 后续改进
四、源码
- 4.1 CUDA最终源码
总结

前言

最近开始学习CUDA，要写一个小神经网络练练手，鉴于网上资料较少，便自己记录一下过程经验。

本篇文章将介绍如何以MNIST数据集为例，从零开始用C++ CUDA搭建出LeNet神经网络的推理代码过程。注意，本篇教程只是推理的部分，训练部分先用已有的Python代码。

因为用C++实现的训练代码涉及到反向传播等算法，博客讲解起来较复杂，后续有时间再写一篇。

从零开始不代表从零基础开始，建议掌握Python基础、神经网络基础、一丢丢CUDA基础。

一、所需环境

训练代码所需环境：python、pytorch、numpy。(版本够模型训练即可，要求不高)

推理代码所需环境：C++、对应版本的CUDA。(如果有VS 编译器的话，可以直接在安装CUDA的时候，勾选VS依赖包，从而能直接在VS编译器上新建CUDA项目 )

红框部分勾选起来。

如果已经有CUDA环境了但之前没有勾选Visual Studio Integration，可以参考这篇文章。如果嫌配置麻烦也可以卸载CUDA再重新安装。

环境安装过程本文不过多赘述，可以在网上看相关教程根据自己版本进行安装。

二、实现思路

要用C++ CUDA实现LeNet的推理过程(即前向传播)，我们需要先知道LeNet的神经网络架构是怎么样的。本篇文章所用的LeNet 训练代码如下：

#Train_LeNet.py

'''
Package                  Version
------------------------ ----------
certifi                  2023.7.22
charset-normalizer       3.2.0
cmake                    3.27.4.1
filelock                 3.12.4
idna                     3.4
Jinja2                   3.1.2
lit                      16.0.6
MarkupSafe               2.1.3
mpmath                   1.3.0
networkx                 3.1
numpy                    1.26.0
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
Pillow                   10.0.1
pip                      23.2.1
requests                 2.31.0
setuptools               68.0.0
sympy                    1.12
torch                    2.0.1
torchaudio               2.0.2
torchvision              0.15.2
triton                   2.0.0
typing_extensions        4.7.1
urllib3                  2.0.4
wheel                    0.38.4
'''

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torch.nn.functional as F
import os


# 定义LeNet模型
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


script_dir = os.path.dirname(__file__)  # 获取脚本所在的目录

# 数据预处理
transform = transforms.Compose([transforms.ToTensor()])

# 加载数据集
trainset = torchvision.datasets.FashionMNIST(os.path.join(script_dir, '../../data'), download=True, train=True,
                                             transform=transform)
testset = torchvision.datasets.FashionMNIST(os.path.join(script_dir, '../../data'), download=True, train=False,
                                            transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)

# 创建模型
model = LeNet()
model = model.to('cuda')

# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.002, momentum=0.9)

# 训练模型
for epoch in range(20):
    print('epoch ', epoch)
    for inputs, labels in trainloader:
        inputs, labels = inputs.to('cuda'), labels.to('cuda')

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# 测试模型的准确率
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to('cuda'), labels.to('cuda')
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(correct / total)

# 以txt的方式导出模型参数，也可以自定义导出模型参数的文件格式，这里使用了最简单的方法。
for name, param in model.named_parameters():
    np.savetxt(os.path.join(script_dir, f'./{name}.txt'), param.detach().cpu().numpy().flatten())

#将该模型保存起来，以方便python代码对该模型进行读取调试
torch.save(model, "./model/modeltrain.pth")

简单来讲，训练代码做了以下三件事：

2.1. 定义了LeNet网络模型结构，并训练了20次

由代码可知，LeNet的模型由Conv2d(卷积层)、MaxPool2d(最大池化层)、Linear(线性层)、ReLu(激活函数层)这四个网络层组成：

2.2 以txt格式导出训练结果(模型的各个层权重偏置等参数)

将模型各个层权重参数以txt形式导出，方便C++代码读取。如果你将模型以pth/ckpt等格式进行存储，那C++读取起来有点麻烦。

导出的txt文件如下：

这些txt文件就代表了LeNet训练后的模型结果，如果你们不想训练可以直接下载提取码：4DEF

2.3 (可选)以pth格式导出训练结果，以方便后期调试

我们已经将训练好的模型以txt形式导出了，为什么要多此一举用pth再次导出呢？

众所周知，凡是涉及到并行的代码，调试起来颇为不方便，用cuda-gdb等方式给你的CUDA代码打断点查变量可以是可以，但对于新手使用起来较麻烦。

除此之外，像LeNet这种多层神经网络，一步错则步步错，调试起来十分棘手。那我们怎么知道自己写的CUDA代码对不对呢？

故本文章提供一个简单的逐层调试方法：
我们不仅要用C++ CUDA实现LeNet的推理，还用Python的PyTorch实现一遍LeNet推理过程。
由于Python实现LeNet推理十分简单，即在原先训练代码上修改几行函数即可实现，不可能有出错情况，故我们可以将Python实现的LeNet推理结果作为参考答案，利用PyTorch提供的hook方法来打印LeNet模型每个层的输出结果，并将自己C++ CUDA每一层的输出进行逐层比较，从而得知自己用CUDA实现的LeNet推理代码是否有问题。

这里本文也提供LeNet的 python推理代码：

#Inference_LeNet.py
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torch.nn.functional as F
import os
import struct

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
script_dir = os.path.dirname(__file__)  # 获取脚本所在的目录

# 数据预处理
transform = transforms.Compose([transforms.ToTensor()])

# 加载数据集
trainset = torchvision.datasets.FashionMNIST(os.path.join(script_dir, './data'), download=False, train=True, transform=transform)
testset = torchvision.datasets.FashionMNIST(os.path.join(script_dir, './data'), download=False, train=False, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False)

#输出conv1层结果
def conv1_hook1(model,input,output):
    print("conv1 ", output[0,0,:,:])#输出conv1第1个通道结果
    print("conv1 ", output[0, 5, :, :])#输出conv1第5个通道结果
    print("relu: ",F.relu(output[0,0,:,:]))

def conv2_hook1(model,input,output):
    print("relu2: ", F.relu(output[0, 0, :, :]))
    print("cov2: ",output[0,0,:,:])

def relu_hook1(model, input, output):
    print("relu ", output[0, 5, :, :])  # [0, 0, 0, :]
def maxpool_hook1(model, input, output):
    try:
        print("max pool ", output[0, 0, :, :])  # [0, 0, 0, :]
    except:
        return

def fc1_hook1(model, input, output):
    print("fc1 ", output)  # [0, 0, 0, :]
    print("fc1 ", F.relu(output))  # [0, 0, 0, :]
    #print("conv1 ",output[0,2,0:10,0:10])

#想查看哪层网络输出结果，就取消注释掉哪一层
#model.conv1.register_forward_hook(conv1_hook1) #输出conv1结果
#model.pool.register_forward_hook(maxpool_hook1)
#model.relu.register_forward_hook(maxpool_hook1)
#model.conv2.register_forward_hook(conv2_hook1)
#model.fc2.register_forward_hook(fc1_hook1)

model = torch.load("./model/modeltrain.pth")
model.eval()
model = model.to('cuda')

model.conv1.register_forward_hook(conv1_hook1)

data = iter(testloader)
#print(data)
sum = 0
for i in range(10000):
    image,label = next(data)
    image = image.to('cuda')
    output = model(image)
    #print(output)
    pre = 0
    for i in range(10):
        if output[0][i] > output[0][pre]:
            pre = i
    if pre == label:
        sum+=1
#算准确率
print(sum/10000)

调试的时候直接注释掉hook函数即可打印相应层的输出结果。

2.4 C++ CUDA要做的事

由于图像数据可以看作是一种矩阵，故神经网络在对各个像素进行卷积、池化等操作的时候，十分适合并行操作，即CUDA可以对所有像素并行卷积得到结果，而不用前面像素卷积完再轮到下一个像素，拖累了速度。

我们要做的，就是用C++ CUDA实现这四个网络层，并为每个层开辟数组以存储txt中的模型各个层参数，并将这些参数从Host移动到Device内存中(即从CPU端移动到显卡端)。再编写运行在Device上的CUDA函数，让CUDA函数能并行调用Device内存中的参数进行卷积等运算，从而提高推理速度，实现CPU串行推理所做不到的事。

三、C++ CUDA具体实现

3.1 新建.cu文件并填好框架

首先需要新建一个.cu文件，我是用VS2022直接新建了CUDA项目。

然后在该.cu文件中填入需要的函数：读取MNIST数据集的图片、读取MNIST数据集的标签、读取上述导出的模型结果txt文件、逐张图片进行推理(我们要实现的内容)。

关于MNIST数据集的下载，建议直接运行上述的LeNet训练代码即可自动下载(Download=True)，或者从网上下载后放到对应文件夹(“/…/…/data/FashionMNIST/raw/t10k-images-idx3-ubyte"和”/…/…/data/FashionMNIST/raw/t10k-labels-idx1-ubyte")

为了方便起见，基本框架和所需函数我已提前写好：

//Inference_LeNet.cu
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#ifndef __CUDACC__ 
#define __CUDACC__
#endif
//#include 

//定义宏函数wbCheck，该函数用于检查Device内存是否分配成功，以避免写过多代码
#define wbCheck(stmt)  do {                                                    \
        cudaError_t err = stmt;                                               \
        if (err != cudaSuccess) {                                             \
            printf( "\n\nFailed to run stmt %d ", __LINE__);                       \
            printf( "Got CUDA error ...  %s \n\n", cudaGetErrorString(err));    \
            return -1;                                                        \
        }                                                                     \
    } while(0)

// 读取MNIST数据集图片，该数据集需自行从网上下载，或直接运行上面的LeNet python训练程序自动下载
std::vector<std::vector<float>> read_mnist_images(const std::string & path) {
    std::ifstream file(path, std::ios::binary);
    if (!file) {
        std::cout << "Cannot open file!" << std::endl;
        return {};
    }

    int magic_number = 0, num_images = 0, num_rows = 0, num_cols = 0;
    file.read((char*)&magic_number, sizeof(magic_number));
    file.read((char*)&num_images, sizeof(num_images));
    file.read((char*)&num_rows, sizeof(num_rows));
    file.read((char*)&num_cols, sizeof(num_cols));

    // Reverse Integers (MNIST data is in big endian format)
    magic_number = ((magic_number & 0xff000000) >> 24) | ((magic_number & 0x00ff0000) >> 8) |
        ((magic_number & 0x0000ff00) << 8) | ((magic_number & 0x000000ff) << 24);
    num_images = ((num_images & 0xff000000) >> 24) | ((num_images & 0x00ff0000) >> 8) |
        ((num_images & 0x0000ff00) << 8) | ((num_images & 0x000000ff) << 24);
    num_rows = ((num_rows & 0xff000000) >> 24) | ((num_rows & 0x00ff0000) >> 8) |
        ((num_rows & 0x0000ff00) << 8) | ((num_rows & 0x000000ff) << 24);
    num_cols = ((num_cols & 0xff000000) >> 24) | ((num_cols & 0x00ff0000) >> 8) |
        ((num_cols & 0x0000ff00) << 8) | ((num_cols & 0x000000ff) << 24);

    int image_size = num_rows * num_cols;
    std::vector<std::vector<float>> images(num_images, std::vector<float>(image_size));

    for (int i = 0; i < num_images; ++i) {
        for (int j = 0; j < image_size; ++j) {
            unsigned char pixel = 0;
            file.read((char*)&pixel, sizeof(pixel));
            images[i][j] = static_cast<float>(pixel) / 255.0f;
        }
    }

    return images;
}
// loading MNIST Labels
std::vector<int> read_mnist_labels(const std::string & path) {
    std::ifstream file(path, std::ios::binary);
    if (!file) {
        std::cout << "Cannot open file!" << std::endl;
        return {};
    }

    int magic_number = 0, num_items = 0;
    file.read((char*)&magic_number, sizeof(magic_number));
    file.read((char*)&num_items, sizeof(num_items));

    // Reverse Integers (MNIST data is in big endian format)
    magic_number = ((magic_number & 0xff000000) >> 24) | ((magic_number & 0x00ff0000) >> 8) |
        ((magic_number & 0x0000ff00) << 8) | ((magic_number & 0x000000ff) << 24);
    num_items = ((num_items & 0xff000000) >> 24) | ((num_items & 0x00ff0000) >> 8) |
        ((num_items & 0x0000ff00) << 8) | ((num_items & 0x000000ff) << 24);

    std::vector<int> labels(num_items);
    for (int i = 0; i < num_items; ++i) {
        unsigned char label = 0;
        file.read((char*)&label, sizeof(label));
        labels[i] = static_cast<int>(label);
    }

    return labels;
}
// 负责从txt文件中读取参数
std::vector<float> read_param(const std::string & path) {
    std::ifstream file(path);
    std::vector<float> params;
    float param;
    while (file >> param) {
        params.push_back(param);
    }
    return params;
}


int main(int argc, char* argv[]) {

    std::string dir = argv[1];  //dir from args
    // cout << dir;
    //printf("%s", dir.c_str());
    auto images = read_mnist_images(dir + "/../../data/FashionMNIST/raw/t10k-images-idx3-ubyte");   //input height = input width = 28
    // loading label
    auto labels = read_mnist_labels(dir + "/../../data/FashionMNIST/raw/t10k-labels-idx1-ubyte");
    // loading param from .txt
    auto conv1_weight = read_param(dir + "/conv1.weight.txt");
    auto conv1_bias = read_param(dir + "/conv1.bias.txt");
    auto conv2_weight = read_param(dir + "/conv2.weight.txt");
    auto conv2_bias = read_param(dir + "/conv2.bias.txt");
    auto fc1_weight = read_param(dir + "/fc1.weight.txt");
    auto fc1_bias = read_param(dir + "/fc1.bias.txt");
    auto fc2_weight = read_param(dir + "/fc2.weight.txt");
    auto fc2_bias = read_param(dir + "/fc2.bias.txt");
    auto fc3_weight = read_param(dir + "/fc3.weight.txt");
    auto fc3_bias = read_param(dir + "/fc3.bias.txt");
 


    int correct_nums = 0, predict_label;// images.size()
    int index = 0,k=0;
   
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int t = 0; t < images.size(); t++) {
		//TODO:在这里实现逐张图片推理

    }

    // CUDA 同步
    cudaDeviceSynchronize();

    // calculate time
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> diff = end - start;

    // print result
    std::cout << std::fixed << std::setprecision(4) << diff.count() << ":"<<float(correct_nums)/float(images.size())


    return 0;
}

3.2 C++实现各网络层

既然框架和基本函数已经有了，那我们就专注于如何用C++ CUDA实现各个网络层即可：

前向推理过程为：输入图像->卷积层1->ReLu层->池化层->卷积层2->ReLu层->池化层->全连接层1->ReLu层->全连接层2->ReLu层->全连接层3->推理结果：

输入

卷积层1

ReLu层

池化层

ReLu层

池化层

全连接层1

ReLu层

全连接层2

ReLu层

全连接层3

输出结果

3.0 CUDA 编程核心思路

假设我们要对6张24x24大小的图像中每个像素值进行加1的操作，传统的串行处理方法显然耗时较大。

那么我们可以调用CUDA函数开辟6个并行块，每个块有着24x24的线程。我们让输入的6x24x24的像素值平摊到这6x24x24个线程，让他们并行处理(读取像素值并加1)，再让这6x24x24个线程将结果汇总到同一块内存中，从而得到最终结果。

//伪代码

dim3 blocksperGrid(6);	//设置并行块数为6
dim3 threadsperBlock(24, 24); //每个并行块中有(24x24)个线程

处理函数 << < blocksperGrid, threadsperBlock >> > (balabalabala);

为了效率起见，我们用CUDA时，往往会将二维矩阵转化为一维矩阵(可以参考矩阵向量化/一维化等资料)，比如输入6张24x24的图像，我们并不用6个[24][24]的二维矩阵来存储，而是用一个6x24x24=3456的一维矩阵进行存储。

那问题来了，我们如何从一维矩阵中获取原先二维矩阵中对应下标的像素呢？其实进行一点简单的地址换算即可。

每个并行块负责处理一张图片，且每个线程块都有自己的一个ID号，即blockIdx，故每个线程块可以用blockIdx * 0 到 blockIdx * 24 * 24 来获取自己负责的那张图像数据，在本例子中，blockIdx取值为0~5。

每个并行块都可以看成一个24x24的二维矩阵，矩阵里的每个元素即为一个线程，每个线程都有自己的一个二维标识的ID号，即(threadIdx.x,threadIdx.y)，利用blockIdx * 24 * 24 + threadIdx.x * 24 + threadIdx.y 则可以获取自己负责处理的像素下标，在本例子中，threadIdx.x和y取值为0~23。

所以我们只要让每个线程执行以下操作：

//伪代码
//每个线程都会调用一次线程函数
__ global ___ 线程函数(float* input_image, float* output_image)
{
	input_pixel_index = blockIdx * 24 * 24 + threadIdx.x * 24 + threadIdx.y;
	input_pixel_value = input_image[input_pixel_index];
	
	//输出值 = 原像素值+1
	output_pixel_value = input_pixel_value + 1;	
	
	//假设输出的数据也是存在一维的6x24x24矩阵中
	output_pixel_index = blockIdx * 24 * 24 + threadIdx.x * 24 + threadIdx.y;
	//存到输出矩阵中
    output_image[output_pixel_index] = output_pixel_value ;

}

int main()
{
	dim3 blocksperGrid(6);	//设置并行块数为6，必须要用dim3设置
	dim3 threadsperBlock(24, 24); //每个并行块中有(24x24)个线程
	
	线程函数<< < blocksperGrid, threadsperBlock >> > (balabalabala);
}

//自此，我们就能让每个并行线程根据自己的下标找到自己所负责的待处理像素。

每个线程都会调用同一个线程函数()，但是由于自身的ID号不同，从而读取和处理的像素值也不同，从而达到一函数多用的并行效果，而这也是CUDA编程的核心思路所在。

说白了，写CUDA并行函数就是一个找下标对应关系的过程，只要你找到了每个线程与对应负责的元素下标关系，写起来很简单，原理并不复杂。

3.1 卷积层Conv1

由LeNet模型定义可知，我们从MNIST中读取一张图片后，需要输入到第一个卷积层nn.Conv2d(1, 6, 5)中。

卷积层1	数据
输入channels	1 通道
输出channels	6 通道
核大小	5*5
核数量	6 (有几个输出通道就有几个核)
权重参数(weight)数量	256 (有6个核，每个核是55的矩阵，每个矩阵元素代表一个weight)
偏置参数(bias)数量	6 (有几个核就有几个偏置)

由卷积层基本定义可知，其运算过程如下：

矩阵转换成一维是为了方便处理，而要取到一维矩阵中的对应像素值，只要进行上述的下标转换关系即可。

由于MNIST数据集单张图片大小是28x28，而由pytorch官方提供的卷积公式可知，当卷积核大小为5x5时，无padding情况下得到的输出图片大小为24x24：

所以我们可以这样理解，不用CUDA的话，C++代码会串行对这24x24个像素逐个进行5x5的卷积操作，十分耗时。
那么就用CUDA开辟24x24个线程，让它们并行执行，每个线程负责自己的5x5卷积操作，这样相当于一次卷积操作的耗时就完成了对图片所有像素的卷积。

总共6个输出通道，故要开辟6x24x24个并行线程（6个并行块，每个块中有24x24个并行线程）。

除此之外：
每个卷积核有着25个权重(weight)，6个卷积核则有150个权重值；
每个卷积核有着自己的一个偏置(bias)，6个卷积核则有6个偏置值；

而我们先前导出的模型结果txt文件中，第一个卷积层的参数conv1.weight.txt以及conv1.bias.txt，刚好有着150个weight以及6个bias数据。

那这就好办了！我们要做的事情就是：
1.读入训练好的150个权重值，赋值到6个卷积核上面。
2.用这6个卷积核分别对输入图片的中央24x24个像素进行卷积，得到6个通道的24x24输出结果。
3.读入训练好的6个偏置值，在对应的自己通道里24x24逐个像素进行相加，相加后即为Conv1层的输出结果。

故我们先定义好以下内容(为了直观起见直接在for循环里定义):

    for (int t = 0; t < images.size(); t++) {
		//TODO:在这里实现逐张图片推理
		
		//Conv1
	    int input_height = 28;	//MNIST数据集单张图像大小 长=宽=28
	    int kernel_height = 5;
	    int output_height = input_height - kernel_height + 1; 	//用卷积层公式得到输出的高度
	    int input_channels = 1, output_channels = 6, kernel_channel = 1, kernel_nums = 6;
	
	    float* device_InputImage;	//开辟一个显卡上的内存空间，用于存储输入的图像数据
	    float* device_OutputImage;	//开辟一个显卡上的内存空间，用于存储输出的图像数据
	    float* device_kernel_weight;	//开辟一个显卡上的内存空间，用于存储卷积层的核的 150 个权重参数
	    float* device_kernel_bias;		//开辟一个显卡上的内存空间，用于存储卷积层的核的 6 个偏置参数
		
		//用cudaMalloc函数给显卡分配所需内存大小，并用wbCheck检查是否分配成功
		wbCheck(cudaMemcpy(device_InputImage, &images[t][0], images[t].size() * sizeof(float), cudaMemcpyHostToDevice));	//读入MNIST数据集第t张图片
	    wbCheck(cudaMalloc((void**)&device_InputImage, input_height * input_height * input_channels * sizeof(float)));//28x28x1
	    wbCheck(cudaMalloc((void**)&device_OutputImage, output_height * output_height * output_channels * sizeof(float)));//24x24x6
	    wbCheck(cudaMalloc((void**)&device_kernel_weight, kernel_height * kernel_height * kernel_channel * kernel_nums * sizeof(float)));//5x5x6=150
	    wbCheck(cudaMalloc((void**)&device_kernel_bias, kernel_nums * sizeof(float)));	//6

		//分配完内存后，将txt中的权重值和bias值存储到分配好的空间上。
	    wbCheck(cudaMemcpy(device_kernel_weight, &conv1_weight[0], kernel_height * kernel_height * kernel_nums * sizeof(float), cudaMemcpyHostToDevice)); 
	    wbCheck(cudaMemcpy(device_kernel_bias, &conv1_bias[0], kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
	    
		//分配完内存后，设定你要并行的数量
	    dim3 threadsperBlock(output_height, output_height); //(24,24)
	    dim3 blocksperGrid(6);		//总共6个输出通道，故要开辟6x24x24个并行线程（6个并行块，每个块中有24x24个并行线程）
    	
    	//调用并行函数Convlotuion1，该函数会以6x(24x24)个线程进行运算
    	Convolution1 << < blocksperGrid, threadsperBlock >> > (device_InputImage, device_OutputImage, device_kernel_weight, device_kernel_bias, input_height, output_height, kernel_height);
    }

为了效率起，我们并不将6x5x5的卷积核定义为6个5x5的二维矩阵，而是定义成1个6x5x5的一维矩阵，每个并行块可以用自己的块Id来获取自己通道对应的核参数： blockId * 0 至 blockId * 5 * 5。

在将结果保存到1个一维的6x24x24的输出图像时，每个线程可以根据自己所属的并行块Id以及线程Id来得到要保存的地址下标:blockIdx.x * 24 * 24 + threadIdx.x * 24 + threadIdx.y

blockIdx.x * 24 * 24 代表前面已经存入了多少通道的24 x 24的图像数据，按顺序接下来的位置才是自己这个通道所要存储的。
threadIdx.x * 24 + threadIdx.y则表示自己这个线程负责的像素下标。两者加起来才是实际要在6x24x24中存的位置。

为了上述运算方便，我已经将地址换算关系封装成OFFSET()函数，最终卷积层的线程函数如下：

// 给定二维矩阵中的行和列下标，计算出一维矩阵对应下标的元素。
#define OFFSET(row, col, ld) ((row) * (ld) + (col))	

//global是指该函数是用于CUDA并行函数
__global__ void Convolution1(float* input_image, float* output_image, float* kernel_weights, float* kernel_bias, int input_height, int output_height, int kernel_height)
{
    
    int input_image_index;	//要处理的输入像素对应下标
    int kernel_index;		//当前运行到哪个核的哪个下标
    float value = 0;
    //由于我们的线程数目设置的是(24,24),故不可能超过边界，这里的if可加可不加。如果设置成(32,32)则需要加
    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
    	//进行卷积操作，至于什么blockIdx和threadIdx说白了就是地址换算，看着复杂而已
        for (int i = 0; i < kernel_height; i++)       
            for (int j = 0; j < kernel_height; j++) { 
                input_image_index = OFFSET(threadIdx.x+i, threadIdx.y+j, input_height);
                kernel_index = blockIdx.x * kernel_height * kernel_height + OFFSET(i, j, kernel_height);
                value += input_image[input_image_index] * kernel_weights[kernel_index];
            }
        //将卷积结果存入到输出图像的对应位置中
        output_image[blockIdx.x * output_height* output_height + threadIdx.x * output_height + threadIdx.y] = value + kernel_bias[blockIdx.x];
		
		//确保线程都执行完毕
        __syncthreads();
    }
}

每个线程都会执行相同的卷积函数Convolution1()，但由于自身ID号以及所属块ID号不同，使得虽然执行函数相同但执行的像素不同，从而实现一函数多用，6x24x24个线程都能卷积自己所负责的那个像素，最终汇总到output_image中。

3.2 激活函数ReLu1

从Conv1层输出的结果需要送入到ReLu层，该实现相对简单。

原理：将输入的图像中小于0的像素值设置为0，其他不变，最后汇总后输出

由于ReLu并不改变输入图像大小，所以输入尺寸就=Conv1的输出尺寸，其余照葫芦画瓢：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//.....
	
	//Relu
    int relu_input_height = output_height;  //relu input height = conv1 output height
    int relu_output_height = relu_input_height;
    int relu_input_channels = output_channels; //relu input channels = conv1 output channels
    float* device_relu_Output_image;
    
    wbCheck(cudaMalloc((void**)&device_relu_Output_image, relu_input_height * relu_input_height * relu_input_channels * sizeof(float)));
    
    ReLu << < blocksperGrid, threadsperBlock >> > (device_OutputImage, device_relu_Output_image, relu_input_height, relu_output_height);
    }

__global__ void ReLu(float* input_image, float* output_image,int input_height,int output_height) {
    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        int input_index = blockIdx.x * input_height * input_height + threadIdx.x * input_height + threadIdx.y;
        if (input_image[input_index] <= 0)
            output_image[input_index] = 0;
        else
            output_image[input_index] = input_image[input_index];

        __syncthreads();
    }
}

3.2 池化层MaxPool1

从ReLu出来后，数据进入到nn.MaxPool2d(2, 2)中。
nn.MaxPool2d(2, 2)的第一个参数2是指核大小为2x2，第二个参数是指核移动的步数。

该函数的作用是：找到该像素周围四个像素中的最大值并输出：

由官方给的公式可得输入图像和输出图像的尺寸关系：

MaxPool1	数据
输入channels	6 通道
输出channels	6 通道
输入图像大小	24*24
输出图像大小	12*12
核大小	2*2

有了输入图像和输出图像的关系(即输入图像为24x24，输出图像为12x12)，继续照葫芦画瓢：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//.....	
	
	//Max Pool1
    int pool1_input_height = relu_output_height;
    int pool1_output_height = 12;
    int stride = 2, pool1_kernel_height = 2;
    int pool1_channels = 6;
    float* device_pool1_Output_image;
    
    wbCheck(cudaMalloc((void**)&device_pool1_Output_image, pool1_output_height * pool1_output_height * pool1_channels * sizeof(float)));
    dim3 pool1_threadsperBlock(12, 12);//threadsperBlock

	MaxPool1 << <blocksperGrid, pool1_threadsperBlock >> > (device_relu_Output_image, device_pool1_Output_image, pool1_input_height, pool1_output_height, pool1_kernel_height, stride, pool1_channels);
    }

//找出周围最大值
__global__ void MaxPool1(float* input_image, float* output_image, int input_height, int output_height, int kernel_height, int stride,int channel) {
    int input_image_index;
    int kernel_index;
    float value = 0;
    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        for (int i = 0; i < kernel_height; i++)        
            for (int j = 0; j < kernel_height; j++) {  
                input_image_index = blockIdx.x*input_height*input_height+ OFFSET(threadIdx.x*stride + i, threadIdx.y*stride + j, input_height);
                if (input_image[input_image_index] >= value)	//如果当前值更大
                {
                    value = input_image[input_image_index];
                    
                }
            }
        output_image[blockIdx.x * output_height * output_height + threadIdx.x * output_height + threadIdx.y] = value;
        __syncthreads();
    }
}

3.3 卷积层Conv2

与卷积层Conv1同理，变的有输入图像尺寸大小和输出图像尺寸大小(输入12x12输出8x8)，输入和输出的通道(输入6通道，输出16通道)。此时conv2weight.txt中有16x5x5个权重值，conv2bias.txt中有16个偏置值

且此时要用CUDA开辟16个并行块，每个并行块有8x8的线程数：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//.....	
	
	//Conv2
    int conv2_input_height = pool1_output_height;//12
    int conv2_kernel_height = 5;
    int conv2_output_height = conv2_input_height - conv2_kernel_height + 1;//8
    int conv2_input_channels = 6, conv2_output_channels = 16, conv2_kernel_channel = 6, conv2_kernel_nums = 16;
    
    float* device_conv2__OutputImage;
    float* device_conv2__kernel_weight;
    float* device_conv2__kernel_bias;

    wbCheck(cudaMalloc((void**)&device_conv2__OutputImage, conv2_output_height * conv2_output_height * conv2_output_channels * sizeof(float)));
    wbCheck(cudaMalloc((void**)&device_conv2__kernel_weight, conv2_kernel_height * conv2_kernel_height * conv2_kernel_channel * conv2_kernel_nums * sizeof(float)));//5*5*6*15
    wbCheck(cudaMalloc((void**)&device_conv2__kernel_bias, conv2_kernel_nums * sizeof(float)));//16
    
    //读取权重和偏置
    wbCheck(cudaMemcpy(device_conv2__kernel_weight, &conv2_weight[0], conv2_kernel_height * conv2_kernel_height * conv2_kernel_channel * conv2_kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_conv2__kernel_bias, &conv2_bias[0], conv2_kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
    
    dim3 conv2_threadsperBlock(conv2_output_height, conv2_output_height); //(8,8)
    dim3 conv2_blocksperGrid(16);
    
    Convolution2 << < conv2_blocksperGrid, conv2_threadsperBlock >> > (device_pool1_Output_image, device_conv2__OutputImage, device_conv2__kernel_weight, device_conv2__kernel_bias
            , conv2_input_height, conv2_output_height, conv2_kernel_height, conv2_input_channels);
    }

__global__ void Convolution2(float* input_image, float* output_image, float* kernel_weights, float* kernel_bias, int input_height, int output_height, int kernel_height,int input_channel)
{
    int input_image_index;
    int kernel_index;

    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        int output_index = blockIdx.x * output_height * output_height + threadIdx.x * output_height + threadIdx.y;
        float value = 0;
        
        //进行卷积操作
        for (int z = 0; z < input_channel; z++) {
            for (int i = 0; i < kernel_height; i++)        
            {
                for (int j = 0; j < kernel_height; j++) {  
                    input_image_index = z * input_height * input_height + OFFSET(threadIdx.x + i, threadIdx.y + j, input_height);
                    kernel_index = (blockIdx.x) * (input_channel) * kernel_height * kernel_height + z * kernel_height * kernel_height + OFFSET(i, j, kernel_height);
                    
                    value += input_image[input_image_index] * kernel_weights[kernel_index];
                }
            }
        }
        output_image[output_index] = value + kernel_bias[blockIdx.x];
    }
    
}

3.4 激活函数ReLu2

接下来又进入到ReLu层，与上面的ReLu同理，要改的只有尺寸等数据：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//.....
	
	//ReLu2
	int relu2_input_channels = conv2_output_channels;//16
    int relu2_input_height = conv2_output_height;//8
    int relu2_output_height = relu2_input_height;//8
    float* device_relu2_Output_image;

    wbCheck(cudaMalloc((void**)&device_relu2_Output_image, relu2_output_height * relu2_output_height * relu2_input_channels * sizeof(float)));

    dim3 relu2_threadsperBlock(conv2_output_height, conv2_output_height); //(8,8)
    dim3 relu2_blocksperGrid(16);
    
	ReLu << <relu2_blocksperGrid, relu2_threadsperBlock >> > (device_conv2__OutputImage, device_relu2_Output_image, relu2_input_height, relu2_output_height);
    }

ReLu本身函数不变

3.5 池化层MaxPool2

MaxPool1	数据
输入channels	16 通道
输出channels	16 通道
输入图像大小	8*8
输出图像大小	4*4
核大小	2*2

接下来又进入到池化层，与上面的池化层同理，要改的只有尺寸等数据(输入8x8，输出4x4)：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//ReLu2
	//.....
	
	//Max pool2
    int pool2_input_height = relu2_output_height;
    int pool2_output_height = 4;//(pool2_input_height - 1)/2+1 ;
    int pool2_stride = 2, pool2_kernel_height = 2;
    int pool2_channels = relu2_input_channels;//16

    float* device_pool2_Output_image;

    wbCheck(cudaMalloc((void**)&device_pool2_Output_image, pool2_output_height* pool2_output_height* pool2_channels * sizeof(float)));
    dim3 pool2_threadsperBlock(pool2_output_height, pool2_output_height);//4
    dim3 pool2_blocksperGrid(16);
	
	MaxPool1 << <pool2_blocksperGrid, pool2_threadsperBlock >> > (device_relu2_Output_image, device_pool2_Output_image, pool2_input_height, pool2_output_height, pool2_kernel_height, pool2_stride, pool2_channels);
    }

MaxPool本身函数不变

3.6 全连接层fc1

全连接层的作用：输入16x4x4=256的图像，输出1x120的数据。

全连接层可以看作是特殊的卷积层，

全连接层fc1	数据
输入channels	16 通道
输出channels	16 通道
输入图像大小	4*4
输出图像大小	1*120
核大小	1644 = 256
核数量	120
权重参数(weight)数量	256*120 = 30720
偏置参数(bias)数量	120 (有几个核就有几个偏置)

其过程是，将输入矩阵16x4x4中的256个元素，与大小为256的核（这样的核一共有120个）进行对应元素相乘再相加，得到的结果再加上一个bias，从而得到输出的120个元素中的第一个元素值。

所以我们可以用CUDA开辟120个并行块，每个块再开辟256个线程。其中，每个并行块负责处理一个核，每个线程负责一个元素相乘。

这里为了方便起见，只开辟了120个并行块，每个并行块只有一个线程，该线程就负责一次核的计算：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//ReLu2
	//Max pool2
	//.....
	
	//fc1
    int fc1_input_channels = pool2_channels;//16
    int fc1_input_height = pool2_output_height; //4
    int fc1_output_height = 120;
    float* device_fc1__kernel_weight;
    float* device_fc1__kernel_bias;
    float* device_fc1_Output_image;

    wbCheck(cudaMalloc((void**)&device_fc1__kernel_weight, fc1_input_height* fc1_input_height* fc1_input_channels* fc1_output_height * sizeof(float)));//4*4*16*120
    wbCheck(cudaMalloc((void**)&device_fc1__kernel_bias, fc1_output_height * sizeof(float)));//120
    wbCheck(cudaMalloc((void**)&device_fc1_Output_image, fc1_output_height * sizeof(float)));//120
    wbCheck(cudaMemcpy(device_fc1__kernel_weight, &fc1_weight[0], fc1_input_height* fc1_input_height* fc1_input_channels* fc1_output_height * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_fc1__kernel_bias, &fc1_bias[0], fc1_output_height * sizeof(float), cudaMemcpyHostToDevice));
    
    dim3 fc1_threadsperBlock(1);//(16)
    dim3 fc1_blocksperGrid(fc1_output_height);//(120)

	Fc1_naive << <fc1_blocksperGrid, fc1_threadsperBlock >> > (device_pool2_Output_image, device_fc1_Output_image, device_fc1__kernel_weight, device_fc1__kernel_bias, fc1_input_height, fc1_input_channels);
    }

__global__ void Fc1_naive(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias, int input_height, int input_channel) {

    int input_index = 0;
    int fc1_w_index = 0;
    //计算一次核操作，即对应元素相乘再相加
    for (int i = 0; i < 16 * 4 * 4; i++)
    {
        output_image[blockIdx.x] += input_image[i] * fc1_weights[blockIdx.x*16*4*4 + i];
    }
    //最后再加上一次bias
    output_image[blockIdx.x] += fc1_bias[blockIdx.x];
}

3.7 激活函数ReLu3

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//ReLu2
	//Max pool2
	//fc1
	//.....
	
	//relu fc1
	int relu_fc1_input_channels = 1;//1
    int relu_fc1_input_height = fc1_output_height;//120
    int relu_fc1_output_height = relu_fc1_input_height;//120
    float* device_relu_fc1_Output_image;
    wbCheck(cudaMalloc((void**)&device_relu_fc1_Output_image, relu_fc1_output_height * sizeof(float)));
    dim3 relu_fc1_threadsperBlock(1); //(8,8)
    dim3 relu_fc1_blocksperGrid(relu_fc1_output_height);
	
	ReLu_fc1 << <relu_fc1_blocksperGrid, relu_fc1_threadsperBlock >> > (device_fc1_Output_image, device_relu_fc1_Output_image);
    }


__global__ void ReLu_fc1(float* input_image, float* output_image) {
    if (input_image[blockIdx.x] <= 0)
        output_image[blockIdx.x] = 0;
    else
        output_image[blockIdx.x] = input_image[blockIdx.x];
}

3.8 全连接层fc2

全连接层fc2	数据
输入channels	1 通道
输出channels	1 通道
输入图像大小	1*120
输出图像大小	1*84
核大小	120
核数量	84
权重参数(weight)数量	10080
偏置参数(bias)数量	84 (有几个核就有几个偏置)

与全连接层fc1同理，只是更改了尺寸：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//ReLu2
	//Max pool2
	//fc1
	//relu fc1
	//.....
	
	//fc2
	int fc2_input_channels = 1;
    int fc2_input_height = relu_fc1_output_height; //120
    int fc2_output_height = 84;
    float* device_fc2__kernel_weight;
    float* device_fc2__kernel_bias;
    float* device_fc2_Output_image;

    wbCheck(cudaMalloc((void**)&device_fc2__kernel_weight, fc2_input_height* fc2_input_channels* fc2_output_height * sizeof(float)));//120*84
    wbCheck(cudaMalloc((void**)&device_fc2__kernel_bias, fc2_output_height * sizeof(float)));//84
    wbCheck(cudaMalloc((void**)&device_fc2_Output_image, fc2_output_height * sizeof(float)));//84
    wbCheck(cudaMemcpy(device_fc2__kernel_weight, &fc2_weight[0], fc2_input_height* fc2_input_channels* fc2_output_height * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_fc2__kernel_bias, &fc2_bias[0], fc2_output_height * sizeof(float), cudaMemcpyHostToDevice));
    dim3 fc2_threadsperBlock(1);//(16)
    dim3 fc2_blocksperGrid(fc2_output_height);//(84)

	Fc2_naive << <fc2_blocksperGrid, fc2_threadsperBlock >> > (device_relu_fc1_Output_image, device_fc2_Output_image, device_fc2__kernel_weight, device_fc2__kernel_bias, fc2_input_height, fc2_input_channels);
    }

__global__ void Fc2_naive(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias, int input_height, int input_channel) {

    int input_index = 0;
    int fc1_w_index = 0;
    for (int i = 0; i < 120; i++)
    {
        output_image[blockIdx.x] += input_image[i] * fc1_weights[blockIdx.x * 120 + i];
    }
    output_image[blockIdx.x] += fc1_bias[blockIdx.x];
}

3.9 激活函数ReLu4

与激活函数ReLu3同理，只是更改了尺寸：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//ReLu2
	//Max pool2
	//fc1
	//relu fc1
	//fc2
	//.....
	
	
	//relu fc2
	int relu_fc2_input_channels = 1;//1
    int relu_fc2_input_height = fc2_output_height;//84
    int relu_fc2_output_height = relu_fc2_input_height;//84
    float* device_relu_fc2_Output_image;
    wbCheck(cudaMalloc((void**)&device_relu_fc2_Output_image, relu_fc2_output_height * sizeof(float)));
    dim3 relu_fc2_threadsperBlock(1); //(1)
    dim3 relu_fc2_blocksperGrid(relu_fc2_output_height);//84

	ReLu_fc1 << <relu_fc2_blocksperGrid, relu_fc2_threadsperBlock >> > (device_fc2_Output_image, device_relu_fc2_Output_image);
    }

3.10 全连接层fc3

全连接层fc3	数据
输入channels	1 通道
输出channels	1 通道
输入图像大小	1*84
输出图像大小	1*10
核大小	84
核数量	10
权重参数(weight)数量	84110 = 840
偏置参数(bias)数量	10 (输出图像大小)

与全连接层fc1同理，只是更改了尺寸：

for (int t = 0; t < images.size(); t++) {
	//TODO:在这里实现逐张图片推理
	//Conv1
	//ReLu1
	//Max Pool1
	//Conv2
	//ReLu2
	//Max pool2
	//fc1
	//relu fc1
	//fc2
	//relu fc2
	//.....
	
	
	//fc3
	int fc3_input_channels = 1;
    int fc3_input_height = relu_fc2_output_height; //84
    int fc3_output_height = 10;
    float* host_fc3_Output_image;
    host_fc3_Output_image = (float*)malloc(sizeof(float) * fc3_output_height);
    float* device_fc3__kernel_weight;
    float* device_fc3__kernel_bias;
    float* device_fc3_Output_image;


    wbCheck(cudaMalloc((void**)&device_fc3__kernel_weight, fc3_input_height* fc3_input_channels* fc3_output_height * sizeof(float)));//120*84
    wbCheck(cudaMalloc((void**)&device_fc3__kernel_bias, fc3_output_height * sizeof(float)));//84
    wbCheck(cudaMalloc((void**)&device_fc3_Output_image, fc3_output_height * sizeof(float)));//84
    wbCheck(cudaMemcpy(device_fc3__kernel_weight, &fc3_weight[0], fc3_input_height* fc3_input_channels* fc3_output_height * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_fc3__kernel_bias, &fc3_bias[0], fc3_output_height * sizeof(float), cudaMemcpyHostToDevice));
    dim3 fc3_threadsperBlock(1);//(16)
    dim3 fc3_blocksperGrid(fc3_output_height);//(84)

	Fc3_naive << <fc3_blocksperGrid, fc3_threadsperBlock >> > (device_relu_fc2_Output_image, device_fc3_Output_image, device_fc3__kernel_weight, device_fc3__kernel_bias, fc3_input_height, fc3_input_channels);
	
	//将输出结果拷贝回Host内存
	cudaMemcpy(host_fc3_Output_image, device_fc3_Output_image, fc3_output_height * sizeof(float), cudaMemcpyDeviceToHost);
    }

__global__ void Fc3_naive(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias, int input_height, int input_channel) {

    int input_index = 0;
    int fc1_w_index = 0;
    for (int i = 0; i < 84; i++)
    {
        output_image[blockIdx.x] += input_image[i] * fc1_weights[blockIdx.x * 84 + i];
    }
    output_image[blockIdx.x] += fc1_bias[blockIdx.x];
}

3.11 输出结果

经过上述步骤，我们最终便能得到推理结果，1万张图片推理时间应当是2秒钟以内，准确率应该在80%左右。

除了上述网络层的搭建，我们CUDA实际编程过程也有一些细节需要注意：

在逐张图片推理过程中，需要每次都将上一次开辟出来的空间数据清零，不然会导致逐张图片推理后误差越来越大。
每次调用完global函数后，应当检查函数是否执行正常(用wbCheck(cudaGetLastError())等函数)。
各个网络层的输入/输出尺寸大小等变量其实应该在for循环代码块外定义，以免重复定义浪费时间。
在一切结束后记得调用cudaFree()释放内存。

3.12 后续改进

如果有想继续深入了解，并加快程序运行速度者，可以使用：

动态并行方法，即开辟1万个线程，每个线程又并行处理处理一张图片，从而避免for循环串行带来的时间开销。
使用tiling技术，利用好共享内存，减少重复计算量。
了解CUDA bank冲突机制，对内存读写过程进行改善。

四、源码

4.1 CUDA最终源码

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#ifndef __CUDACC__ 
#define __CUDACC__
#endif
//#include 

#define wbCheck(stmt)  do {                                                    \
        cudaError_t err = stmt;                                               \
        if (err != cudaSuccess) {                                             \
            printf( "\n\nFailed to run stmt %d ", __LINE__);                       \
            printf( "Got CUDA error ...  %s \n\n", cudaGetErrorString(err));    \
            return -1;                                                        \
        }                                                                     \
    } while(0)

// loading MNIST images
std::vector<std::vector<float>> read_mnist_images(const std::string & path) {
    std::ifstream file(path, std::ios::binary);
    if (!file) {
        std::cout << "Cannot open file!" << std::endl;
        return {};
    }

    int magic_number = 0, num_images = 0, num_rows = 0, num_cols = 0;
    file.read((char*)&magic_number, sizeof(magic_number));
    file.read((char*)&num_images, sizeof(num_images));
    file.read((char*)&num_rows, sizeof(num_rows));
    file.read((char*)&num_cols, sizeof(num_cols));

    // Reverse Integers (MNIST data is in big endian format)
    magic_number = ((magic_number & 0xff000000) >> 24) | ((magic_number & 0x00ff0000) >> 8) |
        ((magic_number & 0x0000ff00) << 8) | ((magic_number & 0x000000ff) << 24);
    num_images = ((num_images & 0xff000000) >> 24) | ((num_images & 0x00ff0000) >> 8) |
        ((num_images & 0x0000ff00) << 8) | ((num_images & 0x000000ff) << 24);
    num_rows = ((num_rows & 0xff000000) >> 24) | ((num_rows & 0x00ff0000) >> 8) |
        ((num_rows & 0x0000ff00) << 8) | ((num_rows & 0x000000ff) << 24);
    num_cols = ((num_cols & 0xff000000) >> 24) | ((num_cols & 0x00ff0000) >> 8) |
        ((num_cols & 0x0000ff00) << 8) | ((num_cols & 0x000000ff) << 24);

    int image_size = num_rows * num_cols;
    std::vector<std::vector<float>> images(num_images, std::vector<float>(image_size));

    for (int i = 0; i < num_images; ++i) {
        for (int j = 0; j < image_size; ++j) {
            unsigned char pixel = 0;
            file.read((char*)&pixel, sizeof(pixel));
            images[i][j] = static_cast<float>(pixel) / 255.0f;
        }
    }

    return images;
}
// loading MNIST Labels
std::vector<int> read_mnist_labels(const std::string & path) {
    std::ifstream file(path, std::ios::binary);
    if (!file) {
        std::cout << "Cannot open file!" << std::endl;
        return {};
    }

    int magic_number = 0, num_items = 0;
    file.read((char*)&magic_number, sizeof(magic_number));
    file.read((char*)&num_items, sizeof(num_items));

    // Reverse Integers (MNIST data is in big endian format)
    magic_number = ((magic_number & 0xff000000) >> 24) | ((magic_number & 0x00ff0000) >> 8) |
        ((magic_number & 0x0000ff00) << 8) | ((magic_number & 0x000000ff) << 24);
    num_items = ((num_items & 0xff000000) >> 24) | ((num_items & 0x00ff0000) >> 8) |
        ((num_items & 0x0000ff00) << 8) | ((num_items & 0x000000ff) << 24);

    std::vector<int> labels(num_items);
    for (int i = 0; i < num_items; ++i) {
        unsigned char label = 0;
        file.read((char*)&label, sizeof(label));
        labels[i] = static_cast<int>(label);
    }

    return labels;
}

//读取参数
std::vector<float> read_param(const std::string & path) {
    std::ifstream file(path);
    std::vector<float> params;
    float param;
    while (file >> param) {
        params.push_back(param);
    }
    return params;
}

//用于打印输出，记得要先将Device内存中数据拷贝回Host才能打印
void printVector(float* a)
{
    printf("\nprintconv1 : \n");
    for (int i = 0; i < 24; i++)
    {
        for (int j = 0; j < 24; j++)
        {
            std::cout << a[0 * 24 * 24 + i * 24 + j] << " ";
        }
        std::cout << std::endl;
    }
    std::cout << std::endl;
}

#define BLOCK_SIZE 32


// 24 * 24
#define OFFSET(row, col, ld) ((row) * (ld) + (col))
__global__ void Convolution1(float* input_image, float* output_image, float* kernel_weights, float* kernel_bias, int input_height, int output_height, int kernel_height)
{
    //printf("in\n\n");
    int input_image_index;
    int kernel_index;
    float value = 0;
    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        for (int i = 0; i < kernel_height; i++)       
            for (int j = 0; j < kernel_height; j++) { 
                input_image_index = OFFSET(threadIdx.x+i, threadIdx.y+j, input_height);
                kernel_index = blockIdx.x * kernel_height * kernel_height + OFFSET(i, j, kernel_height);
                value += input_image[input_image_index] * kernel_weights[kernel_index];
            }
        output_image[blockIdx.x * output_height* output_height + threadIdx.x * output_height + threadIdx.y] = value + kernel_bias[blockIdx.x];
        __syncthreads();
    }
}

__global__ void ReLu(float* input_image, float* output_image,int input_height,int output_height) {
    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        int input_index = blockIdx.x * input_height * input_height + threadIdx.x * input_height + threadIdx.y;
        if (input_image[input_index] <= 0)
            output_image[input_index] = 0;
        else
            output_image[input_index] = input_image[input_index];

        __syncthreads();
    }
}


__global__ void MaxPool1(float* input_image, float* output_image, int input_height, int output_height, int kernel_height, int stride,int channel) {
    int input_image_index;
    int kernel_index;
    float value = 0;
    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        for (int i = 0; i < kernel_height; i++)        
            for (int j = 0; j < kernel_height; j++) {  
                input_image_index = blockIdx.x*input_height*input_height+ OFFSET(threadIdx.x*stride + i, threadIdx.y*stride + j, input_height);
                if (input_image[input_image_index] >= value)
                {
                    value = input_image[input_image_index];
                    
                }
            }
        output_image[blockIdx.x * output_height * output_height + threadIdx.x * output_height + threadIdx.y] = value;
        __syncthreads();
    }
}



__global__ void Convolution2(float* input_image, float* output_image, float* kernel_weights, float* kernel_bias, int input_height, int output_height, int kernel_height,int input_channel)
{
    int input_image_index;
    int kernel_index;

    if (threadIdx.y < output_height && threadIdx.x < output_height)
    {
        int output_index = blockIdx.x * output_height * output_height + threadIdx.x * output_height + threadIdx.y;
        float value = 0;
        for (int z = 0; z < input_channel; z++) {
            for (int i = 0; i < kernel_height; i++)        
            {
                for (int j = 0; j < kernel_height; j++) {  
                    input_image_index = z * input_height * input_height + OFFSET(threadIdx.x + i, threadIdx.y + j, input_height);
                    kernel_index = (blockIdx.x) * (input_channel) * kernel_height * kernel_height + z * kernel_height * kernel_height + OFFSET(i, j, kernel_height);
                    
                    value += input_image[input_image_index] * kernel_weights[kernel_index];
                }
            }
        }
        output_image[output_index] = value + kernel_bias[blockIdx.x];
    }
    
}


__global__ void Fc1(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias,int input_height,int input_channel) {
    int ouput_index = blockIdx.x;
    float value = 0;
    int input_index = 0;
    int fc1_weights_index = 0;
    for (int i = 0; i < input_height; i++) {    //4*4
        for (int j = 0; j < input_height; j++) {
            //1*16*4*4
            
            fc1_weights_index = blockIdx.x * (input_channel) * input_height * input_height + threadIdx.x * input_height * input_height + OFFSET(i, j, input_height);
            input_index = threadIdx.x * input_height * input_height + OFFSET(i, j, input_height);

            value += input_image[input_index] * fc1_weights[fc1_weights_index];
            
        }
    }
    
    output_image[blockIdx.x] = output_image[blockIdx.x]+ value;
    __syncthreads();
}

__global__ void Fc1_naive(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias, int input_height, int input_channel) {

    int input_index = 0;
    int fc1_w_index = 0;
    for (int i = 0; i < 16 * 4 * 4; i++)
    {
        output_image[blockIdx.x] += input_image[i] * fc1_weights[blockIdx.x*16*4*4 + i];
    }
    output_image[blockIdx.x] += fc1_bias[blockIdx.x];
}

__global__ void ReLu_fc1(float* input_image, float* output_image) {
    if (input_image[blockIdx.x] <= 0)
        output_image[blockIdx.x] = 0;
    else
        output_image[blockIdx.x] = input_image[blockIdx.x];
}


__global__ void Fc2_naive(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias, int input_height, int input_channel) {

    int input_index = 0;
    int fc1_w_index = 0;
    for (int i = 0; i < 120; i++)
    {
        output_image[blockIdx.x] += input_image[i] * fc1_weights[blockIdx.x * 120 + i];
    }
    output_image[blockIdx.x] += fc1_bias[blockIdx.x];
}

__global__ void Fc3_naive(float* input_image, float* output_image, float* fc1_weights, float* fc1_bias, int input_height, int input_channel) {

    int input_index = 0;
    int fc1_w_index = 0;
    for (int i = 0; i < 84; i++)
    {
        output_image[blockIdx.x] += input_image[i] * fc1_weights[blockIdx.x * 84 + i];
    }
    output_image[blockIdx.x] += fc1_bias[blockIdx.x];
}


int main(int argc, char* argv[]) {

    std::string dir = argv[1];  //dir from args
    // cout << dir;
    //printf("%s", dir.c_str());
    auto images = read_mnist_images(dir + "/../../data/FashionMNIST/raw/t10k-images-idx3-ubyte");   //input height = input width = 28
    // loading label
    auto labels = read_mnist_labels(dir + "/../../data/FashionMNIST/raw/t10k-labels-idx1-ubyte");
    // loading param from .txt
    auto conv1_weight = read_param(dir + "/conv1.weight.txt");
    auto conv1_bias = read_param(dir + "/conv1.bias.txt");
    auto conv2_weight = read_param(dir + "/conv2.weight.txt");
    auto conv2_bias = read_param(dir + "/conv2.bias.txt");
    auto fc1_weight = read_param(dir + "/fc1.weight.txt");
    auto fc1_bias = read_param(dir + "/fc1.bias.txt");
    auto fc2_weight = read_param(dir + "/fc2.weight.txt");
    auto fc2_bias = read_param(dir + "/fc2.bias.txt");
    auto fc3_weight = read_param(dir + "/fc3.weight.txt");
    auto fc3_bias = read_param(dir + "/fc3.bias.txt");
 

    //Conv1
    int input_height = 28;
    int kernel_height = 5;
    int output_height = input_height - kernel_height + 1;
    int input_channels = 1, output_channels = 6, kernel_channel = 1, kernel_nums = 6;

    float* device_InputImage;
    float* device_OutputImage;
    float* device_kernel_weight;
    float* device_kernel_bias;

    wbCheck(cudaMalloc((void**)&device_InputImage, input_height * input_height * input_channels * sizeof(float)));
    wbCheck(cudaMalloc((void**)&device_OutputImage, output_height * output_height * output_channels * sizeof(float)));
    wbCheck(cudaMalloc((void**)&device_kernel_weight, kernel_height * kernel_height * kernel_channel * kernel_nums * sizeof(float)));
    wbCheck(cudaMalloc((void**)&device_kernel_bias, kernel_nums * sizeof(float)));
    wbCheck(cudaMemcpy(device_kernel_weight, &conv1_weight[0], kernel_height * kernel_height * kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_kernel_bias, &conv1_bias[0], kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
    dim3 threadsperBlock(output_height, output_height); //(24,24)
    dim3 blocksperGrid(6);

    //Relu
    int relu_input_height = output_height;  //relu input height = conv1 output height
    int relu_output_height = relu_input_height;
    int relu_input_channels = output_channels; //relu input channels = conv1 output channels
    float* device_relu_Output_image;
    
    wbCheck(cudaMalloc((void**)&device_relu_Output_image, relu_input_height * relu_input_height * relu_input_channels * sizeof(float)));

    //Max Pool1
    int pool1_input_height = relu_output_height;
    int pool1_output_height = 12;
    int stride = 2, pool1_kernel_height = 2;
    int pool1_channels = 6;
    float* device_pool1_Output_image;
    
    wbCheck(cudaMalloc((void**)&device_pool1_Output_image, pool1_output_height * pool1_output_height * pool1_channels * sizeof(float)));
    dim3 pool1_threadsperBlock(12, 12);//threadsperBlock

    //Conv2
    int conv2_input_height = pool1_output_height;//12
    int conv2_kernel_height = 5;
    int conv2_output_height = conv2_input_height - conv2_kernel_height + 1;//8
    int conv2_input_channels = 6, conv2_output_channels = 16, conv2_kernel_channel = 6, conv2_kernel_nums = 16;
    
    float* device_conv2__OutputImage;
    float* device_conv2__kernel_weight;
    float* device_conv2__kernel_bias;

    wbCheck(cudaMalloc((void**)&device_conv2__OutputImage, conv2_output_height * conv2_output_height * conv2_output_channels * sizeof(float)));
    wbCheck(cudaMalloc((void**)&device_conv2__kernel_weight, conv2_kernel_height * conv2_kernel_height * conv2_kernel_channel * conv2_kernel_nums * sizeof(float)));//5*5*6*15
    wbCheck(cudaMalloc((void**)&device_conv2__kernel_bias, conv2_kernel_nums * sizeof(float)));//16
    wbCheck(cudaMemcpy(device_conv2__kernel_weight, &conv2_weight[0], conv2_kernel_height * conv2_kernel_height * conv2_kernel_channel * conv2_kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_conv2__kernel_bias, &conv2_bias[0], conv2_kernel_nums * sizeof(float), cudaMemcpyHostToDevice));
    dim3 conv2_threadsperBlock(conv2_output_height, conv2_output_height); //(8,8)
    dim3 conv2_blocksperGrid(16);

    //ReLu2
    int relu2_input_channels = conv2_output_channels;//16
    int relu2_input_height = conv2_output_height;//8
    int relu2_output_height = relu2_input_height;//8
    float* device_relu2_Output_image;

    wbCheck(cudaMalloc((void**)&device_relu2_Output_image, relu2_output_height * relu2_output_height * relu2_input_channels * sizeof(float)));

    dim3 relu2_threadsperBlock(conv2_output_height, conv2_output_height); //(8,8)
    dim3 relu2_blocksperGrid(16);

    //Max pool2
    int pool2_input_height = relu2_output_height;
    int pool2_output_height = 4;//(pool2_input_height - 1)/2+1 ;
    int pool2_stride = 2, pool2_kernel_height = 2;
    int pool2_channels = relu2_input_channels;//16

    float* device_pool2_Output_image;

    wbCheck(cudaMalloc((void**)&device_pool2_Output_image, pool2_output_height* pool2_output_height* pool2_channels * sizeof(float)));
    dim3 pool2_threadsperBlock(pool2_output_height, pool2_output_height);//4
    dim3 pool2_blocksperGrid(16);

    //fc1
    int fc1_input_channels = pool2_channels;//16
    int fc1_input_height = pool2_output_height; //4
    int fc1_output_height = 120;
    float* device_fc1__kernel_weight;
    float* device_fc1__kernel_bias;
    float* device_fc1_Output_image;

    wbCheck(cudaMalloc((void**)&device_fc1__kernel_weight, fc1_input_height* fc1_input_height* fc1_input_channels* fc1_output_height * sizeof(float)));//4*4*16*120
    wbCheck(cudaMalloc((void**)&device_fc1__kernel_bias, fc1_output_height * sizeof(float)));//120
    wbCheck(cudaMalloc((void**)&device_fc1_Output_image, fc1_output_height * sizeof(float)));//120
    wbCheck(cudaMemcpy(device_fc1__kernel_weight, &fc1_weight[0], fc1_input_height* fc1_input_height* fc1_input_channels* fc1_output_height * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_fc1__kernel_bias, &fc1_bias[0], fc1_output_height * sizeof(float), cudaMemcpyHostToDevice));
    dim3 fc1_threadsperBlock(1);//(16)
    dim3 fc1_blocksperGrid(fc1_output_height);//(120)

    //relu fc1
    int relu_fc1_input_channels = 1;//1
    int relu_fc1_input_height = fc1_output_height;//120
    int relu_fc1_output_height = relu_fc1_input_height;//120
    float* device_relu_fc1_Output_image;
    wbCheck(cudaMalloc((void**)&device_relu_fc1_Output_image, relu_fc1_output_height * sizeof(float)));
    dim3 relu_fc1_threadsperBlock(1); //(8,8)
    dim3 relu_fc1_blocksperGrid(relu_fc1_output_height);

    //fc2 
    int fc2_input_channels = 1;
    int fc2_input_height = relu_fc1_output_height; //120
    int fc2_output_height = 84;
    float* device_fc2__kernel_weight;
    float* device_fc2__kernel_bias;
    float* device_fc2_Output_image;

    wbCheck(cudaMalloc((void**)&device_fc2__kernel_weight, fc2_input_height* fc2_input_channels* fc2_output_height * sizeof(float)));//120*84
    wbCheck(cudaMalloc((void**)&device_fc2__kernel_bias, fc2_output_height * sizeof(float)));//84
    wbCheck(cudaMalloc((void**)&device_fc2_Output_image, fc2_output_height * sizeof(float)));//84
    wbCheck(cudaMemcpy(device_fc2__kernel_weight, &fc2_weight[0], fc2_input_height* fc2_input_channels* fc2_output_height * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_fc2__kernel_bias, &fc2_bias[0], fc2_output_height * sizeof(float), cudaMemcpyHostToDevice));
    dim3 fc2_threadsperBlock(1);//(16)
    dim3 fc2_blocksperGrid(fc2_output_height);//(84)

    //Relu fc2
    int relu_fc2_input_channels = 1;//1
    int relu_fc2_input_height = fc2_output_height;//84
    int relu_fc2_output_height = relu_fc2_input_height;//84
    float* device_relu_fc2_Output_image;
    wbCheck(cudaMalloc((void**)&device_relu_fc2_Output_image, relu_fc2_output_height * sizeof(float)));
    dim3 relu_fc2_threadsperBlock(1); //(1)
    dim3 relu_fc2_blocksperGrid(relu_fc2_output_height);//84

    //fc3
    int fc3_input_channels = 1;
    int fc3_input_height = relu_fc2_output_height; //84
    int fc3_output_height = 10;
    float* host_fc3_Output_image;
    host_fc3_Output_image = (float*)malloc(sizeof(float) * fc3_output_height);
    float* device_fc3__kernel_weight;
    float* device_fc3__kernel_bias;
    float* device_fc3_Output_image;


    wbCheck(cudaMalloc((void**)&device_fc3__kernel_weight, fc3_input_height* fc3_input_channels* fc3_output_height * sizeof(float)));//120*84
    wbCheck(cudaMalloc((void**)&device_fc3__kernel_bias, fc3_output_height * sizeof(float)));//84
    wbCheck(cudaMalloc((void**)&device_fc3_Output_image, fc3_output_height * sizeof(float)));//84
    wbCheck(cudaMemcpy(device_fc3__kernel_weight, &fc3_weight[0], fc3_input_height* fc3_input_channels* fc3_output_height * sizeof(float), cudaMemcpyHostToDevice));
    wbCheck(cudaMemcpy(device_fc3__kernel_bias, &fc3_bias[0], fc3_output_height * sizeof(float), cudaMemcpyHostToDevice));
    dim3 fc3_threadsperBlock(1);//(16)
    dim3 fc3_blocksperGrid(fc3_output_height);//(84)

    int correct_nums = 0, predict_label;// images.size()
    int index = 0,k=0;
   
    auto start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < images.size(); t++) {

        //Host to Device
        //Conv1
        wbCheck(cudaMemcpy(device_InputImage, &images[t][0], images[t].size() * sizeof(float), cudaMemcpyHostToDevice));// images[0].size()*sizeof(float)
        Convolution1 << < blocksperGrid, threadsperBlock >> > (device_InputImage, device_OutputImage, device_kernel_weight, device_kernel_bias, input_height, output_height, kernel_height);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());


        //ReLu1
        ReLu << < blocksperGrid, threadsperBlock >> > (device_OutputImage, device_relu_Output_image, relu_input_height, relu_output_height);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());

        //Max Pool 1

        MaxPool1 << <blocksperGrid, pool1_threadsperBlock >> > (device_relu_Output_image, device_pool1_Output_image, pool1_input_height, pool1_output_height, pool1_kernel_height, stride, pool1_channels);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());


        //Conv2

        
        Convolution2 << < conv2_blocksperGrid, conv2_threadsperBlock >> > (device_pool1_Output_image, device_conv2__OutputImage, device_conv2__kernel_weight, device_conv2__kernel_bias
            , conv2_input_height, conv2_output_height, conv2_kernel_height, conv2_input_channels);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());



        //ReLu2
        
        ReLu << <relu2_blocksperGrid, relu2_threadsperBlock >> > (device_conv2__OutputImage, device_relu2_Output_image, relu2_input_height, relu2_output_height);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());
        


         //Max Pool 2


        MaxPool1 << <pool2_blocksperGrid, pool2_threadsperBlock >> > (device_relu2_Output_image, device_pool2_Output_image, pool2_input_height, pool2_output_height, pool2_kernel_height, pool2_stride, pool2_channels);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());
        

        //fc1


        
        Fc1_naive << <fc1_blocksperGrid, fc1_threadsperBlock >> > (device_pool2_Output_image, device_fc1_Output_image, device_fc1__kernel_weight, device_fc1__kernel_bias, fc1_input_height, fc1_input_channels);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());


        //relu(fc1)


        ReLu_fc1 << <relu_fc1_blocksperGrid, relu_fc1_threadsperBlock >> > (device_fc1_Output_image, device_relu_fc1_Output_image);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());

        //fc2

        Fc2_naive << <fc2_blocksperGrid, fc2_threadsperBlock >> > (device_relu_fc1_Output_image, device_fc2_Output_image, device_fc2__kernel_weight, device_fc2__kernel_bias, fc2_input_height, fc2_input_channels);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());



        //relu(fc2)


        ReLu_fc1 << <relu_fc2_blocksperGrid, relu_fc2_threadsperBlock >> > (device_fc2_Output_image, device_relu_fc2_Output_image);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());



        //fc3

        Fc3_naive << <fc3_blocksperGrid, fc3_threadsperBlock >> > (device_relu_fc2_Output_image, device_fc3_Output_image, device_fc3__kernel_weight, device_fc3__kernel_bias, fc3_input_height, fc3_input_channels);
        //wbCheck(cudaGetLastError());
        //wbCheck(cudaDeviceSynchronize());
        //wbCheck(cudaMemcpy(host_fc3_Output_image, device_fc3_Output_image, fc3_output_height * sizeof(float), cudaMemcpyDeviceToHost));
        cudaMemcpy(host_fc3_Output_image, device_fc3_Output_image, fc3_output_height * sizeof(float), cudaMemcpyDeviceToHost);
        
        index = 0;
        for (k = 0; k < 10; k++) {
            if (host_fc3_Output_image[k] > host_fc3_Output_image[index]) {
                index = k;
            }
        }
        if (index == labels[t])
            correct_nums++;
        //Conv1
        //wbCheck(cudaMemset(device_InputImage, 0, input_height * input_height * input_channels * sizeof(float)));
        cudaMemset(device_InputImage, 0, input_height * input_height * input_channels * sizeof(float));
        //wbCheck(cudaMemset(device_OutputImage, 0, output_height * output_height * output_channels * sizeof(float)));
        cudaMemset(device_OutputImage, 0, output_height * output_height * output_channels * sizeof(float));
        //ReLu1
        //wbCheck(cudaMemset(device_relu_Output_image, 0, relu_input_height * relu_input_height * relu_input_channels * sizeof(float)));
        cudaMemset(device_relu_Output_image, 0, relu_input_height * relu_input_height * relu_input_channels * sizeof(float));
        //Max Pool 1
        //wbCheck(cudaMemset(device_pool1_Output_image, 0, pool1_output_height * pool1_output_height * pool1_channels * sizeof(float)));
        cudaMemset(device_pool1_Output_image, 0, pool1_output_height * pool1_output_height * pool1_channels * sizeof(float));
        //Conv2
        //wbCheck(cudaMemset(device_conv2__OutputImage, 0, conv2_output_height * conv2_output_height * conv2_output_channels * sizeof(float)));
        cudaMemset(device_conv2__OutputImage, 0, conv2_output_height* conv2_output_height* conv2_output_channels * sizeof(float));

        //Relu2
        //wbCheck(cudaMemset(device_relu2_Output_image, 0,relu2_output_height * relu2_output_height * relu2_input_channels * sizeof(float)));
        cudaMemset(device_relu2_Output_image, 0, relu2_output_height * relu2_output_height * relu2_input_channels * sizeof(float));

        //Max Pool2
        //wbCheck(cudaMemset(device_pool2_Output_image, 0, pool2_output_height * pool2_output_height * pool2_channels * sizeof(float)));
        cudaMemset(device_pool2_Output_image, 0, pool2_output_height * pool2_output_height * pool2_channels * sizeof(float));
        //fc1 device_fc1_Output_image
        //wbCheck(cudaMemset(device_fc1_Output_image, 0, fc1_output_height * sizeof(float)));
        cudaMemset(device_fc1_Output_image, 0, fc1_output_height * sizeof(float));

        //Relu fc1
        //wbCheck(cudaMemset(device_relu_fc1_Output_image, 0, relu_fc1_output_height * sizeof(float))); 
        cudaMemset(device_relu_fc1_Output_image, 0, relu_fc1_output_height * sizeof(float));
        //fc2
        //wbCheck(cudaMemset(device_fc2_Output_image, 0, fc2_output_height * sizeof(float)));
        cudaMemset(device_fc2_Output_image, 0, fc2_output_height * sizeof(float));
        //Relu fc2
        //wbCheck(cudaMemset(device_relu_fc2_Output_image, 0, relu_fc2_output_height * sizeof(float)));
        cudaMemset(device_relu_fc2_Output_image, 0, relu_fc2_output_height * sizeof(float));

        //fc3
        //wbCheck(cudaMemset(device_fc3_Output_image, 0, fc3_output_height * sizeof(float)));
        cudaMemset(device_fc3_Output_image, 0, fc3_output_height * sizeof(float));

    }

    // CUDA Sync
    cudaDeviceSynchronize();

    // calculate time
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> diff = end - start;

    // print result
    std::cout << std::fixed << std::setprecision(4) << diff.count() << ":"<<float(correct_nums)/float(images.size());

    //cudaFree(dev_image);
    

    //Conv1
    cudaFree(device_InputImage);
    cudaFree(device_OutputImage);
    cudaFree(device_kernel_weight);
    cudaFree(device_kernel_bias);

    //Relu
    cudaFree(device_relu_Output_image);
    //Pool1
    cudaFree(device_pool1_Output_image);
    //Conv2
    cudaFree(device_conv2__OutputImage);
    cudaFree(device_conv2__kernel_weight);
    cudaFree(device_conv2__kernel_bias);
    //Relu2
    cudaFree(device_relu2_Output_image);
    //Pool2
    cudaFree(device_pool2_Output_image);
    //fc1
    cudaFree(device_fc1__kernel_weight);
    cudaFree(device_fc1__kernel_bias);
    cudaFree(device_fc1_Output_image);
    //Relu fc1
    cudaFree(device_relu_fc1_Output_image);
    //fc2
    cudaFree(device_fc2__kernel_weight);
    cudaFree(device_fc2__kernel_bias);
    cudaFree(device_fc2_Output_image);
    //Relu fc2
    cudaFree(device_relu_fc2_Output_image);
    //fc3
    cudaFree(device_fc3__kernel_weight);
    cudaFree(device_fc3__kernel_bias);
    cudaFree(device_fc3_Output_image);

    return 0;
}

总结

以上便是如何用C++ CUDA手搓一个简单的神经网络，本人也是初学者，鉴于网上相关资料较少才粗略写下这篇教程，很多地方写的不够优雅，如有问题欢迎指出。

你可能感兴趣的:(Graphics图形学笔记,神经网络,c++,cnn,性能优化,vscode)

手把手教程：在 VS2017 32位 Windows 环境下编译 OR-Tools 9.6 并集成到 C++ 项目 A小庞 C++知识算法 c++开发语言 or-tools 算法库
OR-Tools是Google开源的优化算法库，支持路径规划、线性规划、约束编程等多种功能。本文将详细介绍在VisualStudio201732位Windows环境下编译OR-Tools9.6的两种方法：联网自动下载依赖和手动编译依赖项，并提供避坑指南。方法一：联网自动下载依赖（推荐新手）步骤1：克隆OR-Tools仓库gitclonehttps://github.com/google/or-to
Visual Studio 编译错误 LNK2038：MTD 和 MDD 的区别及解决方法 A小庞 C++知识个人 visual studio windows ide
在使用VisualStudio进行C++项目开发时，我们经常会遇到一些编译错误。其中，LNK2038错误是一个比较常见的链接器错误，通常与运行时库（RuntimeLibrary）的配置不匹配有关。本文将详细介绍MTD和MDD的区别，以及如何解决因运行时库配置不匹配导致的编译错误。一、错误示例以下是一个典型的LNK2038错误示例：从错误信息中可以看出，链接器检测到了运行时库的不匹配项，具体表现为M
前后端分离与不分离解析，很全面！涔溪前端
从多个维度对前后端分离与不分离进行更加深入、系统的分析，包括技术架构、开发流程、部署维护、性能优化、团队协作、适用场景等方面全面理解两者的区别和优劣。一、概念定义1.前后端不分离（传统服务端渲染）前端页面由服务器端生成并返回给浏览器，如PHP、JSP、ASP.NET等。前端逻辑和后端业务耦合在一起，通常一个请求对应一个完整的HTML页面。2.前后端分离（现代Web开发模式）前端独立开发为一个完整的
stm32学习笔记——TIM定时中断算法萌新——1 stm32 学习笔记
一、TIM定时中断的基本概念TIM定时中断是嵌入式系统中一种重要的功能，它基于定时器（TIM）实现。定时器可以对内部时钟或外部事件进行计数，当计数值达到预设的阈值时，会触发一个中断信号。这个中断信号会使CPU暂停当前正在执行的主程序，转而执行预先编写好的中断服务程序（ISR），执行完中断服务程序后，CPU再返回到主程序继续执行。TIM定时中断的核心在于“定时”，它可以实现精确的时间控制，为系统提供
麒麟系统使用-运用VSCode运行.NET工程 mystonelxj 麒麟系统 vscode .net ide 麒麟
文章目录前言一、VSCode安装与配置1.工具安装2.扩展安装3.环境配置二、运行相关工程1.基础设置2.设置并运行mytest工程（控制台演示工程）3.设置并运行mywebtest工程（网页演示工程）总结前言在麒麟系统使用-进行.NET开发一文中我们介绍了如何在麒麟系统系统创建.NET工程，本文将进一步介绍如何使用微软提供的IDE工具VSCode来运行相应的工程。一、VSCode安装与配置1.工
我的创作纪念日 BoAiB 其他
机缘起初，只是因为这个平台学习知识很方便，慢慢的有了记录自己“成长”的想法，也很想一直坚持下去。收获获得了100+粉丝的关注获得了6000+正向的反馈，如赞、评论、阅读量等关注了许多榜样大神学习习惯也变得更好了，会很认真仔细的记录自己的收获，也很开心能被大家认可我的分享日常创作已经是我生活的一部分了一边学习，一边实践，一边记录以前总觉得，做笔记太浪费时间了，总觉得实践才是硬道理，现在想想，真是愚昧
大模型RLHF强化学习笔记（一）：强化学习基础梳理Part1 Gravity! 大模型笔记大模型 LLM 算法机器学习强化学习人工智能
【如果笔记对你有帮助，欢迎关注&点赞&收藏，收到正反馈会加快更新！谢谢支持！】一、强化学习基础1.1Intro定义：强化学习是一种机器学习方法，需要智能体通过与环境交互学习最优策略基本要素：状态（State）：智能体在决策过程中需要考虑的所有相关信息（环境描述）动作（Action）：在环境中可以采取的行为策略（Policy）：定义了在给定状态下智能体应该选择哪个动作，目标是最大化智能体的长期累积奖
C++正则表达式语法 Coding小公仔 c/c++c++正则表达式开发语言
在C++中，正则表达式是处理文本模式匹配和字符串操作的强大工具。C++11及以后的标准库提供了头文件，支持正则表达式的使用。下面是C++正则表达式的核心语法规则和用法：一、基本正则表达式语法1.普通字符直接匹配自身，例如：a匹配字符a。2.元字符（需转义）具有特殊含义的字符，需用反斜杠\转义（在C++字符串中需用双反斜杠\\）。.：匹配除换行符外的任意字符。^：匹配字符串的开头。$：匹配字符串的结
高通 QCS8550 大模型性能深度解析：从算力基准到场景实测的全维度 Benchmark 伊利丹~怒风 Qualcomm 人工智能 AI编程 python arm 自然语言处理
前言在人工智能技术狂飙突进的时代，大模型正以前所未有的速度重塑各行业生态，从智能客服到多模态交互，从边缘推理到端侧部署，其应用场景不断拓展。而这一切革新的背后，离不开底层硬件的强力支撑。高通QCS8550作为面向下一代智能设备的旗舰级计算平台，凭借高达48TOPS的AI算力与先进的第七代高通AI引擎，在大模型性能表现上极具竞争力。其异构多核架构不仅能高效处理复杂的神经网络计算，还通过软硬件协同优化
iphone se 一代不完美越狱 14.6 视频壁纸教程(踩坑笔记) YANG_301 ios iphone
iphonese一代不完美越狱14.6加视频壁纸教程-踩坑笔记越狱流程1.爱思助手制作启动u盘坑点:2.越狱好后视频壁纸软件1.源2.软件安装越狱流程1.爱思助手制作启动u盘https://www.i4.cn/news_detail_42302.html此网址为具体流程,但要注意!!!坑点:下图中最后一排quickmode应被勾选(勾选后是×(´ཀ`」∠))进入options后不禁要勾选allow
js递归性能优化啃火龙果的兔子开发DEMO javascript 开发语言 ecmascript
JavaScript递归性能优化递归是编程中强大的技术，但在JavaScript中如果不注意优化可能会导致性能问题甚至栈溢出。以下是几种优化递归性能的方法：1.尾调用优化(TailCallOptimization,TCO)ES6引入了尾调用优化，但只在严格模式下有效：'usestrict';//普通递归functionfactorial(n){if(n===1)return1;returnn*fa
卷积神经网络（Convolutional Neural Network, CNN）不想秃头的程序神经网络语音识别人工智能深度学习网络卷积神经网络
卷积神经网络（ConvolutionalNeuralNetwork,CNN）是一种专门用于处理图像、视频等网格数据的深度学习模型。它通过卷积层自动提取数据的特征，并利用空间共享权重和池化层减少参数量和计算复杂度，成为计算机视觉领域的核心技术。以下是CNN的详细介绍：一、核心思想CNN的核心目标是从图像中自动学习层次化特征，并通过空间共享权重和平移不变性减少参数量和计算成本。其关键组件包括：卷积层（
ResNet（Residual Network）不想秃头的程序神经网络语音识别人工智能深度学习网络残差网络神经网络
ResNet（ResidualNetwork）是深度学习中一种经典的卷积神经网络（CNN）架构，由微软研究院的KaimingHe等人在2015年提出。它通过引入残差连接（SkipConnection）解决了深度神经网络中的梯度消失问题，使得网络可以训练极深的模型（如上百层），并在图像分类、目标检测、语义分割等任务中取得了突破性成果。以下是ResNet的详细介绍：一、核心思想ResNet的核心创新是
P25：LSTM实现糖尿病探索与预测 ?Agony lstm 人工智能 rnn
本文为365天深度学习训练营中的学习记录博客原作者：K同学啊一、相关技术1.LSTM基本概念LSTM（长短期记忆网络）是RNN（循环神经网络）的一种变体，它通过引入特殊的结构来解决传统RNN中的梯度消失和梯度爆炸问题，特别适合处理序列数据。结构组成：遗忘门：决定丢弃哪些信息，通过sigmoid函数输出0-1之间的值，表示保留或遗忘的程度。输入门：决定更新哪些信息，同样通过sigmoid函数控制更新
Python训练营打卡——DAY16（2025.5.5） cosine2025 Python训练营打卡 python 开发语言机器学习
目录一、NumPy数组基础笔记1.理解数组的维度(Dimensions)2.NumPy数组与深度学习Tensor的关系3.一维数组(1DArray)4.二维数组(2DArray)5.数组的创建5.1数组的简单创建5.2数组的随机化创建5.3数组的遍历5.4数组的运算6.数组的索引6.1一维数组索引6.2二维数组索引6.3三维数组索引二、SHAP值的深入理解三、总结1.NumPy数组基础总结2.SH
【C++】简单学——类和对象（下） CtrlZ小牛码 C++简单学 c++开发语言
初始化列表前提：对象实例化，成员变量就整体定义了，那么成员变量是在哪里单体定义初始化的？构造函数处吗？概念概念：初始化列表是每个的成员定义初始化的位置位置：在构造函数底下结构：：代表开始，代表分点classDate{public:////初始化列表Date(intyear,intmonth,intday):_year(year),_month(month),_day(day){}}语法一个成员变量
【C++】简单学——类和对象（中） CtrlZ小牛码 C++简单学 c++开发语言
六个默认成员函数共性你如果没有写这六个成员函数，编译器就会自动帮你写编译器会自动调用构造函数析构函数拷贝构造函数赋值运算符重载取地址运算符重载被const修饰的取地址运算符重载构造函数作用帮助你初始化以前的初始化的问题：总是会忘记初始化，然后用着用着就崩了使用的位置：对象实例化的时候这几个词要区分开来默认成员函数：类里“隐藏”的6个特殊函数（包括构造函数、析构函数、拷贝构造等），不写时编译器自动生
C++程序实现阻止屏保、阻止系统自动关闭屏幕、阻止系统待机（附源码） dvlinker C/C++实战专栏阻止屏保阻止系统自动关闭屏幕阻止系统待机 API Monitor
目录1、概述2、设置屏幕保护程序，修改自动关闭显示器和待机的时间2.1、设置屏保程序2.2、修改自动关闭显示器和待机的时间3、通过屏保的通知消息来阻止屏保4、调用API函数SystemParametersInfo关闭/启用屏保，但存在问题4.1、初步确定处理策略4.2、启动监控进程去监控主进程4.3、系统强行关机的情况无法处理5、使用APIMonitor监测到目标程序对API的调用，找到了问题的突
Go插件性能优化：如何减少内存占用和提升加载速度 Golang编程笔记 golang 性能优化网络 ai
Go插件性能优化：如何减少内存占用和提升加载速度关键词：Go插件、性能优化、内存占用、加载速度、编译优化、动态链接、插件架构摘要：本文将深入探讨Go语言插件的性能优化策略，从内存管理和加载速度两个核心维度出发，详细分析插件系统的运行机制，并提供一系列实用的优化技巧和最佳实践。通过本文，您将学会如何诊断插件性能瓶颈，应用有效的优化手段，并构建高效可靠的Go插件系统。背景介绍目的和范围本文旨在为Go开
算法竞赛备考冲刺必刷题（C++） | 洛谷 P8814 解密热爱编程的通信人算法 c++开发语言
本文分享的必刷题目是从蓝桥云课、洛谷、AcWing等知名刷题平台精心挑选而来，并结合各平台提供的算法标签和难度等级进行了系统分类。题目涵盖了从基础到进阶的多种算法和数据结构，旨在为不同阶段的编程学习者提供一条清晰、平稳的学习提升路径。欢迎大家订阅我的专栏：算法题解：C++与Python实现！附上汇总贴：算法竞赛备考冲刺必刷题（C++）|汇总【题目来源】洛谷：P8814[CSP-J2022]解密-洛
HarmonyOS从入门到精通：WebView开发逻极 harmonyos 华为鸿蒙 webview UI 前端实战
引言WebView是现代移动应用中不可或缺的组件，它使应用能够显示Web内容，实现混合开发。本文将详细介绍鸿蒙系统中WebView的开发技术，包括基本使用、性能优化和最佳实践。WebView基础知识1.WebView类型鸿蒙系统支持多种WebView实现：系统WebView自定义WebViewWeb组件2.WebView权限配置在开发WebView应用前，需要在配置文件中添加相关权限：{"modu
【机器学习&深度学习】反向传播机制
目录一、一句话定义二、类比理解三、为什重要？四、用生活例子解释：神经网络=烹饪机器人4.1第一步：尝一口（前向传播）4.2第二步：倒着推原因（反向传播）五、换成人工智能流程说一遍六、图示类比：找山顶（最优参数）七、总结一句人话八、PyTorch代码示例：亲眼看到每一层的梯度九、梯度=损失函数对参数的偏导数十、类比总结反向传播（Backpropagation）是神经网络中训练过程的核心机制，它就像“
【C++】atoi和std::stoi bluebonnet27 编程语言 #C++c++算法开发语言
两个将字符串转为int的方法atoi（C语言）atoi是C库中的一个函数，它定义在头文件里。其作用是把一个字符串转换为对应的整数。/*Convertastringtoaninteger.*/externintatoi(constchar*__nptr)__THROW__attribute_pure____nonnull((1))__wur;转换的原则如下：此函数接收一个以空字符'\0'结尾的字符串
人脸识别算法赋能园区无人超市安防升级智驱力人工智能算法人工智能边缘计算人脸识别智慧园区智慧工地智慧煤矿
人脸识别算法赋能园区无人超市安防升级正文在园区无人超市的运营管理中，传统安防手段依赖人工巡检或基础监控设备，存在响应滞后、误报率高、环境适应性差等问题。本文从技术背景、实现路径、功能优势及应用场景四个维度，阐述如何通过人脸识别检测、人员入侵算法及疲劳检测算法的协同应用，构建高效、精准的智能安防体系。一、技术背景：视觉分析算法的核心支撑人脸识别算法基于深度学习的卷积神经网络（CNN）模型，通过提取面
大数据面试必备：Kafka性能优化 Producer与Consumer配置指南
Kafka面试题-在Kafka中，如何通过配置优化Producer和Consumer的性能?回答重点在Kafka中，通过优化Producer和Consumer的配置，可以显著提高性能。以下是一些关键配置项和策略：1、Producer端优化:batch.size：批处理大小。增大batch.size可以使Producer每次发送更多的消息，但要注意不能无限制增大，否则会导致内存占用过多。linger
单双链表及其反转醇醛酸醚酮酯开发语言
一，空指针的补充1.空指针的定义在C语言中，空指针通常被定义为NULL，或者在C++中为nullptr。它的本质是一个指针，指向无效的地址，用来表示一个指针当前没有指向有效的内存空间。空指针并不指向实际的内存地址，因此可以用于表示指针没有被初始化或者没有指向任何有效的对象。例如：int*ptr=NULL;//ptr是一个空指针在许多编译器中，空指针通常会被定义为0，或者一个特定的常量值（例如0x0
【C++】String的使用 nanguochenchuan C++c++开发语言
字符串基础概念std::stringvsC风格字符串std::string是C++标准库提供的字符串类，相比C风格的char*具有明显优势：自动内存管理丰富的成员函数安全的边界检查支持运算符重载#include//必须包含的头文件std::strings1;//空字符串std::strings2="Hello";//直接初始化charcstr[]="World";std::strings3(cst
【Linux】ghb工具 nanguochenchuan Linux操作系统 linux 运维服务器
GDB简介GDB（GNUDebugger）是Linux系统中最强大的命令行调试工具，由GNU项目开发。作为程序员调试C/C++程序的利器，GDB能帮助你：定位程序崩溃原因分析程序运行状态跟踪变量值变化检测内存错误安装与配置安装方法#Ubuntu/Debiansudoaptinstallgdb#CentOS/RHELsudoyuminstallgdb#ArchLinuxsudopacman-Sgdb
JavaScript性能优化 lyh1344 javascript 性能优化开发语言
JavaScript性能优化方法减少重绘和回流频繁操作DOM会导致浏览器反复计算布局，引发性能问题。使用documentFragment进行批量DOM操作，或通过classList一次性修改多个样式属性。缓存DOM查询结果，避免重复访问。事件委托利用事件冒泡机制，将事件监听器绑定到父元素而非多个子元素。减少内存占用，提升动态内容的事件处理效率。节流与防抖高频事件（如滚动、输入）通过节流（Throt
API测试(一)：PortSwigger靶场笔记 h4ckb0ss 笔记网络安全 web安全
写在前面这篇文章是关于作者在学习PortSwigger的APITest类型漏洞时的记录和学习笔记使用到的工具为BurpSuitePro漏洞简介什么是apiAPI全称为ApplicationInterface，是应用程序对外提供功能的接口，现在主要有三种api风格，分别是JSON风格的api，RESTful风格的api以及Graphic风格的apiJSON风格请求获取用户信息POST/api/get
mondb入手木zi_鸣 mongodb
windows 启动mongodb 编写bat文件， mongod --dbpath D:\software\MongoDBDATA mongod --help 查询各种配置配置在mongob 打开批处理，即可启动，27017原生端口，shell操作监控端口扩展28017，web端操作端口启动配置文件配置，数据更灵活
大型高并发高负载网站的系统架构 bijian1013 高并发负载均衡
扩展Web应用程序一.概念简单的来说，如果一个系统可扩展，那么你可以通过扩展来提供系统的性能。这代表着系统能够容纳更高的负载、更大的数据集，并且系统是可维护的。扩展和语言、某项具体的技术都是无关的。扩展可以分为两种： 1.
DISPLAY变量和xhost(原创) czmmiao display
DISPLAY 在Linux/Unix类操作系统上, DISPLAY用来设置将图形显示到何处. 直接登陆图形界面或者登陆命令行界面后使用startx启动图形, DISPLAY环境变量将自动设置为:0:0, 此时可以打开终端, 输出图形程序的名称(比如xclock)来启动程序, 图形将显示在本地窗口上, 在终端上输入printenv查看当前环境变量, 输出结果中有如下内容:DISPLAY=:0.0
获取B/S客户端IP 周凡杨 java 编程 jsp Web 浏览器
最近想写个B/S架构的聊天系统，因为以前做过C/S架构的QQ聊天系统，所以对于Socket通信编程只是一个巩固。对于C/S架构的聊天系统，由于存在客户端Java应用，所以直接在代码中获取客户端的IP，应用的方法为： String ip = InetAddress.getLocalHost().getHostAddress(); 然而对于WEB
浅谈类和对象朱辉辉33 编程
类是对一类事物的总称，对象是描述一个物体的特征，类是对象的抽象。简单来说，类是抽象的，不占用内存，对象是具体的，占用存储空间。类是由属性和方法构成的，基本格式是public class 类名{ //定义属性 private/public 数据类型属性名； //定义方法 publ
android activity与viewpager+fragment的生命周期问题肆无忌惮_ viewpager
有一个Activity里面是ViewPager，ViewPager里面放了两个Fragment。第一次进入这个Activity。开启了服务，并在onResume方法中绑定服务后，对Service进行了一定的初始化，其中调用了Fragment中的一个属性。 super.onResume(); bindService(intent, conn, BIND_AUTO_CREATE);
base64Encode对图片进行编码 843977358 base64 图片 encoder
/** * 对图片进行base64encoder编码 * * @author mrZhang * @param path * @return */ public static String encodeImage(String path) { BASE64Encoder encoder = null; byte[] b = null; I
Request Header简介 aigo servlet
当一个客户端(通常是浏览器)向Web服务器发送一个请求是，它要发送一个请求的命令行，一般是GET或POST命令，当发送POST命令时，它还必须向服务器发送一个叫“Content-Length”的请求头(Request Header) 用以指明请求数据的长度，除了Content-Length之外，它还可以向服务器发送其它一些Headers，如：
HttpClient4.3 创建SSL协议的HttpClient对象 alleni123 httpclient 爬虫 ssl
public class HttpClientUtils { public static CloseableHttpClient createSSLClientDefault(CookieStore cookies){ SSLContext sslContext=null; try { sslContext=new SSLContextBuilder().l
java取反 -右移-左移-无符号右移的探讨百合不是茶位运算符位移
取反：在二进制中第一位，1表示符数，0表示正数 byte a = -1; 原码：10000001 反码：11111110 补码：11111111 //异或: 00000000 byte b = -2; 原码：10000010 反码：11111101 补码：11111110 //异或: 00000001
java多线程join的作用与用法 bijian1013 java 多线程
对于JAVA的join，JDK 是这样说的：join public final void join （long millis ）throws InterruptedException Waits at most millis milliseconds for this thread to die. A timeout of 0 means t
Java发送http请求(get 与post方法请求) bijian1013 java spring
PostRequest.java package com.bijian.study; import java.io.BufferedReader; import java.io.DataOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURL
【Struts2二】struts.xml中package下的action配置项默认值 bit1129 struts.xml
在第一部份，定义了struts.xml文件，如下所示： <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache.org/dtds/struts
【Kafka十三】Kafka Simple Consumer bit1129 simple
代码中关于Host和Port是割裂开的，这会导致单机环境下的伪分布式Kafka集群环境下，这个例子没法运行。实际情况是需要将host和port绑定到一起， package kafka.examples.lowlevel; import kafka.api.FetchRequest; import kafka.api.FetchRequestBuilder; impo
nodejs学习api ronin47 nodejs api
NodeJS基础什么是NodeJS JS是脚本语言，脚本语言都需要一个解析器才能运行。对于写在HTML页面里的JS，浏览器充当了解析器的角色。而对于需要独立运行的JS，NodeJS就是一个解析器。每一种解析器都是一个运行环境，不但允许JS定义各种数据结构，进行各种计算，还允许JS使用运行环境提供的内置对象和方法做一些事情。例如运行在浏览器中的JS的用途是操作DOM，浏览器就提供了docum
java-64.寻找第N个丑数 bylijinnan java
public class UglyNumber { /** * 64.查找第N个丑数具体思路可参考 [url] http://zhedahht.blog.163.com/blog/static/2541117420094245366965/[/url] * 题目：我们把只包含因子 2、3和5的数称作丑数（Ugly Number）。例如6、8都是丑数，但14
二维数组（矩阵）对角线输出 bylijinnan 二维数组
/** 二维数组对角线输出两个方向例如对于数组： { 1, 2, 3, 4 }, { 5, 6, 7, 8 }, { 9, 10, 11, 12 }, { 13, 14, 15, 16 }, slash方向输出： 1 5 2 9 6 3 13 10 7 4 14 11 8 15 12 16 backslash输出： 4 3
[JWFD开源工作流设计]工作流跳跃模式开发关键点(今日更新) comsci 工作流
既然是做开源软件的,我们的宗旨就是给大家分享设计和代码,那么现在我就用很简单扼要的语言来透露这个跳跃模式的设计原理大家如果用过JWFD的ARC-自动运行控制器,或者看过代码,应该知道在ARC算法模块中有一个函数叫做SAN(),这个函数就是ARC的核心控制器,要实现跳跃模式,在SAN函数中一定要对LN链表数据结构进行操作,首先写一段代码,把
redis常见使用 cuityang redis 常见使用
redis 通常被认为是一个数据结构服务器，主要是因为其有着丰富的数据结构 strings、map、 list、sets、 sorted sets 引入jar包 jedis-2.1.0.jar (本文下方提供下载) package redistest; import redis.clients.jedis.Jedis; public class Listtest
配置多个redis dalan_123 redis
配置多个redis客户端 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi=&quo
attrib命令 dcj3sjt126com attr
attrib指令用于修改文件的属性.文件的常见属性有:只读.存档.隐藏和系统. 只读属性是指文件只可以做读的操作.不能对文件进行写的操作.就是文件的写保护. 存档属性是用来标记文件改动的.即在上一次备份后文件有所改动.一些备份软件在备份的时候会只去备份带有存档属性的文件.
Yii使用公共函数 dcj3sjt126com yii
在网站项目中，没必要把公用的函数写成一个工具类，有时候面向过程其实更方便。在入口文件index.php里添加 require_once('protected/function.php'); 即可对其引用，成为公用的函数集合。 function.php如下： <?php /** * This is the shortcut to D
linux 系统资源的查看（free、uname、uptime、netstat） eksliang netstat linux uname linux uptime linux free
linux 系统资源的查看转载请出自出处：http://eksliang.iteye.com/blog/2167081 http://eksliang.iteye.com 一、free查看内存的使用情况语法如下： free [-b][-k][-m][-g] [-t] 参数含义 -b:直接输入free时，显示的单位是kb我们可以使用b(bytes),m
JAVA的位操作符 greemranqq 位运算 JAVA位移 <<>>>
最近几种进制，加上各种位操作符，发现都比较模糊，不能完全掌握，这里就再熟悉熟悉。 1.按位操作符：按位操作符是用来操作基本数据类型中的单个bit,即二进制位，会对两个参数执行布尔代数运算，获得结果。与（&）运算： 1&1 = 1, 1&0 = 0, 0&0 &
Web前段学习网站 ihuning Web
Web前段学习网站菜鸟学习：http://www.w3cschool.cc/ JQuery中文网：http://www.jquerycn.cn/ 内存溢出：http://outofmemory.cn/#csdn.blog http://www.icoolxue.com/ http://www.jikexue
强强联合：FluxBB 作者加盟 Flarum justjavac r
原文：FluxBB Joins Forces With Flarum作者：Toby Zerner译文：强强联合：FluxBB 作者加盟 Flarum译者：justjavac FluxBB 是一个快速、轻量级论坛软件，它的开发者是一名德国的 PHP 天才 Franz Liedke。FluxBB 的下一个版本(2.0)将被完全重写，并已经开发了一段时间。FluxBB 看起来非常有前途的，
java统计在线人数（session存储信息的） macroli java Web
这篇日志是我写的第三次了前两次都发布失败！郁闷极了！由于在web开发中常常用到这一部分所以在此记录一下，呵呵，就到备忘录了！我对于登录信息时使用session存储的，所以我这里是通过实现HttpSessionAttributeListener这个接口完成的。 1、实现接口类，在web.xml文件中配置监听类，从而可以使该类完成其工作。 public class Ses
bootstrp carousel初体验快速构建图片播放 qiaolevip 每天进步一点点学习永无止境 bootstrap 纵观千象
img{ border: 1px solid white; box-shadow: 2px 2px 12px #333; _width: expression(this.width > 600 ? "600px" : this.width + "px"); _height: expression(this.width &
SparkSQL读取HBase数据，通过自定义外部数据源 superlxw1234 spark sparksql sparksql读取hbase sparksql外部数据源
关键字：SparkSQL读取HBase、SparkSQL自定义外部数据源前面文章介绍了SparSQL通过Hive操作HBase表。 SparkSQL从1.2开始支持自定义外部数据源(External DataSource)，这样就可以通过API接口来实现自己的外部数据源。这里基于Spark1.4.0，简单介绍SparkSQL自定义外部数据源，访
Spring Boot 1.3.0.M1发布 wiselyman spring boot
Spring Boot 1.3.0.M1于6.12日发布，现在可以从Spring milestone repository下载。这个版本是基于Spring Framework 4.2.0.RC1,并在Spring Boot 1.2之上提供了大量的新特性improvements and new features。主要包含以下： 1.提供一个新的sprin

[CUDA手搓]从零开始用C++ CUDA搭建一个卷积神经网络(LeNet)，了解神经网络各个层背后算法原理

文章目录

前言

一、所需环境

二、实现思路

2.1. 定义了LeNet网络模型结构，并训练了20次

2.2 以txt格式导出训练结果(模型的各个层权重偏置等参数)

2.3 (可选)以pth格式导出训练结果，以方便后期调试

2.4 C++ CUDA要做的事

三、C++ CUDA具体实现

3.1 新建.cu文件并填好框架

3.2 C++实现各网络层

3.0 CUDA 编程 核心思路

3.1 卷积层Conv1

3.2 激活函数ReLu1

3.2 池化层MaxPool1

3.3 卷积层Conv2

3.4 激活函数ReLu2

3.5 池化层MaxPool2

3.6 全连接层fc1

3.7 激活函数ReLu3

3.8 全连接层fc2

3.9 激活函数ReLu4

3.10 全连接层fc3

3.11 输出结果

3.12 后续改进

四、源码

4.1 CUDA最终源码

总结

你可能感兴趣的:(Graphics图形学笔记,神经网络,c++,cnn,性能优化,vscode)

3.0 CUDA 编程核心思路