[CUDA] 快速入门CUDA(2) 以任务为导向的实践-解决最小二乘法问题

快速入门CUDA(2) 以任务为导向的实践-解决最小二乘法问题

文章目录

  • 快速入门CUDA(2) 以任务为导向的实践-解决最小二乘法问题
    • 1 问题介绍
    • 2 MATLAB解法实现
    • 3 CUDA解法实现
      • 3.1 总览
      • 3.2 第一步,读取矩阵
      • 3.3 将矩阵放到GPU上
      • 3.4 矩阵运算
      • 3.5 求解矩阵A的SVD
    • 4 代码汇总

1 问题介绍

今天我们要解决的是一个最小二乘法的问题,这也是我的作业。下面就简单的介绍一下问题,以及所涉及到的公式,便于我们完成这个任务。如果不理解这个数学问题也没有关系,关键在于理解公式,以及我们要做的运算即可。

给定一个 n × n n\times n n×n的矩阵 A A A,要找到一个近似的解 x x x满足下面式子所表达的最小乘法问题。
arg ⁡ min ⁡ x ∈ R n ∥ A x − b ∥ 2 \arg {\min _{x \in {R^n}}}{\left\| {Ax - b} \right\|_2} argxRnminAxb2
向量 b b b的计算方法是将 A A A乘以一个单位向量 e e e,由全1组成。确保 A A A可以被向量分解(SVD)
A = U Σ V T A=U\Sigma V^T A=UΣVT
其中 U U U V V V都是 n × n n\times n n×n的正交矩阵,而 Σ \Sigma Σ是一个 n × n n\times n n×n的对角矩阵,并且它的对角元素满足下列不等式
σ 1 ≥ σ 2 ≥ σ 3 ≥ ⋯ ≥ σ n ≥ 0 \sigma_1\ge \sigma_2\ge\sigma_3\ge \cdots \ge \sigma_n \ge 0 σ1σ2σ3σn0
那么只要 σ n ≥ 0 \sigma_n\ge0 σn0,我们就可以求出解
x = V Σ − 1 U T b x=V\Sigma ^{-1} U^T b x=VΣ1UTb
当然,如果我们将 U U U V V V表示为
U = [ u 1 , u 2 , ⋯   , u n ] , V = [ v 1 , v 2 , ⋯   , v n ] U=[u_1,u_2,\cdots,u_n],V=[v_1,v_2,\cdots,v_n] U=[u1,u2,,un],V=[v1,v2,,vn]
那么同样可以求得解为
x = ∑ i = 1 n u i T b σ i v i x = \sum\limits_{i = 1}^n {\frac{{u_i^Tb}}{{{\sigma _i}}}{v_i}} x=i=1nσiuiTbvi
我们可以找到对于 1 < k < n 11<k<n而言,可以找到 σ k > > σ k + 1 \sigma_k>>\sigma_{k+1} σk>>σk+1。这样我们只需要收集前 k k k列的 U U U V V V Σ \Sigma Σ,就可以得到一个矩阵 A A A的近似了。


U k = U ( : , 1 : k ) , V k = V ( : , 1 : k ) , Σ k = Σ ( 1 : k , 1 : k ) U_k = U(:,1:k), V_k=V(:,1:k), \Sigma _k = \Sigma(1:k, 1:k) Uk=U(:,1:k),Vk=V(:,1:k),Σk=Σ(1:k,1:k)
可以得到 A k A_k Ak,也被称之为Truncated SVD (TSVD)。
A k = U k Σ k V k T A_k=U_k\Sigma_kV_k^T Ak=UkΣkVkT
接着可以得到
∥ A − A k ∥ 2 ≈ σ k + 1 {\left\| {A - {A_k}} \right\|_2} \approx {\sigma _{k + 1}} AAk2σk+1
同样,有两种方法计算解 x k x_k xk

x k = ∑ i = 1 k u i T b σ i v i x k = V k Σ k − 1 U k T b {x_k} = \sum\limits_{i = 1}^k {\frac{{{u_i}^Tb}}{{{\sigma _i}}}{v_i}} \\ x_k=V_k\Sigma_k^{-1}U_k^Tb xk=i=1kσiuiTbvixk=VkΣk1UkTb
最后,我们可以计算出
e = ∥ x − x k ∥ 2 r = ∥ A k x k − b ∥ 2 e ={\left\| {x - {x_k}} \right\|_2}\\ r={\left\| {A_kx_k - b} \right\|_2} e=xxk2r=Akxkb2
至此,我们明确了目的,对于一个给定的矩阵 A A A,我们要计算出 x k , e , r x_k,e,r xk,e,r

2 MATLAB解法实现

根据上述的任务要求,我们可以先用MATLAB进行实现,因为MATLAB拥有大量的内置函数,使用起来很方便,运算速度也很快。此外,我们还可以用MATLAB的运算结果来检验我们后期用CUDA实现算法的正确性。

clear all 
clc
m = 256;
n = 128;
% load MyMatrix.txt 从文件中读取矩阵
% A = MyMatrix;
A =rand(m, n);  %  随机生成矩阵
b = A * ones(n, 1);
[U,S,V] = svd(A);  % 求解svd
sigma = diag(S);
% temp = S;
% temp(1:128,1:128) = inv(S(1:128,:));
% x_true = V * temp * U' * b;  % solution 1
x = 0;
for i = 1:n
    x = x + V(:,i) * ((U(:,i)'*b) / sigma(i)); %  solution 2
end
k = 2;
for i = 2:n-1
    fprintf('sigma(%d)=%e,sigma(%d)=%e, sigma(%d) / sigma(%d)=%e\n',i+1,sigma(i+1),i,sigma(i),i+1,i,sigma(i + 1) / sigma(i));
    if sigma(i + 1) / sigma(i) <= 1e-3
        k = i;
        break
    else
        k = i;
    end
end
U_k = U(:,1:k);
V_k = V(:,1:k);
sigma_k = S(1:k,1:k);
A_k = U_k * sigma_k * V_k';
x_k = 0;
for i = 1:k
    x_k = x_k + V(:,i) * ((U(:,i)'*b) / sigma(i)); %  solution 2
end
% x_k = V_k * inv(sigma_k) * U_k' * b
e = norm(x - x_k)  % 计算误差
r = norm(A_k * x_k - b)

可以看到用MATLAB实现还是相对容易的,基本上就是对着公式去实现即可。

3 CUDA解法实现

接下来,我们正式使用CUDA来编程,在此之前,我们可以思考一下会用到哪些操作和与之对应的函数。

3.1 总览

在这个问题当中,显然涉及到矩阵的乘法SVD的计算矩阵的2范数计算矩阵的转置求逆矩阵。按照正常的思路,可以分别对应实现函数,然后调用函数实现。不过,如果我们自己用CUDA来实现上述的功能,可能会有一定的难度,而且所实现的函数计算速度也一定不如NVIDIA官方为我们提供的API(除非你是计算机天才2333)。相信大部分人应该和我一样是普通人,初学者,那就快乐地当一名调包侠吧!

3.2 第一步,读取矩阵

第一步先将矩阵 A A A读入,并给它分配空间,注意,此处我们仍然是在host上进行的,也就是CPU负责这个任务。直接给出读取矩阵的函数,矩阵存储的格式为,前两个数字为矩阵的行数和列数,然后才是矩阵的具体内容。

float* readMatrix(char* filename) {
    int i, j;
    FILE *file;
    file = fopen(filename, "r");
    if (file == NULL)
    {
      printf("[ERROR] File \'%s\' does not exist in the current directory\n", filename);
      printf("[ERROR] Program exited.\n");
      exit(0);
    }

    int m, n;
    float float_m, float_n;
    fscanf(file, "%f", &float_m);
    fscanf(file, "%f", &float_n);
    m = (int) float_m;
    n = (int) float_n;
    float *A = (float *) malloc(m * n * sizeof(float)); 
    for (i = 0; i < m; i++)
      for (j = 0; j < n; j++)
      {
        fscanf(file, "%f", &A[i + j * n]);
      }

    fclose(file);
    printf("[INFO] Test matrix has been successfully imported, size=%dx%d.\n", m, n);
    return A;
}

3.3 将矩阵放到GPU上

因为我们要在GPU上进行计算,所以必须将数据上传至GPU,上传的速度取决于

  • CPU的速度
  • host与device之间的通信带宽

malloc()类似,可以使用cudaMalloc()在GPU上分配空间,分配完空间之后,可以将host上的数据拷贝到GPU上,当然,也可以将GPU上的数据拷贝到host上,完成这一操作所需要的函数是cudaMemcpy(device_x, host_x, n * sizeof(float), cudaMemcpyHostToDevice),所给的例子是将host_x拷贝到GPU上,device_xhost_x均需要为指针类型。最后一项参数代表着传输的方向,可以改变,如下表所示。

名称 代表的方向
cudaMemcpyHostToDevice 主机到显卡
cudaMemcpyDeviceToHost 显卡到主机
cudaMemcpyDeviceToDevice 显卡到显卡

统一内存(unified memory)

使用上述方法可能会频繁的拷贝数据,在编写程序的时候不是很方便,万幸的是,CUDA支持统一内存,也就是说,只需要使用统一内存,那么host和device都可以访问,这样在编写程序的时候会更方便简单一些。下面介绍一下如何分配统一内存。

使用__host__​cudaError_t cudaMallocManaged ( void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal )即可,例如

cudaMallocManaged(&x, N*sizeof(float)); // 分配x使用统一内存管理系统

虽然写起来更容易了,但是却更容易造成一些更复杂的bug,难以解决。所以本文采用前者的方法来分配GPU空间。

我们先从内存中读取出数据,先在host上存储,接着给GPU分配空间,将host上的数据传到GPU上,代码如下

float *host_A = readMatrix("MyMatrix.txt");   // A'shape = m * n
float *host_x = matrixOnes(n);         // x'shape = n * 1
float *host_b = (float*)malloc(sizeof(float) * m);
float *device_b;   // b'shape = m * 1
float *device_x;   // x'shape = n * 1
float *device_A;   // A'shape = m * n
cudaMalloc((void**)&device_x, n * sizeof(float));
cudaMalloc((void**)&device_b, m * sizeof(float));
cudaMalloc((void**)&device_A, m * n * sizeof(float));
cudaMemcpy(device_A, host_A, m * n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(device_x, host_x, n * sizeof(float), cudaMemcpyHostToDevice);

3.4 矩阵运算

接下来,我们要计算 b = A x b=Ax b=Ax,现在已知 A A A x x x,求解 b b b,注意,矩阵 b b b x x x实际上只是向量(vector),因为他们都是一维的,所以这里我们使用的函数是cublasSgemv(),这个函数比较复杂,参数也很多,其中S代表单精度,而最后两个字母mv代表矩阵和向量相乘,我将在下一篇中着重讲解该函数,在这里,直接看我怎么使用就好。

我们的步骤如下:

  1. 设定矩阵和向量
  2. 配置参数
  3. 调用cublasSgemv()函数进行矩阵运算

具体代码如下

cublasSetMatrix(m, n, sizeof(float), host_A, m, device_A, m);
cublasSetVector(n, sizeof(float), host_x, 1, device_x, 1);
float alpha = 1.0f;
float beta = 0.0f;
cublas_status = cublasSgemv(cublasH, CUBLAS_OP_N, m, n, &alpha, device_A, m, device_x, 1, &beta, device_b, 1);

这部分代码所作的事情就是 b = A x b=Ax b=Ax

3.5 求解矩阵A的SVD

接下来,我们的任务是求解出 A A A的SVD分解,在MATLAB当中,只需要调用内置函数svd(matrix)即可,CUDA当中有对应的函数吗?答案是有的,事实上,在上一步过程中,我们用到了cublas,其实这个里面还有很多实用的功能。现在,我们要用的是cusolver,它同样提供了使用的功能,封装了很多已经高效优化过的算法,例如SVD,FFT等。必须要说的是,cusolverDnSgesvd()同样非常复杂,如果想知道每一个参数的含义,需要查阅官网提供的文档,建议直接看本例。

和上一步类似,步骤如下

  1. 分配好空间,设定好存储结果的矩阵
  2. 为SVD计算分配空间
  3. 调用cusolverDnSgesvd()求解SVD

具体代码如下

int bufferSize = 0;
cusolver_status = cusolverDnSgesvd_bufferSize(cusolverH, n, n, &bufferSize);
float *buffer;
cudaMalloc((void**)&buffer, bufferSize * sizeof(float));
float *device_S, *device_U, *device_VT;
int *devInfo;
cudaMalloc((void**)&devInfo, sizeof(int));
cudaMalloc((void**)&device_S, n * n * sizeof(float));
cudaMalloc((void**)&device_U, n * n * sizeof(float));
cudaMalloc((void**)&device_VT, n * n * sizeof(float));
cusolver_status = cusolverDnSgesvd(cusolverH, 'A', 'A', m, n, device_A, m, device_S, device_U, m, device_VT, n, buffer, bufferSize, NULL, devInfo);

剩下的内容下次在更新!

4 代码汇总

综上,本实验的代码如下,可供参考。

/**
 * @file jz544_hw4_code.cu
 * @author Jingkai Zhang (jz544) ([email protected])
 * @date 2022-04-07
 * 
 * run the program with the following command:
 * nvcc -o jz544_hw4_code jz544_hw4_code.c -lcublas -lcusolver
 * ./jz544_hw4_code
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define IDX2C(i,j,ld) (((j)*ld)+(i))
#define READ_MATRIX_FROM_FILE 0  // 1 means read matrix from file, 0 means generate matrix
#define MATRIX_SIZE_ROW 1 << 13     // this is used only when READ_MATRIX_FROM_FILE is set to 1
#define MATRIX_SIZE_COL 1 << 13     // this is used only when READ_MATRIX_FROM_FILE is set to 1

double* generateMatrix(int row, int col, int *m, int *n); 
double* readMatrix(char *filename, int *row, int *col);
double* matrixOnes(int size);
double* transpose(double *A, int m, int n);
double* matrixDiag(int n, double *sigmas);

__global__ void transpose(double *A, double *AT, int m, int n) {
    int nx = blockIdx.x * blockDim.x + threadIdx.x;
    int ny = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (nx < m && ny < n) {
        // printf("A(%d) = A(%d)\n", nx * n + ny, nx * n + ny);
        // printf("AT(%d) = A(%d)\n", nx * n + ny, ny * m + nx);
        // AT[ny * m + nx] = A[nx * n + ny];
        AT[nx * n + ny] = A[ny * m + nx];
    }
}

__global__ void calculate_x(double *x, double *u_t_times_b, double *sigma, double *V, int N, int i) {
    int nx = blockIdx.x;
    // double val;
    if (nx < N) {
        for (int j = 0; j < i; j++) {
            x[nx] += V[j * N + nx] * u_t_times_b[j] / sigma[j];
            // val = V[j] + 1;
            // val = 1;
            // printf("val = %f\n", val);
            // x[j] = x[j] + val;
            // atomicAdd(&x[j], val);
        }
    }
}

__global__ void get_S_k(double *S, double *S_k, int k, int M) {
    int nx = blockIdx.x;
    if (nx < k) {
        // k could be 0 1 2 3 ... k-1
        for (int i = nx * M; i < nx * M + M; i++) {
            // 0 1 2 .. M -1 
            // M M+1 M+2 ... 2M-1
            // 2M 2M+1 2M+2 ... 3M-1
            // S_k[i] = 1.0f;
            if (i == nx * M + nx) {
                // printf("nx * M + nx = %d\n", nx * M + nx);
                S_k[nx * M + nx] = S[nx];
            } else {
                S_k[i] = 0.00;
            }
        }
    }
}

__global__ void get_U_or_V_k(double *U, double *U_k, int k, int M) {
    int nx = blockIdx.x;
    if (nx < k) {
        // k could be 0 1 2 3 ... k-1
        for (int i = nx * M; i < nx * M + M; i++) {
            // 0 1 2 .. M -1 
            // M M+1 M+2 ... 2M-1
            // 2M 2M+1 2M+2 ... 3M-1
            U_k[i] = U[i];
        }
    }
}

int main() {
    int m;   // define size of matrix A 
    int n; 
    printf("Hello, World!\n");
    cublasHandle_t cublasH;
    cusolverDnHandle_t cusolverH; 
    // cudaError_t cudaStat1 = cudaSuccess; 
    cublasStatus_t cublas_status = CUBLAS_STATUS_SUCCESS;
    cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS;
    cublas_status = cublasCreate(&cublasH);  
    assert(CUSOLVER_STATUS_SUCCESS == cusolver_status);
    cusolver_status = cusolverDnCreate(&cusolverH);
    assert(CUBLAS_STATUS_SUCCESS == cublas_status);
    /**************************************************************************
     * Step 1 & 2 & 3: 
     *      create n-length all one matrix x 
     *      read matrix A from MyMatrix.txt
     *      copy data from host to device
     *************************************************************************/
    double *host_A;
    if (READ_MATRIX_FROM_FILE) {
        host_A = readMatrix("MyMatrix.txt", &m, &n);   // A'shape = m * n
    } else {
        printf("[INFO] generate matrix, row=%d, col=%d\n", MATRIX_SIZE_ROW, MATRIX_SIZE_COL);
        host_A = generateMatrix(MATRIX_SIZE_ROW, MATRIX_SIZE_COL, &m, &n);
    }
    double *host_x = matrixOnes(n);         // x'shape = n * 1
    double *host_b = (double*)malloc(sizeof(double) * m);
    double *device_b;   // b'shape = m * 1
    double *device_x;   // x'shape = n * 1
    double *device_A;   // A'shape = m * n
    struct timeval start, end;   // start and stop timer
    float el_time;               // elapsed time
    gettimeofday(&start, NULL);  // start counting time
    cudaMalloc((void**)&device_x, n * sizeof(double));
    cudaMalloc((void**)&device_b, m * sizeof(double));
    cudaMalloc((void**)&device_A, m * n * sizeof(double));
    cudaMemcpy(device_A, host_A, m * n * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(device_x, host_x, n * sizeof(double), cudaMemcpyHostToDevice);
    /*************************************************************
     * Step 4: 
     *      set b = Ax
     *************************************************************/
    cublasSetMatrix(m, n, sizeof(double), host_A, m, device_A, m);
    cublasSetVector(n, sizeof(double), host_x, 1, device_x, 1);

    double alpha = 1.0;
    double beta = 0.0;
    cublas_status = cublasDgemv(cublasH, CUBLAS_OP_N, m, n, &alpha, device_A, m, device_x, 1, &beta, device_b, 1);
    /**
     * Step 5: 
     *      cusolverDnDgesvd to get A = U*S*V^T
     *      first you need to call cusolverDnSgesvd_bufferSize to get the buffer size
     *      operate the SVD using cusolverDnDgesvde
     */
    int bufferSize = 0;
    cusolver_status = cusolverDnDgesvd_bufferSize(cusolverH, m, n, &bufferSize);
    double *buffer;
    cudaMalloc((void**)&buffer, bufferSize * sizeof(double));
    double *device_S, *device_U, *device_VT;
    int *devInfo;
    cudaMalloc((void**)&devInfo, sizeof(int));
    cudaMalloc((void**)&device_S, n * sizeof(double));
    cudaMalloc((void**)&device_U, m * m * sizeof(double));
    cudaMalloc((void**)&device_VT, n * n * sizeof(double));
    
    cusolver_status = cusolverDnDgesvd(cusolverH, 'A', 'A', m, n, device_A, m, device_S, device_U, m, device_VT, n, buffer, bufferSize, NULL, devInfo);
    /*************************************************************
     * Step 6: 
     *      find true value of x
     *      find k either on the device or on the host
     ************************************************************/
    dim3 grid(10,10);
    dim3 block(30,30);
    double *device_U_T_times_b; // U^T * b = (double *) malloc(m * sizeof(double));   // U^T * b
    cudaMalloc((void**)&device_U_T_times_b, m * sizeof(double));
    cublas_status = cublasDgemv(cublasH, CUBLAS_OP_T, m, m, &alpha, device_U, m, device_b, 1, &beta, device_U_T_times_b, 1);
    double *device_true_x;
    cudaMalloc((void**)&device_true_x, n * sizeof(double));
    double *device_V;
    cudaMalloc((void**)&device_V, n * n * sizeof(double));
    transpose<<<grid, block>>>(device_VT, device_V, n, n);
    calculate_x<<<128,1>>>(device_true_x, device_U_T_times_b, device_S, device_V, n, n);

    int k = 1;
    double *host_S = (double*)malloc(sizeof(double) * n);
    cublasGetVector(n, sizeof(double), device_S, 1, host_S, 1);
    for (int i = 0; i < n - 1; i++) {
        if (host_S[i+1] / host_S[i] <= 1e-3) {
            k = i;
            break;
        } 
        // else {
        //     k = i;
        // }
    }
    k = k + 1;
    /**
     * Step 7 & 8: 
     *      form U_k, V_k, S_k
     *      b_k = (U_k)' * b
     *      d_k = inv(sigma_k) * (U_k)' * b_k
     *      x_k = V_k * d_k
     */
    double *device_x_k;
    cudaMalloc((void**)&device_x_k, n * sizeof(double));
    calculate_x<<<128,1>>>(device_x_k, device_U_T_times_b, device_S, device_V, n, k);
    /**
     * Step 9 to 13
     *      compute the error = ||x_k - x_true||_2 
     *      compute the Ax_k - b
     *      compute the residual error = ||Ax_k - b||_2
     *      move x_k, e, r to the host
     *      print e, r and the first 8 entries of x_k
     */
    double *device_A_k, *device_U_k, *device_V_k, *device_S_k;
    cudaMalloc((void**)&device_A_k, m * n * sizeof(double));
    cudaMalloc((void**)&device_S_k, k * k * sizeof(double));
    cudaMalloc((void**)&device_U_k, m * k * sizeof(double));
    cudaMalloc((void**)&device_V_k, n * k * sizeof(double));
    get_S_k<<<k, 1>>>(device_S, device_S_k, k, k);
    get_U_or_V_k<<<k, 1>>>(device_U, device_U_k, k, m);
    get_U_or_V_k<<<k, 1>>>(device_V, device_V_k, k, n);
    double *device_w;
    cudaMalloc((void**)&device_w, m * k * sizeof(double));
    alpha = 1.0f;
    cublasDgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, m, k, k, &alpha, device_U_k, m, device_S_k, k, &beta, device_w, m);
    cublasDgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, device_w, m, device_V_k, n, &beta, device_A_k, m);
    alpha = -1.0f;
    // copy device_x_k
    double *copy_device_true_x;
    cudaMalloc((void**)&copy_device_true_x, n * sizeof(double));
    cudaMemcpy(copy_device_true_x, device_true_x, n * sizeof(double), cudaMemcpyDeviceToDevice);
    cublasDaxpy(cublasH, n, &alpha, device_x_k, 1, copy_device_true_x, 1);
    double error;
    cublasDnrm2(cublasH, n, copy_device_true_x, 1, &error);
    double *device_A_k_times_x_k;
    cudaMalloc((void**)&device_A_k_times_x_k, m * sizeof(double));
    // copy device_A again
    cudaMemcpy(device_A, host_A, m * n * sizeof(double), cudaMemcpyHostToDevice);
    alpha = 1.0f;
    cublas_status = cublasDgemv(cublasH, CUBLAS_OP_N, m, n, &alpha, device_A_k, m, device_x_k, 1, &beta, device_A_k_times_x_k, 1);
    alpha = -1.0f;
    cublasDaxpy(cublasH, m, &alpha, device_b, 1, device_A_k_times_x_k, 1);
    double residual_error;
    cublasDnrm2(cublasH, m, device_A_k_times_x_k, 1, &residual_error);
    cudaDeviceSynchronize();
    gettimeofday(&end, NULL);   // stop counting time
    el_time = ((end.tv_sec - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;
    printf("[INFO] k = %d, same as MATLAB index\n", k);
    printf("[INFO] error = %e, residual error = %e\n", error, residual_error);
    // print the first 8 entries of x_k
    double *host_x_k = (double*)malloc(sizeof(double) * n);
    double *host_x_true = (double*)malloc(sizeof(double) * n);
    cublasGetVector(n, sizeof(double), device_x_k, 1, host_x_k, 1);
    cublasGetVector(n, sizeof(double), device_true_x, 1, host_x_true, 1);
    for (int i = 0; i < 8; i++) {
        printf("[INFO] x_k[%d] = %e\n", i, host_x_k[i]);
    }
    printf("[INFO] time consumption: %e s\n", el_time);
    // for (int i = 0; i < 8; i++) {
    //     printf("[INFO] x_true[%d] = %e\n", i, host_x_true[i]);
    // }
    // double *temp = (double *) malloc(m * sizeof(double)); 
    // // cublasGetMatrix(m, m, sizeof(double), device_U, m, temp, m);
    // cublasGetVector(m, sizeof(double), device_U_T_times_b, 1, temp, 1);
    // for (int i = 0; i < 1; i++) {
    //     for (int j = 0; j < m; j++) {
    //         printf("device_U[%d][%d]=%e\n",i,j, temp[j + i * m]);
    //     }
    //     printf("================\n");
    //     // printf("A[%d]=%e\n",i, unified_A[i]);
    // }
    cudaDeviceSynchronize();
    return 0;
}

double* generateMatrix(int row, int col, int *m, int *n) {
    double *A = (double *) malloc(row * col * sizeof(double));
    for (int i = 0; i < row; i++) {
        for (int j = 0; j < col; j++) {
            A[IDX2C(i, j, row)] = (double) drand48();
        }
    }
    *m = row;
    *n = col;
    return A;
}

double* readMatrix(char* filename, int *row, int *col) {
    int i, j;
    FILE *file;
    file = fopen(filename, "r");
    if (file == NULL)
    {
      printf("[ERROR] File \'%s\' does not exist in the current directory\n", filename);
      printf("[ERROR] Program exited.\n");
      exit(0);
    }

    int m, n;
    double double_m, double_n;
    fscanf(file, "%lf", &double_m);
    fscanf(file, "%lf", &double_n);
    m = (int) double_m;
    n = (int) double_n;
    *row = m;
    *col = n;
    // m = 10; n = 10;
    double *A = (double *) malloc(m * n * sizeof(double)); 
    // double *A;
    // cudaMallocManaged((void**)&A, m * n * sizeof(double));
    for (i = 0; i < m; i++)
      for (j = 0; j < n; j++)
      {
        // fscanf(file, "%f", &A[i + j * n]);
        // fscanf(file, "%f", &A[j + i * n]);
        fscanf(file, "%lf", &A[IDX2C(i, j, m)]);
      }

    fclose(file);
    printf("[INFO] Test matrix has been successfully imported, size=%dx%d.\n", m, n);
    return A;
}

double* matrixOnes(int size) {
  int i;
  double *mat = (double *) malloc(size * sizeof(double)); 
//   double *mat;
//   cudaMallocManaged((void**)&mat, size * sizeof(double));
  for (i = 0; i < size; i++) 
  {
      mat[i] = 1.0;
  }
  return mat;
}

double* matrixDiag(int n, double *sigmas) {
    // form a diagonal matrix from sigmas
    int i, j;
    int count = 0;
    double *diag = (double *) malloc(n * n * sizeof(double));
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            if (i == j) {
                diag[i + j * n] = sigmas[count];
                count += 1;
            }
            else {
                diag[i + j * n] = 0.0;
            }
        }
    }
    return diag;
}

double* transpose(double *A, int m, int n) {
    int i, j;
    double *B = (double *) malloc(m * n * sizeof(double)); 
    for (i = 0; i < m; i++)
      for (j = 0; j < n; j++)
      {
        // B[i + j * n] = A[i*n + j];
        B[i*n + j] = A[n*j + i];
      }
    return B;
}

/*
    double *A_h = (double *) malloc(n * n * sizeof(double)); 
    cublasGetMatrix(n, n, sizeof(double), device_A, n, A_h, n);
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++)
            printf("A = %f\n", A_h[i*n + j]);
    }
*/

你可能感兴趣的:(CUDA编程,c语言,机器学习,人工智能,ubuntu)