今天我们要解决的是一个最小二乘法的问题,这也是我的作业。下面就简单的介绍一下问题,以及所涉及到的公式,便于我们完成这个任务。如果不理解这个数学问题也没有关系,关键在于理解公式,以及我们要做的运算即可。
给定一个 n × n n\times n n×n的矩阵 A A A,要找到一个近似的解 x x x满足下面式子所表达的最小乘法问题。
arg min x ∈ R n ∥ A x − b ∥ 2 \arg {\min _{x \in {R^n}}}{\left\| {Ax - b} \right\|_2} argx∈Rnmin∥Ax−b∥2
向量 b b b的计算方法是将 A A A乘以一个单位向量 e e e,由全1组成。确保 A A A可以被向量分解(SVD)
A = U Σ V T A=U\Sigma V^T A=UΣVT
其中 U U U和 V V V都是 n × n n\times n n×n的正交矩阵,而 Σ \Sigma Σ是一个 n × n n\times n n×n的对角矩阵,并且它的对角元素满足下列不等式
σ 1 ≥ σ 2 ≥ σ 3 ≥ ⋯ ≥ σ n ≥ 0 \sigma_1\ge \sigma_2\ge\sigma_3\ge \cdots \ge \sigma_n \ge 0 σ1≥σ2≥σ3≥⋯≥σn≥0
那么只要 σ n ≥ 0 \sigma_n\ge0 σn≥0,我们就可以求出解
x = V Σ − 1 U T b x=V\Sigma ^{-1} U^T b x=VΣ−1UTb
当然,如果我们将 U U U和 V V V表示为
U = [ u 1 , u 2 , ⋯ , u n ] , V = [ v 1 , v 2 , ⋯ , v n ] U=[u_1,u_2,\cdots,u_n],V=[v_1,v_2,\cdots,v_n] U=[u1,u2,⋯,un],V=[v1,v2,⋯,vn]
那么同样可以求得解为
x = ∑ i = 1 n u i T b σ i v i x = \sum\limits_{i = 1}^n {\frac{{u_i^Tb}}{{{\sigma _i}}}{v_i}} x=i=1∑nσiuiTbvi
我们可以找到对于 1 < k < n 1
令
U k = U ( : , 1 : k ) , V k = V ( : , 1 : k ) , Σ k = Σ ( 1 : k , 1 : k ) U_k = U(:,1:k), V_k=V(:,1:k), \Sigma _k = \Sigma(1:k, 1:k) Uk=U(:,1:k),Vk=V(:,1:k),Σk=Σ(1:k,1:k)
可以得到 A k A_k Ak,也被称之为Truncated SVD (TSVD)。
A k = U k Σ k V k T A_k=U_k\Sigma_kV_k^T Ak=UkΣkVkT
接着可以得到
∥ A − A k ∥ 2 ≈ σ k + 1 {\left\| {A - {A_k}} \right\|_2} \approx {\sigma _{k + 1}} ∥A−Ak∥2≈σk+1
同样,有两种方法计算解 x k x_k xk
x k = ∑ i = 1 k u i T b σ i v i x k = V k Σ k − 1 U k T b {x_k} = \sum\limits_{i = 1}^k {\frac{{{u_i}^Tb}}{{{\sigma _i}}}{v_i}} \\ x_k=V_k\Sigma_k^{-1}U_k^Tb xk=i=1∑kσiuiTbvixk=VkΣk−1UkTb
最后,我们可以计算出
e = ∥ x − x k ∥ 2 r = ∥ A k x k − b ∥ 2 e ={\left\| {x - {x_k}} \right\|_2}\\ r={\left\| {A_kx_k - b} \right\|_2} e=∥x−xk∥2r=∥Akxk−b∥2
至此,我们明确了目的,对于一个给定的矩阵 A A A,我们要计算出 x k , e , r x_k,e,r xk,e,r。
根据上述的任务要求,我们可以先用MATLAB进行实现,因为MATLAB拥有大量的内置函数,使用起来很方便,运算速度也很快。此外,我们还可以用MATLAB的运算结果来检验我们后期用CUDA实现算法的正确性。
clear all
clc
m = 256;
n = 128;
% load MyMatrix.txt 从文件中读取矩阵
% A = MyMatrix;
A =rand(m, n); % 随机生成矩阵
b = A * ones(n, 1);
[U,S,V] = svd(A); % 求解svd
sigma = diag(S);
% temp = S;
% temp(1:128,1:128) = inv(S(1:128,:));
% x_true = V * temp * U' * b; % solution 1
x = 0;
for i = 1:n
x = x + V(:,i) * ((U(:,i)'*b) / sigma(i)); % solution 2
end
k = 2;
for i = 2:n-1
fprintf('sigma(%d)=%e,sigma(%d)=%e, sigma(%d) / sigma(%d)=%e\n',i+1,sigma(i+1),i,sigma(i),i+1,i,sigma(i + 1) / sigma(i));
if sigma(i + 1) / sigma(i) <= 1e-3
k = i;
break
else
k = i;
end
end
U_k = U(:,1:k);
V_k = V(:,1:k);
sigma_k = S(1:k,1:k);
A_k = U_k * sigma_k * V_k';
x_k = 0;
for i = 1:k
x_k = x_k + V(:,i) * ((U(:,i)'*b) / sigma(i)); % solution 2
end
% x_k = V_k * inv(sigma_k) * U_k' * b
e = norm(x - x_k) % 计算误差
r = norm(A_k * x_k - b)
可以看到用MATLAB实现还是相对容易的,基本上就是对着公式去实现即可。
接下来,我们正式使用CUDA来编程,在此之前,我们可以思考一下会用到哪些操作和与之对应的函数。
在这个问题当中,显然涉及到矩阵的乘法,SVD的计算,矩阵的2范数计算,矩阵的转置,求逆矩阵。按照正常的思路,可以分别对应实现函数,然后调用函数实现。不过,如果我们自己用CUDA来实现上述的功能,可能会有一定的难度,而且所实现的函数计算速度也一定不如NVIDIA官方为我们提供的API(除非你是计算机天才2333)。相信大部分人应该和我一样是普通人,初学者,那就快乐地当一名调包侠吧!
第一步先将矩阵 A A A读入,并给它分配空间,注意,此处我们仍然是在host上进行的,也就是CPU负责这个任务。直接给出读取矩阵的函数,矩阵存储的格式为,前两个数字为矩阵的行数和列数,然后才是矩阵的具体内容。
float* readMatrix(char* filename) {
int i, j;
FILE *file;
file = fopen(filename, "r");
if (file == NULL)
{
printf("[ERROR] File \'%s\' does not exist in the current directory\n", filename);
printf("[ERROR] Program exited.\n");
exit(0);
}
int m, n;
float float_m, float_n;
fscanf(file, "%f", &float_m);
fscanf(file, "%f", &float_n);
m = (int) float_m;
n = (int) float_n;
float *A = (float *) malloc(m * n * sizeof(float));
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
{
fscanf(file, "%f", &A[i + j * n]);
}
fclose(file);
printf("[INFO] Test matrix has been successfully imported, size=%dx%d.\n", m, n);
return A;
}
因为我们要在GPU上进行计算,所以必须将数据上传至GPU,上传的速度取决于
与malloc()
类似,可以使用cudaMalloc()
在GPU上分配空间,分配完空间之后,可以将host上的数据拷贝到GPU上,当然,也可以将GPU上的数据拷贝到host上,完成这一操作所需要的函数是cudaMemcpy(device_x, host_x, n * sizeof(float), cudaMemcpyHostToDevice)
,所给的例子是将host_x拷贝到GPU上,device_x
和host_x
均需要为指针类型。最后一项参数代表着传输的方向,可以改变,如下表所示。
名称 | 代表的方向 |
---|---|
cudaMemcpyHostToDevice | 主机到显卡 |
cudaMemcpyDeviceToHost | 显卡到主机 |
cudaMemcpyDeviceToDevice | 显卡到显卡 |
统一内存(unified memory)
使用上述方法可能会频繁的拷贝数据,在编写程序的时候不是很方便,万幸的是,CUDA支持统一内存,也就是说,只需要使用统一内存,那么host和device都可以访问,这样在编写程序的时候会更方便简单一些。下面介绍一下如何分配统一内存。
使用__host__cudaError_t cudaMallocManaged ( void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal )
即可,例如
cudaMallocManaged(&x, N*sizeof(float)); // 分配x使用统一内存管理系统
虽然写起来更容易了,但是却更容易造成一些更复杂的bug,难以解决。所以本文采用前者的方法来分配GPU空间。
我们先从内存中读取出数据,先在host上存储,接着给GPU分配空间,将host上的数据传到GPU上,代码如下
float *host_A = readMatrix("MyMatrix.txt"); // A'shape = m * n
float *host_x = matrixOnes(n); // x'shape = n * 1
float *host_b = (float*)malloc(sizeof(float) * m);
float *device_b; // b'shape = m * 1
float *device_x; // x'shape = n * 1
float *device_A; // A'shape = m * n
cudaMalloc((void**)&device_x, n * sizeof(float));
cudaMalloc((void**)&device_b, m * sizeof(float));
cudaMalloc((void**)&device_A, m * n * sizeof(float));
cudaMemcpy(device_A, host_A, m * n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(device_x, host_x, n * sizeof(float), cudaMemcpyHostToDevice);
接下来,我们要计算 b = A x b=Ax b=Ax,现在已知 A A A和 x x x,求解 b b b,注意,矩阵 b b b和 x x x实际上只是向量(vector),因为他们都是一维的,所以这里我们使用的函数是cublasSgemv()
,这个函数比较复杂,参数也很多,其中S
代表单精度,而最后两个字母mv
代表矩阵和向量相乘,我将在下一篇中着重讲解该函数,在这里,直接看我怎么使用就好。
我们的步骤如下:
cublasSgemv()
函数进行矩阵运算具体代码如下
cublasSetMatrix(m, n, sizeof(float), host_A, m, device_A, m);
cublasSetVector(n, sizeof(float), host_x, 1, device_x, 1);
float alpha = 1.0f;
float beta = 0.0f;
cublas_status = cublasSgemv(cublasH, CUBLAS_OP_N, m, n, &alpha, device_A, m, device_x, 1, &beta, device_b, 1);
这部分代码所作的事情就是 b = A x b=Ax b=Ax。
接下来,我们的任务是求解出 A A A的SVD分解,在MATLAB当中,只需要调用内置函数svd(matrix)
即可,CUDA当中有对应的函数吗?答案是有的,事实上,在上一步过程中,我们用到了cublas
,其实这个里面还有很多实用的功能。现在,我们要用的是cusolver
,它同样提供了使用的功能,封装了很多已经高效优化过的算法,例如SVD,FFT等。必须要说的是,cusolverDnSgesvd()
同样非常复杂,如果想知道每一个参数的含义,需要查阅官网提供的文档,建议直接看本例。
和上一步类似,步骤如下
cusolverDnSgesvd()
求解SVD具体代码如下
int bufferSize = 0;
cusolver_status = cusolverDnSgesvd_bufferSize(cusolverH, n, n, &bufferSize);
float *buffer;
cudaMalloc((void**)&buffer, bufferSize * sizeof(float));
float *device_S, *device_U, *device_VT;
int *devInfo;
cudaMalloc((void**)&devInfo, sizeof(int));
cudaMalloc((void**)&device_S, n * n * sizeof(float));
cudaMalloc((void**)&device_U, n * n * sizeof(float));
cudaMalloc((void**)&device_VT, n * n * sizeof(float));
cusolver_status = cusolverDnSgesvd(cusolverH, 'A', 'A', m, n, device_A, m, device_S, device_U, m, device_VT, n, buffer, bufferSize, NULL, devInfo);
剩下的内容下次在更新!
综上,本实验的代码如下,可供参考。
/**
* @file jz544_hw4_code.cu
* @author Jingkai Zhang (jz544) ([email protected])
* @date 2022-04-07
*
* run the program with the following command:
* nvcc -o jz544_hw4_code jz544_hw4_code.c -lcublas -lcusolver
* ./jz544_hw4_code
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define IDX2C(i,j,ld) (((j)*ld)+(i))
#define READ_MATRIX_FROM_FILE 0 // 1 means read matrix from file, 0 means generate matrix
#define MATRIX_SIZE_ROW 1 << 13 // this is used only when READ_MATRIX_FROM_FILE is set to 1
#define MATRIX_SIZE_COL 1 << 13 // this is used only when READ_MATRIX_FROM_FILE is set to 1
double* generateMatrix(int row, int col, int *m, int *n);
double* readMatrix(char *filename, int *row, int *col);
double* matrixOnes(int size);
double* transpose(double *A, int m, int n);
double* matrixDiag(int n, double *sigmas);
__global__ void transpose(double *A, double *AT, int m, int n) {
int nx = blockIdx.x * blockDim.x + threadIdx.x;
int ny = blockIdx.y * blockDim.y + threadIdx.y;
if (nx < m && ny < n) {
// printf("A(%d) = A(%d)\n", nx * n + ny, nx * n + ny);
// printf("AT(%d) = A(%d)\n", nx * n + ny, ny * m + nx);
// AT[ny * m + nx] = A[nx * n + ny];
AT[nx * n + ny] = A[ny * m + nx];
}
}
__global__ void calculate_x(double *x, double *u_t_times_b, double *sigma, double *V, int N, int i) {
int nx = blockIdx.x;
// double val;
if (nx < N) {
for (int j = 0; j < i; j++) {
x[nx] += V[j * N + nx] * u_t_times_b[j] / sigma[j];
// val = V[j] + 1;
// val = 1;
// printf("val = %f\n", val);
// x[j] = x[j] + val;
// atomicAdd(&x[j], val);
}
}
}
__global__ void get_S_k(double *S, double *S_k, int k, int M) {
int nx = blockIdx.x;
if (nx < k) {
// k could be 0 1 2 3 ... k-1
for (int i = nx * M; i < nx * M + M; i++) {
// 0 1 2 .. M -1
// M M+1 M+2 ... 2M-1
// 2M 2M+1 2M+2 ... 3M-1
// S_k[i] = 1.0f;
if (i == nx * M + nx) {
// printf("nx * M + nx = %d\n", nx * M + nx);
S_k[nx * M + nx] = S[nx];
} else {
S_k[i] = 0.00;
}
}
}
}
__global__ void get_U_or_V_k(double *U, double *U_k, int k, int M) {
int nx = blockIdx.x;
if (nx < k) {
// k could be 0 1 2 3 ... k-1
for (int i = nx * M; i < nx * M + M; i++) {
// 0 1 2 .. M -1
// M M+1 M+2 ... 2M-1
// 2M 2M+1 2M+2 ... 3M-1
U_k[i] = U[i];
}
}
}
int main() {
int m; // define size of matrix A
int n;
printf("Hello, World!\n");
cublasHandle_t cublasH;
cusolverDnHandle_t cusolverH;
// cudaError_t cudaStat1 = cudaSuccess;
cublasStatus_t cublas_status = CUBLAS_STATUS_SUCCESS;
cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS;
cublas_status = cublasCreate(&cublasH);
assert(CUSOLVER_STATUS_SUCCESS == cusolver_status);
cusolver_status = cusolverDnCreate(&cusolverH);
assert(CUBLAS_STATUS_SUCCESS == cublas_status);
/**************************************************************************
* Step 1 & 2 & 3:
* create n-length all one matrix x
* read matrix A from MyMatrix.txt
* copy data from host to device
*************************************************************************/
double *host_A;
if (READ_MATRIX_FROM_FILE) {
host_A = readMatrix("MyMatrix.txt", &m, &n); // A'shape = m * n
} else {
printf("[INFO] generate matrix, row=%d, col=%d\n", MATRIX_SIZE_ROW, MATRIX_SIZE_COL);
host_A = generateMatrix(MATRIX_SIZE_ROW, MATRIX_SIZE_COL, &m, &n);
}
double *host_x = matrixOnes(n); // x'shape = n * 1
double *host_b = (double*)malloc(sizeof(double) * m);
double *device_b; // b'shape = m * 1
double *device_x; // x'shape = n * 1
double *device_A; // A'shape = m * n
struct timeval start, end; // start and stop timer
float el_time; // elapsed time
gettimeofday(&start, NULL); // start counting time
cudaMalloc((void**)&device_x, n * sizeof(double));
cudaMalloc((void**)&device_b, m * sizeof(double));
cudaMalloc((void**)&device_A, m * n * sizeof(double));
cudaMemcpy(device_A, host_A, m * n * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(device_x, host_x, n * sizeof(double), cudaMemcpyHostToDevice);
/*************************************************************
* Step 4:
* set b = Ax
*************************************************************/
cublasSetMatrix(m, n, sizeof(double), host_A, m, device_A, m);
cublasSetVector(n, sizeof(double), host_x, 1, device_x, 1);
double alpha = 1.0;
double beta = 0.0;
cublas_status = cublasDgemv(cublasH, CUBLAS_OP_N, m, n, &alpha, device_A, m, device_x, 1, &beta, device_b, 1);
/**
* Step 5:
* cusolverDnDgesvd to get A = U*S*V^T
* first you need to call cusolverDnSgesvd_bufferSize to get the buffer size
* operate the SVD using cusolverDnDgesvde
*/
int bufferSize = 0;
cusolver_status = cusolverDnDgesvd_bufferSize(cusolverH, m, n, &bufferSize);
double *buffer;
cudaMalloc((void**)&buffer, bufferSize * sizeof(double));
double *device_S, *device_U, *device_VT;
int *devInfo;
cudaMalloc((void**)&devInfo, sizeof(int));
cudaMalloc((void**)&device_S, n * sizeof(double));
cudaMalloc((void**)&device_U, m * m * sizeof(double));
cudaMalloc((void**)&device_VT, n * n * sizeof(double));
cusolver_status = cusolverDnDgesvd(cusolverH, 'A', 'A', m, n, device_A, m, device_S, device_U, m, device_VT, n, buffer, bufferSize, NULL, devInfo);
/*************************************************************
* Step 6:
* find true value of x
* find k either on the device or on the host
************************************************************/
dim3 grid(10,10);
dim3 block(30,30);
double *device_U_T_times_b; // U^T * b = (double *) malloc(m * sizeof(double)); // U^T * b
cudaMalloc((void**)&device_U_T_times_b, m * sizeof(double));
cublas_status = cublasDgemv(cublasH, CUBLAS_OP_T, m, m, &alpha, device_U, m, device_b, 1, &beta, device_U_T_times_b, 1);
double *device_true_x;
cudaMalloc((void**)&device_true_x, n * sizeof(double));
double *device_V;
cudaMalloc((void**)&device_V, n * n * sizeof(double));
transpose<<<grid, block>>>(device_VT, device_V, n, n);
calculate_x<<<128,1>>>(device_true_x, device_U_T_times_b, device_S, device_V, n, n);
int k = 1;
double *host_S = (double*)malloc(sizeof(double) * n);
cublasGetVector(n, sizeof(double), device_S, 1, host_S, 1);
for (int i = 0; i < n - 1; i++) {
if (host_S[i+1] / host_S[i] <= 1e-3) {
k = i;
break;
}
// else {
// k = i;
// }
}
k = k + 1;
/**
* Step 7 & 8:
* form U_k, V_k, S_k
* b_k = (U_k)' * b
* d_k = inv(sigma_k) * (U_k)' * b_k
* x_k = V_k * d_k
*/
double *device_x_k;
cudaMalloc((void**)&device_x_k, n * sizeof(double));
calculate_x<<<128,1>>>(device_x_k, device_U_T_times_b, device_S, device_V, n, k);
/**
* Step 9 to 13
* compute the error = ||x_k - x_true||_2
* compute the Ax_k - b
* compute the residual error = ||Ax_k - b||_2
* move x_k, e, r to the host
* print e, r and the first 8 entries of x_k
*/
double *device_A_k, *device_U_k, *device_V_k, *device_S_k;
cudaMalloc((void**)&device_A_k, m * n * sizeof(double));
cudaMalloc((void**)&device_S_k, k * k * sizeof(double));
cudaMalloc((void**)&device_U_k, m * k * sizeof(double));
cudaMalloc((void**)&device_V_k, n * k * sizeof(double));
get_S_k<<<k, 1>>>(device_S, device_S_k, k, k);
get_U_or_V_k<<<k, 1>>>(device_U, device_U_k, k, m);
get_U_or_V_k<<<k, 1>>>(device_V, device_V_k, k, n);
double *device_w;
cudaMalloc((void**)&device_w, m * k * sizeof(double));
alpha = 1.0f;
cublasDgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, m, k, k, &alpha, device_U_k, m, device_S_k, k, &beta, device_w, m);
cublasDgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, device_w, m, device_V_k, n, &beta, device_A_k, m);
alpha = -1.0f;
// copy device_x_k
double *copy_device_true_x;
cudaMalloc((void**)©_device_true_x, n * sizeof(double));
cudaMemcpy(copy_device_true_x, device_true_x, n * sizeof(double), cudaMemcpyDeviceToDevice);
cublasDaxpy(cublasH, n, &alpha, device_x_k, 1, copy_device_true_x, 1);
double error;
cublasDnrm2(cublasH, n, copy_device_true_x, 1, &error);
double *device_A_k_times_x_k;
cudaMalloc((void**)&device_A_k_times_x_k, m * sizeof(double));
// copy device_A again
cudaMemcpy(device_A, host_A, m * n * sizeof(double), cudaMemcpyHostToDevice);
alpha = 1.0f;
cublas_status = cublasDgemv(cublasH, CUBLAS_OP_N, m, n, &alpha, device_A_k, m, device_x_k, 1, &beta, device_A_k_times_x_k, 1);
alpha = -1.0f;
cublasDaxpy(cublasH, m, &alpha, device_b, 1, device_A_k_times_x_k, 1);
double residual_error;
cublasDnrm2(cublasH, m, device_A_k_times_x_k, 1, &residual_error);
cudaDeviceSynchronize();
gettimeofday(&end, NULL); // stop counting time
el_time = ((end.tv_sec - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;
printf("[INFO] k = %d, same as MATLAB index\n", k);
printf("[INFO] error = %e, residual error = %e\n", error, residual_error);
// print the first 8 entries of x_k
double *host_x_k = (double*)malloc(sizeof(double) * n);
double *host_x_true = (double*)malloc(sizeof(double) * n);
cublasGetVector(n, sizeof(double), device_x_k, 1, host_x_k, 1);
cublasGetVector(n, sizeof(double), device_true_x, 1, host_x_true, 1);
for (int i = 0; i < 8; i++) {
printf("[INFO] x_k[%d] = %e\n", i, host_x_k[i]);
}
printf("[INFO] time consumption: %e s\n", el_time);
// for (int i = 0; i < 8; i++) {
// printf("[INFO] x_true[%d] = %e\n", i, host_x_true[i]);
// }
// double *temp = (double *) malloc(m * sizeof(double));
// // cublasGetMatrix(m, m, sizeof(double), device_U, m, temp, m);
// cublasGetVector(m, sizeof(double), device_U_T_times_b, 1, temp, 1);
// for (int i = 0; i < 1; i++) {
// for (int j = 0; j < m; j++) {
// printf("device_U[%d][%d]=%e\n",i,j, temp[j + i * m]);
// }
// printf("================\n");
// // printf("A[%d]=%e\n",i, unified_A[i]);
// }
cudaDeviceSynchronize();
return 0;
}
double* generateMatrix(int row, int col, int *m, int *n) {
double *A = (double *) malloc(row * col * sizeof(double));
for (int i = 0; i < row; i++) {
for (int j = 0; j < col; j++) {
A[IDX2C(i, j, row)] = (double) drand48();
}
}
*m = row;
*n = col;
return A;
}
double* readMatrix(char* filename, int *row, int *col) {
int i, j;
FILE *file;
file = fopen(filename, "r");
if (file == NULL)
{
printf("[ERROR] File \'%s\' does not exist in the current directory\n", filename);
printf("[ERROR] Program exited.\n");
exit(0);
}
int m, n;
double double_m, double_n;
fscanf(file, "%lf", &double_m);
fscanf(file, "%lf", &double_n);
m = (int) double_m;
n = (int) double_n;
*row = m;
*col = n;
// m = 10; n = 10;
double *A = (double *) malloc(m * n * sizeof(double));
// double *A;
// cudaMallocManaged((void**)&A, m * n * sizeof(double));
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
{
// fscanf(file, "%f", &A[i + j * n]);
// fscanf(file, "%f", &A[j + i * n]);
fscanf(file, "%lf", &A[IDX2C(i, j, m)]);
}
fclose(file);
printf("[INFO] Test matrix has been successfully imported, size=%dx%d.\n", m, n);
return A;
}
double* matrixOnes(int size) {
int i;
double *mat = (double *) malloc(size * sizeof(double));
// double *mat;
// cudaMallocManaged((void**)&mat, size * sizeof(double));
for (i = 0; i < size; i++)
{
mat[i] = 1.0;
}
return mat;
}
double* matrixDiag(int n, double *sigmas) {
// form a diagonal matrix from sigmas
int i, j;
int count = 0;
double *diag = (double *) malloc(n * n * sizeof(double));
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (i == j) {
diag[i + j * n] = sigmas[count];
count += 1;
}
else {
diag[i + j * n] = 0.0;
}
}
}
return diag;
}
double* transpose(double *A, int m, int n) {
int i, j;
double *B = (double *) malloc(m * n * sizeof(double));
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
{
// B[i + j * n] = A[i*n + j];
B[i*n + j] = A[n*j + i];
}
return B;
}
/*
double *A_h = (double *) malloc(n * n * sizeof(double));
cublasGetMatrix(n, n, sizeof(double), device_A, n, A_h, n);
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++)
printf("A = %f\n", A_h[i*n + j]);
}
*/