手写AI推出的全新面向AI算法的C++课程 Algo C++,链接。记录下个人学习笔记,仅供自己参考。
本次课程主要是讲解基于矩阵的算法实现
课程大纲可看下面的思维导图
先回忆下矩阵相关知识
定义矩阵乘法:
{ a b d e } × { 1 3 2 4 } = { a 1 + b 2 a 3 + b 4 d 1 + e 2 d 3 + e 4 } \left\{\begin{array}{cc}a&b\\ d&e\end{array}\right\}\times\left\{\begin{array}{cc}1&3\\ 2&4\end{array}\right\}=\left\{\begin{array}{cc}a1+b2&a3+b4\\ d1+e2&d3+e4\end{array}\right\} {adbe}×{1234}={a1+b2d1+e2a3+b4d3+e4}
记法:C[r][c] = 乘加(A中取 r 行,B中取 c 列)
参考:https://www.cnblogs.com/ljy-endl/p/11411665.html
矩阵求导:
对于 A ⋅ B = C A\cdot B = C A⋅B=C 定义 L L L 是关于 C C C 的损失函数
设 G = ∂ L ∂ C G = \dfrac{\partial L}{\partial C} G=∂C∂L 若直接 C C C 对 A A A 求导,则 G G G 定义为 C C C 大小的全 1 矩阵,则有:
∂ L ∂ A = G ⋅ B T ∂ L ∂ B = A T ⋅ G \dfrac{\partial L}{\partial A}=G\cdot B^T \ \ \ \ \ \dfrac{\partial L}{\partial B}=A^T \cdot G ∂A∂L=G⋅BT ∂B∂L=AT⋅G
矩阵求导示例:
import torch
import torch.nn as nn
X = nn.parameter.Parameter(torch.tensor([
[1, 2],
[2, 1],
[0, 2]
], dtype=torch.float32))
theta = nn.parameter.Parameter(torch.tensor([
[5, 1, 0],
[2, 3, 1]
], dtype=torch.float32))
loss = (X @ theta).sum()
loss.backward()
print(f"Loss = {loss.item()}")
print(f"dloss / dX = \n{X.grad}")
print(f"dloss / dtheta = \n{theta.grad}")
print("================手动计算矩阵的导数===================")
G = torch.ones_like(X @ theta)
X_grad = G @ theta.data.T
theta_grad = X.data.T @ G
print(f"dloss / dX = \n{X_grad}")
print(f"dloss / dtheta = \n{theta_grad}")
输出结果如下:
Loss = 48.0
dloss / dX =
tensor([[6., 6.],
[6., 6.],
[6., 6.]])
dloss / dtheta =
tensor([[3., 3., 3.],
[5., 5., 5.]])
================手动计算矩阵的导数===================
dloss / dX =
tensor([[6., 6.],
[6., 6.],
[6., 6.]])
dloss / dtheta =
tensor([[3., 3., 3.],
[5., 5., 5.]])
通过上述验证可以看到利用推导的公式计算的结果和梯度反向传播的结果一致,说明我们推导的公式并无错误。
考虑矩阵乘法 A ⋅ B = C A \cdot B = C A⋅B=C
考虑 Loss 函数 L = ∑ i m ∑ j n ( C i j − p ) 2 L = \sum^m_{i}\sum^n_{j}{(C_{ij} - p)^2} L=i∑mj∑n(Cij−p)2
考虑 C C C 的每一项导数 ▽ C i j = ∂ L ∂ C i j \triangledown C_{ij} = \frac{\partial L}{\partial C_{ij}} ▽Cij=∂Cij∂L
考虑 A B C ABC ABC 都为 2x2 矩阵时,定义 G G G 为 L L L 对 C C C 的导数
A = [ a b c d ] B = [ e f g h ] C = [ i j k l ] G = ∂ L ∂ C = [ ∂ L ∂ i ∂ L ∂ j ∂ L ∂ k ∂ L ∂ l ] = [ w x y z ] A = \begin{bmatrix} a & b\\ c & d \end{bmatrix} \quad B = \begin{bmatrix} e & f \\ g & h \end{bmatrix} \quad C = \begin{bmatrix} i & j \\ k & l \end{bmatrix} \quad G = \frac{\partial L}{\partial C} = \begin{bmatrix} \frac{\partial L}{\partial i} & \frac{\partial L}{\partial j} \\ \frac{\partial L}{\partial k} & \frac{\partial L}{\partial l} \end{bmatrix} = \begin{bmatrix} w & x \\ y & z \end{bmatrix} A=[acbd]B=[egfh]C=[ikjl]G=∂C∂L=[∂i∂L∂k∂L∂j∂L∂l∂L]=[wyxz]
展开左边 A ⋅ B A \cdot B A⋅B
C = [ i = a e + b g j = a f + b h k = c e + d g l = c f + d h ] C = \begin{bmatrix} i = ae + bg & j = af + bh\\ k = ce + dg & l = cf + dh \end{bmatrix} C=[i=ae+bgk=ce+dgj=af+bhl=cf+dh]
L L L 对于每一个 A A A 的导数
▽ A i j = ∂ L ∂ A i j \triangledown A_{ij} = \frac{\partial L}{\partial A_{ij}} ▽Aij=∂Aij∂L
∂ L ∂ a = ∂ L ∂ i ∗ ∂ i ∂ a + ∂ L ∂ j ∗ ∂ j ∂ a ∂ L ∂ b = ∂ L ∂ i ∗ ∂ i ∂ b + ∂ L ∂ j ∗ ∂ j ∂ b ∂ L ∂ c = ∂ L ∂ k ∗ ∂ k ∂ c + ∂ L ∂ l ∗ ∂ l ∂ c ∂ L ∂ d = ∂ L ∂ k ∗ ∂ k ∂ d + ∂ L ∂ l ∗ ∂ l ∂ d \begin{aligned} \frac{\partial L}{\partial a} &= \frac{\partial L}{\partial i} * \frac{\partial i}{\partial a} + \frac{\partial L}{\partial j} * \frac{\partial j}{\partial a} \\ \frac{\partial L}{\partial b} &= \frac{\partial L}{\partial i} * \frac{\partial i}{\partial b} + \frac{\partial L}{\partial j} * \frac{\partial j}{\partial b} \\ \frac{\partial L}{\partial c} &= \frac{\partial L}{\partial k} * \frac{\partial k}{\partial c} + \frac{\partial L}{\partial l} * \frac{\partial l}{\partial c} \\ \frac{\partial L}{\partial d} &= \frac{\partial L}{\partial k} * \frac{\partial k}{\partial d} + \frac{\partial L}{\partial l} * \frac{\partial l}{\partial d} \end{aligned} ∂a∂L∂b∂L∂c∂L∂d∂L=∂i∂L∗∂a∂i+∂j∂L∗∂a∂j=∂i∂L∗∂b∂i+∂j∂L∗∂b∂j=∂k∂L∗∂c∂k+∂l∂L∗∂c∂l=∂k∂L∗∂d∂k+∂l∂L∗∂d∂l
∂ L ∂ a = w e + x f ∂ L ∂ b = w g + x h ∂ L ∂ c = y e + z f ∂ L ∂ d = y g + z h \begin{aligned} \frac{\partial L}{\partial a} &= we + xf \\ \frac{\partial L}{\partial b} &= wg + xh \\ \frac{\partial L}{\partial c} &= ye + zf \\ \frac{\partial L}{\partial d} &= yg + zh \end{aligned} ∂a∂L∂b∂L∂c∂L∂d∂L=we+xf=wg+xh=ye+zf=yg+zh
因此 A A A 的导数为
∂ L ∂ A = [ w e + x f w g + x h y e + z f y g + z h ] ∂ L ∂ A = [ w x y z ] [ e g f h ] \frac{\partial L}{\partial A} = \begin{bmatrix} we + xf & wg + xh\\ ye + zf & yg + zh \end{bmatrix} \quad \frac{\partial L}{\partial A} = \begin{bmatrix} w & x\\ y & z \end{bmatrix} \begin{bmatrix} e & g\\ f & h \end{bmatrix} ∂A∂L=[we+xfye+zfwg+xhyg+zh]∂A∂L=[wyxz][efgh]
∂ L ∂ A = G ⋅ B T \frac{\partial L}{\partial A} = G \cdot B^T ∂A∂L=G⋅BT
同理 B B B 的导数为
∂ L ∂ e = w a + y c ∂ L ∂ f = x a + z c ∂ L ∂ g = w b + y d ∂ L ∂ h = x b + z d \begin{aligned} \frac{\partial L}{\partial e} &= wa + yc \\ \frac{\partial L}{\partial f} &= xa + zc \\ \frac{\partial L}{\partial g} &= wb + yd \\ \frac{\partial L}{\partial h} &= xb + zd \end{aligned} ∂e∂L∂f∂L∂g∂L∂h∂L=wa+yc=xa+zc=wb+yd=xb+zd
∂ L ∂ B = [ w a + y c x a + z c w b + y d x b + z d ] ∂ L ∂ B = [ a c b d ] [ w x y z ] \frac{\partial L}{\partial B} = \begin{bmatrix} wa + yc & xa + zc\\ wb + yd & xb + zd \end{bmatrix} \quad \frac{\partial L}{\partial B} = \begin{bmatrix} a & c\\ b & d \end{bmatrix} \begin{bmatrix} w & x\\ y & z \end{bmatrix} ∂B∂L=[wa+ycwb+ydxa+zcxb+zd]∂B∂L=[abcd][wyxz]
∂ L ∂ B = A T ⋅ G \frac{\partial L}{\partial B} = A^T \cdot G ∂B∂L=AT⋅G
Matrix.hpp
#ifndef GEMM_HPP
#define GEMM_HPP
#include
#include
#include
/* 实现一个自定义的matrix类 */
class Matrix{
public:
Matrix();
Matrix(int rows, int cols, const std::initializer_list<float>& pdata={});
/* 操作符重载,使得支持()操作 */
const float& operator()(int irow, int icol)const {return data_[irow * cols_ + icol];}
float& operator()(int irow, int icol){return data_[irow * cols_ + icol];}
Matrix operator*(float value);
Matrix operator-(const Matrix& other) const;
int rows() const{return rows_;}
int cols() const{return cols_;}
Matrix view(int rows, int cols) const;
Matrix power(float y) const;
float reduce_sum() const;
float* ptr() const{return (float*)data_.data();}
Matrix gemm(const Matrix& other, bool at=false, bool bt=false, float alpha=1.0f, float beta=0.0f);
Matrix inv();
private:
int rows_ = 0;
int cols_ = 0;
std::vector<float> data_;
};
/* 全局操作符重载,使得能够被cout << m; */
std::ostream& operator << (std::ostream& out, const Matrix& m);
/* 对gemm的封装 */
Matrix gemm(const Matrix& a, bool ta, const Matrix& b, bool tb, float alpha, float beta);
Matrix inverse(const Matrix& a);
#endif // GEMM_HPP
Matrix.hpp 头文件中定义了一个矩阵类 Matrix,包含了矩阵的基本运算和一些高级的操作如:
()
操作符重载:使得支持类似数据的访问方式,如 M(0,0)此外,Matrix.hpp 还包括了对于 Matrix 类的输出流操作符重载,使得矩阵可以被 cout 输出。
Matrix.cpp
#include
#include
#include
#include "openblas/cblas.h"
#include "openblas/lapacke.h"
#include "matrix.hpp"
Matrix::Matrix(){}
Matrix::Matrix(int rows, int cols, const std::initializer_list<float>& pdata){
this->rows_ = rows;
this->cols_ = cols;
this->data_ = pdata;
if(this->data_.size() < rows * cols)
this->data_.resize(rows * cols);
}
Matrix Matrix::gemm(const Matrix& other, bool at, bool bt, float alpha, float beta){
return ::gemm(*this, at, other, bt, alpha, beta);
}
Matrix Matrix::view(int rows, int cols) const{
if(rows * cols != this->rows_ * this->cols_){
printf("Invalid view to %d x %d\n", rows, cols);
return Matrix();
}
Matrix newmat = *this;
newmat.rows_ = rows;
newmat.cols_ = cols;
return newmat;
}
Matrix Matrix::operator-(const Matrix& other) const{
Matrix output = *this;
auto p0 = output.ptr();
auto p1 = other.ptr();
for(int i = 0; i < output.data_.size(); ++i)
*p0++ -= *p1++;
return output;
}
Matrix Matrix::power(float y) const{
Matrix output = *this;
auto p0 = output.ptr();
for(int i = 0; i < output.data_.size(); ++i, ++p0)
*p0 = std::pow(*p0, y);
return output;
}
float Matrix::reduce_sum() const{
auto p0 = this->ptr();
float output = 0;
for(int i = 0; i < this->data_.size(); ++i)
output += *p0++;
return output;
}
Matrix Matrix::inv(){
return ::inverse(*this);
}
Matrix Matrix::operator*(float value){
Matrix m = *this;
for(int i = 0; i < data_.size(); ++i)
m.data_[i] *= value;
return m;
}
std::ostream& operator << (std::ostream& out, const Matrix& m){
for(int i = 0; i < m.rows(); ++i){
for(int j = 0; j < m.cols(); ++j){
out << m(i, j) << "\t";
}
out << "\n";
}
return out;
}
Matrix gemm(const Matrix& a, bool ta, const Matrix& b, bool tb, float alpha, float beta){
int a_elastic_rows = ta ? a.cols() : a.rows(); /* 如果转置,则维度转过来 */
int a_elastic_cols = ta ? a.rows() : a.cols(); /* 如果转置,则维度转过来 */
int b_elastic_rows = tb ? b.cols() : b.rows(); /* 如果转置,则维度转过来 */
int b_elastic_cols = tb ? b.rows() : b.cols(); /* 如果转置,则维度转过来 */
/* c是转置后维度的行和列 */
Matrix c(a_elastic_rows, b_elastic_cols);
int m = a_elastic_rows;
int n = b_elastic_cols;
int k = a_elastic_cols;
int lda = a.cols();
int ldb = b.cols();
int ldc = c.cols();
/* cblas的gemm调用风格,在以后也会存在 */
cblas_sgemm(
CblasRowMajor, ta ? CblasTrans : CblasNoTrans, tb ? CblasTrans : CblasNoTrans,
m, n, k, alpha, a.ptr(), lda, b.ptr(), ldb, beta, c.ptr(), ldc
);
return c;
}
Matrix inverse(const Matrix& a){
if(a.rows() != a.cols()){
printf("Invalid to compute inverse matrix by %d x %d\n", a.rows(), a.cols());
return Matrix();
}
Matrix output = a;
int n = a.rows();
int *ipiv = new int[n];
/* LU分解 */
int code = LAPACKE_sgetrf(LAPACK_COL_MAJOR, n, n, output.ptr(), n, ipiv);
if(code == 0){
/* 使用LU分解求解通用逆矩阵 */
code = LAPACKE_sgetri(LAPACK_COL_MAJOR, n, output.ptr(), n, ipiv);
}
if(code != 0){
printf("LAPACKE inverse matrix failed, code = %d\n", code);
return Matrix();
}
delete[] ipiv;
return output;
}
Matrix.cpp 比较重要的就是矩阵乘法和矩阵求逆,其中 gemm
函数实现了矩阵乘法,调用了 BLAS 库中的 cblas_sgemm
函数;inverse
函数实现了矩阵求逆,使用 LAPACK 库中的 sgetrf
和 sgetri
函数进行了 LU
分解求解通用逆矩阵。
main.cpp
#include
#include
#include
#include "matrix.hpp"
using namespace std;
namespace Application{
namespace logger{
#define INFO(...) Application::logger::__printf(__FILE__, __LINE__, __VA_ARGS__)
void __printf(const char* file, int line, const char* fmt, ...){
va_list vl;
va_start(vl, fmt);
printf("\e[32m[%s:%d]:\e[0m ", file, line);
vprintf(fmt, vl);
printf("\n");
}
};
struct Point{
float x, y;
Point(float x, float y):x(x), y(y){}
Point() = default;
};
Matrix mygemm(const Matrix& a, const Matrix& b){
Matrix c(a.rows(), b.cols());
for(int i = 0; i < c.rows(); ++i){
for(int j = 0; j < c.cols(); ++j){
float summary = 0;
for(int k = 0; k < a.cols(); ++k)
summary += a(i, k) * b(k, j);
c(i, j) = summary;
}
}
return c;
}
/* 求解仿射变换矩阵 */
Matrix get_affine_transform(const vector<Point>& src, const vector<Point>& dst){
// P M Y
// x1, y1, 1, 0, 0, 0 m00 x1
// 0, 0, 0, x1, y1, 1 m01 y1
// x2, y2, 1, 0, 0, 0 m02 x2
// 0, 0, 0, x2, y2, 1 m10 y2
// x3, y3, 1, 0, 0, 0 m11 x3
// 0, 0, 0, x3, y3, 1 m12 y3
// Y = PM
// P.inv() @ Y = M
if(src.size() != 3 || dst.size() != 3){
printf("Invalid to compute affine transform, src.size = %d, dst.size = %d\n", src.size(), dst.size());
return Matrix();
}
Matrix P(6, 6, {
src[0].x, src[0].y, 1, 0, 0, 0,
0, 0, 0, src[0].x, src[0].y, 1,
src[1].x, src[1].y, 1, 0, 0, 0,
0, 0, 0, src[1].x, src[1].y, 1,
src[2].x, src[2].y, 1, 0, 0, 0,
0, 0, 0, src[2].x, src[2].y, 1
});
Matrix Y(6, 1, {
dst[0].x, dst[0].y, dst[1].x, dst[1].y, dst[2].x, dst[2].y
});
return P.inv().gemm(Y).view(2, 3);
}
void test_matrix(){
Matrix a1(2, 3, {
1, 2, 3,
4, 5, 6
});
Matrix b1(3, 2,{
3, 0,
2, 1,
0, 2
});
INFO("A1 @ B1 = ");
std::cout << a1.gemm(b1) << std::endl;
Matrix a2(3, 2, {
1, 4,
2, 5,
3, 6
});
INFO("A2.T @ B1 =");
std::cout << gemm(a2, true, b1, false, 1.0f, 0.0f) << std::endl;
INFO("A1 @ B1 = ");
std::cout << mygemm(a1, b1) << std::endl;
INFO("a2 * 2 = ");
std::cout << a2 * 2 << std::endl;
Matrix c(2, 2, {
1, 2,
3, 4
});
INFO("C.inv = ");
std::cout << c.inv() << std::endl;
std::cout << c.gemm(c.inv()) << std::endl;
}
void test_affine(){
auto M = get_affine_transform({
Point(0, 0), Point(10, 0), Point(10, 10)
},
{
Point(20, 20), Point(100, 20), Point(100, 100)
});
INFO("Affine matrix = ");
std::cout << M << std::endl;
}
/* 测试矩阵求导的过程 */
void test_matrix_derivation(){
/* loss = (X @ theta).sum() */
Matrix X(
3, 2, {
1, 2,
2, 1,
0, 2
}
);
Matrix theta(
2, 3, {
5, 1, 0,
2, 3, 1
}
);
auto loss = X.gemm(theta).reduce_sum();
// G = dloss / d(X @ theta) = ones_like(X @ theta)
// dloss/dX = G @ theta.T
// dloss/dtheta = X.T @ G
INFO("Loss = %f", loss);
Matrix G(3, 3, {1, 1, 1, 1, 1, 1, 1, 1, 1});
INFO("dloss / dX = ");
std::cout << G.gemm(theta, false, true);
INFO("dloss / dtheta = ");
std::cout << X.gemm(G, true);
}
int run(){
test_matrix();
test_affine();
test_matrix_derivation();
return 0;
}
};
int main(){
return Application::run();
}
main.cpp 主要演示了如何使用自定义的矩阵类 Matrix,并对其进行了简单的测试。
在 test_matrix
函数中,对自定义矩阵类 Matrix 进行了简单的测试,包括矩阵乘法、矩阵转置、自定义矩阵乘法、矩阵数乘、矩阵求逆等操作。
test_affine_transform
函数中,实现了对三个点的仿射变换矩阵的求解。P 矩阵是一个6行6列的矩阵,代表源点的位置,Y 矩阵式一个6行1列的矩阵,代表目标点的坐标,通过计算 P 的逆矩阵 P_inv,并与 Y 矩阵相乘即可得到6行1列的仿射变换矩阵 M,最后将其 view 成为2行3列返回。具体可以参考YOLOv5推理详解及预处理高性能实现
test_matrix_derivation
函数中,展示了矩阵求导的过程,即给定一个由 X 和 theta 计算得到的损失函数,求解其对 X 和 theta 的偏导数。
我们还在代码中定义了一个名为 INFO 的宏,它会调用 Application::logger::__printf
函数,其参数 __FILE__ 和 __LINE__ 是内置变量,分别代表当前代码所在的源文件和行号。__VA_ARGS__ 则是一个可变参数的占位符,表示可以传入不定数量的参数。使用这个宏可以输出带有文件名和行号的日志信息。
__printf
函数使用了可变参数列表,通过调用 vprintf
函数来实现格式化输出(from chatGPT)
cblas_sgemm
函数是 BLAS 库中的矩阵乘法函数,其参数如下:(from chatGPT)
void cblas_sgemm(
const enum CBLAS_ORDER Order,
const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB,
const int M,
const int N,
const int K,
const float alpha,
const float *A,
const int lda,
const float *B,
const int ldb,
const float beta,
float *C,
const int ldc
);
其中各参数的含义如下:
cblas_sgemm
函数会对 A、B、C 矩阵进行矩阵乘法运算,并将结果存储在 C 矩阵中。其中 A 矩阵的大小为 MxK,B 矩阵的大小为 KxN,C 矩阵的大小为 MxN
LU 分解是一种将矩阵分解成下三角矩阵 L 和上三角矩阵 U 的方法,使得原矩阵可以表示为 LU 的乘积。利用 LU 分解,可以解线性方程组和计算矩阵的行列式、逆矩阵等操作。(from chatGPT)
对于一个 n × n n\times n n×n 的矩阵 A A A,可以通过 LU 分解得到:
A = L U A = LU A=LU
其中, L L L 是一个 n × n n\times n n×n 的下三角矩阵,而 U U U 是一个 n × n n\times n n×n 的上三角矩阵。具体来说,对于矩阵 A A A,可以通过高斯消元的方式得到 L L L 和 U U U:
[ a 11 a 12 a 13 … a 1 n a 21 a 22 a 23 … a 2 n a 31 a 32 a 33 … a 3 n ⋮ ⋮ ⋮ ⋱ ⋮ a n 1 a n 2 a n 3 … a n n ] = [ 1 l 21 1 l 31 l 32 1 ⋮ ⋮ ⋮ ⋱ l n 1 l n 2 l n 3 ⋯ 1 ] [ u 11 u 12 u 13 ⋯ u 1 n u 22 u 23 ⋯ u 2 n u 23 ⋯ u 3 n ⋱ ⋮ u n n ] \begin{bmatrix}a_{11}&a_{12}&a_{13}&\ldots&a_{1n}\\ a_{21}&a_{22}&a_{23}&\ldots&a_{2n}\\ a_{31}&a_{32}&a_{33}&\ldots&a_{3n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ a_{n1}&a_{n2}&a_{n3}&\ldots&a_{nn}\end{bmatrix}= \begin{bmatrix}1\\l_{21}&1\\l_{31}&l_{32}&1\\\vdots&\vdots&\vdots&\ddots\\l_{n1}&l_{n2}&l_{n3}&\cdots&1\end{bmatrix} \begin{bmatrix}u_{11}&u_{12}&u_{13}&\cdots&u_{1n}\\ &u_{22}&u_{23}&\cdots&u_{2n}\\ &&u_{23}&\cdots&u_{3n}\\ &&&\ddots&\vdots\\ &&&&u_{nn}\end{bmatrix} a11a21a31⋮an1a12a22a32⋮an2a13a23a33⋮an3………⋱…a1na2na3n⋮ann = 1l21l31⋮ln11l32⋮ln21⋮ln3⋱⋯1 u11u12u22u13u23u23⋯⋯⋯⋱u1nu2nu3n⋮unn
求解逆矩阵的方法就是利用 LU 分解得到的 L L L 和 U U U 矩阵,通过下面的公式计算 A − 1 A^{-1} A−1:
A − 1 = U − 1 L − 1 A^{-1} = U^{-1}L^{-1} A−1=U−1L−1
其中, L − 1 L^{-1} L−1 是一个 n × n n\times n n×n 的下三角矩阵, U − 1 U^{-1} U−1 是一个 n × n n\times n n×n 的上三角矩阵。
LAPACK库提供了一系列求解线性方程组和矩阵分解的函数。其中,LAPACK_sgetrf
函数用于进行 LU 分解,LAPACK_sgetri
函数用于计算矩阵的逆矩阵。
LAPACK_sgetrf
函数的参数:
LAPACK_sgetrf
函数的返回值:
LAPACK_sgetri
函数的参数:
LAPACK_sgetri函数的返回值:
LAPACK_sgetrf
函数和 LAPACK_sgetri
函数的使用可以实现对逆矩阵的求解。具体来说,可以先使用LAPACK_sgetrf
函数进行 LU 分解,然后再使用 LAPACK_sgetri
函数对 LU 分解后的矩阵进行求逆。
多元线性回归模型如下:
h θ ( x ) = θ 0 x 0 + θ 1 x 1 + ⋯ + θ n x n = ∑ i = 0 n θ i x i = θ T X ( x 0 = 1 ) h_\theta(x)=\theta_0x_0+\theta_1x_1+\cdots+\theta_nx_n=\sum_{i=0}^n\theta_i x_i=\theta^T X(x_0=1) hθ(x)=θ0x0+θ1x1+⋯+θnxn=i=0∑nθixi=θTX(x0=1)
以我们之前讨论的房价预测的线性回归模型为例,特征有 x x x c o s ( x ) cos(x) cos(x) s i n ( x ) sin(x) sin(x)
设 X X X 是数据 n × 4 n\times 4 n×4 , θ \theta θ 是参数 4 × 1 4\times 1 4×1 ,初始化为随机数
X = [ 1 x ( 1 ) s i n ( x ( 1 ) ) c o s ( x ( 1 ) ) ) 1 x ( 2 ) s i n ( x ( 2 ) ) c o s ( x ( 2 ) ) ⋯ 1 x ( n ) s i n ( x ( n ) ) c o s ( x ( n ) ) ) ] θ = [ b i a s k i d e n t i t y k c o s k s i n ] = [ θ 1 θ 2 θ 3 θ 4 ] X=\left[\begin{array}{l l l l}{1}&{x^{(1)}}&{s i n(x^{(1)})}&{c o s(x^{(1)}))}\\ {1}&{x^{(2)}}&{s i n(x^{(2)})}&{c o s(x^{(2)})}\\ {}&{}&{\cdots}&{}\\ {1}&{x^{(n)}}&{s i n(x^{(n)})}&{c o s(x^{(n)}))}\\ \end{array}\right] \ \ \ \ \theta=\left[\begin{array}{c}{b i a s}\\ {k_{i d e n t i t y}}\\ {k_{c o s}}\\ {k_{s i n}}\\ \end{array}\right]=\left[\begin{array}{c}{\theta_{1}}\\ {\theta_{2}}\\ {\theta_{3}}\\ {\theta_{4}}\\ \end{array}\right] X= 111x(1)x(2)x(n)sin(x(1))sin(x(2))⋯sin(x(n))cos(x(1)))cos(x(2))cos(x(n))) θ= biaskidentitykcosksin = θ1θ2θ3θ4
对于预测值 P P P 等于:
P = [ p ( 1 ) p ( 2 ) ⋯ p ( n ) ] = X θ = [ 1 x ( 1 ) s i n ( x ( 1 ) ) c o s ( x ( 1 ) ) ) 1 x ( 2 ) s i n ( x ( 2 ) ) c o s ( x ( 2 ) ) ⋯ 1 x ( n ) s i n ( x ( n ) ) c o s ( x ( n ) ) ) ] [ b i a s k i d e n t i t y k c o s k s i n ] P={\left[\begin{array}{l}{p^{(1)}}\\ {p^{(2)}}\\ {\cdots}\\ {p^{(n)}}\end{array}\right]}=X\theta= \left[\begin{array}{l l l l}{1}&{x^{(1)}}&{s i n(x^{(1)})}&{c o s(x^{(1)}))}\\ {1}&{x^{(2)}}&{s i n(x^{(2)})}&{c o s(x^{(2)})}\\ {}&{}&{\cdots}&{}\\ {1}&{x^{(n)}}&{s i n(x^{(n)})}&{c o s(x^{(n)}))}\\ \end{array}\right] \left[\begin{array}{c}{b i a s}\\ {k_{identity}}\\ {k_{cos}}\\ {k_{sin}}\\ \end{array}\right] P= p(1)p(2)⋯p(n) =Xθ= 111x(1)x(2)x(n)sin(x(1))sin(x(2))⋯sin(x(n))cos(x(1)))cos(x(2))cos(x(n))) biaskidentitykcosksin
对于真值,定义为 Y Y Y n × 1 n\times 1 n×1,则 Y Y Y 等于:
Y n x 1 = [ y ( 1 ) y ( 2 ) ⋯ y ( n ) ] Y_{n x1}=\left[\begin{array}{c}{{y^{(1)}}}\\ {{y^{(2)}}}\\ {{\cdots}}\\ {{y^{(n)}}}\end{array}\right] Ynx1= y(1)y(2)⋯y(n)
对于 Loss 的计算,有如下:
L = 1 2 n ∑ i = 1 n ( p ( i ) − y ( i ) ) 2 L=\frac{1}{2n}\sum_{i=1}^n(p^{(i)}-y^{(i)})^2 L=2n1i=1∑n(p(i)−y(i))2
根据前面的矩阵求导可得对于梯度的推导有:
∂ L ∂ P = 1 n ( P − Y ) ∂ L ∂ θ = 1 n X T ( P − Y ) \begin{aligned} &\frac{\partial L}{\partial P} =\frac{1}{n}(P-Y) \\ &\frac{\partial L}{\partial\theta} =\frac{1}{n}X^T(P-Y) \end{aligned} ∂P∂L=n1(P−Y)∂θ∂L=n1XT(P−Y)
示例代码如下:
alpha = 0.01
n = len(X)
for i in range(100):
L = 0.5 * ((X @ theta - Y)**2).sum() / n
G = (X @ theta - Y) / n
grad = X.T @ G
theta = theta - alpha * grad
以我们之前讨论的居民幸福感的逻辑回归模型为例,特征有 area、distance
设 X X X 数据 n × 3 n\times 3 n×3, θ \theta θ 是参数 3 × 1 3\times 1 3×1,初始化为随机数
X = [ 1 a r e a ( 1 ) d i s t a n c e ( 1 ) 1 a r e a ( 2 ) d i s t a n c e ( 2 ) ⋯ 1 a r e a ( n ) d i s t a n c e ( n ) ] θ = [ b i a s k a r e a k d i s t a n c e ] = [ θ 1 θ 2 θ 3 ] X=\left[\begin{array}{c c c}{{1}}&{{a r e a^{(1)}}}&{{d i s t a n c e^{(1)}}}\\ {{1}}&{{a r e a^{(2)}}}&{{d i s t a n c e^{(2)}}}\\ {{}}&{{}}&{{\cdots}}\\ {{1}}&{{a r e a^{(n)}}}&{{d i s t a n c e^{(n)}}}\end{array}\right] \ \ \ \ \theta=\left[\begin{array}{c}{b i a s}\\ {k_{area}}\\ {k_{distance}}\\ \end{array}\right]=\left[\begin{array}{c}{\theta_{1}}\\ {\theta_{2}}\\ {\theta_{3}}\\ \end{array}\right] X= 111area(1)area(2)area(n)distance(1)distance(2)⋯distance(n) θ= biaskareakdistance = θ1θ2θ3
对于预测值 P P P 等于:
P = [ p ( 1 ) p ( 2 ) ⋯ p ( n ) ] = X θ = [ 1 a r e a ( 1 ) d i s t a n c e ( 1 ) 1 a r e a ( 2 ) d i s t a n c e ( 2 ) ⋯ 1 a r e a ( n ) d i s t a n c e ( n ) ] [ b i a s k a r e a k d i s t a n c e ] P={\left[\begin{array}{l}{p^{(1)}}\\ {p^{(2)}}\\ {\cdots}\\ {p^{(n)}}\end{array}\right]}=X\theta= \left[\begin{array}{c c c}{{1}}&{{a r e a^{(1)}}}&{{d i s t a n c e^{(1)}}}\\ {{1}}&{{a r e a^{(2)}}}&{{d i s t a n c e^{(2)}}}\\ {{}}&{{}}&{{\cdots}}\\ {{1}}&{{a r e a^{(n)}}}&{{d i s t a n c e^{(n)}}}\end{array}\right] \left[\begin{array}{c}{b i a s}\\ {k_{area}}\\ {k_{distance}}\\ \end{array}\right] P= p(1)p(2)⋯p(n) =Xθ= 111area(1)area(2)area(n)distance(1)distance(2)⋯distance(n) biaskareakdistance
对于真值,定义为 Y Y Y n × 1 n\times 1 n×1,则 Y Y Y 等于:
Y n x 1 = [ y ( 1 ) y ( 2 ) ⋯ y ( n ) ] Y_{n x1}=\left[\begin{array}{c}{{y^{(1)}}}\\ {{y^{(2)}}}\\ {{\cdots}}\\ {{y^{(n)}}}\end{array}\right] Ynx1= y(1)y(2)⋯y(n)
令 H = s i g m o i d ( P ) H = sigmoid(P) H=sigmoid(P),对于 Loss的计算如下式:
L = − 1 2 n ∑ i = 1 n [ y ( i ) ⋅ l n ( h ( i ) ) + ( 1 − y ( i ) ) ⋅ l n ( 1 − h ( i ) ) ] L=-\frac{1}{2n}\sum_{i=1}^n[y^{(i)}\cdot ln(h^{(i)})+(1-y^{(i)})\cdot ln(1-h^{(i)})] L=−2n1i=1∑n[y(i)⋅ln(h(i))+(1−y(i))⋅ln(1−h(i))]
此时,对于梯度的推导有:
∂ L ∂ P = 1 n ( H − Y ) ∂ L ∂ θ = 1 n X T ( H − Y ) \begin{aligned} &\frac{\partial L}{\partial P} =\frac{1}{n}(H-Y) \\ &\frac{\partial L}{\partial\theta} =\frac{1}{n}X^T(H-Y) \end{aligned} ∂P∂L=n1(H−Y)∂θ∂L=n1XT(H−Y)
示例代码如下:
Python版:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
alpha = 0.01
n = len(x)
for i in range(100):
H = sigmoid(X @ theta)
L = -(Y * np.log(H) + (1 - Y) * np.log(1 - H)).sum() / n
G = (H - Y) / n
grad = X.T @ G
theta = theta - alpha * grad
C++版:
Matrix datas_matrix, label_matrix;
tie(datas_matrix, label_matrix) = statistics::datas_to_matrix(datas);
// Matrix theta = random::create_normal_distribution_matrix(3, 1);
Matrix theta(3, 1, {0, 0.1, 0.1});
int batch_size = datas.size();
float lr = 0.1;
for(int iter = 0; iter < 1000; ++iter){
auto logistic = datas_matrix.gemm(theta).sigmoid();
auto loss = -(logistic.log() * label_matrix + logistic.log_1subx() * label_matrix._1subx()).reduce.sum() / batch_size;
auto G = (logistic - label_matrix) * (1.0f / batch_size);
auto grad = datas_matrix.gemm(G, true);
theta = theta - grad * lr;
if(iter % 100 == 0)
cout << "Iter " << iter <<", Loss: " << setprecision(3) << loss << endl;
}
向量模长的概念
对于向量 V V V 与自己做内积,等价于如下:
V T V = v 1 2 + v 2 2 + v 3 2 + . . . + v n 2 V^TV = v_1^2 + v_2^2 + v_3^2 +... + v_n^2 VTV=v12+v22+v32+...+vn2
而这个式子正好等于模长的平方,即:
V T V = ∣ ∣ V ∣ ∣ 2 2 = v 1 2 + v 2 2 + v 3 2 + . . . + v n 2 V^TV=||V||_2^2=v_1^2+v_2^2+v_3^2+...+v_n^2 VTV=∣∣V∣∣22=v12+v22+v32+...+vn2
对于 X m × n θ n × 1 = Y m × 1 X_{m\times n}\theta_{n\times 1}=Y_{m\times 1} Xm×nθn×1=Ym×1,已知 X X X 和 Y Y Y 如何求解最佳 θ n × 1 \theta_{n\times 1} θn×1 使得 ∣ ∣ X θ − Y ∣ ∣ 2 2 ||X\theta-Y||_2^2 ∣∣Xθ−Y∣∣22 最小,其中 m m m 是样本数量而 n n n 是特征维度
∣ ∣ X θ − Y ∣ ∣ 2 2 = ( X θ − Y ) T ( X θ − Y ) ∣ ||X\theta-Y||_2^2=(X\theta-Y)^T(X\theta-Y)| ∣∣Xθ−Y∣∣22=(Xθ−Y)T(Xθ−Y)∣
定义优化的目标函数为:
L ( θ ) = 1 2 ( X θ − Y ) T ( X θ − Y ) L(\theta)=\frac{1}{2}(X\theta-Y)^{T}(X\theta-Y) L(θ)=21(Xθ−Y)T(Xθ−Y)
令
∂ L ( θ ) ∂ θ = X T ( X θ − Y ) = 0 \begin{aligned} & \\ &\frac{\partial L(\theta)}{\partial\theta}=X^T(X\theta-Y)=0 \end{aligned} ∂θ∂L(θ)=XT(Xθ−Y)=0
推导得
X T X θ = X T Y θ = ( X T X ) − 1 X T Y \begin{aligned} &X^TX\theta=X^TY \\ &\theta=(X^TX)^{-1}X^TY \end{aligned} XTXθ=XTYθ=(XTX)−1XTY
相比于梯度下降法,最小二乘法需要求解逆矩阵,但是 X T X X^TX XTX 很容易出现不可逆现象。并且求解逆矩阵比较费时,所以两种方法各有优缺点。
示例代码如下
theta = np.linalg.inv(X.T @ X) @ X.T @ Y
theta
岭回归(即L2正则化的回归),相较于最小二乘法增加了正则化项,如下:
m i n i m i z e ∣ ∣ X θ − Y ∣ ∣ 2 2 + λ ∣ ∣ θ ∣ ∣ 2 2 minimize||X\theta-Y||_2^2+\lambda||\theta||_2^2 minimize∣∣Xθ−Y∣∣22+λ∣∣θ∣∣22
得到求解公式:
θ = ( X T X + λ I ) − 1 X T Y \theta=(X^TX+\lambda I)^{-1}X^TY θ=(XTX+λI)−1XTY
对于正则化项,可以理解为使得 X X X 在更多情况下可逆
示例代码如下:
eps = 1e-5
theta = np.linalg.inv(X.T @ X + np.eye(X.shape[1]) * eps) @ X.T @ Y
theta
还是之前那个问题,对于 X m × n θ n × 1 = Y m × 1 X_{m\times n}\theta_{n\times 1}=Y_{m\times 1} Xm×nθn×1=Ym×1,已知 X X X 和 Y Y Y 如何求解最佳 θ n × 1 \theta_{n\times 1} θn×1 使得 ∣ ∣ X θ − Y ∣ ∣ 2 2 ||X\theta-Y||_2^2 ∣∣Xθ−Y∣∣22 最小,其中 m m m 是样本数量而 n n n 是特征维度
定义目标函数如下:
L = 1 2 ∣ ∣ X θ − Y ∣ ∣ 2 2 L=\frac{1}{2}||X\theta-Y||_2^2 L=21∣∣Xθ−Y∣∣22
根据海森矩阵更新 θ \theta θ
θ + = θ − H − 1 ∂ L ( θ ) ∂ θ \theta^+=\theta-H^{-1}\frac{\partial L(\theta)}{\partial\theta} θ+=θ−H−1∂θ∂L(θ)
其中 H H H 为 L L L 对参数 θ \theta θ 的海森矩阵,定义为元素的二阶偏导数组成的方阵
H = [ ∂ 2 L ∂ θ 1 ∂ θ 1 ∂ 2 L ∂ θ 1 ∂ θ 2 ⋯ ∂ 2 L ∂ θ 1 ∂ θ n ∂ 2 L ∂ θ 2 ∂ θ 1 ∂ 2 L ∂ θ 2 ∂ θ 2 ⋯ ∂ 2 L ∂ θ 2 ∂ θ n ⋯ ⋯ ⋯ ⋯ ∂ 2 L ∂ θ n ∂ θ 1 ∂ 2 L ∂ θ n ∂ θ 2 ⋯ ∂ 2 L ∂ θ n ∂ θ n ] H=\left[\begin{array}{l l l l}{\frac{\partial^{2}L}{\partial\theta_{1}\partial\theta_{1}}}&{\frac{\partial^{2}L}{\partial\theta_{1}\partial\theta_{2}}}&{\cdots}&{\frac{\partial^{2}L}{\partial\theta_{1}\partial\theta_{n}}}\\ {\frac{\partial^{2}L}{\partial\theta_{2}\partial\theta_{1}}}&{\frac{\partial^{2}L}{\partial\theta_{2}\partial\theta_{2}}}&{\cdots}&{\frac{\partial^{2}L}{\partial\theta_{2}\partial\theta_{n}}}\\ {\cdots}&{\cdots}&{\cdots}&{\cdots}\\ {\frac{\partial^{2}L}{\partial\theta_{n}\partial\theta_{1}}}&{\frac{\partial^{2}L}{\partial\theta_{n}\partial\theta_{2}}}&{\cdots}&{\frac{\partial^{2}L}{\partial\theta_{n}\partial\theta_{n}}}\end{array}\right] H= ∂θ1∂θ1∂2L∂θ2∂θ1∂2L⋯∂θn∂θ1∂2L∂θ1∂θ2∂2L∂θ2∂θ2∂2L⋯∂θn∂θ2∂2L⋯⋯⋯⋯∂θ1∂θn∂2L∂θ2∂θn∂2L⋯∂θn∂θn∂2L
海森(hessian)矩阵代码如下:
def hessian(X, theta, Y):
O = np.zeros((theta.shape[0], theta.shape[0]))
for i in range(O.shape[0]):
for j in range(O.shape[1]):
for k in range(X.shape[0]):
O[i, j] += X[k, i] * X[k, j]
return O
# 或者
def hessian(X, theta, Y):
return X.T @ X
梯度代码如下:
def gradient(X, theta, Y):
g = X @ theta - Y
return X.T @ g
for i in range(100):
L = ((X @ theta - Y)**2).sum()
theta = theta - np.linalg.inv(hessian(X, theta, Y)) @ gradient(X, theta, Y)
参考https://zhuanlan.zhihu.com/p/139159521
定义损失函数如下:
L = 1 2 ∣ ∣ X θ − Y ∣ ∣ 2 2 L=\frac{1}{2}||X\theta-Y||_2^2 L=21∣∣Xθ−Y∣∣22
定义残差 r r r 为:
r = X θ − Y r = X\theta-Y r=Xθ−Y
则有
L = 1 2 ∣ ∣ X θ − Y ∣ ∣ 2 2 = 1 2 ∣ ∣ r ∣ ∣ 2 2 L=\frac{1}{2}||X\theta-Y||_2^2=\frac{1}{2}||r||_2^2 L=21∣∣Xθ−Y∣∣22=21∣∣r∣∣22
参考 r r r 可表示为:
r = [ r ( 1 ) r ( 2 ) … r ( m ) ] = [ x 1 ( 1 ) θ 1 + x 2 ( 1 ) θ 2 + . . . + x n ( 1 ) θ n − y ( 1 ) x 1 ( 2 ) θ 1 + x 2 ( 2 ) θ 2 + . . . + x n ( 2 ) θ n − y ( 2 ) … x 1 ( m ) θ 1 + x 2 ( m ) θ 2 + . . . + x n ( m ) θ n − y ( m ) ] r={\left[\begin{array}{l}{r^{(1)}}\\ {r^{(2)}}\\ {\ldots}\\ {r^{(m)}}\end{array}\right]}={\left[\begin{array}{l}{x_{1}^{(1)}\theta_{1}+x_{2}^{(1)}\theta_{2}+...+x_{n}^{(1)}\theta_{n}-y^{(1)}}\\ {x_{1}^{(2)}\theta_{1}+x_{2}^{(2)}\theta_{2}+...+x_{n}^{(2)}\theta_{n}-y^{(2)}}\\ {\ldots}\\ {x_{1}^{(m)}\theta_{1}+x_{2}^{(m)}\theta_{2}+...+x_{n}^{(m)}\theta_{n}-y^{(m)}}\\ \end{array}\right]} r= r(1)r(2)…r(m) = x1(1)θ1+x2(1)θ2+...+xn(1)θn−y(1)x1(2)θ1+x2(2)θ2+...+xn(2)θn−y(2)…x1(m)θ1+x2(m)θ2+...+xn(m)θn−y(m)
则残差 r r r 对于参数 θ \theta θ 的雅可比矩阵如下:
J ( r ( θ ) ) = [ ∂ r ( 1 ) ∂ θ 1 ∂ r ( 1 ) ∂ θ 2 ⋯ ∂ r ( 1 ) ∂ θ n ∂ r ( 2 ) ∂ θ 1 ∂ r ( 2 ) ∂ θ 2 ⋯ ∂ r ( 2 ) ∂ θ n ∂ r ( m ) ∂ θ 1 ∂ r ( m ) ∂ θ 2 ⋯ ∂ r ( m ) ∂ θ n ] J(r(\theta)) =\left[\begin{array}{c c c c}{\frac{\partial r^{(1)}}{\partial\theta_{1}}}&{\frac{\partial r^{(1)}}{\partial\theta_{2}}}&{\cdots}&{\frac{\partial r^{(1)}}{\partial\theta_{n}}}\\ {\frac{\partial r^{(2)}}{\partial\theta_{1}}}&{\frac{\partial r^{(2)}}{\partial\theta_{2}}}&{\cdots}&{\frac{\partial r^{(2)}}{\partial\theta_{n}}}\\ {}&{}&{}&{}\\ {\frac{\partial r^{(m)}}{\partial\theta_{1}}}&{\frac{\partial r^{(m)}}{\partial\theta_{2}}}&{\cdots}&{\frac{\partial r^{(m)}}{\partial\theta_{n}}}\\ \end{array}\right] J(r(θ))= ∂θ1∂r(1)∂θ1∂r(2)∂θ1∂r(m)∂θ2∂r(1)∂θ2∂r(2)∂θ2∂r(m)⋯⋯⋯∂θn∂r(1)∂θn∂r(2)∂θn∂r(m)
对于多元牛顿法的迭代公式:
θ + = θ − H − 1 ∂ L ( θ ) ∂ θ \theta^+=\theta-H^{-1}\frac{\partial L(\theta)}{\partial\theta} θ+=θ−H−1∂θ∂L(θ)
使用残差 r r r 对参数 θ \theta θ 的雅可比矩阵 J ( r ( θ ) ) J(r(\theta)) J(r(θ)) 近似目标函数对参数 θ \theta θ 的海森矩阵 H ( L ( θ ) ) H(L(\theta)) H(L(θ)):
H ( L ( θ ) ) ≈ J ( r ( θ ) ) T J ( r ( θ ) ) H(L(\theta))\approx J(r(\theta))^T J(r(\theta)) H(L(θ))≈J(r(θ))TJ(r(θ))
最后得到迭代式:
θ + = θ − ( J T J ) − 1 J T r \theta^+=\theta - (J^T J)^{-1} J^T r θ+=θ−(JTJ)−1JTr
参考https://www.bilibili.com/video/av93296032
参考https://zhuanlan.zhihu.com/p/139159521
参考http://www.whudj.cn/?p=1122
高斯牛顿法中, θ \theta θ 迭代式很像最小二乘法
θ + = θ − ( J T J ) − 1 J T r \theta^+=\theta - (J^T J)^{-1} J^T r θ+=θ−(JTJ)−1JTr
LM 修正法,引入了类似正则化项的 μ \mu μ,解决了 J T J J^TJ JTJ 的奇异问题。 μ \mu μ 也称之为阻尼系数
θ + = θ − ( J T J + μ I ) − 1 J T r \theta^+=\theta - (J^T J + \mu I)^{-1} J^T r θ+=θ−(JTJ+μI)−1JTr
示例代码如下:
u = 0.00001
for i in range(100):
r = X @ theta - Y
L = (r**2).sum()
J = jacobian(X, theta, Y)
theta = theta - np.linalg.inv(J.T @ J + u * np.eye(J.shape[1])) @ J.T @ r
本次课程学习了矩阵,以及矩阵求导推导及其代码实现,并基于矩阵介绍了常见的一些优化算法如多元牛顿法、高斯牛顿法、LM算法等,只是点到为止并没有对这些算法进行详细的分析,后续用到的时候再来补吧