手写AI推出的全新面向AI算法的C++课程 Algo C++,链接。记录下个人学习笔记,仅供自己参考。
{ a b d e } × { 1 3 2 4 } = { a 1 + b 2 a 3 + b 4 d 1 + e 2 d 3 + e 4 } \left\{\begin{array}{cc}a&b\\ d&e\end{array}\right\}\times\left\{\begin{array}{cc}1&3\\ 2&4\end{array}\right\}=\left\{\begin{array}{cc}a1+b2&a3+b4\\ d1+e2&d3+e4\end{array}\right\} {adbe}×{1234}={a1+b2d1+e2a3+b4d3+e4}
记法:C[r][c] = 乘加(A中取 r 行,B中取 c 列)
对于 A ⋅ B = C A\cdot B = C A⋅B=C 定义 L L L 是关于 C C C 的损失函数
设 G = ∂ L ∂ C G = \dfrac{\partial L}{\partial C} G=∂C∂L 若直接 C C C 对 A A A 求导,则 G G G 定义为 C C C 大小的全 1 矩阵,则有:
∂ L ∂ A = G ⋅ B T ∂ L ∂ B = A T ⋅ G \dfrac{\partial L}{\partial A}=G\cdot B^T \ \ \ \ \ \dfrac{\partial L}{\partial B}=A^T \cdot G ∂A∂L=G⋅BT ∂B∂L=AT⋅G
import torch
import torch.nn as nn
X = nn.parameter.Parameter(torch.tensor([
[1, 2],
[2, 1],
[0, 2]
], dtype=torch.float32))
theta = nn.parameter.Parameter(torch.tensor([
[5, 1, 0],
[2, 3, 1]
], dtype=torch.float32))
loss = (X @ theta).sum()
print(f"Loss = {loss.item()}")
print(f"dloss / dX = \n{X.grad}")
print(f"dloss / dtheta = \n{theta.grad}")
G = torch.ones_like(X @ theta)
X_grad = G @ theta.data.T
theta_grad = X.data.T @ G
print(f"dloss / dX = \n{X_grad}")
print(f"dloss / dtheta = \n{theta_grad}")
Loss = 48.0
dloss / dX =
tensor([[6., 6.],
[6., 6.],
[6., 6.]])
dloss / dtheta =
tensor([[3., 3., 3.],
[5., 5., 5.]])
dloss / dX =
tensor([[6., 6.],
[6., 6.],
[6., 6.]])
dloss / dtheta =
tensor([[3., 3., 3.],
[5., 5., 5.]])
考虑矩阵乘法 A ⋅ B = C A \cdot B = C A⋅B=C
考虑 Loss 函数 L = ∑ i m ∑ j n ( C i j − p ) 2 L = \sum^m_{i}\sum^n_{j}{(C_{ij} - p)^2} L=i∑mj∑n(Cij−p)2
考虑 C C C 的每一项导数 ▽ C i j = ∂ L ∂ C i j \triangledown C_{ij} = \frac{\partial L}{\partial C_{ij}} ▽Cij=∂Cij∂L
考虑 A B C ABC ABC 都为 2x2 矩阵时,定义 G G G 为 L L L 对 C C C 的导数
A = [ a b c d ] B = [ e f g h ] C = [ i j k l ] G = ∂ L ∂ C = [ ∂ L ∂ i ∂ L ∂ j ∂ L ∂ k ∂ L ∂ l ] = [ w x y z ] A = \begin{bmatrix} a & b\\ c & d \end{bmatrix} \quad B = \begin{bmatrix} e & f \\ g & h \end{bmatrix} \quad C = \begin{bmatrix} i & j \\ k & l \end{bmatrix} \quad G = \frac{\partial L}{\partial C} = \begin{bmatrix} \frac{\partial L}{\partial i} & \frac{\partial L}{\partial j} \\ \frac{\partial L}{\partial k} & \frac{\partial L}{\partial l} \end{bmatrix} = \begin{bmatrix} w & x \\ y & z \end{bmatrix} A=[acbd]B=[egfh]C=[ikjl]G=∂C∂L=[∂i∂L∂k∂L∂j∂L∂l∂L]=[wyxz]
展开左边 A ⋅ B A \cdot B A⋅B
C = [ i = a e + b g j = a f + b h k = c e + d g l = c f + d h ] C = \begin{bmatrix} i = ae + bg & j = af + bh\\ k = ce + dg & l = cf + dh \end{bmatrix} C=[i=ae+bgk=ce+dgj=af+bhl=cf+dh]
L L L 对于每一个 A A A 的导数
▽ A i j = ∂ L ∂ A i j \triangledown A_{ij} = \frac{\partial L}{\partial A_{ij}} ▽Aij=∂Aij∂L
∂ L ∂ a = ∂ L ∂ i ∗ ∂ i ∂ a + ∂ L ∂ j ∗ ∂ j ∂ a ∂ L ∂ b = ∂ L ∂ i ∗ ∂ i ∂ b + ∂ L ∂ j ∗ ∂ j ∂ b ∂ L ∂ c = ∂ L ∂ k ∗ ∂ k ∂ c + ∂ L ∂ l ∗ ∂ l ∂ c ∂ L ∂ d = ∂ L ∂ k ∗ ∂ k ∂ d + ∂ L ∂ l ∗ ∂ l ∂ d \begin{aligned} \frac{\partial L}{\partial a} &= \frac{\partial L}{\partial i} * \frac{\partial i}{\partial a} + \frac{\partial L}{\partial j} * \frac{\partial j}{\partial a} \\ \frac{\partial L}{\partial b} &= \frac{\partial L}{\partial i} * \frac{\partial i}{\partial b} + \frac{\partial L}{\partial j} * \frac{\partial j}{\partial b} \\ \frac{\partial L}{\partial c} &= \frac{\partial L}{\partial k} * \frac{\partial k}{\partial c} + \frac{\partial L}{\partial l} * \frac{\partial l}{\partial c} \\ \frac{\partial L}{\partial d} &= \frac{\partial L}{\partial k} * \frac{\partial k}{\partial d} + \frac{\partial L}{\partial l} * \frac{\partial l}{\partial d} \end{aligned} ∂a∂L∂b∂L∂c∂L∂d∂L=∂i∂L∗∂a∂i+∂j∂L∗∂a∂j=∂i∂L∗∂b∂i+∂j∂L∗∂b∂j=∂k∂L∗∂c∂k+∂l∂L∗∂c∂l=∂k∂L∗∂d∂k+∂l∂L∗∂d∂l
∂ L ∂ a = w e + x f ∂ L ∂ b = w g + x h ∂ L ∂ c = y e + z f ∂ L ∂ d = y g + z h \begin{aligned} \frac{\partial L}{\partial a} &= we + xf \\ \frac{\partial L}{\partial b} &= wg + xh \\ \frac{\partial L}{\partial c} &= ye + zf \\ \frac{\partial L}{\partial d} &= yg + zh \end{aligned} ∂a∂L∂b∂L∂c∂L∂d∂L=we+xf=wg+xh=ye+zf=yg+zh
因此 A A A 的导数为
∂ L ∂ A = [ w e + x f w g + x h y e + z f y g + z h ] ∂ L ∂ A = [ w x y z ] [ e g f h ] \frac{\partial L}{\partial A} = \begin{bmatrix} we + xf & wg + xh\\ ye + zf & yg + zh \end{bmatrix} \quad \frac{\partial L}{\partial A} = \begin{bmatrix} w & x\\ y & z \end{bmatrix} \begin{bmatrix} e & g\\ f & h \end{bmatrix} ∂A∂L=[we+xfye+zfwg+xhyg+zh]∂A∂L=[wyxz][efgh]
∂ L ∂ A = G ⋅ B T \frac{\partial L}{\partial A} = G \cdot B^T ∂A∂L=G⋅BT
同理 B B B 的导数为
∂ L ∂ e = w a + y c ∂ L ∂ f = x a + z c ∂ L ∂ g = w b + y d ∂ L ∂ h = x b + z d \begin{aligned} \frac{\partial L}{\partial e} &= wa + yc \\ \frac{\partial L}{\partial f} &= xa + zc \\ \frac{\partial L}{\partial g} &= wb + yd \\ \frac{\partial L}{\partial h} &= xb + zd \end{aligned} ∂e∂L∂f∂L∂g∂L∂h∂L=wa+yc=xa+zc=wb+yd=xb+zd
∂ L ∂ B = [ w a + y c x a + z c w b + y d x b + z d ] ∂ L ∂ B = [ a c b d ] [ w x y z ] \frac{\partial L}{\partial B} = \begin{bmatrix} wa + yc & xa + zc\\ wb + yd & xb + zd \end{bmatrix} \quad \frac{\partial L}{\partial B} = \begin{bmatrix} a & c\\ b & d \end{bmatrix} \begin{bmatrix} w & x\\ y & z \end{bmatrix} ∂B∂L=[wa+ycwb+ydxa+zcxb+zd]∂B∂L=[abcd][wyxz]
∂ L ∂ B = A T ⋅ G \frac{\partial L}{\partial B} = A^T \cdot G ∂B∂L=AT⋅G
#ifndef GEMM_HPP
#define GEMM_HPP
/* 实现一个自定义的matrix类 */
class Matrix{
Matrix(int rows, int cols, const std::initializer_list<float>& pdata={});
/* 操作符重载,使得支持()操作 */
const float& operator()(int irow, int icol)const {return data_[irow * cols_ + icol];}
float& operator()(int irow, int icol){return data_[irow * cols_ + icol];}
Matrix operator*(float value);
Matrix operator-(const Matrix& other) const;
int rows() const{return rows_;}
int cols() const{return cols_;}
Matrix view(int rows, int cols) const;
Matrix power(float y) const;
float reduce_sum() const;
float* ptr() const{return (float*)data_.data();}
Matrix gemm(const Matrix& other, bool at=false, bool bt=false, float alpha=1.0f, float beta=0.0f);
Matrix inv();
int rows_ = 0;
int cols_ = 0;
std::vector<float> data_;
/* 全局操作符重载,使得能够被cout << m; */
std::ostream& operator << (std::ostream& out, const Matrix& m);
/* 对gemm的封装 */
Matrix gemm(const Matrix& a, bool ta, const Matrix& b, bool tb, float alpha, float beta);
Matrix inverse(const Matrix& a);
#endif // GEMM_HPP
Matrix.hpp 头文件中定义了一个矩阵类 Matrix,包含了矩阵的基本运算和一些高级的操作如:
操作符重载:使得支持类似数据的访问方式,如 M(0,0)此外,Matrix.hpp 还包括了对于 Matrix 类的输出流操作符重载,使得矩阵可以被 cout 输出。
#include "openblas/cblas.h"
#include "openblas/lapacke.h"
#include "matrix.hpp"
Matrix::Matrix(int rows, int cols, const std::initializer_list<float>& pdata){
this->rows_ = rows;
this->cols_ = cols;
this->data_ = pdata;
if(this->data_.size() < rows * cols)
this->data_.resize(rows * cols);
Matrix Matrix::gemm(const Matrix& other, bool at, bool bt, float alpha, float beta){
return ::gemm(*this, at, other, bt, alpha, beta);
Matrix Matrix::view(int rows, int cols) const{
if(rows * cols != this->rows_ * this->cols_){
printf("Invalid view to %d x %d\n", rows, cols);
return Matrix();
Matrix newmat = *this;
newmat.rows_ = rows;
newmat.cols_ = cols;
return newmat;
Matrix Matrix::operator-(const Matrix& other) const{
Matrix output = *this;
auto p0 = output.ptr();
auto p1 = other.ptr();
for(int i = 0; i < output.data_.size(); ++i)
*p0++ -= *p1++;
return output;
Matrix Matrix::power(float y) const{
Matrix output = *this;
auto p0 = output.ptr();
for(int i = 0; i < output.data_.size(); ++i, ++p0)
*p0 = std::pow(*p0, y);
return output;
float Matrix::reduce_sum() const{
auto p0 = this->ptr();
float output = 0;
for(int i = 0; i < this->data_.size(); ++i)
output += *p0++;
return output;
Matrix Matrix::inv(){
return ::inverse(*this);
Matrix Matrix::operator*(float value){
Matrix m = *this;
for(int i = 0; i < data_.size(); ++i)
m.data_[i] *= value;
return m;
std::ostream& operator << (std::ostream& out, const Matrix& m){
for(int i = 0; i < m.rows(); ++i){
for(int j = 0; j < m.cols(); ++j){
out << m(i, j) << "\t";
out << "\n";
return out;
Matrix gemm(const Matrix& a, bool ta, const Matrix& b, bool tb, float alpha, float beta){
int a_elastic_rows = ta ? a.cols() : a.rows(); /* 如果转置,则维度转过来 */
int a_elastic_cols = ta ? a.rows() : a.cols(); /* 如果转置,则维度转过来 */
int b_elastic_rows = tb ? b.cols() : b.rows(); /* 如果转置,则维度转过来 */
int b_elastic_cols = tb ? b.rows() : b.cols(); /* 如果转置,则维度转过来 */
/* c是转置后维度的行和列 */
Matrix c(a_elastic_rows, b_elastic_cols);
int m = a_elastic_rows;
int n = b_elastic_cols;
int k = a_elastic_cols;
int lda = a.cols();
int ldb = b.cols();
int ldc = c.cols();
/* cblas的gemm调用风格,在以后也会存在 */
CblasRowMajor, ta ? CblasTrans : CblasNoTrans, tb ? CblasTrans : CblasNoTrans,
m, n, k, alpha, a.ptr(), lda, b.ptr(), ldb, beta, c.ptr(), ldc
return c;
Matrix inverse(const Matrix& a){
if(a.rows() != a.cols()){
printf("Invalid to compute inverse matrix by %d x %d\n", a.rows(), a.cols());
return Matrix();
Matrix output = a;
int n = a.rows();
int *ipiv = new int[n];
/* LU分解 */
int code = LAPACKE_sgetrf(LAPACK_COL_MAJOR, n, n, output.ptr(), n, ipiv);
if(code == 0){
/* 使用LU分解求解通用逆矩阵 */
code = LAPACKE_sgetri(LAPACK_COL_MAJOR, n, output.ptr(), n, ipiv);
if(code != 0){
printf("LAPACKE inverse matrix failed, code = %d\n", code);
return Matrix();
delete[] ipiv;
return output;
Matrix.cpp 比较重要的就是矩阵乘法和矩阵求逆,其中 gemm
函数实现了矩阵乘法,调用了 BLAS 库中的 cblas_sgemm
函数实现了矩阵求逆,使用 LAPACK 库中的 sgetrf
和 sgetri
函数进行了 LU
#include "matrix.hpp"
using namespace std;
namespace Application{
namespace logger{
#define INFO(...) Application::logger::__printf(__FILE__, __LINE__, __VA_ARGS__)
void __printf(const char* file, int line, const char* fmt, ...){
va_list vl;
va_start(vl, fmt);
printf("\e[32m[%s:%d]:\e[0m ", file, line);
vprintf(fmt, vl);
struct Point{
float x, y;
Point(float x, float y):x(x), y(y){}
Point() = default;
Matrix mygemm(const Matrix& a, const Matrix& b){
Matrix c(a.rows(), b.cols());
for(int i = 0; i < c.rows(); ++i){
for(int j = 0; j < c.cols(); ++j){
float summary = 0;
for(int k = 0; k < a.cols(); ++k)
summary += a(i, k) * b(k, j);
c(i, j) = summary;
return c;
/* 求解仿射变换矩阵 */
Matrix get_affine_transform(const vector<Point>& src, const vector<Point>& dst){
// P M Y
// x1, y1, 1, 0, 0, 0 m00 x1
// 0, 0, 0, x1, y1, 1 m01 y1
// x2, y2, 1, 0, 0, 0 m02 x2
// 0, 0, 0, x2, y2, 1 m10 y2
// x3, y3, 1, 0, 0, 0 m11 x3
// 0, 0, 0, x3, y3, 1 m12 y3
// Y = PM
// P.inv() @ Y = M
if(src.size() != 3 || dst.size() != 3){
printf("Invalid to compute affine transform, src.size = %d, dst.size = %d\n", src.size(), dst.size());
return Matrix();
Matrix P(6, 6, {
src[0].x, src[0].y, 1, 0, 0, 0,
0, 0, 0, src[0].x, src[0].y, 1,
src[1].x, src[1].y, 1, 0, 0, 0,
0, 0, 0, src[1].x, src[1].y, 1,
src[2].x, src[2].y, 1, 0, 0, 0,
0, 0, 0, src[2].x, src[2].y, 1
Matrix Y(6, 1, {
dst[0].x, dst[0].y, dst[1].x, dst[1].y, dst[2].x, dst[2].y
return P.inv().gemm(Y).view(2, 3);
void test_matrix(){
Matrix a1(2, 3, {
1, 2, 3,
4, 5, 6
Matrix b1(3, 2,{
3, 0,
2, 1,
0, 2
INFO("A1 @ B1 = ");
std::cout << a1.gemm(b1) << std::endl;
Matrix a2(3, 2, {
1, 4,
2, 5,
3, 6
INFO("A2.T @ B1 =");
std::cout << gemm(a2, true, b1, false, 1.0f, 0.0f) << std::endl;
INFO("A1 @ B1 = ");
std::cout << mygemm(a1, b1) << std::endl;
INFO("a2 * 2 = ");
std::cout << a2 * 2 << std::endl;
Matrix c(2, 2, {
1, 2,
3, 4
INFO("C.inv = ");
std::cout << c.inv() << std::endl;
std::cout << c.gemm(c.inv()) << std::endl;
void test_affine(){
auto M = get_affine_transform({
Point(0, 0), Point(10, 0), Point(10, 10)
Point(20, 20), Point(100, 20), Point(100, 100)
INFO("Affine matrix = ");
std::cout << M << std::endl;
/* 测试矩阵求导的过程 */
void test_matrix_derivation(){
/* loss = (X @ theta).sum() */
Matrix X(
3, 2, {
1, 2,
2, 1,
0, 2
Matrix theta(
2, 3, {
5, 1, 0,
2, 3, 1
auto loss = X.gemm(theta).reduce_sum();
// G = dloss / d(X @ theta) = ones_like(X @ theta)
// dloss/dX = G @ theta.T
// dloss/dtheta = X.T @ G
INFO("Loss = %f", loss);
Matrix G(3, 3, {1, 1, 1, 1, 1, 1, 1, 1, 1});
INFO("dloss / dX = ");
std::cout << G.gemm(theta, false, true);
INFO("dloss / dtheta = ");
std::cout << X.gemm(G, true);
int run(){
return 0;
int main(){
return Application::run();
main.cpp 主要演示了如何使用自定义的矩阵类 Matrix,并对其进行了简单的测试。
在 test_matrix
函数中,对自定义矩阵类 Matrix 进行了简单的测试,包括矩阵乘法、矩阵转置、自定义矩阵乘法、矩阵数乘、矩阵求逆等操作。
函数中,实现了对三个点的仿射变换矩阵的求解。P 矩阵是一个6行6列的矩阵,代表源点的位置,Y 矩阵式一个6行1列的矩阵,代表目标点的坐标,通过计算 P 的逆矩阵 P_inv,并与 Y 矩阵相乘即可得到6行1列的仿射变换矩阵 M,最后将其 view 成为2行3列返回。具体可以参考YOLOv5推理详解及预处理高性能实现
函数中,展示了矩阵求导的过程,即给定一个由 X 和 theta 计算得到的损失函数,求解其对 X 和 theta 的偏导数。
我们还在代码中定义了一个名为 INFO 的宏,它会调用 Application::logger::__printf
函数,其参数 __FILE__ 和 __LINE__ 是内置变量,分别代表当前代码所在的源文件和行号。__VA_ARGS__ 则是一个可变参数的占位符,表示可以传入不定数量的参数。使用这个宏可以输出带有文件名和行号的日志信息。
函数使用了可变参数列表,通过调用 vprintf
函数来实现格式化输出(from chatGPT)
函数是 BLAS 库中的矩阵乘法函数,其参数如下:(from chatGPT)
void cblas_sgemm(
const enum CBLAS_ORDER Order,
const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB,
const int M,
const int N,
const int K,
const float alpha,
const float *A,
const int lda,
const float *B,
const int ldb,
const float beta,
float *C,
const int ldc
函数会对 A、B、C 矩阵进行矩阵乘法运算,并将结果存储在 C 矩阵中。其中 A 矩阵的大小为 MxK,B 矩阵的大小为 KxN,C 矩阵的大小为 MxN
LU 分解是一种将矩阵分解成下三角矩阵 L 和上三角矩阵 U 的方法,使得原矩阵可以表示为 LU 的乘积。利用 LU 分解,可以解线性方程组和计算矩阵的行列式、逆矩阵等操作。(from chatGPT)
对于一个 n × n n\times n n×n 的矩阵 A A A,可以通过 LU 分解得到:
A = L U A = LU A=LU
其中, L L L 是一个 n × n n\times n n×n 的下三角矩阵,而 U U U 是一个 n × n n\times n n×n 的上三角矩阵。具体来说,对于矩阵 A A A,可以通过高斯消元的方式得到 L L L 和 U U U:
[ a 11 a 12 a 13 … a 1 n a 21 a 22 a 23 … a 2 n a 31 a 32 a 33 … a 3 n ⋮ ⋮ ⋮ ⋱ ⋮ a n 1 a n 2 a n 3 … a n n ] = [ 1 l 21 1 l 31 l 32 1 ⋮ ⋮ ⋮ ⋱ l n 1 l n 2 l n 3 ⋯ 1 ] [ u 11 u 12 u 13 ⋯ u 1 n u 22 u 23 ⋯ u 2 n u 23 ⋯ u 3 n ⋱ ⋮ u n n ] \begin{bmatrix}a_{11}&a_{12}&a_{13}&\ldots&a_{1n}\\ a_{21}&a_{22}&a_{23}&\ldots&a_{2n}\\ a_{31}&a_{32}&a_{33}&\ldots&a_{3n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ a_{n1}&a_{n2}&a_{n3}&\ldots&a_{nn}\end{bmatrix}= \begin{bmatrix}1\\l_{21}&1\\l_{31}&l_{32}&1\\\vdots&\vdots&\vdots&\ddots\\l_{n1}&l_{n2}&l_{n3}&\cdots&1\end{bmatrix} \begin{bmatrix}u_{11}&u_{12}&u_{13}&\cdots&u_{1n}\\ &u_{22}&u_{23}&\cdots&u_{2n}\\ &&u_{23}&\cdots&u_{3n}\\ &&&\ddots&\vdots\\ &&&&u_{nn}\end{bmatrix} a11a21a31⋮an1a12a22a32⋮an2a13a23a33⋮an3………⋱…a1na2na3n⋮ann = 1l21l31⋮ln11l32⋮ln21⋮ln3⋱⋯1 u11u12u22u13u23u23⋯⋯⋯⋱u1nu2nu3n⋮unn
求解逆矩阵的方法就是利用 LU 分解得到的 L L L 和 U U U 矩阵,通过下面的公式计算 A − 1 A^{-1} A−1:
A − 1 = U − 1 L − 1 A^{-1} = U^{-1}L^{-1} A−1=U−1L−1
其中, L − 1 L^{-1} L−1 是一个 n × n n\times n n×n 的下三角矩阵, U − 1 U^{-1} U−1 是一个 n × n n\times n n×n 的上三角矩阵。
函数用于进行 LU 分解,LAPACK_sgetri
函数和 LAPACK_sgetri
函数进行 LU 分解,然后再使用 LAPACK_sgetri
函数对 LU 分解后的矩阵进行求逆。
h θ ( x ) = θ 0 x 0 + θ 1 x 1 + ⋯ + θ n x n = ∑ i = 0 n θ i x i = θ T X ( x 0 = 1 ) h_\theta(x)=\theta_0x_0+\theta_1x_1+\cdots+\theta_nx_n=\sum_{i=0}^n\theta_i x_i=\theta^T X(x_0=1) hθ(x)=θ0x0+θ1x1+⋯+θnxn=i=0∑nθixi=θTX(x0=1)
以我们之前讨论的房价预测的线性回归模型为例,特征有 x x x c o s ( x ) cos(x) cos(x) s i n ( x ) sin(x) sin(x)
设 X X X 是数据 n × 4 n\times 4 n×4 , θ \theta θ 是参数 4 × 1 4\times 1 4×1 ,初始化为随机数
X = [ 1 x ( 1 ) s i n ( x ( 1 ) ) c o s ( x ( 1 ) ) ) 1 x ( 2 ) s i n ( x ( 2 ) ) c o s ( x ( 2 ) ) ⋯ 1 x ( n ) s i n ( x ( n ) ) c o s ( x ( n ) ) ) ] θ = [ b i a s k i d e n t i t y k c o s k s i n ] = [ θ 1 θ 2 θ 3 θ 4 ] X=\left[\begin{array}{l l l l}{1}&{x^{(1)}}&{s i n(x^{(1)})}&{c o s(x^{(1)}))}\\ {1}&{x^{(2)}}&{s i n(x^{(2)})}&{c o s(x^{(2)})}\\ {}&{}&{\cdots}&{}\\ {1}&{x^{(n)}}&{s i n(x^{(n)})}&{c o s(x^{(n)}))}\\ \end{array}\right] \ \ \ \ \theta=\left[\begin{array}{c}{b i a s}\\ {k_{i d e n t i t y}}\\ {k_{c o s}}\\ {k_{s i n}}\\ \end{array}\right]=\left[\begin{array}{c}{\theta_{1}}\\ {\theta_{2}}\\ {\theta_{3}}\\ {\theta_{4}}\\ \end{array}\right] X= 111x(1)x(2)x(n)sin(x(1))sin(x(2))⋯sin(x(n))cos(x(1)))cos(x(2))cos(x(n))) θ= biaskidentitykcosksin = θ1θ2θ3θ4
对于预测值 P P P 等于:
P = [ p ( 1 ) p ( 2 ) ⋯ p ( n ) ] = X θ = [ 1 x ( 1 ) s i n ( x ( 1 ) ) c o s ( x ( 1 ) ) ) 1 x ( 2 ) s i n ( x ( 2 ) ) c o s ( x ( 2 ) ) ⋯ 1 x ( n ) s i n ( x ( n ) ) c o s ( x ( n ) ) ) ] [ b i a s k i d e n t i t y k c o s k s i n ] P={\left[\begin{array}{l}{p^{(1)}}\\ {p^{(2)}}\\ {\cdots}\\ {p^{(n)}}\end{array}\right]}=X\theta= \left[\begin{array}{l l l l}{1}&{x^{(1)}}&{s i n(x^{(1)})}&{c o s(x^{(1)}))}\\ {1}&{x^{(2)}}&{s i n(x^{(2)})}&{c o s(x^{(2)})}\\ {}&{}&{\cdots}&{}\\ {1}&{x^{(n)}}&{s i n(x^{(n)})}&{c o s(x^{(n)}))}\\ \end{array}\right] \left[\begin{array}{c}{b i a s}\\ {k_{identity}}\\ {k_{cos}}\\ {k_{sin}}\\ \end{array}\right] P= p(1)p(2)⋯p(n) =Xθ= 111x(1)x(2)x(n)sin(x(1))sin(x(2))⋯sin(x(n))cos(x(1)))cos(x(2))cos(x(n))) biaskidentitykcosksin
对于真值,定义为 Y Y Y n × 1 n\times 1 n×1,则 Y Y Y 等于:
Y n x 1 = [ y ( 1 ) y ( 2 ) ⋯ y ( n ) ] Y_{n x1}=\left[\begin{array}{c}{{y^{(1)}}}\\ {{y^{(2)}}}\\ {{\cdots}}\\ {{y^{(n)}}}\end{array}\right] Ynx1= y(1)y(2)⋯y(n)
对于 Loss 的计算,有如下:
L = 1 2 n ∑ i = 1 n ( p ( i ) − y ( i ) ) 2 L=\frac{1}{2n}\sum_{i=1}^n(p^{(i)}-y^{(i)})^2 L=2n1i=1∑n(p(i)−y(i))2
∂ L ∂ P = 1 n ( P − Y ) ∂ L ∂ θ = 1 n X T ( P − Y ) \begin{aligned} &\frac{\partial L}{\partial P} =\frac{1}{n}(P-Y) \\ &\frac{\partial L}{\partial\theta} =\frac{1}{n}X^T(P-Y) \end{aligned} ∂P∂L=n1(P−Y)∂θ∂L=n1XT(P−Y)
alpha = 0.01
n = len(X)
for i in range(100):
L = 0.5 * ((X @ theta - Y)**2).sum() / n
G = (X @ theta - Y) / n
grad = X.T @ G
theta = theta - alpha * grad
以我们之前讨论的居民幸福感的逻辑回归模型为例,特征有 area、distance
设 X X X 数据 n × 3 n\times 3 n×3, θ \theta θ 是参数 3 × 1 3\times 1 3×1,初始化为随机数
X = [ 1 a r e a ( 1 ) d i s t a n c e ( 1 ) 1 a r e a ( 2 ) d i s t a n c e ( 2 ) ⋯ 1 a r e a ( n ) d i s t a n c e ( n ) ] θ = [ b i a s k a r e a k d i s t a n c e ] = [ θ 1 θ 2 θ 3 ] X=\left[\begin{array}{c c c}{{1}}&{{a r e a^{(1)}}}&{{d i s t a n c e^{(1)}}}\\ {{1}}&{{a r e a^{(2)}}}&{{d i s t a n c e^{(2)}}}\\ {{}}&{{}}&{{\cdots}}\\ {{1}}&{{a r e a^{(n)}}}&{{d i s t a n c e^{(n)}}}\end{array}\right] \ \ \ \ \theta=\left[\begin{array}{c}{b i a s}\\ {k_{area}}\\ {k_{distance}}\\ \end{array}\right]=\left[\begin{array}{c}{\theta_{1}}\\ {\theta_{2}}\\ {\theta_{3}}\\ \end{array}\right] X= 111area(1)area(2)area(n)distance(1)distance(2)⋯distance(n) θ= biaskareakdistance = θ1θ2θ3
对于预测值 P P P 等于:
P = [ p ( 1 ) p ( 2 ) ⋯ p ( n ) ] = X θ = [ 1 a r e a ( 1 ) d i s t a n c e ( 1 ) 1 a r e a ( 2 ) d i s t a n c e ( 2 ) ⋯ 1 a r e a ( n ) d i s t a n c e ( n ) ] [ b i a s k a r e a k d i s t a n c e ] P={\left[\begin{array}{l}{p^{(1)}}\\ {p^{(2)}}\\ {\cdots}\\ {p^{(n)}}\end{array}\right]}=X\theta= \left[\begin{array}{c c c}{{1}}&{{a r e a^{(1)}}}&{{d i s t a n c e^{(1)}}}\\ {{1}}&{{a r e a^{(2)}}}&{{d i s t a n c e^{(2)}}}\\ {{}}&{{}}&{{\cdots}}\\ {{1}}&{{a r e a^{(n)}}}&{{d i s t a n c e^{(n)}}}\end{array}\right] \left[\begin{array}{c}{b i a s}\\ {k_{area}}\\ {k_{distance}}\\ \end{array}\right] P= p(1)p(2)⋯p(n) =Xθ= 111area(1)area(2)area(n)distance(1)distance(2)⋯distance(n) biaskareakdistance
对于真值,定义为 Y Y Y n × 1 n\times 1 n×1,则 Y Y Y 等于:
Y n x 1 = [ y ( 1 ) y ( 2 ) ⋯ y ( n ) ] Y_{n x1}=\left[\begin{array}{c}{{y^{(1)}}}\\ {{y^{(2)}}}\\ {{\cdots}}\\ {{y^{(n)}}}\end{array}\right] Ynx1= y(1)y(2)⋯y(n)
令 H = s i g m o i d ( P ) H = sigmoid(P) H=sigmoid(P),对于 Loss的计算如下式:
L = − 1 2 n ∑ i = 1 n [ y ( i ) ⋅ l n ( h ( i ) ) + ( 1 − y ( i ) ) ⋅ l n ( 1 − h ( i ) ) ] L=-\frac{1}{2n}\sum_{i=1}^n[y^{(i)}\cdot ln(h^{(i)})+(1-y^{(i)})\cdot ln(1-h^{(i)})] L=−2n1i=1∑n[y(i)⋅ln(h(i))+(1−y(i))⋅ln(1−h(i))]
∂ L ∂ P = 1 n ( H − Y ) ∂ L ∂ θ = 1 n X T ( H − Y ) \begin{aligned} &\frac{\partial L}{\partial P} =\frac{1}{n}(H-Y) \\ &\frac{\partial L}{\partial\theta} =\frac{1}{n}X^T(H-Y) \end{aligned} ∂P∂L=n1(H−Y)∂θ∂L=n1XT(H−Y)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
alpha = 0.01
n = len(x)
for i in range(100):
H = sigmoid(X @ theta)
L = -(Y * np.log(H) + (1 - Y) * np.log(1 - H)).sum() / n
G = (H - Y) / n
grad = X.T @ G
theta = theta - alpha * grad
Matrix datas_matrix, label_matrix;
tie(datas_matrix, label_matrix) = statistics::datas_to_matrix(datas);
// Matrix theta = random::create_normal_distribution_matrix(3, 1);
Matrix theta(3, 1, {0, 0.1, 0.1});
int batch_size = datas.size();
float lr = 0.1;
for(int iter = 0; iter < 1000; ++iter){
auto logistic = datas_matrix.gemm(theta).sigmoid();
auto loss = -(logistic.log() * label_matrix + logistic.log_1subx() * label_matrix._1subx()).reduce.sum() / batch_size;
auto G = (logistic - label_matrix) * (1.0f / batch_size);
auto grad = datas_matrix.gemm(G, true);
theta = theta - grad * lr;
if(iter % 100 == 0)
cout << "Iter " << iter <<", Loss: " << setprecision(3) << loss << endl;
对于向量 V V V 与自己做内积,等价于如下:
V T V = v 1 2 + v 2 2 + v 3 2 + . . . + v n 2 V^TV = v_1^2 + v_2^2 + v_3^2 +... + v_n^2 VTV=v12+v22+v32+...+vn2
V T V = ∣ ∣ V ∣ ∣ 2 2 = v 1 2 + v 2 2 + v 3 2 + . . . + v n 2 V^TV=||V||_2^2=v_1^2+v_2^2+v_3^2+...+v_n^2 VTV=∣∣V∣∣22=v12+v22+v32+...+vn2
对于 X m × n θ n × 1 = Y m × 1 X_{m\times n}\theta_{n\times 1}=Y_{m\times 1} Xm×nθn×1=Ym×1,已知 X X X 和 Y Y Y 如何求解最佳 θ n × 1 \theta_{n\times 1} θn×1 使得 ∣ ∣ X θ − Y ∣ ∣ 2 2 ||X\theta-Y||_2^2 ∣∣Xθ−Y∣∣22 最小,其中 m m m 是样本数量而 n n n 是特征维度
∣ ∣ X θ − Y ∣ ∣ 2 2 = ( X θ − Y ) T ( X θ − Y ) ∣ ||X\theta-Y||_2^2=(X\theta-Y)^T(X\theta-Y)| ∣∣Xθ−Y∣∣22=(Xθ−Y)T(Xθ−Y)∣
L ( θ ) = 1 2 ( X θ − Y ) T ( X θ − Y ) L(\theta)=\frac{1}{2}(X\theta-Y)^{T}(X\theta-Y) L(θ)=21(Xθ−Y)T(Xθ−Y)
∂ L ( θ ) ∂ θ = X T ( X θ − Y ) = 0 \begin{aligned} & \\ &\frac{\partial L(\theta)}{\partial\theta}=X^T(X\theta-Y)=0 \end{aligned} ∂θ∂L(θ)=XT(Xθ−Y)=0
X T X θ = X T Y θ = ( X T X ) − 1 X T Y \begin{aligned} &X^TX\theta=X^TY \\ &\theta=(X^TX)^{-1}X^TY \end{aligned} XTXθ=XTYθ=(XTX)−1XTY
相比于梯度下降法,最小二乘法需要求解逆矩阵,但是 X T X X^TX XTX 很容易出现不可逆现象。并且求解逆矩阵比较费时,所以两种方法各有优缺点。
theta = np.linalg.inv(X.T @ X) @ X.T @ Y
m i n i m i z e ∣ ∣ X θ − Y ∣ ∣ 2 2 + λ ∣ ∣ θ ∣ ∣ 2 2 minimize||X\theta-Y||_2^2+\lambda||\theta||_2^2 minimize∣∣Xθ−Y∣∣22+λ∣∣θ∣∣22
θ = ( X T X + λ I ) − 1 X T Y \theta=(X^TX+\lambda I)^{-1}X^TY θ=(XTX+λI)−1XTY
对于正则化项,可以理解为使得 X X X 在更多情况下可逆
eps = 1e-5
theta = np.linalg.inv(X.T @ X + np.eye(X.shape[1]) * eps) @ X.T @ Y
还是之前那个问题,对于 X m × n θ n × 1 = Y m × 1 X_{m\times n}\theta_{n\times 1}=Y_{m\times 1} Xm×nθn×1=Ym×1,已知 X X X 和 Y Y Y 如何求解最佳 θ n × 1 \theta_{n\times 1} θn×1 使得 ∣ ∣ X θ − Y ∣ ∣ 2 2 ||X\theta-Y||_2^2 ∣∣Xθ−Y∣∣22 最小,其中 m m m 是样本数量而 n n n 是特征维度
L = 1 2 ∣ ∣ X θ − Y ∣ ∣ 2 2 L=\frac{1}{2}||X\theta-Y||_2^2 L=21∣∣Xθ−Y∣∣22
根据海森矩阵更新 θ \theta θ
θ + = θ − H − 1 ∂ L ( θ ) ∂ θ \theta^+=\theta-H^{-1}\frac{\partial L(\theta)}{\partial\theta} θ+=θ−H−1∂θ∂L(θ)
其中 H H H 为 L L L 对参数 θ \theta θ 的海森矩阵,定义为元素的二阶偏导数组成的方阵
H = [ ∂ 2 L ∂ θ 1 ∂ θ 1 ∂ 2 L ∂ θ 1 ∂ θ 2 ⋯ ∂ 2 L ∂ θ 1 ∂ θ n ∂ 2 L ∂ θ 2 ∂ θ 1 ∂ 2 L ∂ θ 2 ∂ θ 2 ⋯ ∂ 2 L ∂ θ 2 ∂ θ n ⋯ ⋯ ⋯ ⋯ ∂ 2 L ∂ θ n ∂ θ 1 ∂ 2 L ∂ θ n ∂ θ 2 ⋯ ∂ 2 L ∂ θ n ∂ θ n ] H=\left[\begin{array}{l l l l}{\frac{\partial^{2}L}{\partial\theta_{1}\partial\theta_{1}}}&{\frac{\partial^{2}L}{\partial\theta_{1}\partial\theta_{2}}}&{\cdots}&{\frac{\partial^{2}L}{\partial\theta_{1}\partial\theta_{n}}}\\ {\frac{\partial^{2}L}{\partial\theta_{2}\partial\theta_{1}}}&{\frac{\partial^{2}L}{\partial\theta_{2}\partial\theta_{2}}}&{\cdots}&{\frac{\partial^{2}L}{\partial\theta_{2}\partial\theta_{n}}}\\ {\cdots}&{\cdots}&{\cdots}&{\cdots}\\ {\frac{\partial^{2}L}{\partial\theta_{n}\partial\theta_{1}}}&{\frac{\partial^{2}L}{\partial\theta_{n}\partial\theta_{2}}}&{\cdots}&{\frac{\partial^{2}L}{\partial\theta_{n}\partial\theta_{n}}}\end{array}\right] H= ∂θ1∂θ1∂2L∂θ2∂θ1∂2L⋯∂θn∂θ1∂2L∂θ1∂θ2∂2L∂θ2∂θ2∂2L⋯∂θn∂θ2∂2L⋯⋯⋯⋯∂θ1∂θn∂2L∂θ2∂θn∂2L⋯∂θn∂θn∂2L
def hessian(X, theta, Y):
O = np.zeros((theta.shape[0], theta.shape[0]))
for i in range(O.shape[0]):
for j in range(O.shape[1]):
for k in range(X.shape[0]):
O[i, j] += X[k, i] * X[k, j]
return O
# 或者
def hessian(X, theta, Y):
return X.T @ X
def gradient(X, theta, Y):
g = X @ theta - Y
return X.T @ g
for i in range(100):
L = ((X @ theta - Y)**2).sum()
theta = theta - np.linalg.inv(hessian(X, theta, Y)) @ gradient(X, theta, Y)
L = 1 2 ∣ ∣ X θ − Y ∣ ∣ 2 2 L=\frac{1}{2}||X\theta-Y||_2^2 L=21∣∣Xθ−Y∣∣22
定义残差 r r r 为:
r = X θ − Y r = X\theta-Y r=Xθ−Y
L = 1 2 ∣ ∣ X θ − Y ∣ ∣ 2 2 = 1 2 ∣ ∣ r ∣ ∣ 2 2 L=\frac{1}{2}||X\theta-Y||_2^2=\frac{1}{2}||r||_2^2 L=21∣∣Xθ−Y∣∣22=21∣∣r∣∣22
参考 r r r 可表示为:
r = [ r ( 1 ) r ( 2 ) … r ( m ) ] = [ x 1 ( 1 ) θ 1 + x 2 ( 1 ) θ 2 + . . . + x n ( 1 ) θ n − y ( 1 ) x 1 ( 2 ) θ 1 + x 2 ( 2 ) θ 2 + . . . + x n ( 2 ) θ n − y ( 2 ) … x 1 ( m ) θ 1 + x 2 ( m ) θ 2 + . . . + x n ( m ) θ n − y ( m ) ] r={\left[\begin{array}{l}{r^{(1)}}\\ {r^{(2)}}\\ {\ldots}\\ {r^{(m)}}\end{array}\right]}={\left[\begin{array}{l}{x_{1}^{(1)}\theta_{1}+x_{2}^{(1)}\theta_{2}+...+x_{n}^{(1)}\theta_{n}-y^{(1)}}\\ {x_{1}^{(2)}\theta_{1}+x_{2}^{(2)}\theta_{2}+...+x_{n}^{(2)}\theta_{n}-y^{(2)}}\\ {\ldots}\\ {x_{1}^{(m)}\theta_{1}+x_{2}^{(m)}\theta_{2}+...+x_{n}^{(m)}\theta_{n}-y^{(m)}}\\ \end{array}\right]} r= r(1)r(2)…r(m) = x1(1)θ1+x2(1)θ2+...+xn(1)θn−y(1)x1(2)θ1+x2(2)θ2+...+xn(2)θn−y(2)…x1(m)θ1+x2(m)θ2+...+xn(m)θn−y(m)
则残差 r r r 对于参数 θ \theta θ 的雅可比矩阵如下:
J ( r ( θ ) ) = [ ∂ r ( 1 ) ∂ θ 1 ∂ r ( 1 ) ∂ θ 2 ⋯ ∂ r ( 1 ) ∂ θ n ∂ r ( 2 ) ∂ θ 1 ∂ r ( 2 ) ∂ θ 2 ⋯ ∂ r ( 2 ) ∂ θ n ∂ r ( m ) ∂ θ 1 ∂ r ( m ) ∂ θ 2 ⋯ ∂ r ( m ) ∂ θ n ] J(r(\theta)) =\left[\begin{array}{c c c c}{\frac{\partial r^{(1)}}{\partial\theta_{1}}}&{\frac{\partial r^{(1)}}{\partial\theta_{2}}}&{\cdots}&{\frac{\partial r^{(1)}}{\partial\theta_{n}}}\\ {\frac{\partial r^{(2)}}{\partial\theta_{1}}}&{\frac{\partial r^{(2)}}{\partial\theta_{2}}}&{\cdots}&{\frac{\partial r^{(2)}}{\partial\theta_{n}}}\\ {}&{}&{}&{}\\ {\frac{\partial r^{(m)}}{\partial\theta_{1}}}&{\frac{\partial r^{(m)}}{\partial\theta_{2}}}&{\cdots}&{\frac{\partial r^{(m)}}{\partial\theta_{n}}}\\ \end{array}\right] J(r(θ))= ∂θ1∂r(1)∂θ1∂r(2)∂θ1∂r(m)∂θ2∂r(1)∂θ2∂r(2)∂θ2∂r(m)⋯⋯⋯∂θn∂r(1)∂θn∂r(2)∂θn∂r(m)
θ + = θ − H − 1 ∂ L ( θ ) ∂ θ \theta^+=\theta-H^{-1}\frac{\partial L(\theta)}{\partial\theta} θ+=θ−H−1∂θ∂L(θ)
使用残差 r r r 对参数 θ \theta θ 的雅可比矩阵 J ( r ( θ ) ) J(r(\theta)) J(r(θ)) 近似目标函数对参数 θ \theta θ 的海森矩阵 H ( L ( θ ) ) H(L(\theta)) H(L(θ)):
H ( L ( θ ) ) ≈ J ( r ( θ ) ) T J ( r ( θ ) ) H(L(\theta))\approx J(r(\theta))^T J(r(\theta)) H(L(θ))≈J(r(θ))TJ(r(θ))
θ + = θ − ( J T J ) − 1 J T r \theta^+=\theta - (J^T J)^{-1} J^T r θ+=θ−(JTJ)−1JTr
高斯牛顿法中, θ \theta θ 迭代式很像最小二乘法
θ + = θ − ( J T J ) − 1 J T r \theta^+=\theta - (J^T J)^{-1} J^T r θ+=θ−(JTJ)−1JTr
LM 修正法,引入了类似正则化项的 μ \mu μ,解决了 J T J J^TJ JTJ 的奇异问题。 μ \mu μ 也称之为阻尼系数
θ + = θ − ( J T J + μ I ) − 1 J T r \theta^+=\theta - (J^T J + \mu I)^{-1} J^T r θ+=θ−(JTJ+μI)−1JTr
u = 0.00001
for i in range(100):
r = X @ theta - Y
L = (r**2).sum()
J = jacobian(X, theta, Y)
theta = theta - np.linalg.inv(J.T @ J + u * np.eye(J.shape[1])) @ J.T @ r