【深度学习-吴恩达】L1-2 神经网络基础

L1 深度学习概论

2 神经网络基础

课程视频共145min6s

2.1 二分分类

Binary Classification

一些表示方法

  • m:数据集的规模
    • mtrain :训练集规模
    • mtest:测试集规模
  • nx:输入特征向量的维度,简写为n
  • (x, y):一组单独训练样本
  • y:在二分类中,0/1的输出结果,即y∈{0, 1}
  • x:nx维度的输入特征向量,即x∈Rnx
  • 训练集:{(x(1), y(1)), (x(2), y(2)), …, (x(m), y(m))}
    • X=[x(1), x(2), …, x(m)]
      • X ∈ Rnx × m
      • X.shape=(nx, m)
    • Y=[y(1), y(2), …, y(m)]
      • Y∈R1×m
      • Y.shape = (1, m)
2.2 Logisitc回归

logistic回归:二分分类算法,输出标签0/1

Given x, want y ^ \hat{y} y^ = P(y = 1 | x)

若线性回归,则 y ^ \hat{y} y^ = wTx + b

  • 值不会在[0, 1]之间,可以很大或者负值

Sigmoid函数, y ^ = σ ( w T x + b ) \hat{y} = \sigma(w^Tx + b) y^=σ(wTx+b)

  • σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1 + e^-z} σ(z)=1+ez1

  • 0到1之间

  • 极限接近0或者1

  • 学习参数 w w w b b b

  • 另一种表示方式(不重要)

    • y ^ = σ ( θ T x ) \hat{y} = \sigma(\theta^Tx) y^=σ(θTx)

    • 其中输入x默认x0 = 1,所以x∈Rnx+1

    • θ T = ( θ 0 θ 1 ⋮ θ n x ) \theta^T = \left( \begin{matrix} \theta_0\\ \theta_1\\ \vdots\\ \theta_{n_x} \end{matrix} \right) θT= θ0θ1θnx

    • 其中 θ 0 \theta_0 θ0与x0相乘,即为b

    • θ 1 \theta_1 θ1 θ n x \theta_{n_x} θnx即为 w T w^T wT

2.3 Logistic回归损失函数

使用上标 ( i ) (i) (i)来指明数据

  • σ ( z ( i ) ) = 1 1 + e − z ( i ) \sigma(z^{(i)})= \frac{1}{1 + e^{-z^{(i)}}} σ(z(i))=1+ez(i)1

损失函数 L ( y ^ − y ) L(\hat{y} - y) L(y^y) loss function

  • 适用于单个训练样本

  • L ( y ^ − y ) = − ( y l o g y ^ + ( 1 − y ) l o g ( 1 − y ^ ) ) L(\hat{y}-y) = -(ylog\hat{y} + (1-y)log(1-\hat{y})) L(y^y)=(ylogy^+(1y)log(1y^))

    • 若y = 1,则 L ( y ^ − y ) = − l o g y ^ L(\hat{y}-y) = -log\hat{y} L(y^y)=logy^
      • 损失函数较小,则期望 y ^ \hat{y} y^较大,接近1
    • 若y = 0,则 L ( y ^ − y ) = l o g ( 1 − y ^ ) L(\hat{y}-y) = log(1-\hat{y}) L(y^y)=log(1y^)
      • 损失函数较小,则期望 y ^ \hat{y} y^较小,接近0

成本函数 cost function

  • 参数的总成本

  • J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) − y ( i ) ) = − 1 m ∑ i = 1 m ( y ( i ) l o g y ^ ( i ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ) J(w,b) = \frac{1}{m}\sum_{i = 1}^{m} L(\hat{y}^{(i)} - y^{(i)}) = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}log\hat{y}^{(i)} + (1-y^{(i)})log(1-\hat{y}^{(i)})) J(w,b)=m1i=1mL(y^(i)y(i))=m1i=1m(y(i)logy^(i)+(1y(i))log(1y^(i)))

2.4 梯度下降法

希望找到最小的$J(w,b) $

$J(w,b) $是一个凸函数,由此从任何初始点,容易获得相似的最终位置,即全局最优解

【深度学习-吴恩达】L1-2 神经网络基础_第1张图片
w : = w − α d J ( w , b ) d w b : = b − α d J ( w , b ) d b w:= w - \alpha\frac{dJ(w,b)}{dw}\\ b:= b - \alpha\frac{dJ(w,b)}{db} w:=wαdwdJ(w,b)b:=bαdbdJ(w,b)

  • α \alpha α:学习率

  • 更新 w , b w,b w,b,迭代学习,直到最优点

  • 求导严格来讲,两个及以上变量时候,应该使用 ∂ \partial

  • 编程时候常使用dw、db变量

2.5 导数

f ( a ) = 3 a f(a) = 3a f(a)=3a

d d a f ( a ) = 3 \frac{d}{da}f(a)=3 dadf(a)=3

derivative 导数

slope 斜率

2.6 导数2

f ( a ) = a 2 f(a) = a^2 f(a)=a2 d d a f ( a ) = 2 a \frac{d}{da}f(a)=2a dadf(a)=2a

f ( a ) = 3 a f(a) = 3a f(a)=3a d d a f ( a ) = 3 \frac{d}{da}f(a)=3 dadf(a)=3

f ( a ) = l n ( a ) f(a) = ln(a) f(a)=ln(a) d d a f ( a ) = 1 a \frac{d}{da}f(a)=\frac{1}{a} dadf(a)=a1

2.7 计算图

J ( a , b , c ) = 3 ( a + b c ) J(a,b,c) = 3(a+bc) J(a,b,c)=3(a+bc)

【深度学习-吴恩达】L1-2 神经网络基础_第2张图片

计算图:从左到右的计算

2.8 使用计算图求导

导数:计算图从右到左计算

链式法则

d J d v = 3 \frac{dJ}{dv}=3 dvdJ=3

d J d a = 3 \frac{dJ}{da} = 3 dadJ=3

d J d u = 3 \frac{dJ}{du}=3 dudJ=3

d J d b = 3 c \frac{dJ}{db}=3c dbdJ=3c

一些编程时候,想要最终结果的某个导数 d F i n d O u t p u t V a r d v a r \frac{dFindOutputVar}{dvar} dvardFindOutputVar通常记作dvar变量

2.9 Logistic回归中的梯度下降法

logistic回归回顾

  • z = w T x + b z=w^Tx+b z=wTx+b
  • y ^ = a = σ ( z ) \hat{y}=a=\sigma(z) y^=a=σ(z)
  • L ( a , y ) = − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) L(a,y) = -(ylog(a) + (1-y)log(1-a)) L(a,y)=(ylog(a)+(1y)log(1a))

两个输入特征的计算图

【深度学习-吴恩达】L1-2 神经网络基础_第3张图片

变量da = d L ( a , y ) d a = − y a + 1 − y 1 − a \frac{dL(a,y)}{da} = -\frac{y}{a} + \frac{1-y}{1-a} dadL(a,y)=ay+1a1y

变量dz = d L d z = a − y \frac{dL}{dz} = a-y dzdL=ay

变量dw1 = d L d w 1 = x 1 ( a − y ) \frac{dL}{dw_1}=x_1(a-y) dw1dL=x1(ay)

变量dw2 = d L d w 1 = x 2 ( a − y ) \frac{dL}{dw_1}=x_2(a-y) dw1dL=x2(ay)

变量db = d L d w 1 = a − y \frac{dL}{dw_1}=a-y dw1dL=ay

单样本的梯度下降法求解:

$w_1:= w_1 - \alpha\frac{dL}{dw_1}\$

$w_2:= w_2 - \alpha\frac{dL}{dw_2}\$

$b:= b - \alpha\frac{dL}{db}\$

2.10 多样本的梯度下降法

$J(w,b) = \frac{1}{m}\sum_{i = 1}^{m} L(a^{(i)} - y^{(i)}) $

  • 所有样本损失函数的平均值

a ( i ) = y ^ ( i ) = σ ( w T x ( i ) + b ) a^{(i)} = \hat{y}^{(i)}=\sigma(w^Tx^{(i)}+b) a(i)=y^(i)=σ(wTx(i)+b)
∂ ∂ w 1 J ( w , b ) = 1 m ∑ i = 1 m ∂ ∂ w 1 L ( a ( i ) − y ( i ) ) \frac{\partial}{\partial w_1}J(w,b) = \frac{1}{m}\sum_{i = 1}^{m}\frac{\partial}{\partial w_1} L(a^{(i)} - y^{(i)}) w1J(w,b)=m1i=1mw1L(a(i)y(i))
一个例子

J = 0 , d w 1 = 0 , d w 2 = 0 , d b = 0 J=0, dw_1=0, dw_2=0, db=0 J=0,dw1=0,dw2=0,db=0

For i=1 to m

z ( i ) = w T x ( i ) + b \quad z^{(i)}=w^Tx{(i)}+b z(i)=wTx(i)+b

a ( i ) = σ ( z ( i ) ) \quad a^{(i)}=\sigma(z^{(i)}) a(i)=σ(z(i))

J + = − ( y ( i ) l o g ( a ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − a ( i ) ) ) \quad J += -(y^{(i)}log(a^{(i)}) + (1-y^{(i)})log(1-a^{(i)})) J+=(y(i)log(a(i))+(1y(i))log(1a(i)))

d z ( i ) = a ( i ) − y ( i ) \quad dz^{(i)} = a^{(i)}-y^{(i)} dz(i)=a(i)y(i)

d w 1 + = x 1 ( i ) d z ( i ) \quad dw_1+= x_1^{(i)}dz^{(i)} dw1+=x1(i)dz(i)

d w 2 + = x 2 ( i ) d z ( i ) \quad dw_2+= x_2^{(i)}dz^{(i)} dw2+=x2(i)dz(i)

d b + = d z ( i ) \quad db += dz^{(i)} db+=dz(i)

J / = m J /= m J/=m

d w 1 / = m ; d w 1 / = m ; d b / = m dw_1/= m;dw_1/=m;db/=m dw1/=m;dw1/=m;db/=m

w 1 : = w 1 − α d w 1 w_1:= w_1 - \alpha dw_1 w1:=w1αdw1

w 2 : = w 2 − α d w 2 w_2:= w_2 - \alpha dw_2 w2:=w2αdw2

b : = b − α b b:= b - \alpha b b:=bαb

  • 两个缺点
    • 需要两个循环,一个循环m个样本,另一个循环n个特征
      • 显式使用for循环效率较低
      • 使用向量化进行优化
    • ?另一个缺点没说
2.11 向量化

Vectorization

非向量化版本

z = 0
for i in range(nx):
    z += w[i]*x[i]
z += b

向量化版本

z = np.dot(w, x) + b
  • GPU和CPU都有并行化指令SIMD
    • 单指令流多数据流
    • numpy能充分利用并行化提速,而不要显式使用for循环
2.12 向量化2

避免for循环

  • 计算en矩阵

    u = np.exp(v)

  • 计算log值

    np.log(v)

  • 计算绝对值

    np.abs(v)

  • 计算最大值

    np.maxinum(v)

logistic回归非向量化版本

J = 0 , d w 1 = 0 , d w 2 = 0 , d b = 0 J=0, dw_1=0, dw_2=0, db=0 J=0,dw1=0,dw2=0,db=0

For i=1 to m

z ( i ) = w T x ( i ) + b z^{(i)}=w^Tx{(i)}+b z(i)=wTx(i)+b

a ( i ) = σ ( z ( i ) ) a^{(i)}=\sigma(z^{(i)}) a(i)=σ(z(i))

J + = − ( y ( i ) l o g ( a ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − a ( i ) ) ) J += -(y^{(i)}log(a^{(i)}) + (1-y^{(i)})log(1-a^{(i)})) J+=(y(i)log(a(i))+(1y(i))log(1a(i)))

d z ( i ) = a ( i ) − y ( i ) dz^{(i)} = a^{(i)}-y^{(i)} dz(i)=a(i)y(i)

d w 1 + = x 1 ( i ) d z ( i ) dw_1+= x_1^{(i)}dz^{(i)} dw1+=x1(i)dz(i)

d w 2 + = x 2 ( i ) d z ( i ) dw_2+= x_2^{(i)}dz^{(i)} dw2+=x2(i)dz(i)

d b + = d z ( i ) db += dz^{(i)} db+=dz(i)

J /= m

d w 1 / = m ; d w 1 / = m ; d b / = m dw_1/= m;dw_1/=m;db/=m dw1/=m;dw1/=m;db/=m

w 1 : = w 1 − α d w 1 w_1:= w_1 - \alpha dw_1 w1:=w1αdw1

w 2 : = w 2 − α d w 2 w_2:= w_2 - \alpha dw_2 w2:=w2αdw2

b : = b − α d b b:= b - \alpha db b:=bαdb

logistic回归向量化版本

J = 0 , d w = n p . z e r o s ( ( n x , 1 ) ) , d b = 0 J=0, dw=np.zeros((n_x,1)), db=0 J=0,dw=np.zeros((nx,1)),db=0

For i=1 to m

z ( i ) = w T x ( i ) + b z^{(i)}=w^Tx{(i)}+b z(i)=wTx(i)+b

a ( i ) = σ ( z ( i ) ) a^{(i)}=\sigma(z^{(i)}) a(i)=σ(z(i))

J + = − ( y ( i ) l o g ( a ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − a ( i ) ) ) J += -(y^{(i)}log(a^{(i)}) + (1-y^{(i)})log(1-a^{(i)})) J+=(y(i)log(a(i))+(1y(i))log(1a(i)))

d z ( i ) = a ( i ) − y ( i ) dz^{(i)} = a^{(i)}-y^{(i)} dz(i)=a(i)y(i)

d w + = x ( i ) d z ( i ) dw+= x^{(i)}dz^{(i)} dw+=x(i)dz(i)

d b + = d z ( i ) db += dz^{(i)} db+=dz(i)

J /= m

d w / = m ; d b / = m dw/=m;db/=m dw/=m;db/=m

w 1 : = w 1 − α d w 1 w_1:= w_1 - \alpha dw_1 w1:=w1αdw1

w 2 : = w 2 − α d w 2 w_2:= w_2 - \alpha dw_2 w2:=w2αdw2

b : = b − α d b b:= b - \alpha db b:=bαdb

2.13 向量化3

计算正向传播
Z = [ z ( 1 ) z ( 2 ) … z ( m ) ] = w T X + [ b b … b ] = [ w T x ( 1 ) + b     w T x ( 2 ) + b     …     w T x ( m ) + b ] Z = [z^{(1)}z^{(2)}\dots z^{(m)}]\\ = w^TX+[bb\dots b]\\ = [w^Tx^{(1)} + b\space\space\space w^Tx^{(2)} + b \space\space\space \dots \space\space\space w^Tx^{(m)} + b] Z=[z(1)z(2)z(m)]=wTX+[bbb]=[wTx(1)+b   wTx(2)+b      wTx(m)+b]
在b为常数时候,由此使用一行代码可以进行计算,无需使用for循环:

Z = np.dot(w^T, X) + b

其中b为常数,在Python中会广播为1×m向量

2.14 向量化4

同时计算m个数据集参数

非向量法:

  • dZ定义:
    d z ( 1 ) = a ( 1 ) − y ( 1 ) d z ( 2 ) = a ( 2 ) − y ( 2 ) ⋮ d z ( m ) = a ( m ) − y ( m ) d Z = [ d z ( 1 ) d z ( 2 ) … d z ( m ) ] dz^{(1)} = a^{(1)} - y^{(1)}\\ dz^{(2)} = a^{(2)} - y^{(2)}\\ \vdots\\ dz^{(m)} = a^{(m)} - y^{(m)}\\ dZ = [dz^{(1)}\quad dz^{(2)}\quad \dots \quad dz^{(m)}]\\\\ dz(1)=a(1)y(1)dz(2)=a(2)y(2)dz(m)=a(m)y(m)dZ=[dz(1)dz(2)dz(m)]
  • dZ用A和Y表示
    A = [ a ( 1 ) … a ( m ) ] Y = [ y ( 1 ) … y ( m ) ] = > d Z = A − Y = [ a ( 1 )  ⁣ −  ⁣ y ( 1 )   …   a ( m )  ⁣ −  ⁣ y ( m ) ] A = [a^{(1)}\dots a^{(m)}]\qquad Y = [y^{(1)}\dots y^{(m)}]\\ => dZ = A - Y = [a^{(1)}\!-\!y^{(1)}\space\dots\space a^{(m)}\!-\!y^{(m)}]\\\\ A=[a(1)a(m)]Y=[y(1)y(m)]=>dZ=AY=[a(1)y(1)  a(m)y(m)]
  • 计算dw
    d w = 0 d w 1 + = x 1 ( 1 ) d z ( 1 ) d w 2 + = x 2 ( 2 ) d z ( 2 ) ⋮ d w n + = x n ( n ) d z ( n ) d w / = d w 计算 d b d b + = d z ( 1 ) d b + = d z ( 2 ) ⋮ d b + = d z ( n ) dw=0\\ dw_1+= x_1^{(1)}dz^{(1)}\\ dw_2+= x_2^{(2)}dz^{(2)}\\ \vdots\\ dw_n+= x_n^{(n)}dz^{(n)}\\ dw /= dw\\\\ 计算db\\ db += dz^{(1)}\\ db += dz^{(2)}\\ \vdots\\ db += dz^{(n)} dw=0dw1+=x1(1)dz(1)dw2+=x2(2)dz(2)dwn+=xn(n)dz(n)dw/=dw计算dbdb+=dz(1)db+=dz(2)db+=dz(n)
  • 向量法:
    d b = 1 m ∑ i = 1 m d z ( i ) = 1 m n p . s u m ( d Z ) d w = 1 m X d Z T = 1 m [ x ( 1 ) … x ( m ) ] [ d z ( 1 ) … d z ( m ) ] T db = \frac{1}{m}\sum_{i=1}^mdz^{(i)}=\frac{1}{m}np.sum(dZ)\\ dw=\frac{1}{m}XdZ^T=\frac{1}{m}[x^{(1)}\dots x^{(m)}][dz^{(1)}\dots dz^{(m)}]^T db=m1i=1mdz(i)=m1np.sum(dZ)dw=m1XdZT=m1[x(1)x(m)][dz(1)dz(m)]T

具体实现:

For iter in range(1000): #多次迭代更新参数

Z = w T X + b = n p . d o t ( w T , x ) + b \quad Z = w^TX+b=np.dot(w^T,x)+b Z=wTX+b=np.dot(wT,x)+b

A = σ ( Z ) \quad A=\sigma(Z) A=σ(Z)

d Z = A − Y \quad dZ=A-Y dZ=AY

d w = 1 m X d Z T \quad dw=\frac{1}{m}XdZ^T dw=m1XdZT

d b = 1 m n p . s u m ( d Z ) \quad db=\frac{1}{m}np.sum(dZ) db=m1np.sum(dZ)

w : = w − α d w \quad w:=w-\alpha dw w:=wαdw

b : = b − α d b \quad b:=b-\alpha db b:=bαdb

【深度学习-吴恩达】L1-2 神经网络基础_第4张图片

【深度学习-吴恩达】L1-2 神经网络基础_第5张图片

2.15 Python中的广播

广播的例子

#求列的和,行使用axis=1
cal = A.sum(axis = 0)
percentage = 100 * A / (cal.reshape(1, 4))

统用法则

  • m*n矩阵,进行加法或者减法
    • (1, n) / (m, 1) -> (m, n)
2.16 Python Numpy的说明

优势

  • 语言表现能力更强

劣势

  • 容易出现细微的错误

尽量避免使用秩为1的矩阵

2.17 Jupyter使用指南

作业链接:https://www.heywhale.com/home/column/5e8181ce246a590036b875f9

2.18 Logistic损失函数解释

损失函数 L L L

  • If y=1: P(y|x) = y ^ \hat{y} y^

  • If y=0: P(y|x) = 1 − y ^ 1-\hat{y} 1y^

  • P ( y ∣ x ) = y ^ y ( 1 − y ^ ) ( 1 − y ) P(y|x) = \hat{y}^y(1-\hat{y})^{(1-y)} P(yx)=y^y(1y^)(1y)

  • l o g P ( y ∣ x ) = y l o g y ^ + ( 1 − y ) l o g ( 1 − y ^ ) = − L ( y ^ , y ) logP(y|x) = ylog\hat{y} + (1-y)log(1-\hat{y}) = -L(\hat{y},y) logP(yx)=ylogy^+(1y)log(1y^)=L(y^,y)

成本函数 J J J

P ( l a b e l s   i n   t r a i n   s e t ) = ∏ i = 1 m P ( x ( i ) ∣ y ( i ) ) l o g   P ( l a b e l s   i n   t r a i n   s e t ) = l o g ∏ i = 1 m P ( x ( i ) ∣ y ( i ) ) = − ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) − y ( i ) ) P(labels\space in\space train\space set) = \prod_{i=1}^mP(x^{(i)}|y^{(i)})\\ log\space P(labels\space in\space train\space set) = log\prod_{i=1}^mP(x^{(i)}|y^{(i)})=-\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i)})\\ J(w,b) = \frac{1}{m}\sum_{i = 1}^{m} L(\hat{y}^{(i)} - y^{(i)}) P(labels in train set)=i=1mP(x(i)y(i))log P(labels in train set)=logi=1mP(x(i)y(i))=i=1mL(y^(i),y(i))J(w,b)=m1i=1mL(y^(i)y(i))

你可能感兴趣的:(深度学习,深度学习,神经网络,机器学习)