一般神经网络(DNN)反向传播过程

DNN反向传播过程

多元函数微分

损失函数都是标量函数,它使用范数损失将向量转换为标量。计算损失函数在第L层输入的导数是一种标量对向量的求导。实际上不论是几维向量,都可以视为一列多元函数的自变量数组。
例如, m × n mtimes n m×n维度的矩阵 { W i j } {W_{ij}} {Wij}可以转化为一列多元函数的自变量数组:
{ W i j } → ( W 11 , W 12 . . . W n m ) {W_{ij}}rightarrow(W_{11},W_{12}...W_{nm}) {Wij}(W11,W12...Wnm)
那么关于 { W i j } {W_{ij}} {Wij}的标量函数可以视作关于 ( W 11 , W 12 . . . W n m ) (W_{11},W_{12}...W_{nm}) (W11,W12...Wnm)的多元函数。多元函数的梯度就是标量函数对矩阵求导的结果。还记得多元函数的梯度是这样省的:
∂ f ∂ x → = ( ∂ f ∂ x 1 , ∂ f ∂ x 2 . . . ∂ f ∂ x n ) frac{partial f}{partial overrightarrow{x}}=(frac{partial f}{partial x_{1}}, frac{partial f}{partial x_{2}}...frac{partial f}{partial x_{n}}) x f=(x1f,x2f...xnf)

向量对向量求导

向量函数可以视作多个标量多元函数组成的向量,例如有将向量B映射为A的向量函数。
A = G ( B ) w h e r e   A ∈ R N × 1 , B ∈ R M × 1 A=G(B)\ where Ain R^{Ntimes1},Bin R^{Mtimes1} A=G(B)where ARN×1,BRM×1

如果我们将向量A视作多个标量多元函数组成的向量,那么求导就方便多了。
A = ( a 1 ( b 1 , b 2 , . . . b m ) , a 2 ( b 1 , b 2 , . . . b m ) , . . . ) ∂ A ∂ B = ( ∂ a 1 ∂ B , ∂ a 2 ∂ B , . . . ) = ( ∂ a 1 ∂ b 1 . . . ∂ a 1 ∂ b m ∂ a 2 ∂ b 1 . . . ∂ a 2 ∂ b m . . . . . . . . . ∂ a n ∂ b 1 . . . ∂ a n ∂ b m ) begin{aligned} A&=(a_{1}(b_{1},b_{2},...b_{m}),a_{2}(b_{1},b_{2},...b_{m}),...)\ frac{partial A}{partial B}&=(frac{partial a_{1}}{partial B},frac{partial a_{2}}{partial B},...)\ &=left( begin{array}{ccc} frac{partial a_{1}}{partial b_{1}} & ... & frac{partial a_{1}}{partial b_{m}}\ frac{partial a_{2}}{partial b_{1}} & ... & frac{partial a_{2}}{partial b_{m}}\ ... & ... & ...\ frac{partial a_{n}}{partial b_{1}} & ... & frac{partial a_{n}}{partial b_{m}}\ end{array} right) end{aligned} ABA=(a1(b1,b2,...bm),a2(b1,b2,...bm),...)=(Ba1,Ba2,...)=b1a1b1a2...b1an............bma1bma2...bman
Wow, see, 现在向量求导清晰多了。当然,不管你将求导展开成 n × m ntimes m n×m形式的矩阵还是 m × n mtimes n m×n的矩阵,只要在求导时统一,都没有关系。

DNN损失函数求导

神经网络的损失函数都是标量函数。常见的损失有L1、L2范数损失、啦啦啦的。以L2范数损失为例,一般的全连接神经网络损失函数:
ϵ = 1 2 ∣ ∣ σ ( a L ) − y ∣ ∣ 2 @ E q . 1 begin{array}{ccc} epsilon = frac{1}{2} ||sigma (bf{a^{L}})-bf{y}||^{2} & @Eq.1 end{array} ϵ=21σ(aL)y2@Eq.1
其中 a L = W L ⋅ a L − 1 + b L , a L , b L ∈ R N L , W L ∈ R N L × R N L − 1 bf{a^{L}}=bf{W^{L}}cdotbf{a^{L-1}}+bf{b^{L}}, bf{a^{L}},bf{b^{L}}in R^{N_{L}},bf{W^{L}}in R^{N_{L}}times R^{N_{L-1}} aL=WLaL1+bL,aL,bLRNL,WLRNL×RNL1表示第L层激活函数的结果, y bf{y} y表示Ground truth。Now,如何求解损失函数对 W L , b L bf{W^{L}}, bf{b^{L}} WL,bL的梯度呢?We only have to expand Eq.1 to the following expression 啦啦啦:
ϵ = 1 2 Σ i N [ σ ( Σ j M W i j L ⋅ a j L − 1 + b i L ) − y i ] 2 ∂ ϵ ∂ W x y = [ σ ( Σ j M W x j L ⋅ a j L − 1 + b x L ) − y x ] × σ ′ ( Σ j M W x j L ⋅ a j L − 1 + b x L ) × a y L − 1 s o , ∂ ϵ ∂ W L = { ∂ ϵ ∂ W x y L } x : 1 → N , y : 1 → M T h e n   s u r p r i s i n g l y = [ σ ( W L ⋅ a L − 1 + b L ) ⊙ σ ′ ( W L ⋅ a L − 1 + b L ) ] ⋅ ( a L − 1 ) T begin{aligned} epsilon &= frac{1}{2}Sigma_{i}^{N} [sigma(Sigma_{j}^{M}W_{ij}^{L}cdot a^{L-1}_{j}+b_{i}^{L})-y_{i}]^{2}\ frac{partialepsilon}{partial W_{xy}} &= [sigma(Sigma_{j}^{M}W_{xj}^{L}cdot a^{L-1}_{j}+b_{x}^{L})-y_{x}]timessigma'(Sigma_{j}^{M}W_{xj}^{L}cdot a^{L-1}_{j}+b_{x}^{L})times a_{y}^{L-1}\ so, frac{partialepsilon}{partial bf{W^{L}}}&={frac{partialepsilon}{partial W_{xy}^{L}}}_{x:1rightarrow N,y:1rightarrow M}\ &Then surprisingly\ &=[sigma(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})odotsigma'(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})]cdot (a^{L-1})^{T} end{aligned} ϵWxyϵso,WLϵ=21ΣiN[σ(ΣjMWijLajL1+biL)yi]2=[σ(ΣjMWxjLajL1+bxL)yx]×σ(ΣjMWxjLajL1+bxL)×ayL1={WxyLϵ}x:1N,y:1MThen surprisingly=[σ(WLaL1+bL)σ(WLaL1+bL)](aL1)T
同样的,损失函数对偏置求导得到:
∂ ϵ ∂ b L = [ σ ( W L ⋅ a L − 1 + b L ) ⊙ σ ′ ( W L ⋅ a L − 1 + b L ) ] frac{partialepsilon}{partial bf{b^{L}}}=[sigma(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})odotsigma'(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})] bLϵ=[σ(WLaL1+bL)σ(WLaL1+bL)]
通常我们用 z L = W L ⋅ a L − 1 + b L bf{z^{L}}=bf{W^{L}}cdot a^{L-1}+bf{b^{L}} zL=WLaL1+bL表示未激活输出, δ L = σ ( z L ) ⊙ σ ′ ( z L ) bf{delta^{L}}=sigma(bf{z^{L}})odotsigma'(bf{z^{L}}) δL=σ(zL)σ(zL)表示Hadamard乘积结果。那么损失函数对最后一层神经网络参数的梯度就是:
∂ ϵ ∂ W L = δ L ⋅ ( a L − 1 ) T ∂ ϵ ∂ b L = δ L begin{aligned} frac{partialepsilon}{partial bf{W^{L}}}&=bf{delta^{L}}cdot (bf{a^{L-1}})^{T}\ frac{partialepsilon}{partial bf{b^{L}}}&=bf{delta^{L}} end{aligned} WLϵbLϵ=δL(aL1)T=δL
桥豆麻嘚,好像推出来了什么不得了的东西。如果是对第 h h h层的参数求导,那么有:
∂ ϵ ∂ W H = δ H ⋅ ( a H − 1 ) T       @ E q . 2 ∂ ϵ ∂ b H = δ H                        @ E q . 3 w h e r e   δ H = ∂ ϵ ∂ Z L ⋅ ∂ Z L ∂ Z L − 1 . . . ∂ Z H + 1 ∂ Z H begin{aligned} frac{partialepsilon}{partial bf{W^{H}}}&=bf{delta^{H}}cdot (bf{a^{H-1}})^{T} @Eq.2\ frac{partialepsilon}{partial bf{b^{H}}}&=bf{delta^{H}} @Eq.3\\ where bf{delta^{H}}&=frac{partialepsilon}{partial bf{Z^{L}}}cdotfrac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}...frac{partialbf{Z^{H+1}}}{partial bf{Z^{H}}} end{aligned} WHϵbHϵwhere δH=δH(aH1)T     @Eq.2=δH                      @Eq.3=ZLϵZL1ZL...ZHZH+1
clearly,求导的关键在于求解后一层非激活输出对前一层非激活输出的导数,即:
∂ Z L ∂ Z L − 1 = { ∂ Z i L ∂ Z j L − 1 } ∂ Z i L ∂ Z j L − 1 = W i j L ⋅ a j L w h i c h i n d i c a t e s   ∂ Z L ∂ Z L − 1 = W L ⋅ d i a g ( a L − 1 ) w h e r e   d i a g ( a L − 1 ) = ( a 1 L − 1 0 . . . 0 a 2 L − 1 . . . . . . . . . . . . . . . . . . a N L − 1 L − 1 ) begin{aligned} frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}&={frac{partial Z^{L}_{i}}{partial Z^{L-1}_{j}}}\ frac{partial Z^{L}_{i}}{partial Z^{L-1}_{j}}&=W^{L}_{ij}cdot a^{L}_{j}\ which indicates frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}&=bf{W^{L}}cdot diag(bf{a^{L-1}})\ where diag(bf{a^{L-1}})&=left(begin{array}{ccc} a_{1}^{L-1} & 0 & ...\ 0 & a_{2}^{L-1} & ...\ ...& ... & ... \ ... & ... & a_{N^{L-1}}^{L-1}\ end{array}right) end{aligned} ZL1ZLZjL1ZiLwhichindicates ZL1ZLwhere diag(aL1)={ZjL1ZiL}=WijLajL=WLdiag(aL1)=a1L10......0a2L1...............aNL1L1

将上式代入至 δ H delta^{H} δH中,就可以得到:
δ H = ( ∂ Z L ∂ Z L − 1 . . . ∂ Z H + 1 ∂ Z H ) T ⋅ δ L = Π T ( W L ⋅ d i a g ( a L − 1 ) ) ⋅ δ L              @ E q . 4 begin{aligned} delta^{H} &= (frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}...frac{partialbf{Z^{H+1}}}{partial bf{Z^{H}}})^{T}cdotdelta^{L}\ &= Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L} @Eq.4 end{aligned} δH=(ZL1ZL...ZHZH+1)TδL=ΠT(WLdiag(aL1))δL            @Eq.4
to analyze it from the dimension aspect, Eq.4的维度信息是:
[ ( N L ∗ N L − 1 ) × ( N L − 1 ∗ N L − 2 ) × . . . ( N H + 1 ∗ N H ) ] T × ( N L ∗ 1 ) = ( N H ∗ 1 ) [(N^{L}*N^{L-1})times(N^{L-1}*N^{L-2})times...(N^{H+1}*N^{H})]^{T}times(N^{L}*1)=(N^{H}*1) [(NLNL1)×(NL1NL2)×...(NH+1NH)]T×(NL1)=(NH1)
那么就不难得到任意一层的参数梯度表达式:
∂ ϵ ∂ W H = Π T ( W L ⋅ d i a g ( a L − 1 ) ) ⋅ δ L ⋅ ( a H − 1 ) T ∂ ϵ ∂ b H = Π T ( W L ⋅ d i a g ( a L − 1 ) ) ⋅ δ L begin{aligned} frac{partialepsilon}{partial bf{W^{H}}}&=Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L}cdot (bf{a^{H-1}})^{T}\ frac{partialepsilon}{partial bf{b^{H}}}&=Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L} end{aligned} WHϵbHϵ=ΠT(WLdiag(aL1))δL(aH1)T=ΠT(WLdiag(aL1))δL