神经网络常用优化器

神经网络常用优化器


前言

  该内容为笔者学习中国大学慕课中北京大学曹健老师Tensorflow笔记所总结
  在此之前,笔者观看过吴恩达老师的深度学习和CS231n,其中都对几种优化器进行了讲解,并对几种不同的优化器为什么有效进行了说明,但相比直接曹健老师的讲解更便于记忆


一、预备知识和参数说明

待优化参数 w w w
损失函数 l o s s loss loss
学习率 l r lr lr
每次迭代一个 b a t c h batch batch
t t t表示当前 b a t c h batch batch迭代的总次数

参数更新的步骤:

  1. 计算t时刻损失函数关于当前参数的梯度 g t = ∇ l o s s = ∂  loss  ∂ ( w t ) g_t=nabla loss =dfrac{partial text { loss }}{partialleft(w_{t}right)} gt=loss=(wt) loss 
  2. 计算t时刻一阶动量 m t m_t mt和二阶动量 V t V_t Vt
  3. 计算t时刻下降梯度: η t = l r ⋅ m t / V t eta_t=lr cdot m_t/sqrt{V_t} ηt=lrmt/Vt
  4. 计算t+1时刻参数: w t + 1 = w t − η t = w t − l r ⋅ m t / V t w_{t+1}=w_t-eta_t=w_t-lr cdot m_t/sqrt{V_t} wt+1=wtηt=wtlrmt/Vt

一阶动量:与梯度相关的函数
二阶动量:与梯度平方相关的函数

二、随机梯度下降SGD

一阶动量: m t = g t m_t=g_t mt=gt        二阶动量: V t = 1 V_t=1 Vt=1

η t = l r ⋅ m t / V t eta_t=lrcdot m_t/sqrt{V_t} ηt=lrmt/Vt
   = l r ⋅ g t =lrcdot g_t =lrgt

w t + 1 = w t − η t w_{t+1}=w_t-eta_t wt+1=wtηt
   = w t − l r ⋅ m t V t =w_t-lrcdot m_tsqrt{V_t} =wtlrmtVt
   = w t − l r ⋅ g t = w_t-lrcdot g_t =wtlrgt

三、SGDM

在SGD基础上增加了一阶动量
在SGDM中 m t m_t mt 表示各时刻梯度方向的指数滑动平均

一阶动量: m t = β ⋅ m t − 1 + ( 1 − β ) ⋅ g t m_t=beta cdot m_{t-1}+(1-beta ) cdot g_t mt=βmt1+(1β)gt        二阶动量: V t = 1 V_t=1 Vt=1

η t = l r ⋅ m t / V t eta_t=lrcdot m_t/sqrt{V_t} ηt=lrmt/Vt
    = l r ⋅ m t =lrcdot m_t =lrmt
    = l r ⋅ ( β ⋅ m t − 1 + ( 1 − β ) ⋅ g t ) =lr cdot(beta cdot m_{t-1}+(1-beta ) cdot g_t) =lr(βmt1+(1β)gt)

w t + 1 = w t − η t w_{t+1}=w_t-eta_t wt+1=wtηt
    = w t − l r ⋅ ( β ⋅ m t − 1 + ( 1 − β ) ⋅ g t ) =w_t-lr cdot(beta cdot m_{t-1}+(1-beta ) cdot g_t) =wtlr(βmt1+(1β)gt)

三、Adagrad

在SGD基础上增加二阶动量
二阶动量是从开始到现在梯度平方的累计和

一阶动量: m t = g t m_t=g_t mt=gt         二阶动量: V t = ∑ τ t g τ 2 V_t=sum^t_{tau}g_{tau}^2 Vt=τtgτ2

η t = l r ⋅ m t / ( V t ) eta_t=lr cdot m_t/(sqrt{V_t}) ηt=lrmt/(Vt )
    = l r ⋅ g t / ( ∑ τ = 1 t ) g τ 2 ) =lr cdot g_t/(sqrt{sum^t_{tau=1})g_{tau}^2}) =lrgt/(τ=1t)gτ2 )

w t + 1 = w t − η t w_{t+1}=w_t-eta_t wt+1=wtηt
    = w t − l r ⋅ g t / ( ∑ τ = 1 t ) g τ 2 ) =w_t-lr cdot g_t/(sqrt{sum^t_{tau=1})g_{tau}^2}) =wtlrgt/(τ=1t)gτ2 )

四、RMSProp

在SGD基础上增加二阶动量
二阶动量使用指数滑动平均值计算,表征的是过去一段时间的平均值

一阶动量: m t = g t m_t=g_t mt=gt         二阶动量: V t = β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 V_t=beta cdot V_{t-1}+(1-beta)cdot g_2^2 Vt=βVt1+(1β)g22

η t = l r ⋅ m t / ( ( V t ) ) eta_t=lr cdot m_t/(sqrt(V_t)) ηt=lrmt/(( Vt))
   = l r ⋅ g t / ( β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 ) =lr cdot g_t/(sqrt{beta cdot V_{t-1}+(1-beta)cdot g_2^2}) =lrgt/(βVt1+(1β)g22 )

w t + 1 = w t − η t w_{t+1}=w_t-eta_t wt+1=wtηt
   = w t − l r ⋅ g t / ( β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 ) =w_t-lr cdot g_t/(sqrt{beta cdot V_{t-1}+(1-beta)cdot g_2^2}) =wtlrgt/(βVt1+(1β)g22 )

五、Adam

同时结合SGDM一阶动量和RMWSProp二阶动量

一阶动量: m t = β 1 ⋅ m t − 1 + ( 1 − β 1 ) m_t=beta_1 cdot m_{t-1}+(1-beta_1 ) mt=β1mt1+(1β1)

修正一阶动量的偏差: m t ^ = m t 1 − β 1 t hat{m_t}=dfrac{m_t}{1-beta_1^t} mt^=1β1tmt

二阶动量: V t = β 2 ⋅ V t − 1 + ( 1 − β 2 ) ⋅ g 2 2 V_t=beta_2 cdot V_{t-1}+(1-beta_2)cdot g_2^2 Vt=β2Vt1+(1β2)g22

修正二阶动量的偏差: V t ^ = V t 1 − β 2 t hat{V_t}=dfrac{V_t}{1-beta_2^t} Vt^=1β2tVt
η t = l r ⋅ m t ^ / ( V t ^ ) eta_t=lr cdot hat{m_t}/(sqrt{hat{V_t}}) ηt=lrmt^/(Vt^ )
   = l r ⋅ m t 1 − β 1 t / V t 1 − β 2 t =lr cdot dfrac{m_t}{1-beta_1^t}/sqrt{dfrac{V_t}{1-beta_2^t}} =lr1β1tmt/1β2tVt
w t + 1 = w t − η t w_{t+1}=w_t-eta_t wt+1=wtηt
   = w t − l r ⋅ m t 1 − β 1 t / V t 1 − β 2 t =w_t-lr cdot dfrac{m_t}{1-beta_1^t}/sqrt{dfrac{V_t}{1-beta_2^t}} =wtlr1β1tmt/1β2tVt