神经网络常用优化器
神经网络常用优化器
前言
该内容为笔者学习中国大学慕课中北京大学曹健老师Tensorflow笔记所总结
在此之前,笔者观看过吴恩达老师的深度学习和CS231n,其中都对几种优化器进行了讲解,并对几种不同的优化器为什么有效进行了说明,但相比直接曹健老师的讲解更便于记忆
一、预备知识和参数说明
待优化参数
w
w
w
损失函数
l
o
s
s
loss
loss
学习率
l
r
lr
lr
每次迭代一个
b
a
t
c
h
batch
batch
t
t
t表示当前
b
a
t
c
h
batch
batch迭代的总次数
参数更新的步骤:
- 计算t时刻损失函数关于当前参数的梯度 g t = ∇ l o s s = ∂ loss ∂ ( w t ) g_t=nabla loss =dfrac{partial text { loss }}{partialleft(w_{t}right)} gt=∇loss=∂(wt)∂ loss
- 计算t时刻一阶动量 m t m_t mt和二阶动量 V t V_t Vt
- 计算t时刻下降梯度: η t = l r ⋅ m t / V t eta_t=lr cdot m_t/sqrt{V_t} ηt=lr⋅mt/Vt
- 计算t+1时刻参数: w t + 1 = w t − η t = w t − l r ⋅ m t / V t w_{t+1}=w_t-eta_t=w_t-lr cdot m_t/sqrt{V_t} wt+1=wt−ηt=wt−lr⋅mt/Vt
一阶动量:与梯度相关的函数
二阶动量:与梯度平方相关的函数
二、随机梯度下降SGD
一阶动量: m t = g t m_t=g_t mt=gt 二阶动量: V t = 1 V_t=1 Vt=1
η
t
=
l
r
⋅
m
t
/
V
t
eta_t=lrcdot m_t/sqrt{V_t}
ηt=lr⋅mt/Vt
=
l
r
⋅
g
t
=lrcdot g_t
=lr⋅gt
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
m
t
V
t
=w_t-lrcdot m_tsqrt{V_t}
=wt−lr⋅mtVt
=
w
t
−
l
r
⋅
g
t
= w_t-lrcdot g_t
=wt−lr⋅gt
三、SGDM
在SGD基础上增加了一阶动量
在SGDM中
m
t
m_t
mt 表示各时刻梯度方向的指数滑动平均
一阶动量: m t = β ⋅ m t − 1 + ( 1 − β ) ⋅ g t m_t=beta cdot m_{t-1}+(1-beta ) cdot g_t mt=β⋅mt−1+(1−β)⋅gt 二阶动量: V t = 1 V_t=1 Vt=1
η
t
=
l
r
⋅
m
t
/
V
t
eta_t=lrcdot m_t/sqrt{V_t}
ηt=lr⋅mt/Vt
=
l
r
⋅
m
t
=lrcdot m_t
=lr⋅mt
=
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
=lr cdot(beta cdot m_{t-1}+(1-beta ) cdot g_t)
=lr⋅(β⋅mt−1+(1−β)⋅gt)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
=w_t-lr cdot(beta cdot m_{t-1}+(1-beta ) cdot g_t)
=wt−lr⋅(β⋅mt−1+(1−β)⋅gt)
三、Adagrad
在SGD基础上增加二阶动量
二阶动量是从开始到现在梯度平方的累计和
一阶动量: m t = g t m_t=g_t mt=gt 二阶动量: V t = ∑ τ t g τ 2 V_t=sum^t_{tau}g_{tau}^2 Vt=∑τtgτ2
η
t
=
l
r
⋅
m
t
/
(
V
t
)
eta_t=lr cdot m_t/(sqrt{V_t})
ηt=lr⋅mt/(Vt)
=
l
r
⋅
g
t
/
(
∑
τ
=
1
t
)
g
τ
2
)
=lr cdot g_t/(sqrt{sum^t_{tau=1})g_{tau}^2})
=lr⋅gt/(∑τ=1t)gτ2)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
g
t
/
(
∑
τ
=
1
t
)
g
τ
2
)
=w_t-lr cdot g_t/(sqrt{sum^t_{tau=1})g_{tau}^2})
=wt−lr⋅gt/(∑τ=1t)gτ2)
四、RMSProp
在SGD基础上增加二阶动量
二阶动量使用指数滑动平均值计算,表征的是过去一段时间的平均值
一阶动量: m t = g t m_t=g_t mt=gt 二阶动量: V t = β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 V_t=beta cdot V_{t-1}+(1-beta)cdot g_2^2 Vt=β⋅Vt−1+(1−β)⋅g22
η
t
=
l
r
⋅
m
t
/
(
(
V
t
)
)
eta_t=lr cdot m_t/(sqrt(V_t))
ηt=lr⋅mt/((Vt))
=
l
r
⋅
g
t
/
(
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
2
2
)
=lr cdot g_t/(sqrt{beta cdot V_{t-1}+(1-beta)cdot g_2^2})
=lr⋅gt/(β⋅Vt−1+(1−β)⋅g22)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
g
t
/
(
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
2
2
)
=w_t-lr cdot g_t/(sqrt{beta cdot V_{t-1}+(1-beta)cdot g_2^2})
=wt−lr⋅gt/(β⋅Vt−1+(1−β)⋅g22)
五、Adam
同时结合SGDM一阶动量和RMWSProp二阶动量
一阶动量: m t = β 1 ⋅ m t − 1 + ( 1 − β 1 ) m_t=beta_1 cdot m_{t-1}+(1-beta_1 ) mt=β1⋅mt−1+(1−β1)
修正一阶动量的偏差: m t ^ = m t 1 − β 1 t hat{m_t}=dfrac{m_t}{1-beta_1^t} mt^=1−β1tmt
二阶动量: V t = β 2 ⋅ V t − 1 + ( 1 − β 2 ) ⋅ g 2 2 V_t=beta_2 cdot V_{t-1}+(1-beta_2)cdot g_2^2 Vt=β2⋅Vt−1+(1−β2)⋅g22
修正二阶动量的偏差:
V
t
^
=
V
t
1
−
β
2
t
hat{V_t}=dfrac{V_t}{1-beta_2^t}
Vt^=1−β2tVt
η
t
=
l
r
⋅
m
t
^
/
(
V
t
^
)
eta_t=lr cdot hat{m_t}/(sqrt{hat{V_t}})
ηt=lr⋅mt^/(Vt^)
=
l
r
⋅
m
t
1
−
β
1
t
/
V
t
1
−
β
2
t
=lr cdot dfrac{m_t}{1-beta_1^t}/sqrt{dfrac{V_t}{1-beta_2^t}}
=lr⋅1−β1tmt/1−β2tVt
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
m
t
1
−
β
1
t
/
V
t
1
−
β
2
t
=w_t-lr cdot dfrac{m_t}{1-beta_1^t}/sqrt{dfrac{V_t}{1-beta_2^t}}
=wt−lr⋅1−β1tmt/1−β2tVt