强化学习(三)—— 策略学习(Policy-Based)及策略梯度(Policy Gradient)

强化学习(三)—— 策略学习(Policy-Based)及策略梯度(Policy Gradient)

1. 策略学习

Policy Network

  • 通过策略网络近似策略函数
    π ( a ∣ s t ) ≈ π ( a ∣ s t ; θ ) π(a|s_t)≈π(a|s_t;theta) π(ast)π(ast;θ)
  • 状态价值函数及其近似
    V π ( s t ) = ∑ a π ( a ∣ s t ) Q π ( s t , a ) V_π(s_t)=sum_aπ(a|s_t)Q_π(s_t,a) Vπ(st)=aπ(ast)Qπ(st,a)
    V ( s t ; θ ) = ∑ a π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a ) V(s_t;theta)=sum_aπ(a|s_t;theta)·Q_π(s_t,a) V(st;θ)=aπ(ast;θ)Qπ(st,a)
  • 策略学习最大化的目标函数
    J ( θ ) = E S [ V ( S ; θ ) ] J(theta)=E_S[V(S;theta)] J(θ)=ES[V(S;θ)]
  • 依据策略梯度上升进行
    θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ thetagetstheta+beta·frac{partial V(s;theta)}{partial theta} θθ+βθV(s;θ)

2. 策略梯度

Policy Gradient

∂ V ( s ; θ ) θ = ∑ a Q π ( s , a ) ∂ π ( a ∣ s ; θ ) ∂ θ = ∫ a Q π ( s , a ) ∂ π ( a ∣ s ; θ ) ∂ θ = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ l n [ π ( a ∣ s ; θ ) ] ∂ θ = E A ∼ π ( a ∣ s ; θ ) [ Q π ( s , A ) ∂ l n [ π ( A ∣ s ; θ ) ] ∂ θ ] ≈ Q π ( s t , a t ) ∂ l n [ π ( a t ∣ s t ; θ ) ] ∂ θ frac{partial V(s;theta)}{theta}=sum_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=int_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=sum_a{pi(a|s;theta)·Q_pi(s,a)frac{partial ln[pi(a|s;theta)]}{partialtheta}}\=E_{Asimpi(a|s;theta)}[Q_pi(s,A)frac{partial ln[pi(A|s;theta)]}{partialtheta}]\≈Q_pi(s_t,a_t)frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta} θV(s;θ)=aQπ(s,a)θπ(as;θ)=aQπ(s,a)θπ(as;θ)=aπ(as;θ)Qπ(s,a)θln[π(as;θ)]=EAπ(as;θ)[Qπ(s,A)θln[π(As;θ)]]Qπ(st,at)θln[π(atst;θ)]

  • 观测得到状态
    s t s_t st
  • 依据策略函数随机采样动作
    a t = π ( a t ∣ s t ; θ ) a_t = pi(a_t|s_t;theta) at=π(atst;θ)
  • 计算价值函数
    q t = Q π ( s t , a t ) q_t = Q_pi(s_t,a_t) qt=Qπ(st,at)
  • 求取策略网络的梯度
    d θ , t = ∂ l n [ π ( a t ∣ s t ; θ ) ] ∂ θ ∣ θ = θ t d_{theta,t}=frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}|theta=theta_t dθ,t=θln[π(atst;θ)]θ=θt
  • 计算近似的策略梯度
    g ( a t , θ t ) = q t ⋅ d θ , t g(a_t,theta _t)=q_t·d_{theta,t} g(at,θt)=qtdθ,t
  • 更新策略网络
    θ t + 1 = θ t + β ⋅ g ( a t , θ t ) theta_{t+1}=theta_t+beta·g(a_t,theta_t) θt+1=θt+βg(at,θt)

3. 案例

目前没有好的方法近似动作价值函数,则未撰写案例。

by CyrusMay 2022 03 29