reinforce learning

actor-critic

origin

  • sts_{t}: state at time t
  • V(st)V(s_{t}): value function predict reward with sts_{t}
  • A(st)A(s_{t}): action function return action probability base on sts_{t}
  • top1: select the action with max probability
  • R(a[0:t])R(a_{[0:t]}): reward function input a sequence of action out float
  • advantages value: if the can get more reward then positive value else negative value
action probability=A(s[0:t1])a[0:t1]=top1(action probability)reward=R(a[0:t1])Lcritic=MSE(reward,V(st1))advantages value=rewardV(st1)Lactor=CrossEntropy(action probability,a[0:t1])×advantages value\begin{aligned} \text{action probability}&=A(s_{[0:t-1]}) \\ a_{[0:t-1]}&=\text{top1}(\text{action probability})\\ \text{reward}&=R(a_{[0:t-1]})\\ L_\text{critic}&=MSE(reward,V(s_{t-1}))\\ \text{advantages value}&=\text{reward}-V(s_{t-1})\\ L_\text{actor}&=\text{CrossEntropy}(\text{action probability},a_{[0:t-1]})\times {\text{advantages value}}\\ \end{aligned}

with Temporal Difference error(TD-error)

  • Hope V(st)V(s_{t}) can predict reward that may get in future with proportion γ\gamma
  • note: you should add stop gradient to TDtargetTD_\text{target} (aka detach in pytorch)
TDtarget=reward+γV(st)TDerror=TDtargetV(st1)Lcritic=MSE(TDerror,0)=MSE(TDtarget,V(st1))advantages value=TDerrorLactor=CrossEntropy(action probability,a[0:t1])×advantages value\begin{aligned} TD_\text{target}&=\text{reward}+\gamma V(s_{t})\\ TD_\text{error} &=TD_\text{target} - V(s_{t-1})\\ L_\text{critic}&=MSE(TD_\text{error},0) =MSE(TD_\text{target},V(s_{t-1}))\\ \text{advantages value}&=TD_\text{error}\\ L_\text{actor}&=\text{CrossEntropy}(\text{action probability},a_{[0:t-1]})\times {\text{advantages value}}\\ \end{aligned}