1. computer-science
- 1.1. compiler
  - 1.1.1. compiler
  - 1.1.2. hw0
  - 1.1.3. hw1
  - 1.1.4. hw2
  - 1.1.5. hw3
- 1.2. computer-graphics
  - 1.2.1. computer-graphics
  - 1.2.2. deferred-rendering
- 1.3. computer-network
  - 1.3.1. computer-network
  - 1.3.2. hw1
  - 1.3.3. hw2
  - 1.3.4. hw4
  - 1.3.5. hw5
  - 1.3.6. hw6
- 1.4. computer-vision
  - 1.4.1. computer-vistion
- 1.5. data-science
  - 1.5.1. data-science
  - 1.5.2. hw1
  - 1.5.3. hw2
  - 1.5.4. hw3
- 1.6. database-systems
  - 1.6.1. database-systems
  - 1.6.2. hw1
  - 1.6.3. hw2
  - 1.6.4. hw3
  - 1.6.5. hw4
- 1.7. machine-learing
  - 1.7.1. auto-encoder
  - 1.7.2. distributedtraining
  - 1.7.3. gan
  - 1.7.4. machine-learing
  - 1.7.5. reinforcement-learning
  - 1.7.6. transformer
- 1.8. model-checking
  - 1.8.1. model-checking
- 1.9. operating-system
  - 1.9.1. hw1
  - 1.9.2. hw2
  - 1.9.3. hw3
  - 1.9.4. hw4
  - 1.9.5. operating-system
- 1.10. system-design
  - 1.10.1. devops
  - 1.10.2. website-system-design
- 1.11. vlsi
  - 1.11.1. vlsi
- 1.12. csg
- 1.13. leetcode
- 1.14. nerf
2. concept
- 2.1. japanese
3. math
- 3.1. formal-language
  - 3.1.1. formal-language
  - 3.1.2. hw1
- 3.2. math-modeling
  - 3.2.1. math-modeling
  - 3.2.2. round-robin-tournament
- 3.3. matrix-algebra
  - 3.3.1. matrix-algebra
  - 3.3.2. q2
  - 3.3.3. q3
- 3.4. probability
  - 3.4.1. hw2
  - 3.4.2. hw3
  - 3.4.3. hw4
  - 3.4.4. hw5
  - 3.4.5. hw6
  - 3.4.6. hw7
  - 3.4.7. hw8
  - 3.4.8. hw9
- 3.5. signals-and-systems
  - 3.5.1. final
  - 3.5.2. mid
  - 3.5.3. signals-and-systems
- 3.6. calculus
- 3.7. linear-algebra
4. tool
- 4.1. bittorrent
- 4.2. docker
- 4.3. ffmpeg
- 4.4. git
- 4.5. graphviz
- 4.6. shieldsio
- 4.7. shortcut
- 4.8. vim

reinforce learning actor-critic origin with Temporal Difference error(TD-error)

reinforce learning

actor-critic

origin

$s_{t}$ : state at time t
$V(s_{t})$ : value function predict reward with $s_{t}$
$A(s_{t})$ : action function return action probability base on $s_{t}$
top1: select the action with max probability
$R(a_{[0:t]})$ : reward function input a sequence of action out float
advantages value: if the can get more reward then positive value else negative value

\begin{aligned} \text{action probability}&=A(s_{[0:t-1]}) \\ a_{[0:t-1]}&=\text{top1}(\text{action probability})\\ \text{reward}&=R(a_{[0:t-1]})\\ L_\text{critic}&=MSE(reward,V(s_{t-1}))\\ \text{advantages value}&=\text{reward}-V(s_{t-1})\\ L_\text{actor}&=\text{CrossEntropy}(\text{action probability},a_{[0:t-1]})\times {\text{advantages value}}\\ \end{aligned}

with Temporal Difference error(TD-error)

Hope $V(s_{t})$ can predict reward that may get in future with proportion $\gamma$
note: you should add stop gradient to $TD_\text{target}$ (aka detach in pytorch)

\begin{aligned} TD_\text{target}&=\text{reward}+\gamma V(s_{t})\\ TD_\text{error} &=TD_\text{target} - V(s_{t-1})\\ L_\text{critic}&=MSE(TD_\text{error},0) =MSE(TD_\text{target},V(s_{t-1}))\\ \text{advantages value}&=TD_\text{error}\\ L_\text{actor}&=\text{CrossEntropy}(\text{action probability},a_{[0:t-1]})\times {\text{advantages value}}\\ \end{aligned}