reinforce learning
actor-critic
origin
- : state at time t
- : value function predict reward with
- : action function return action probability base on
- top1: select the action with max probability
- : reward function input a sequence of action out float
- advantages value: if the can get more reward then positive value else negative value
with Temporal Difference error(TD-error)
- Hope can predict reward that may get in future with proportion
- note: you should add
stop gradientto (aka detach in pytorch)