SARSA

 

In TD, the state-value function was used for policy evaluation, while the Q-function (Action-Value Function) was used only for policy control. However, both policy evaluation and control can be conducted using the Q-function since it contains values for both actions and states.

SARSA

The Q-function receives a reward Rt+1 after taking action At in state St. Then, in the next state St+1, the process repeats with action At+1, creating a continuous sequence of S, A, R, S, A. This sequence gives the algorithm its name: SARSA.

Looking back at TD, the agent in the environment moves one timestep based on its policy and then calculates the state-value function. The original state-value function involves calculating the expected value of actions and states according to the policy, meaning all actions and states should be considered. However, in MC and TD, only one action based on the policy is chosen to calculate the value, which is not entirely accurate.

Thus, the SARSA algorithm becomes the appropriate formula for policy evaluation in model-free environments. In SARSA, policy control is also updated to select the action that maximizes the Q-function, similar to MC and TD.In the Q-function, when in state St, action At is taken, and reward Rt+1 is received. The action At+1 is taken in the next state St+1, repeating the previous sequence. This sequence of S, A, R, S, A is called SARSA.

Revisiting TD, the agent calculates the state-value function after moving one timestep based on the policy. The original state-value function involves calculating the expected values of actions and states based on the policy, meaning all actions and states should be considered. However, in MC and TD, th...

Thus, the SARSA algorithm becomes the formula for policy evaluation in model-free environments, which we aim to obtain. In SARSA, policy control is updated to select actions that maximize the Q-function, as in MC and TD.


Post a Comment

Previous Post Next Post