In TD, the state-value function was used for policy evaluation, while the Q-function (Action-Value Function) was used only for policy control. However, both policy evaluation and control can be conducted using the Q-function since it contains values for both actions and states.
SARSA
The Q-function receives a
reward Rt+1 after taking action At in state St. Then, in the next state St+1,
the process repeats with action At+1, creating a continuous sequence of S, A,
R, S, A. This sequence gives the algorithm its name: SARSA.
Looking back at TD, the
agent in the environment moves one timestep based on its policy and then
calculates the state-value function. The original state-value function involves
calculating the expected value of actions and states according to the policy,
meaning all actions and states should be considered. However, in MC and TD,
only one action based on the policy is chosen to calculate the value, which is
not entirely accurate.
Thus, the SARSA algorithm
becomes the appropriate formula for policy evaluation in model-free
environments. In SARSA, policy control is also updated to select the action
that maximizes the Q-function, similar to MC and TD.In the Q-function, when in
state St, action At is taken, and reward Rt+1 is received. The action At+1 is
taken in the next state St+1, repeating the previous sequence. This sequence of
S, A, R, S, A is called SARSA.
Revisiting TD, the agent
calculates the state-value function after moving one timestep based on the
policy. The original state-value function involves calculating the expected
values of actions and states based on the policy, meaning all actions and states
should be considered. However, in MC and TD, th...
Thus, the SARSA algorithm becomes the formula for policy evaluation in model-free environments, which we aim to obtain. In SARSA, policy control is updated to select actions that maximize the Q-function, as in MC and TD.
