Earlier, we explained that the Q-function is used to
control policies in MC and TD. In environments where MC and TD are applied, all
information about the model is not available (Model-Free), meaning the next
state is unknown. Therefore, it is not possible to use the state-value function
to find the optimal policy.
However, by using the action-value function
(Q-function), we can calculate the value for each possible action (A) in the
current state (s), even without knowing the next state. This allows policy
control by updating the policy to choose the action that returns the highest
value.
SARSA is an extension of TD that replaces the
state-value function with the action-value function (Q-function) for policy
evaluation and control. In TD, a single action (a) is taken in state (s) based
on the policy, and its value is calculated, so strictly speaking, using the
Q-function is a more accurate approach. Therefore, SARSA replaces the
state-value function with the Q-function.
Both TD and MC use the same policy π for policy
evaluation and control, making them examples of on-policy learning methods.
So far, this covers what we've studied. Now, let’s explore Q-learning, which allows for more efficient learning.As previously mentioned, Q-functions are used for policy control in MC and TD. In environments where MC and TD are applied, complete information about the model is lacking (Model-Free), making it impossible to know the next state. Therefore, using the state-value function to determine the optimal policy is not feasible.
Sampling
in Q-Learning
However, by using the action-value function
(Q-function), it is possible to determine the value of each possible action (A)
in the current state (s), allowing the policy to be updated to sel...
TD and MC used the same policy π for policy evaluation
and policy control, thus falling under the category of on-policy learning.
Up to this point, we have covered the concepts learned
so far. Now let’s look at Q-learning, which allows for more efficient learning.
In SARSA, experience is accumulated by calculating the
Q-value based on the policy (π) to determine the next action. However, in
Q-learning, the next action is selected to maximize the Q-value rather than
following the policy. This is the key difference between SARSA and Q-learning.
Q-learning does not use importance sampling; however,
it is considered an off-policy method because the policy used for evaluation
(max) is different from the policy used for control (π). Generally, Q-learning
shows better pe...
Let's examine the issues of policy evaluation and
policy control in Q-learning, as previously discussed. In the SARSA algorithm,
policy evaluation involves calculating the Q-function. Actions are chosen based
on a fixed policy, values are calculated, and the Q-function is updated. Policy
control modifies the policy to select the action with the highest Q-function
calculated during policy evaluation.
Q-learning, on the other hand, does not use a fixed
policy. In the policy evaluation process, the action wit...