Q-Learning

 


Earlier, we explained that the Q-function is used to control policies in MC and TD. In environments where MC and TD are applied, all information about the model is not available (Model-Free), meaning the next state is unknown. Therefore, it is not possible to use the state-value function to find the optimal policy.

However, by using the action-value function (Q-function), we can calculate the value for each possible action (A) in the current state (s), even without knowing the next state. This allows policy control by updating the policy to choose the action that returns the highest value.

SARSA is an extension of TD that replaces the state-value function with the action-value function (Q-function) for policy evaluation and control. In TD, a single action (a) is taken in state (s) based on the policy, and its value is calculated, so strictly speaking, using the Q-function is a more accurate approach. Therefore, SARSA replaces the state-value function with the Q-function.

Both TD and MC use the same policy π for policy evaluation and control, making them examples of on-policy learning methods.

So far, this covers what we've studied. Now, let’s explore Q-learning, which allows for more efficient learning.As previously mentioned, Q-functions are used for policy control in MC and TD. In environments where MC and TD are applied, complete information about the model is lacking (Model-Free), making it impossible to know the next state. Therefore, using the state-value function to determine the optimal policy is not feasible.

Sampling in Q-Learning

However, by using the action-value function (Q-function), it is possible to determine the value of each possible action (A) in the current state (s), allowing the policy to be updated to sel...

TD and MC used the same policy π for policy evaluation and policy control, thus falling under the category of on-policy learning.

Up to this point, we have covered the concepts learned so far. Now let’s look at Q-learning, which allows for more efficient learning.

In SARSA, experience is accumulated by calculating the Q-value based on the policy (π) to determine the next action. However, in Q-learning, the next action is selected to maximize the Q-value rather than following the policy. This is the key difference between SARSA and Q-learning.

Q-learning does not use importance sampling; however, it is considered an off-policy method because the policy used for evaluation (max) is different from the policy used for control (π). Generally, Q-learning shows better pe...

 

Let's examine the issues of policy evaluation and policy control in Q-learning, as previously discussed. In the SARSA algorithm, policy evaluation involves calculating the Q-function. Actions are chosen based on a fixed policy, values are calculated, and the Q-function is updated. Policy control modifies the policy to select the action with the highest Q-function calculated during policy evaluation.

Q-learning, on the other hand, does not use a fixed policy. In the policy evaluation process, the action wit...

The algorithms we have studied so far are the foundational theories of reinforcement learning. Dynamic Programming, MC, TD, and Q-learning are not frequently used in practical applications. Starting with DQN, which we will examine next, we’ll cover algorithms widely used in practice. However, it is essential to thoroughly understand the early reinforcement learning algorithms because it is nearly impossible to understand advanced algorithms without grasping these foundational ones.

Post a Comment

Previous Post Next Post