TD (Temporal Difference Learning)

MC has one drawback: the state-value function is calculated after the episode is completed, which slows down learning. To address this, a new concept called Temporal Difference Learning (TD) was introduced.

Temporal Difference Learning (TD)

(1) The Gt (Return) used in MC is a value obtained at the end of an episode. For more efficient learning, this value can be replaced by the value obtained when a single timestep is completed. (2) It can be replaced by the value obtained immediately in the next timestep (Rt+1) and the value obtained by calculation (or estimation) (γV(st+1)).

DP, MC, and TD

To aid understanding of Temporal Difference (TD), let’s visualize Dynamic Programming (DP), Monte Carlo (MC), and TD. In Dynamic Programming, all possible future states from a given state are considered to calculate value, and the policy is immediately evaluated (value is updated). This process is repeated continuously. In MC, values are calculated by following an episode, and the policy is evaluated all at once when the episode ends. In contrast, TD considers only the value obtained from a single chosen action and evaluates the policy immediately, repeating this process continually. TD combines the frequent, short updates of Dynamic Programming with the single-action execution of MC.

TD can calculate the value function even before an episode is fully completed, so it can be used not only in episodic (Terminating) environments like MC but also in non-terminating environments.

Before exploring policy control in MC and TD, let’s look at how policy control is achieved in Dynamic Programming. In DP, the policy is first evaluated, and the value function for several states accessible from the chosen action is calculated. The policy is then updated to take actions that lead to the state with the highest value function. A critical point here is that Dynamic Programming is only possible in a model-based environment where all information about the model is known, allowing us to calculate which state yields the highest value function.

However, since MC and TD operate in model-free environments, they lack sufficient information about the environment. Consequently, the next state cannot be predicted, nor can we know which state would yield the highest value function. Nevertheless, the Q-function can be used to evaluate good actions. Because the Q-function represents the value of a specific action, it can be evaluated even without complete information about the next state.

Q-function (Action-Value Function)

Let's revisit the Q-function formula from MDP. The Q-function measures the reward obtained by selecting a single action in the state-value function. To derive the Q-function from the state-value function, the expected value of the state-value function for all states that can be reached from a single action (a) and the state transition probability must be calculated.

Returning to TD, in TD, after going one timestep forward and calculating the value based on a policy (initially set to random values), this value is subtracted from the previous timestep's value. To update the randomly set policy in Dynamic Programming, the state-value function for all states in the next timestep is calculated, and the policy is modified to take actions leading to the state with the highest value. However, in TD, the agent neither knows the possible states in the next timestep nor what those s...

Policy Control in Model-Free Environments

In TD, the only information available is the actions possible in the current state. Therefore, if each action is attempted and the action yielding the highest Q-function (Action-Value Function) is found, the policy can be adjusted to perform the related action.

TD (Temporal Difference Learning)

Post a Comment