MC has one drawback: the state-value function is calculated after the episode is completed, which slows down learning. To address this, a new concept called Temporal Difference Learning (TD) was introduced.
Temporal Difference
Learning (TD)
(1) The Gt (Return) used in MC is a value obtained at
the end of an episode. For more efficient learning, this value can be replaced
by the value obtained when a single timestep is completed. (2) It can be
replaced by the value obtained immediately in the next timestep (Rt+1) and the
value obtained by calculation (or estimation) (γV(st+1)).
DP, MC, and TD
To aid understanding of Temporal Difference (TD),
let’s visualize Dynamic Programming (DP), Monte Carlo (MC), and TD. In Dynamic
Programming, all possible future states from a given state are considered to
calculate value, and the policy is immediately evaluated (value is updated).
This process is repeated continuously. In MC, values are calculated by
following an episode, and the policy is evaluated all at once when the episode
ends. In contrast, TD considers only the value obtained from a single chosen action
and evaluates the policy immediately, repeating this process continually. TD
combines the frequent, short updates of Dynamic Programming with the
single-action execution of MC.
TD can calculate the value function even before an
episode is fully completed, so it can be used not only in episodic
(Terminating) environments like MC but also in non-terminating environments.
Before exploring policy control in MC and TD, let’s
look at how policy control is achieved in Dynamic Programming. In DP, the
policy is first evaluated, and the value function for several states accessible
from the chosen action is calculated. The policy is then updated to take
actions that lead to the state with the highest value function. A critical
point here is that Dynamic Programming is only possible in a model-based
environment where all information about the model is known, allowing us to
calculate which state yields the highest value function.
However, since MC and TD operate in model-free
environments, they lack sufficient information about the environment.
Consequently, the next state cannot be predicted, nor can we know which state
would yield the highest value function. Nevertheless, the Q-function can be
used to evaluate good actions. Because the Q-function represents the value of a
specific action, it can be evaluated even without complete information about
the next state.
Q-function
(Action-Value Function)
Let's revisit the Q-function formula from MDP. The
Q-function measures the reward obtained by selecting a single action in the
state-value function. To derive the Q-function from the state-value function,
the expected value of the state-value function for all states that can be
reached from a single action (a) and the state transition probability must be
calculated.
Returning to TD, in TD, after going one timestep
forward and calculating the value based on a policy (initially set to random
values), this value is subtracted from the previous timestep's value. To update
the randomly set policy in Dynamic Programming, the state-value function for
all states in the next timestep is calculated, and the policy is modified to
take actions leading to the state with the highest value. However, in TD, the
agent neither knows the possible states in the next timestep nor what those
s...
Policy
Control in Model-Free Environments
In TD, the only information available is the actions
possible in the current state. Therefore, if each action is attempted and the
action yielding the highest Q-function (Action-Value Function) is found, the
policy can be adjusted to perform the related action.