On-Policy and Off-Policy
All the content we have studied so far pertains to
on-policy. This is because the policy used for evaluation (π) and the policy
used for control (π) are the same. In on-policy learning in TD, one more
timestep is taken to calculate the state-value function to evaluate the policy,
and the policy is modified in a greedy manner in the Q-function, choosing the
action with the highest Q-value. This process is repeated continuously. There
are two issues here: first, experiences used once for evaluation are no...
To address the issues of experience reuse and applying
various policies, the off-policy algorithm was introduced. In off-policy, the
policy used for evaluation and the policy used for control are separately
applied.
Importance Sampling
Importance Sampling refers to generating the expected
value of f(x) under probability distribution p(x) when it is challenging to
sample from p, by instead sampling from a distribution q(x) from which it is
relatively easy to obtain samples. This method calculates the expectation of
f(x) in p(x) using samples from q(x).
Probability
Distributions and Probability Density Function
A random variable, simply put, represents the
types of actions (A: Set of Actions), and a probability distribution can be
considered as the policy (π: Policy). In the figure above, there are three
types of actions: high, medium, and low. The policy for each action is 0.3,
0.4, and 0.3, respectively. To obtain a relatively accurate expected value, the
agent must observe many actions taken as it transitions from state R1 to the
next state (samples) and calculate the average. If we think of it as finding a
new navigation route, samples may not be available, so it is necessary to
utilize existing data.
By using existing route data, not only can the random
variable and probability distribution be obtained, but a large number of
samples can also be acquired. This allows us to apply the theory of Importance
Sampling to calculate an appropriate expected value for a new route.
Importance
Sampling
To solve problems using importance sampling, it is
necessary to know the probability distribution (Q) of the data-rich environment
and the probability distribution (P) of the environment we want to target. The
expected value of Q and variable x is then calculated, multiplied by the ratio
of Q to P. This concept has been mathematically proven, and it’s recommended to
approach reinforcement learning with this level of understanding.
Importance
Sampling in MC and TD
MC and TD can be modified using importance sampling.
Here, μ represents a policy from an information-rich environment with extensive
experience. This policy is likely to be well-trained and can provide samples
easily. π\piπ is the policy we aim to learn, but obtaining samples for this
policy is difficult. When we want to train policy π using MC, we can use policy
μ to obtain samples and train π\piπ through importance sampling.
In MC, samples continue to be generated until an
episode ends, so importance sampling must be repeatedly multiplied to calculate
the expected value. In TD, only a single timestep is executed and its value
calculated, requiring just one instance of importance sampling.
In MC, the continuous multiplication of importance
sampling can lead to severe value distortion. Therefore, practically, using
importance sampling in MC is infeasible.
MC and TD can be modified using importance sampling. µ
is the policy from an information-rich environment with many experiences. This
policy would be relatively well-trained and provide samples easily. π is the
policy we want to train but is difficult to sample from. To train policy π
using MC, samples can be obtained through policy µ, and π is trained through
importance sampling.
In MC, importance sampling continues to be multiplied
because samples are generated until the end of an episode, whereas in TD, ...
MC uses continuous multiplication of importance sampling, which may distort the values significantly. Thus, using importance sampling in MC is impractical.