On-Policy and Off-Policy

All the content we have studied so far pertains to on-policy. This is because the policy used for evaluation (π) and the policy used for control (π) are the same. In on-policy learning in TD, one more timestep is taken to calculate the state-value function to evaluate the policy, and the policy is modified in a greedy manner in the Q-function, choosing the action with the highest Q-value. This process is repeated continuously. There are two issues here: first, experiences used once for evaluation are no...

Off-Policy

To address the issues of experience reuse and applying various policies, the off-policy algorithm was introduced. In off-policy, the policy used for evaluation and the policy used for control are separately applied.

Importance Sampling

Importance Sampling refers to generating the expected value of f(x) under probability distribution p(x) when it is challenging to sample from p, by instead sampling from a distribution q(x) from which it is relatively easy to obtain samples. This method calculates the expectation of f(x) in p(x) using samples from q(x).

Probability Distributions and Probability Density Function

A random variable, simply put, represents the types of actions (A: Set of Actions), and a probability distribution can be considered as the policy (π: Policy). In the figure above, there are three types of actions: high, medium, and low. The policy for each action is 0.3, 0.4, and 0.3, respectively. To obtain a relatively accurate expected value, the agent must observe many actions taken as it transitions from state R1 to the next state (samples) and calculate the average. If we think of it as finding a new navigation route, samples may not be available, so it is necessary to utilize existing data.

Existing Route Data

By using existing route data, not only can the random variable and probability distribution be obtained, but a large number of samples can also be acquired. This allows us to apply the theory of Importance Sampling to calculate an appropriate expected value for a new route.

Importance Sampling

To solve problems using importance sampling, it is necessary to know the probability distribution (Q) of the data-rich environment and the probability distribution (P) of the environment we want to target. The expected value of Q and variable x is then calculated, multiplied by the ratio of Q to P. This concept has been mathematically proven, and it’s recommended to approach reinforcement learning with this level of understanding.

Importance Sampling in MC and TD

MC and TD can be modified using importance sampling. Here, μ represents a policy from an information-rich environment with extensive experience. This policy is likely to be well-trained and can provide samples easily. π\piπ is the policy we aim to learn, but obtaining samples for this policy is difficult. When we want to train policy π using MC, we can use policy μ to obtain samples and train π\piπ through importance sampling.

In MC, samples continue to be generated until an episode ends, so importance sampling must be repeatedly multiplied to calculate the expected value. In TD, only a single timestep is executed and its value calculated, requiring just one instance of importance sampling.

In MC, the continuous multiplication of importance sampling can lead to severe value distortion. Therefore, practically, using importance sampling in MC is infeasible.

MC and TD can be modified using importance sampling. µ is the policy from an information-rich environment with many experiences. This policy would be relatively well-trained and provide samples easily. π is the policy we want to train but is difficult to sample from. To train policy π using MC, samples can be obtained through policy µ, and π is trained through importance sampling.

In MC, importance sampling continues to be multiplied because samples are generated until the end of an episode, whereas in TD, ...

MC uses continuous multiplication of importance sampling, which may distort the values significantly. Thus, using importance sampling in MC is impractical.

On-Policy and Off-Policy, Importance Sampling

Importance Sampling

Post a Comment