MC: Monte-Carlo Method

 

In dynamic programming, policy evaluation and control are performed under the assumption that the model is known (Model-Based). When the model is known, the next state can be predicted, allowing the problem to be broken down into smaller units and calculated sequentially to find the optimal policy.

However, in situations where the model is unknown (Model-Free), the reward function (R) and state transition probability (P) are not available, and particularly, the next state is unpredictable. Thus, methods like dynamic programming cannot be used to solve the problem. In such cases, Monte Carlo Prediction is needed.

In dynamic programming, the value of each state is updated by running through all states once, but in Monte Carlo Prediction, the algorithm runs until an episode is completed, accumulating experiences and calculating the value function based on those experiences.

The Monte Carlo method (MC) is one of the most commonly used approaches when there is insufficient information about the environment. Instead of precise mathematical calculations, the Monte Carlo method estimates values statistically using probabilistic techniques. It is typically used to obtain approximate results when calculating an exact value is too complex.

The name "Monte Carlo" originates from Polish-American mathematician Stanislaw Ulam, who named it after the famous gambling city in Monaco. The Monte Carlo method is well-suited for computer simulations, making it useful in the development of atomic and hydrogen bombs and still widely applied in various fields today.

Monte Carlo Method

For example, calculating the area of a polygon like the one above can be difficult mathematically. Using the Monte Carlo method, we can proceed as follows: First, draw a square that surrounds the polygon. Suppose we draw a square with each side measuring 1 meter. The area of the square is easily calculated by multiplying the width and height, resulting in 1 m². Now, let’s scatter round balls inside the square. Suppose 30 balls are thrown, and 15 fall inside the polygon. Using the ratio of the total number of balls in the square to those inside the polygon, we can estimate the polygon’s area to be 0.5 m². This is the essence of the Monte Carlo method.

In this approach, the polygon's area is estimated based on the ratio of balls, rather than through precise measurement. In reinforcement learning, the Monte Carlo method can approximate values effectively in cases where complete information about the environment is unavailable.

There is one prerequisite for using the Monte Carlo method in reinforcement learning: the environment in which the agent operates must have a defined beginning and end. This type of environment is called episodic. For example, in games, MMORPGs like Lineage do not have clear episode boundaries, while games like Diablo progress in episodic units. The Monte Carlo method can be applied only to cases that can be divided into episodes, such as the latter.

The Monte Carlo method involves executing an episode to its end and using the result to estimate a value. For example, to calculate the value function, the Bellman equation is used, which sums the value obtained in the current timestep with all future discounted values. When all information about the environment is available, the next state can be predicted, allowing for precise calculation, as in dynamic programming. However, in the absence of complete information, the next state is unknown. Therefore, the agent must operate through the episode, gathering information based on the chosen policy and the state transition probabilities provided by the environment, and calculate the value function by accumulating this information.

When the agent completes its first episode, there may be a significant difference between the calculated value function and the true value function (which can only be computed if all environmental information is known and is unattainable in a model-free environment). However, by averaging the values obtained over many episodes, the estimated value function will approach the true value function. This approach is known as the Monte Carlo method.

Solving MDP with the Monte Carlo Method

To solve an MDP using the Monte Carlo (MC) method, let's express it mathematically. Recall that in an MDP, the state-value function is defined as (1) the expected value of the return (G). This means that to calculate the value, all possible actions and all possible states (2)-1 and (2)-2 must be considered under a fixed policy. This is possible in a Model-Based environment, where all information about the environment is known.

 

Now, let’s calculate the state-value function using MC. In MC, (3) the average of the returns is used as the state-value function. (4) First, run the agent until the end of an episode. When an episode ends, increment the count (N: accumulated count) by one. (5) Gather all the returns collected during the episode and store them in the variable S (accumulated returns). In contrast to MDP, this approach only considers the actions taken by the agent and the states visited during the episode, without accounting for all actions and states. (6) Finally, divide the accumulated returns by the accumulated count to obtain the average, which gives the state-value function. This calculated state-value function can be used to evaluate the fixed policy.

Incremental Mean

Incremental Mean provides a method to quickly update the overall average when a new value arrives by using the previously calculated average up to the last timestep. xj represents a continuously occurring value. (1) To calculate the overall average of continuously occurring xxx values, add all xxx values together and divide by the number of occurrences.

(2) Separate the most recent x value as xk and group the sum of values up to the previous timestep as a sequence.

(3) For ease of calculation in programming, multiply by k−1 and then divide. This adjustment is made for convenience in computation.

(3)-1 This part represents the sum of data up to k−1 since only values up to k−1 are summed.

(4) This can be viewed as the average calculated up to the previous timestep, represented as μk−1.

(5) Transform the equation once more for calculation purposes.

(6) Finally, convert it into a form that’s easier to program. While the precise formula should use μk−1, it’s acceptable to replace it with μk for convenience.

MC Using Incremental Mean

Now, let’s use the incremental mean to transform MC into a form that can be programmed. (1) In the previous formula, the average was calculated by dividing the accumulated return by the accumulated count. To calculate values this way, information from all past episodes must be stored, which burdens the system and slows down computation.

(2) By using the incremental mean, we can simplify this formula: subtract the average return up to the previous timestep, V(s), from the latest return Gt, and divide by the episode count. Adding this result to the average return up to the previous timestep provides the state-value function.

A unique point in this formula is that if Gt and V(s) are equal, the equation becomes V(s)=V(s), meaning the state-value function no longer changes. This state is known as the true value function. In reinforcement learning, the goal is to find a policy that minimizes the difference between Gt and V(s).

MC for Programming

Now, let’s replace 1/N(s) in formula (1) with to create a new formula (2). Changing to makes this value a fixed constant rather than one that changes over time. Typically, is a value between 0 and 1. Transforming to a fixed constant simplifies programming considerably. Replacing elements from the complete mathematical formula for ease of programming is feasible because MC is an empirical method that approaches a solution statistically through experience rather than precise measurement.


Post a Comment

Previous Post Next Post