In dynamic programming, policy evaluation and control
are performed under the assumption that the model is known (Model-Based). When
the model is known, the next state can be predicted, allowing the problem to be
broken down into smaller units and calculated sequentially to find the optimal
policy.
However, in situations where the model is unknown
(Model-Free), the reward function (R) and state transition probability (P) are
not available, and particularly, the next state is unpredictable. Thus, methods
like dynamic programming cannot be used to solve the problem. In such cases,
Monte Carlo Prediction is needed.
In dynamic programming, the value of each state is
updated by running through all states once, but in Monte Carlo Prediction, the
algorithm runs until an episode is completed, accumulating experiences and
calculating the value function based on those experiences.
The Monte Carlo method (MC) is one of the most
commonly used approaches when there is insufficient information about the
environment. Instead of precise mathematical calculations, the Monte Carlo
method estimates values statistically using probabilistic techniques. It is
typically used to obtain approximate results when calculating an exact value is
too complex.
The name "Monte Carlo" originates from
Polish-American mathematician Stanislaw Ulam, who named it after the famous
gambling city in Monaco. The Monte Carlo method is well-suited for computer
simulations, making it useful in the development of atomic and hydrogen bombs
and still widely applied in various fields today.
Monte Carlo Method
For example, calculating the area of a polygon like
the one above can be difficult mathematically. Using the Monte Carlo method, we
can proceed as follows: First, draw a square that surrounds the polygon.
Suppose we draw a square with each side measuring 1 meter. The area of the
square is easily calculated by multiplying the width and height, resulting in 1
m². Now, let’s scatter round balls inside the square. Suppose 30 balls are
thrown, and 15 fall inside the polygon. Using the ratio of the total number of
balls in the square to those inside the polygon, we can estimate the polygon’s
area to be 0.5 m². This is the essence of the Monte Carlo method.
In this approach, the polygon's area is estimated
based on the ratio of balls, rather than through precise measurement. In
reinforcement learning, the Monte Carlo method can approximate values
effectively in cases where complete information about the environment is
unavailable.
There is one prerequisite for using the Monte Carlo
method in reinforcement learning: the environment in which the agent operates
must have a defined beginning and end. This type of environment is called
episodic. For example, in games, MMORPGs like Lineage do not have clear episode
boundaries, while games like Diablo progress in episodic units. The Monte Carlo
method can be applied only to cases that can be divided into episodes, such as
the latter.
The Monte Carlo method involves executing an episode
to its end and using the result to estimate a value. For example, to calculate
the value function, the Bellman equation is used, which sums the value obtained
in the current timestep with all future discounted values. When all information
about the environment is available, the next state can be predicted, allowing
for precise calculation, as in dynamic programming. However, in the absence of
complete information, the next state is unknown. Therefore, the agent must
operate through the episode, gathering information based on the chosen policy
and the state transition probabilities provided by the environment, and
calculate the value function by accumulating this information.
When the agent completes its first episode, there may
be a significant difference between the calculated value function and the true
value function (which can only be computed if all environmental information is
known and is unattainable in a model-free environment). However, by averaging
the values obtained over many episodes, the estimated value function will
approach the true value function. This approach is known as the Monte Carlo
method.
Solving MDP with the Monte
Carlo Method
To solve an MDP using the Monte Carlo (MC) method,
let's express it mathematically. Recall that in an MDP, the state-value
function is defined as (1) the expected value of the return (G). This means
that to calculate the value, all possible actions and all possible states (2)-1
and (2)-2 must be considered under a fixed policy. This is possible in a
Model-Based environment, where all information about the environment is known.
Now, let’s calculate the state-value function using
MC. In MC, (3) the average of the returns is used as the state-value function.
(4) First, run the agent until the end of an episode. When an episode ends,
increment the count (N: accumulated count) by one. (5) Gather all the returns
collected during the episode and store them in the variable S (accumulated
returns). In contrast to MDP, this approach only considers the actions taken by
the agent and the states visited during the episode, without accounting for all
actions and states. (6) Finally, divide the accumulated returns by the
accumulated count to obtain the average, which gives the state-value function.
This calculated state-value function can be used to evaluate the fixed policy.
Incremental Mean
Incremental Mean provides a method to quickly update
the overall average when a new value arrives by using the previously calculated
average up to the last timestep. xj represents a continuously occurring value.
(1) To calculate the overall average of continuously occurring xxx values, add
all xxx values together and divide by the number of occurrences.
(2) Separate the most recent x value as xk and group
the sum of values up to the previous timestep as a sequence.
(3) For ease of calculation in programming, multiply
by k−1 and then divide. This adjustment is made for convenience in computation.
(3)-1 This part represents the sum of data up to k−1
since only values up to k−1 are summed.
(4) This can be viewed as the average calculated up to
the previous timestep, represented as μk−1.
(5) Transform the equation once more for calculation
purposes.
(6) Finally, convert it into a form that’s easier to
program. While the precise formula should use μk−1, it’s acceptable to replace
it with μk for convenience.
MC Using Incremental Mean
Now, let’s use the incremental mean to transform MC
into a form that can be programmed. (1) In the previous formula, the average
was calculated by dividing the accumulated return by the accumulated count. To
calculate values this way, information from all past episodes must be stored,
which burdens the system and slows down computation.
(2) By using the incremental mean, we can simplify
this formula: subtract the average return up to the previous timestep, V(s),
from the latest return Gt, and divide by the episode count. Adding this result
to the average return up to the previous timestep provides the state-value
function.
A unique point in this formula is that if Gt and V(s)
are equal, the equation becomes V(s)=V(s), meaning the state-value function no
longer changes. This state is known as the true value function. In
reinforcement learning, the goal is to find a policy that minimizes the
difference between Gt and V(s).
MC for Programming
Now, let’s replace 1/N(s) in formula (1) with ∝ to create a new formula (2). Changing to ∝ makes this value a fixed constant rather than one
that changes over time. Typically, ∝ is a value between 0 and 1. Transforming to a fixed constant simplifies
programming considerably. Replacing elements from the complete mathematical
formula for ease of programming is feasible because MC is an empirical method
that approaches a solution statistically through experience rather than precise
measurement.