Optimization Theory and
MDP
There are two conditions under which a problem can be
broken down using optimization theory (the Principle of Optimality). The first
is the Optimal Substructure. This means that by dividing a large problem into
smaller ones and finding the optimal solutions for these smaller problems, the
solution for the larger problem can be constructed. The second is Overlapping
Subproblems, which means that small problems recur repeatedly, allowing
solutions found once to be reused continuously. MDP satisfies both con...
Now, let’s look at dynamic programming, a
representative method for solving MDP problems in model-based environments by
breaking them into smaller units.
Model-Based and Model-Free
First, let’s divide a routing example into model-based
and model-free. In a model-based environment, all information regarding actions
and states is known. In a model-free environment, however, only the information
at the start and end is known, with no prior knowledge of the next state
resulting from an action. It’s only by taking action that the next state is
revealed.
Dynamic Programming (DP) is a model-based method in
reinforcement learning, assuming full knowledge of the environment. In DP,
<S, A, P, R, γ> is known, and vπ, v*, and π* can be computed. DP first
calculates vπ through policy evaluation, then calculates v* and π* through
policy control, and updates the policy accordingly.
Let’s solve an MDP using dynamic programming.
First, perform policy evaluation. The value function
in an MDP consists of two parts: the solution to the initial timestep and the
solutions for subsequent steps summed together. In policy evaluation, a fixed
policy is used to calculate the value function by considering only the next
timestep, and the value of the current timestep is updated. Repeating this
process allows calculating the value of each state. If this process is repeated
infinitely, the true value function of the MDP can be calculated, which i...
Policy control involves greedily selecting a policy
that maximizes the current policy by updating it based on the value function
calculated from the fixed policy. Iterative policy evaluation and policy
control enable finding the optimal policy.
Now, let's look at an example of gridworld to
understand this in more detail. Gridworld is a game where an agent navigates a
grid-like environment to reach a destination. The example game has 16 states.
The agent is randomly placed in one of the 14 states (excluding the
destinations), with the goal of establishing a policy that guides the agent to
the destination in the shortest path.
Gridworld Example
Assume a state transition probability of 1, and a
reward of -1 is received as each timestep progresses. There are four actions
(up, down, left, right), and the initial policy is evenly set to 0.25 for each
action. The game has 16 states, with (1) the top-left and bottom-right being
the destinations. (2) All states (except the destinations) reset to -1 as each
timestep progresses. (2) In timestep 2, the value of state (0,1) is -1.75.
At timestep k=2, let’s directly calculate the state
value for (0,1). With an initial policy assigning a probability of 0.25 for
each action, the value of the state at the next timestep can be calculated by
summing up the policy-weighted values from each state. Repeating this process
continuously updates the state values.
Policy Update
Policy evaluation is repeatedly performed to
sufficiently calculate the value for each grid. The initially uniform policy
(0.25 for each action) is then updated. This is called policy control. Policy
control involves setting new probabilities for actions on the grid, and if a
greedy approach is used, the policy is set to move toward the grid with the
highest value. In the example above, the grid with a value of -14 has the
highest value, so the probabilities for moving to it are set equally at 0.5,
and the probabilities for moving to other grids are set to 0. Dynamic
programming ultimately allows finding the optimal value and policy through
sufficient iterations of policy evaluation and policy control.
While understanding dynamic programming fully based on
this explanation is challenging, it’s not essential for studying reinforcement
learning, so we’ll conclude with this overview.