Dynamic Programming


Dynamic programming is a representative algorithm that leverages optimization theory to simplify problem-solving. Optimization problems like MDPs are often set in complex environments. While it may be feasible to find a solution in the example we’ve examined with a few calculations, solving the problem becomes significantly harder when dealing with hundreds of nodes in a network.

Optimization Theory and MDP

There are two conditions under which a problem can be broken down using optimization theory (the Principle of Optimality). The first is the Optimal Substructure. This means that by dividing a large problem into smaller ones and finding the optimal solutions for these smaller problems, the solution for the larger problem can be constructed. The second is Overlapping Subproblems, which means that small problems recur repeatedly, allowing solutions found once to be reused continuously. MDP satisfies both con...

Now, let’s look at dynamic programming, a representative method for solving MDP problems in model-based environments by breaking them into smaller units.

Model-Based and Model-Free

First, let’s divide a routing example into model-based and model-free. In a model-based environment, all information regarding actions and states is known. In a model-free environment, however, only the information at the start and end is known, with no prior knowledge of the next state resulting from an action. It’s only by taking action that the next state is revealed.

Dynamic Programming (DP) is a model-based method in reinforcement learning, assuming full knowledge of the environment. In DP, <S, A, P, R, γ> is known, and vπ, v*, and π* can be computed. DP first calculates vπ through policy evaluation, then calculates v* and π* through policy control, and updates the policy accordingly.

Let’s solve an MDP using dynamic programming.

First, perform policy evaluation. The value function in an MDP consists of two parts: the solution to the initial timestep and the solutions for subsequent steps summed together. In policy evaluation, a fixed policy is used to calculate the value function by considering only the next timestep, and the value of the current timestep is updated. Repeating this process allows calculating the value of each state. If this process is repeated infinitely, the true value function of the MDP can be calculated, which i...

Policy control involves greedily selecting a policy that maximizes the current policy by updating it based on the value function calculated from the fixed policy. Iterative policy evaluation and policy control enable finding the optimal policy.

Now, let's look at an example of gridworld to understand this in more detail. Gridworld is a game where an agent navigates a grid-like environment to reach a destination. The example game has 16 states. The agent is randomly placed in one of the 14 states (excluding the destinations), with the goal of establishing a policy that guides the agent to the destination in the shortest path.

Gridworld Example

Assume a state transition probability of 1, and a reward of -1 is received as each timestep progresses. There are four actions (up, down, left, right), and the initial policy is evenly set to 0.25 for each action. The game has 16 states, with (1) the top-left and bottom-right being the destinations. (2) All states (except the destinations) reset to -1 as each timestep progresses. (2) In timestep 2, the value of state (0,1) is -1.75.

Calculating State Value

At timestep k=2, let’s directly calculate the state value for (0,1). With an initial policy assigning a probability of 0.25 for each action, the value of the state at the next timestep can be calculated by summing up the policy-weighted values from each state. Repeating this process continuously updates the state values.

Policy Update

Policy evaluation is repeatedly performed to sufficiently calculate the value for each grid. The initially uniform policy (0.25 for each action) is then updated. This is called policy control. Policy control involves setting new probabilities for actions on the grid, and if a greedy approach is used, the policy is set to move toward the grid with the highest value. In the example above, the grid with a value of -14 has the highest value, so the probabilities for moving to it are set equally at 0.5, and the probabilities for moving to other grids are set to 0. Dynamic programming ultimately allows finding the optimal value and policy through sufficient iterations of policy evaluation and policy control.

While understanding dynamic programming fully based on this explanation is challenging, it’s not essential for studying reinforcement learning, so we’ll conclude with this overview.






Post a Comment

Previous Post Next Post