Markov Decision Process (MDP) Concept


The MDP is an extension of the Markov Reward Process (MRP) with added Action (A) and Policy (π). While the goal of MRP is to calculate the overall value of an episode or environment, MDP aims to determine a policy that maximizes the value of the environment. Though it may not be clear now, remember the term “policy decision” for now.

Agent Movement in MRP and MDP

In MDP, the concept of an agent is introduced. We previously explained Brownian motion by observing the movement of pollen particles in a stochastic process. This pollen corresponds to an agent in MDP. While an agent could be used to explain concepts in stochastic processes and MRP, it wasn't necessary. In stochastic processes and MRP, the state (S: State) of the environment changes naturally over time due to the state transition probability (P). This introduces a passive meaning to the environment. However, in MDP, the agent’s actions and the state transition probabilities together affect the environment’s state. The agent actively decides actions according to the policy, giving MDP an active meaning that influences its application fields compared to MRP.

The term agent means an actor or entity that takes actions. In MDP, the agent acts based on the policy (π), and the state (S: State) changes based on the actions taken by the agent and the state transition probabilities (P).

Components of MDP

In MDP, the state transition matrix (P) represents the conditional probability that, given the state s at time t and action a, the state will be s' at time t+1. The reward function (R) is the expected reward at time t+1 when action a is taken in state s at time t. MDP adds action as a condition to the state transition matrix and reward function, meaning actions (A: Set of Actions) must be considered along with these components. Actions influence the next state, and just like states, the number of actions in MDP is finite.

MDP Policy

In MDP, a policy represents the probability of choosing actions, similar in structure to the state transition matrix. If there are four types of actions, the sum of probabilities for each action in a given state should equal 1. Since policies are probabilistic, following a policy does not always mean taking the highest-probability action but rather that higher-probability actions are more likely to be chosen.

For example, if a policy assigns a 60% probability to action A and a 40% probability to action B, the agent does not always choose action A just because its probability is higher. Instead, action A has a 60% likelihood of being chosen, with the remaining 40% for action B. Similar to real-world decisions, MDP allows for unpredictability in the agent's behavior. This concept is closely related to the exploration problem discussed later.

Comparison of MRP and MDP Examples

The sudden appearance of actions and policies might seem confusing, but examining examples should make this clearer. The MRP environment is relatively simple. In timestep t1, the state is S1, and at t2, the states are S2 and S3, with state transition probabilities of 0.7 and 0.3, respectively. The probability of being in S2 at t2 equals the state transition probability.

In an MDP environment, the probability of being in S2 is more complex. In state S1, the available actions are A1 and A2. Action A1 leads to state S2, and action A2 leads to state S3. The agent’s policy assigns probabilities of 0.4 and 0.6 for choosing A1 and A2, respectively. Even if the agent selects A1 per the policy, transition to S2 isn’t guaranteed due to the influence of state transition probabilities. In an MDP environment, the agent resembles a captain steering a ship, while transition probabilities are like ocean currents and wind, affecting movement independently of the agent’s intentions. To compute probabilities in MDP, multiply each action probability by its respective transition probability and sum the results.

In this example, we see that the probability of moving from S1 to S2 is the same in both MRP and MDP. However, MDP introduces the new element of policy, requiring a more complex formula to account for this. Nonetheless, this complexity adds new functionality.

In MDP, the agent’s actions are solely determined by the policy, which does not change over time. Additionally, MDP assumes the Markov property, meaning the policy depends only on the current state, not past states.

Considering Policy in State Transition Matrix and Reward Function

In MDPs, policy is an important factor, so we should recreate the state transition matrix (P) to account for policy. The state transition matrix represents the conditional probability of changing state (State) within an environment (Environment) in matrix form. In MDPs, actions (Action) are introduced, and policy (π) is the only determinant of actions. Policy itself is a conditional probability matrix. Therefore, in MDPs, changes in state are influenced by both the original state transition matrix and policy. By calculating the expected value (mean) of the policy-based action and the state transition probability, we can derive a state transition matrix that considers policy.

Similarly, the reward function, taking policy into account, can be represented by calculating the expected value (mean) of the reward function for each action based on policy, just like the state transition matrix.

State Value Function in MDP

Just as in MRP, a state value function (State Value Function) can be calculated in MDP. From now on, the formulae will become slightly more complex. However, by analyzing them step-by-step, they should be understandable. (1) The general formula is similar to that of MRP, but policy (π) is considered. (2) Policy is a probability for choosing an action, and the sum of probabilities for all actions in a state equals 1. Therefore, to calculate the expected value considering policy, we need to multiply the conditional probability (policy) by each possible action and sum them. (3)-1 calculates the immediate reward in state S by considering the policy. (3)-2 requires the use of both policy and the state transition matrix, so summations (∑) are used twice: the first for policy per action, the second for transition probability per state.

Relationship Between State Value Functions in MRP and MDP

A common mistake here is thinking that v(s) and vπ(s) are different values. Both functions ultimately calculate the same state value, but vπ(s) considers an additional policy factor when calculating value.

The state value function is used to evaluate how valuable a given state is.

Simplified Form of State Value Function (Matrix Form)

The state value function can be represented in a simplified form. For various algorithms, a simplified expression may be more useful than a detailed one. (1) The direct reward is calculated by following policy π, so it can be represented as Rπ. For the reward received in the next time step, we must consider the reward expected by following policy π, the state transition probability, and the discount rate. (2) To calculate the state transition probability using only policy, reward, state transition probability, and discount rate, γPπvπ must be moved to the left side, factored out as vπ, and then both sides should be divided accordingly.

There is a slight oddity here, however. In (1), the reward for the next time step should be calculated as v(s’)π rather than simply as v(s)π. Yet, here, v(s)π is used directly. The reason for this is (3) it is presented in matrix form. Calculations are performed for every state (s S) based on the transition probability per state, so using the same v(s)π yields the same calculation results. Both s and s are subsets of the entire state set S.


Post a Comment

Previous Post Next Post