Agent Movement in MRP and
MDP
In MDP, the concept of an agent is introduced. We previously explained Brownian motion by observing the movement of pollen particles in a stochastic process. This pollen corresponds to an agent in MDP. While an agent could be used to explain concepts in stochastic processes and MRP, it wasn't necessary. In stochastic processes and MRP, the state (S: State) of the environment changes naturally over time due to the state transition probability (P). This introduces a passive meaning to the environment. However, in MDP, the agent’s actions and the state transition probabilities together affect the environment’s state. The agent actively decides actions according to the policy, giving MDP an active meaning that influences its application fields compared to MRP.
The term agent means an actor or entity that takes
actions. In MDP, the agent acts based on the policy (π), and the state (S:
State) changes based on the actions taken by the agent and the state transition
probabilities (P).
Components of MDP
In MDP, the state transition matrix (P) represents the
conditional probability that, given the state s at time t and action a, the
state will be s' at time t+1. The reward function (R) is the expected reward at
time t+1 when action a is taken in state s at time t. MDP adds action as a
condition to the state transition matrix and reward function, meaning actions
(A: Set of Actions) must be considered along with these components. Actions
influence the next state, and just like states, the number of actions in MDP is
finite.
MDP Policy
In MDP, a policy represents the probability of
choosing actions, similar in structure to the state transition matrix. If there
are four types of actions, the sum of probabilities for each action in a given
state should equal 1. Since policies are probabilistic, following a policy does
not always mean taking the highest-probability action but rather that
higher-probability actions are more likely to be chosen.
For example, if a policy assigns a 60% probability to
action A and a 40% probability to action B, the agent does not always choose
action A just because its probability is higher. Instead, action A has a 60%
likelihood of being chosen, with the remaining 40% for action B. Similar to
real-world decisions, MDP allows for unpredictability in the agent's behavior.
This concept is closely related to the exploration problem discussed later.
Comparison of MRP and MDP
Examples
The sudden appearance of actions and policies might
seem confusing, but examining examples should make this clearer. The MRP
environment is relatively simple. In timestep t1, the state is S1, and at t2,
the states are S2 and S3, with state transition probabilities of 0.7 and 0.3,
respectively. The probability of being in S2 at t2 equals the state transition
probability.
In an MDP environment, the probability of being in S2
is more complex. In state S1, the available actions are A1 and A2. Action A1
leads to state S2, and action A2 leads to state S3. The agent’s policy assigns
probabilities of 0.4 and 0.6 for choosing A1 and A2, respectively. Even if the
agent selects A1 per the policy, transition to S2 isn’t guaranteed due to the
influence of state transition probabilities. In an MDP environment, the agent
resembles a captain steering a ship, while transition probabilities are like
ocean currents and wind, affecting movement independently of the agent’s
intentions. To compute probabilities in MDP, multiply each action probability
by its respective transition probability and sum the results.
In this example, we see that the probability of moving
from S1 to S2 is the same in both MRP and MDP. However, MDP introduces the new
element of policy, requiring a more complex formula to account for this.
Nonetheless, this complexity adds new functionality.
Considering
Policy in State Transition Matrix and Reward Function
In MDPs, policy is an important factor, so we should
recreate the state transition matrix (P) to account for policy. The state
transition matrix represents the conditional probability of changing state
(State) within an environment (Environment) in matrix form. In MDPs, actions
(Action) are introduced, and policy (π) is the only determinant of actions.
Policy itself is a conditional probability matrix. Therefore, in MDPs, changes
in state are influenced by both the original state transition matrix and policy.
By calculating the expected value (mean) of the policy-based action and the
state transition probability, we can derive a state transition matrix that
considers policy.
Similarly, the reward function, taking policy into
account, can be represented by calculating the expected value (mean) of the
reward function for each action based on policy, just like the state transition
matrix.
State Value Function in
MDP
Just as in MRP, a state value function (State Value
Function) can be calculated in MDP. From now on, the formulae will become
slightly more complex. However, by analyzing them step-by-step, they should be
understandable. (1) The general formula is similar to that of MRP, but policy (π)
is considered. (2) Policy is a probability for choosing an action, and the sum
of probabilities for all actions in a state equals 1. Therefore, to calculate
the expected value considering policy, we need to multiply the conditional
probability (policy) by each possible action and sum them. (3)-1 calculates the
immediate reward in state S by considering the policy. (3)-2 requires the use
of both policy and the state transition matrix, so summations (∑) are used
twice: the first for policy per action, the second for transition probability
per state.
Relationship Between State
Value Functions in MRP and MDP
A common mistake here is thinking that v(s) and vπ(s)
are different values. Both functions ultimately calculate the same state value,
but vπ(s) considers an additional policy factor when calculating value.
The state value function is used to evaluate how
valuable a given state is.
Simplified Form of State
Value Function (Matrix Form)
The state value function can be represented in a
simplified form. For various algorithms, a simplified expression may be more
useful than a detailed one. (1) The direct reward is calculated by following
policy π, so it can be represented as Rπ. For the reward received in the next
time step, we must consider the reward expected by following policy π, the
state transition probability, and the discount rate. (2) To calculate the state
transition probability using only policy, reward, state transition probability,
and discount rate, γPπvπ must be moved to the left side, factored out as vπ,
and then both sides should be divided accordingly.
There is a slight oddity here, however. In (1), the
reward for the next time step should be calculated as v(s’)π rather than simply
as v(s)π. Yet, here, v(s)π is used directly. The reason for this is (3) it is
presented in matrix form. Calculations are performed for every state (s ∈ S) based on the transition probability per state, so
using the same v(s)π yields the same calculation results. Both s
and s’ are subsets of the entire state set S.




