MDP Optimal Value Function

Now, let's discuss the Optimal Value Function to achieve the ultimate goal of MDP. The Optimal Value Function can be divided into the Optimal State-Value Function and the Optimal Action-Value Function.


Optimal State-Value Function and Optimal Action-Value Function

(1) The Optimal State-Value Function (v*(s)) is defined as the state-value function that follows the policy with the highest value when there are multiple state-value functions following different policies. Similarly, (2) the Optimal Action-Value Function (q*(s,a)) is the action-value function that follows the policy with the highest value among the various action-value functions.

Knowing the Optimal Action-Value Function in an MDP is equivalent to knowing the policy that allows for the selection of the most efficient action. Therefore, if the Optimal Action-Value Function can be found, the MDP problem can be solved.

Characteristics of the Optimal Policy

We can now naturally define the Optimal Policy (π*). The Optimal Policy is a policy that allows actions to maximize the optimal value. The Optimal Policy has several characteristics: (1) the value of the Optimal Policy is greater than that of any other policy. Since a policy determines the probability of action selection, having a greater value implies a higher probability. (2) The value of the state-value function obtained by using the Optimal Policy is equal to the value of the Optimal State-Value Function. (3) The value of the action-value function obtained by using the Optimal Policy is also equal to the value of the Optimal Action-Value Function.

A Method to Represent the Optimal Policy

Let's look at one way to represent the Optimal Policy in an MDP. (1) If an action (a: Action) is the same as the action that returns the maximum value of the Optimal Action-Value Function, the policy for that action is 1; otherwise, the policy for the action is 0. Since the policy is the probability of selecting an action, the policy in state s will always select the action set to a probability of 1.

A Quick Note
Mathematical Symbols
and argmax
The symbol
signifies "for all" or "any." π implies "for all policies." The function argmax(x) finds the x that maximizes the value of a function satisfying a condition. For instance, in argmax sin(x), 0 x 2π, x = 0.5π maximizes the sin value.





Post a Comment

Previous Post Next Post