(1) The Optimal State-Value Function (v*(s)) is
defined as the state-value function that follows the policy with the highest
value when there are multiple state-value functions following different
policies. Similarly, (2) the Optimal Action-Value Function (q*(s,a)) is the
action-value function that follows the policy with the highest value among the
various action-value functions.
Knowing the Optimal Action-Value Function in an MDP is
equivalent to knowing the policy that allows for the selection of the most
efficient action. Therefore, if the Optimal Action-Value Function can be found,
the MDP problem can be solved.
Characteristics of the
Optimal Policy
We can now naturally define the Optimal Policy (π*).
The Optimal Policy is a policy that allows actions to maximize the optimal
value. The Optimal Policy has several characteristics: (1) the value of the
Optimal Policy is greater than that of any other policy. Since a policy
determines the probability of action selection, having a greater value implies
a higher probability. (2) The value of the state-value function obtained by
using the Optimal Policy is equal to the value of the Optimal State-Value
Function. (3) The value of the action-value function obtained by using the
Optimal Policy is also equal to the value of the Optimal Action-Value Function.
A Method to Represent the
Optimal Policy
Let's look at one way to represent the Optimal Policy in an MDP. (1) If an action (a: Action) is the same as the action that returns the maximum value of the Optimal Action-Value Function, the policy for that action is 1; otherwise, the policy for the action is 0. Since the policy is the probability of selecting an action, the policy in state s will always select the action set to a probability of 1.
A
Quick Note |