Previously, it was mentioned that the goal of an MDP is to determine a policy that maximizes the value of the environment. So, what is a policy? A policy is the probability that determines an action. Therefore, a policy that maximizes value can be thought of as the policy that yields the best outcome for the value function when followed. The state-value function evaluated value based on states rather than actions. To evaluate a policy, a function that evaluates value based on actions is needed. This is known as the Action-Value Function (Q: Action Value Function), also called the Q-function.
MDP Action-Value Function
(Q-function)
The Action-Value Function (Q-function) is a function
that calculates the value when one of several possible actions is selected.
Equation (1) restates the state-value function. In the action-value function,
since the action has already been chosen, there is no need to find an
expectation. Therefore, it can be described as Equation (3) without parts (1)-1
and (1)-2. A notable point here is the addition of π(s’, a’), as to accurately
calculate the reward in the next state, one must multiply the probability matrix
(policy) for selecting actions with the state transition probability matrix.
Relationship Between
Action-Value Function and State-Value Function
The action-value function is a function that
calculates the value of a chosen action, while the state-value function
calculates the value of a specific state. In an MDP, to move from one state to
another, both the state transition matrix and the probability of selecting an
action (policy) must be considered. Therefore, to derive the state-value
function using the action-value function, an expected value, or average, of the
policy is required. The action-value function calculates the value of an action
(a) as the sum of the immediate reward that can be received in the current
state and the future rewards. Future rewards depend on the state reached by
taking the action, which in turn depends on the chosen action and the state
transition matrix values of the environment. Therefore, the discount rate, the
state transition matrix, and the future state value function values are taken
into account to calculate the action-value function. The key to understanding
this equation lies in Pass', which considers only one action rather
than all actions in the state transition matrix.
As previously mentioned, the goal of an MDP is to determine a policy that maximizes the value of the environment. The action-value and state-value functions studied so far are all functions to calculate value. The purpose of calculating value is to evaluate a policy to find the policy that maximizes value (Optimal Policy). This is the fundamental concept of reinforcement learning.