MDP Action-Value Function

Previously, it was mentioned that the goal of an MDP is to determine a policy that maximizes the value of the environment. So, what is a policy? A policy is the probability that determines an action. Therefore, a policy that maximizes value can be thought of as the policy that yields the best outcome for the value function when followed. The state-value function evaluated value based on states rather than actions. To evaluate a policy, a function that evaluates value based on actions is needed. This is known as the Action-Value Function (Q: Action Value Function), also called the Q-function.

MDP Action-Value Function (Q-function)

The Action-Value Function (Q-function) is a function that calculates the value when one of several possible actions is selected. Equation (1) restates the state-value function. In the action-value function, since the action has already been chosen, there is no need to find an expectation. Therefore, it can be described as Equation (3) without parts (1)-1 and (1)-2. A notable point here is the addition of π(s’, a’), as to accurately calculate the reward in the next state, one must multiply the probability matrix (policy) for selecting actions with the state transition probability matrix.

Relationship Between Action-Value Function and State-Value Function

The action-value function is a function that calculates the value of a chosen action, while the state-value function calculates the value of a specific state. In an MDP, to move from one state to another, both the state transition matrix and the probability of selecting an action (policy) must be considered. Therefore, to derive the state-value function using the action-value function, an expected value, or average, of the policy is required. The action-value function calculates the value of an action (a) as the sum of the immediate reward that can be received in the current state and the future rewards. Future rewards depend on the state reached by taking the action, which in turn depends on the chosen action and the state transition matrix values of the environment. Therefore, the discount rate, the state transition matrix, and the future state value function values are taken into account to calculate the action-value function. The key to understanding this equation lies in P^ass', which considers only one action rather than all actions in the state transition matrix.

As previously mentioned, the goal of an MDP is to determine a policy that maximizes the value of the environment. The action-value and state-value functions studied so far are all functions to calculate value. The purpose of calculating value is to evaluate a policy to find the policy that maximizes value (Optimal Policy). This is the fundamental concept of reinforcement learning.

MDP Action-Value Function

Post a Comment