Markov Reward Process (MRP)

The Markov Reward Process (MRP) is an extension of the Markov Chain that includes a reward and a discount factor, denoted by gamma (γ), which represents the depreciation of future rewards over time. While a Markov Chain consists of states (S) and a state transition matrix (P), an MRP consists of a set of states (S), a state transition matrix (P), a reward function (R), and a discount factor (γ). In a Markov Chain, only the transition probability between states is given, without indicating the value of state changes. However, by using MRP, we can calculate the value of these state changes. MRP was first introduced in a book by Ronald Arthur Howard, published in 1971.

Now, we’ll introduce a bit of mathematical concept, so let’s explore it step-by-step without worry; it’s simpler than it seems. First, we can mathematically express the elements of the Markov Chain process as follows:

Components of an MRP (Markov Reward Process)

The set of states (S) represents the various states that the environment can take. In an MRP, the states must be finite (i.e., a defined number). Here, the environment refers to the system or problem being dealt with. If you want to predict stock prices, various factors necessary for this prediction make up the environment. Likewise, if predicting department store sales, customer data, sales data, and financial data form the environment.

As explained earlier, P is the state transition matrix, representing the conditional probabilities of transitioning from one state to another in matrix form. Mathematically, this can be shown as the probability that the state will change from s at time t to s′ at time t+1.

R represents the reward function. The reward function can be expressed in the form of the expectation (E) of probability, meaning the expected reward received at time t+1 when the state is s at time t.

Reward Function Calculation

If state s at t+1 has two possible next states with probabilities p1 and p2, and rewards r1 and r2 associated with each transition, the value of the reward function is p1×r1+p2×r2. When the state is s at time t, the reward obtained through the reward function only calculates the immediate reward (this will become clearer when we discuss the concept of return later). So, when is the reward for state s calculated? It is calculated at t+1. At time t, the state is s, and at t+1, it transitions to state s′. The reward for state s is thus calculated as time progresses to t+1 and the state moves to s′.

A Quick Note: Expected Value of Probability

The expected value is the sum of each possible outcome's gain multiplied by the probability of that outcome occurring. It represents the average value of a probabilistic event. (Wikipedia)

In a discrete probability distribution, f(x) represents the probability of event x occurring. It is the sum of each event’s value (gain) multiplied by its probability.

For a die, the possible values of each outcome range from 1 to 6, with each outcome having a probability of 1/6. By considering all possible outcomes and their probabilities, the expected value can be calculated. Multiplying each possible die value by its probability and summing the results yields the average of the die’s values. Thus, the expected value of probability is equivalent to finding the average value of each outcome.

In a continuous probability distribution, f(x) is the probability density function. While individual values can be calculated in discrete settings, continuous settings require integration to account for each possible value. Integration is used to determine the area under the curve represented by a given graph, corresponding to the total probability.

Gamma (γ) represents the discount factor, which can take values between 0 and 1. In general, a discount factor is the rate used to determine the depreciation of value over time. For example, when evaluating the price of a 2-year-old and a 3-year-old car with an annual discount rate of 0.8, we multiply the original price by 0.8×0.8=0.64 for the 2-year-old car and 0.8×0.8×0.8=0.5120 for the 3-year-old car.

The discount factor is also used to calculate the future value of unreceived rewards. Receiving payment for goods delivered today differs in perceived value from receiving the same amount a year later. For instance, 10 million won today has more value than 10 million won a year from now. If we assume a discount factor of 0.9, the present value of 10 million won to be received a year later is 9 million won.

The purpose of MRP is to calculate value. This value calculation involves evaluating the overall value of an entire episode or environment rather than calculating a single momentary value using the reward function. This value must be calculated as the present value. To determine the entire episode value, several timesteps must pass until the episode ends. The discount factor is thus necessary, as it converts the future value into the present value to estimate the episode’s value from the current perspective.

The discount factor reflects the relationship between the current and future rewards. When γ is 0, it indicates no consideration for future rewards, while a γ of 1 implies that current and future rewards are valued equally.

Return

A new concept, the return (G), is introduced. The return is the cumulative reward calculated at timestep t. This cumulative reward is calculated with a discount factor applied. The return is usually calculated on an episode basis, rather than the entire environment, allowing us to evaluate an episode’s efficiency or value based on the return. Designing an environment to maximize the return is one purpose of MRP.

A unique point in the return calculation is that state transition probabilities are not considered. The return calculates the total reward for a chosen path (episode) without needing transition probabilities, as the path is already determined.

Return Calculation

Assume the discount factor is set at 1/2. There are three episodes that reach the destination in three timesteps. Each node has a defined reward, and the discount factor is multiplied with each reward as time progresses. Summing these values yields the return for each episode. The third episode has the highest efficiency calculated through return.

State Value Function

Let’s explore the state value function (v). If the return (G) measures the value of a single episode, the state value function measures the value of the entire environment. As the term "function" suggests, the state value function considers the state transition probabilities.

Return and State Value Function

Formula (1) represents that when the state at timestep ttt is sss, the state value function can be obtained as the expectation of the return. For example, if there are two possible states to which state sss can transition, we calculate the return for each possible state in the next timestep (g1, g2), multiply each by its conditional probability (p1, p2), and sum them as v(s)=p1×g1+p2×g2.

State Value Function Using Equation 1

Formula (2) is derived by inserting the return formula, while formula (3) groups the returns of the next timestep using the discount factor. Formula (4) substitutes the return of the next timestep back into the equation. Finally, in formula (4), we replace the return with the state value function. Let’s generalize this formula to make it programmable.

Bellman Equation for the State Value Function

In reinforcement learning, the Bellman equation, named after the American mathematician Richard Ernest Bellman, is commonly used to calculate values through programming.

The Bellman equation typically expresses the expectation as a summation series with the current state related to the next state. Formula (1) is the conceptual state value function we’ve covered; formula (2) simplifies this by removing a constant, which has no effect in expectations. Formula (3) then expresses the expectation as a sum of a series and the state value function of the next state, known as the Bellman equation.

States to Consider at Each Timestep

Applying the Bellman equation directly may be challenging. The purpose of using the Bellman equation is to solve problems through programming. In a simple network-based routing problem, sequential returns can be calculated to determine the state value function. Since MRP serves as a foundational step in understanding reinforcement learning, instead of calculating every state value function, it is sufficient to understand the meaning of the state value function and its representation with the Bellman equation.

Post a Comment

Previous Post Next Post