Understanding the Basics of Markov Decision Processes
Understanding the Basics of Markov Decision Processes¶
In this post:
- Markov Process (or Markov Chains)
- Markov Reward Processes
- Markov Decision Processes
Markov Process/Chain¶
Let's say we are observing some type of system in a way that means we can only watch, not interact with it or change it in any way. Any change in the system is a different state, and all the possible states of the system is known as the state space. For this example, we will think about the daily change in value, positive or negative, of an imaginary stock price. We can observe the current day's state as higher or lower than the previous day's value-- this is our state space. Over time, we end up with a seuqence of these observations, forming a chain of states ([higher, higher, lower, lower, higher, lower, etc]). This is the history of our state space.
In order for this sytem to be a markov process, it needs to satisfy the Markov property*; namely, that "future system dynamics from any state must depend on this state only" (Lapan, 12). Each state must be self-contained and not dependent on the whole history-- future states must be able to be modeled from only one state. (Of course, this is not true of stock prices in the real world, but we'll bend the rules of reality a bit for our example.)
When a system model fulfills the Markov property, we can build a transition matrix from the probabilities of transitioning from one state to another. Below, we can see the transition matrix for our simple stock model:
import pandas as pd
transitionMatrix = pd.DataFrame([[0.60, 0.40]
, [0.25, 0.75]]
, index = ["Higher", "Lower"]
, columns = ["Higher", "Lower"])
transitionMatrix
As we can see, if a day's stock value is higher than the day before, there is a 60% chance that the price will rise further the next day, and a 40% chance it will decline; however, if the value is lower than the previous day, there is a 75% chance it will continue to decline. Our model is pretty bearish, it seems!
A quick review:¶
- A Markov process consists of a state space which contains all possible states in which the system can be observed
- A sequence of states forms its history, or chain of states
- A transition matrix defines the dynamics of the system by containing the probabilities of the system transitioning between each state
One more important point: the Markov property implies stationarity-- the "underlying transition distribution for any state does not change over time" (Lapan, 15). If some hidden, unobserved factor changes the underlying system dynamics, then the Marvov property does not apply.
Markov Reward Process¶
The transition matrix gives us the probability of state-to-state changes, but we need to assign values to those transitions. This will be the reward. Rewards can be positive or negative, of all sizes. In addition to the reward, we will also add a discount factor gamma, $\gamma$ , which is a single number from 0 to 1. (More on this later.)
Since we observe a chain of states in a Markov process (and a Markov reward process), we now have a reward value for every transition in the system. With our reward values and our discount factor, we can define return as:
$$G_t= \sum_{i=0}^\infty \gamma^k R_{t+k+1}$$, where k = number of steps we are from our starting point at time t.
In essence, the return value is the sum of future rewards, degraded by the strength of our discount value. A discount value of 1 means we have given our agent perfect vision into future rewards, while a value of 0 means it is unable to consider anything but the current reward.
Gamma is going to be very important in reinforcement learning applications, but for now we will think of it as only our ability to see into the future and remember that the higher the number, the further we can see.
At the end of the day, individual return values don't mean too much-- they are tied to very specific chains, which means that every state can have wide variations in their return values. To make it more useful, we can take the average of a large number of possible chains for a given state, giving us a value of state.
A quick review:¶
- In addition to transition probabilities, we also assign a value to each transition, called a reward
- We can calculate a return value by multiplying the sum of future rewards by a discount factor $\gamma$
- The larger the discount factor, the further into the future we can see
- By averaging the returns across many chains for a given state, we get a much more useful metric called the value of state
Markov Decision Process¶
Next, we want to add a finite set of actions called an action space to our process. When we add action to our transition matrix from our initial Markov process, it adds an extra dimension, making a transition cube. Instead of passively observing state changes as we did in our stock example, we can now take an action at every single timestep. Our cube identifies the probability that state i will become state j given action k.
To make it a bit clearer, our agent can now affect the probability that the system will end up in a particular state. Not a bad ability to have!
Just as in the Markov Reward Process section, we are not only interested in probabilities, we also need to add these actions to our reward matrix. So, our agent doesn't just get a reward for the state the system is in (or goes into), it also gets rewarded for the actions it takes.
This gets us to one of the key features of both Markov Decision Processes and Reinforcement Learning-- policy. Policies are rule sets governing the behavior of our agent. Policy choice is important, because our agent will seek to maximize its return. Small changes in policy (say, rewarding a certain action more than a certain state) can have dramatic effect on the return that is achieved.
Policy can be defined as:
$$\pi(a|s) = P[A_t = a|S_t = s] $$ , or, the probability distribution over actions for every possible state.
One final note: If our policy stays fixed and unchanging, we can model our Markov Decision Process as a Markov Reward Process by reducing the transition and reward matrices via the policy's probabilities. No need for the action dimensions.
A few notes:¶
- By adding actions with their own set of rewards, we construct a Markov Decision Process
- The policy can be defined as the rule sets governing the behavior of our agent
- Setting good policy is vitally important to the success of our agent
Sources
Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more.
Comments