Building a (Very Simple) Autonomous Agent and Environment

*Before reading and following along with this post, it might be helpful to go read this post on agents, environments, and Markov decision processes. Also, the code below is found in Maxim Laplan's excellent book, Deep Reinforcement Learning Hands-On, cited fully at the end of the post. His code without my commentary can be found here.

How to build a (very simple) autonomous agent and environment

As a beginning exercise in my ongoing series on implementing Reinforcement Learning models, we'll define a simplsitic environment that gives rewards for a determined number of steps, regardless of the actual actions.

First, a quick review:

Agent: Some piece of code that implements some policy. Given observations, the policy dictates what action the agent takes at each timestep

Environment: The model of the world, external to the agent. Provides observations, gives awards, and changes state based on actions.

Defining our environment

We'll start by initializing the internal state of the environment, which is simply a counter that limits the total number of steps the agent is allowed to take.

In [1]:
class Environment:
  def __init__(self):
    self.steps_left = 10

Let's now define get_observation(), which returns the observation of the current environment to the agent. Normally, it would be implemented as a function of the internal environment state, but in our case the vector is always zero.

In [2]:
def get_observation(self):
  return [0.0, 0.0, 0.0]

Next, we'll define get_actions, a set of actions from which the agent can choose. For this model, there are only two actions:

In [3]:
def get_actions(self):
  return [0, 1]

As our agent takes actions within the environment, it performs series of steps called episodes. We need to define when an episode is over, so the agent knows when there is no longer any way to communicate with the environment:

In [4]:
def is_done(self):
  return self.steps_left == 0

The action() method is perhaps the most important piece of the environment. The action() method both handles the agent's action (thereby changing the environment's state) and returns a reward to the agent for this action. In this case, the reward is random, and the action has no effect on the environment. Additionally, we reduce the steps_left counter by one. Finally, if the steps remaining == 0, then we stop the episodes.

In [5]:
def action(self, action):
  if self.is_done():
    raise Exception("Game is over")
  self.steps_left -= 1
  return random.random()
Defining our agent

We'll begin by initializing our agent and a counter which will be used to store the reward value as it accumulates:

In [6]:
class Agent:
  def __init__(self):
    self.total_reward = 0.0

Now, we'll define a step() function that accepts our environment as an argument, allowing the agent to do a few things:

  • Observe the state of the environment via current_obs
  • Make a decision on which action to perform via actions
  • Pass the action to the environment and receive the reward (reward/env.actions)
  • Add the current reward to the accumulated rewards
In [7]:
def step(self, env):
  current_obs = env.get_observation() # Observe the environment
  actions = env.get_actions() # Get available actions
  reward = env.action(random.choice(actions)) # Perform the action, get a reward
  self.total_reward += reward # Add reward to total
A quick review:

So, we've done quite a bit here, and it may not be lost on anyone following along that the above code snippets don't mean much until they are combined correctly and passed through a proedure to create the classes and run an episode. Before we do that, let's review what we've done:

  • We created classes for both our environment and our agent
  • We defined our observation vector, allowing our agent to see the environment in its current state
  • We defined the action space for our agent in the environment
  • We set the rules for when an episode ends
  • We defined what happens to the environment (nothing, in this case) and what rewards are given when our agent takes an actions
  • We defined what happens at each step in an episode

Whew, a lot for a simple environment! Let's put it all together and run the code:

In [8]:
import random

class Environment:
  def __init__(self):
    self.steps_left = 10
    
  def get_observation(self):
    return [0.0, 0.0, 0.0]

  def get_actions(self):
    return [0, 1]

  def is_done(self):
    return self.steps_left == 0
  
  def action(self, action):
    if self.is_done():
      raise Exception("Game is over")
    self.steps_left -= 1
    return random.random()
  
class Agent:
  def __init__(self):
    self.total_reward = 0.0
    
  def step(self, env):
    current_obs = env.get_observation() # Observe the environment
    actions = env.get_actions() # Get available actions
    reward = env.action(random.choice(actions)) # Perform action,get reward
    self.total_reward += reward # Add reward to total
  
if __name__ == "__main__":
  env = Environment()
  agent = Agent()
  
  while not env.is_done():
    agent.step(env)
    
  print("Total reward collected: %.4f" % agent.total_reward)
Total reward collected: 3.8136

Sources

Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more.

Understanding the Basics of Markov Decision Processes

Understanding the Basics of Markov Decision Processes

In this post:

  • Markov Process (or Markov Chains)
  • Markov Reward Processes
  • Markov Decision Processes

Markov Process/Chain

Let's say we are observing some type of system in a way that means we can only watch, not interact with it or change it in any way. Any change in the system is a different state, and all the possible states of the system is known as the state space. For this example, we will think about the daily change in value, positive or negative, of an imaginary stock price. We can observe the current day's state as higher or lower than the previous day's value-- this is our state space. Over time, we end up with a seuqence of these observations, forming a chain of states ([higher, higher, lower, lower, higher, lower, etc]). This is the history of our state space.

In order for this sytem to be a markov process, it needs to satisfy the Markov property*; namely, that "future system dynamics from any state must depend on this state only" (Lapan, 12). Each state must be self-contained and not dependent on the whole history-- future states must be able to be modeled from only one state. (Of course, this is not true of stock prices in the real world, but we'll bend the rules of reality a bit for our example.)

When a system model fulfills the Markov property, we can build a transition matrix from the probabilities of transitioning from one state to another. Below, we can see the transition matrix for our simple stock model:

In [1]:
import pandas as pd

transitionMatrix = pd.DataFrame([[0.60, 0.40]
                              , [0.25, 0.75]]
                              , index = ["Higher", "Lower"]
                              , columns = ["Higher", "Lower"])
transitionMatrix
Out[1]:
Higher Lower
Higher 0.60 0.40
Lower 0.25 0.75

As we can see, if a day's stock value is higher than the day before, there is a 60% chance that the price will rise further the next day, and a 40% chance it will decline; however, if the value is lower than the previous day, there is a 75% chance it will continue to decline. Our model is pretty bearish, it seems!

A quick review:
  • A Markov process consists of a state space which contains all possible states in which the system can be observed
  • A sequence of states forms its history, or chain of states
  • A transition matrix defines the dynamics of the system by containing the probabilities of the system transitioning between each state

One more important point: the Markov property implies stationarity-- the "underlying transition distribution for any state does not change over time" (Lapan, 15). If some hidden, unobserved factor changes the underlying system dynamics, then the Marvov property does not apply.

Markov Reward Process

The transition matrix gives us the probability of state-to-state changes, but we need to assign values to those transitions. This will be the reward. Rewards can be positive or negative, of all sizes. In addition to the reward, we will also add a discount factor gamma, $\gamma$ , which is a single number from 0 to 1. (More on this later.)

Since we observe a chain of states in a Markov process (and a Markov reward process), we now have a reward value for every transition in the system. With our reward values and our discount factor, we can define return as:

$$G_t= \sum_{i=0}^\infty \gamma^k R_{t+k+1}$$, where k = number of steps we are from our starting point at time t.

In essence, the return value is the sum of future rewards, degraded by the strength of our discount value. A discount value of 1 means we have given our agent perfect vision into future rewards, while a value of 0 means it is unable to consider anything but the current reward.

Gamma is going to be very important in reinforcement learning applications, but for now we will think of it as only our ability to see into the future and remember that the higher the number, the further we can see.

At the end of the day, individual return values don't mean too much-- they are tied to very specific chains, which means that every state can have wide variations in their return values. To make it more useful, we can take the average of a large number of possible chains for a given state, giving us a value of state.

A quick review:
  • In addition to transition probabilities, we also assign a value to each transition, called a reward
  • We can calculate a return value by multiplying the sum of future rewards by a discount factor $\gamma$
  • The larger the discount factor, the further into the future we can see
  • By averaging the returns across many chains for a given state, we get a much more useful metric called the value of state

Markov Decision Process

Next, we want to add a finite set of actions called an action space to our process. When we add action to our transition matrix from our initial Markov process, it adds an extra dimension, making a transition cube. Instead of passively observing state changes as we did in our stock example, we can now take an action at every single timestep. Our cube identifies the probability that state i will become state j given action k.

To make it a bit clearer, our agent can now affect the probability that the system will end up in a particular state. Not a bad ability to have!

Just as in the Markov Reward Process section, we are not only interested in probabilities, we also need to add these actions to our reward matrix. So, our agent doesn't just get a reward for the state the system is in (or goes into), it also gets rewarded for the actions it takes.

This gets us to one of the key features of both Markov Decision Processes and Reinforcement Learning-- policy. Policies are rule sets governing the behavior of our agent. Policy choice is important, because our agent will seek to maximize its return. Small changes in policy (say, rewarding a certain action more than a certain state) can have dramatic effect on the return that is achieved.

Policy can be defined as:

$$\pi(a|s) = P[A_t = a|S_t = s] $$ , or, the probability distribution over actions for every possible state.

One final note: If our policy stays fixed and unchanging, we can model our Markov Decision Process as a Markov Reward Process by reducing the transition and reward matrices via the policy's probabilities. No need for the action dimensions.

A few notes:
  • By adding actions with their own set of rewards, we construct a Markov Decision Process
  • The policy can be defined as the rule sets governing the behavior of our agent
  • Setting good policy is vitally important to the success of our agent

Sources

Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more.