Markov Decision Process: A Comprehensive Guide to Sequential Decision-Making

Advancedor Academy
9 min readApr 8, 2024

--

Introduction

In the field of artificial intelligence and decision-making, the Markov Decision Process (MDP) has emerged as a fundamental framework for modeling and solving sequential decision-making problems. MDPs provide a structured approach to analyze and optimize decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. This article aims to provide a thorough understanding of MDPs, their theoretical foundations, and their practical applications.

What is a Markov Decision Process? A Markov Decision Process is a mathematical framework that models decision-making in situations where outcomes are uncertain and depend on the actions taken by a decision-maker, also known as an agent. MDPs are characterized by the following components:

  1. States: A set of possible states that the agent can be in at any given time.
  2. Actions: A set of actions available to the agent in each state.
  3. Transition Probabilities: The probability of transitioning from one state to another when a specific action is taken.
  4. Rewards: The immediate reward or cost associated with taking an action in a particular state.
  5. Discount Factor: A parameter that determines the importance of future rewards compared to immediate rewards.

The goal of the agent in an MDP is to find a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over time. The agent makes decisions based on the current state, considering the long-term consequences of its actions.

Markov Property: One of the key assumptions in MDPs is the Markov property, which states that the future state of the system depends only on the current state and the action taken, not on the history of previous states and actions. This assumption simplifies the decision-making process and allows for efficient computation of optimal policies.

Bellman Equation: The Bellman equation is a fundamental concept in MDPs that relates the value of a state to the values of its successor states. It is a recursive equation that expresses the optimal value of a state in terms of the immediate reward and the discounted value of the next state. The Bellman equation forms the basis for many algorithms used to solve MDPs, such as value iteration and policy iteration.

Value Functions and Q-Values: In MDPs, value functions play a crucial role in determining the optimal policy. There are two types of value functions:

  1. State-Value Function (V): Represents the expected cumulative reward starting from a particular state and following a specific policy.
  2. Action-Value Function (Q): Represents the expected cumulative reward starting from a particular state, taking a specific action, and then following a specific policy.

The Q-values are used to determine the optimal action in each state, as the action with the highest Q-value is considered the best action to take.

Solving MDPs: There are several algorithms used to solve MDPs and find the optimal policy. Two popular algorithms are:

  1. Value Iteration: An iterative algorithm that updates the state-value function until convergence, using the Bellman equation.
  2. Policy Iteration: An iterative algorithm that alternates between policy evaluation and policy improvement until convergence.

These algorithms aim to find the optimal policy that maximizes the expected cumulative reward over time.

Scenario

Imagine you are the inventory manager of a retail store that sells a variety of products. Your primary goal is to optimize the store’s inventory levels to maximize profits while minimizing the costs associated with holding excess stock and losing potential sales due to stockouts.

The store’s inventory can be modeled as a Markov Decision Process (MDP) with the following components:

States:

  • The states represent the current stock level of a particular product, ranging from 0 to a maximum capacity of N units.

Actions:

  • In each state, you have the option to order a certain quantity of the product to replenish the inventory. The actions available are to order 0, 1, 2, …, or M units of the product.

Transition Probabilities:

  • The transition probabilities depend on the demand for the product. Given the current stock level and the action taken (i.e., the number of units ordered), there is a probability distribution over the possible next states based on the demand distribution.

Rewards:

  • The rewards are based on the profits earned from selling the product and the costs incurred from holding inventory and placing orders.
  • Each unit sold generates a profit of P.
  • Each unit held in inventory incurs a holding cost of H per time step.
  • Each order placed incurs a fixed ordering cost of C, regardless of the quantity ordered.

Discount Factor:

  • The discount factor, denoted as gamma, represents the importance of future rewards compared to immediate rewards. It balances the trade-off between short-term and long-term profitability.

The objective is to find an optimal policy that determines the best action (i.e., the number of units to order) in each state to maximize the expected long-term profits.

To solve this inventory management problem using MDPs, you can follow these steps:

  1. Define the state space, action space, transition probabilities, rewards, and discount factor based on the problem description.
  2. Implement the MDP framework in Python, defining the necessary data structures and functions.
  3. Use dynamic programming algorithms such as value iteration or policy iteration to compute the optimal policy and the corresponding value function.
  4. Simulate the inventory management process using the obtained optimal policy to evaluate its performance and compare it with other baseline policies.

By applying MDPs to the inventory management problem, you can make informed decisions on when and how much to order, balancing the costs and profits associated with inventory levels. The optimal policy obtained from solving the MDP will provide a strategic approach to maintain appropriate stock levels, minimize costs, and maximize long-term profitability.

import numpy as np

class InventoryMDP:
def __init__(self, max_capacity, max_order, demand_probs, profit, holding_cost, order_cost, discount_factor):
self.max_capacity = max_capacity
self.max_order = max_order
self.demand_probs = demand_probs
self.profit = profit
self.holding_cost = holding_cost
self.order_cost = order_cost
self.discount_factor = discount_factor
self.num_states = max_capacity + 1
self.num_actions = max_order + 1
self.transition_probs = self.compute_transition_probs()
self.rewards = self.compute_rewards()

def compute_transition_probs(self):
transition_probs = np.zeros((self.num_states, self.num_actions, self.num_states))
for s in range(self.num_states):
for a in range(self.num_actions):
for d in range(len(self.demand_probs)):
next_s = max(0, min(s + a - d, self.max_capacity))
transition_probs[s, a, next_s] += self.demand_probs[d]
return transition_probs

def compute_rewards(self):
rewards = np.zeros((self.num_states, self.num_actions))
for s in range(self.num_states):
for a in range(self.num_actions):
expected_sales = sum(min(s + a, d) * self.demand_probs[d] for d in range(len(self.demand_probs)))
rewards[s, a] = expected_sales * self.profit - s * self.holding_cost - (a > 0) * self.order_cost
return rewards

def value_iteration(self, epsilon=1e-6):
V = np.zeros(self.num_states)
while True:
delta = 0
for s in range(self.num_states):
v = V[s]
V[s] = max(self.rewards[s, a] + self.discount_factor * np.dot(self.transition_probs[s, a], V) for a in range(self.num_actions))
delta = max(delta, abs(v - V[s]))
if delta < epsilon:
break
policy = np.argmax(np.array([self.rewards[s] + self.discount_factor * np.dot(self.transition_probs[s, a], V) for a in range(self.num_actions)]), axis=0)
return V, policy


max_capacity = 10
max_order = 5
demand_probs = [0.1, 0.2, 0.3, 0.2, 0.1, 0.1]
profit = 10
holding_cost = 1
order_cost = 5
discount_factor = 0.95

mdp = InventoryMDP(max_capacity, max_order, demand_probs, profit, holding_cost, order_cost, discount_factor)
values, policy = mdp.value_iteration()

print("Optimal values:", values)
print("Optimal policy:", policy)

The provided code implements a Markov Decision Process (MDP) for an inventory management scenario using the value iteration algorithm. Let’s go through the code step by step to understand its functionality.

  1. The code starts by importing the NumPy library, which is used for efficient numerical computations.
  2. The InventoryMDP class is defined, which encapsulates the components and methods required to solve the inventory management MDP.
  3. The __init__ method is the constructor of the InventoryMDP class. It initializes the MDP parameters such as maximum capacity, maximum order quantity, demand probabilities, profit per unit sold, holding cost per unit, order cost, and discount factor. It also calculates the number of states and actions based on the maximum capacity and maximum order quantity.
  4. The compute_transition_probs method calculates the transition probabilities. It creates a 3D array transition_probs to store the probabilities of transitioning from one state to another based on the action taken and the demand realization. It iterates over all possible states, actions, and demand values to compute the probabilities using the given demand probabilities.
  5. The compute_rewards method calculates the immediate rewards for each state-action pair. It creates a 2D array rewards to store the rewards. It iterates over all possible states and actions, computes the expected sales based on the current state, action, and demand probabilities, and calculates the reward considering the profit, holding cost, and order cost.
  6. The value_iteration method implements the value iteration algorithm to find the optimal value function and policy. It initializes a value function V with zeros. It iteratively updates the value function until convergence by computing the maximum expected future reward for each state based on the Bellman equation. The convergence criterion is based on the maximum change in the value function (delta) being smaller than a specified threshold (epsilon).
  7. After convergence, the optimal policy is derived by selecting the action that maximizes the expected future reward for each state, considering the immediate reward and the discounted future value.
  8. The code then demonstrates an example usage of the InventoryMDP class. It creates an instance of the class with specific parameters such as maximum capacity, maximum order quantity, demand probabilities, profit, holding cost, order cost, and discount factor.
  9. The value_iteration method is called on the MDP instance to solve the inventory management problem and obtain the optimal value function and policy.
  10. Finally, the code prints the optimal values and the optimal policy obtained from the value iteration algorithm.

In summary, the code defines an InventoryMDP class that represents the inventory management problem as a Markov Decision Process. It provides methods to compute transition probabilities, rewards, and solve the MDP using the value iteration algorithm. The value iteration algorithm iteratively updates the value function until convergence and derives the optimal policy based on the maximization of expected future rewards. The code demonstrates an example usage of the InventoryMDP class and prints the optimal values and policy obtained from solving the MDP.

The output provided showcases the results obtained from solving the Markov Decision Process (MDP) for the inventory management scenario using the value iteration algorithm. Let’s break down the meaning of the optimal values and the optimal policy.

Optimal Values: The optimal values represent the expected long-term rewards for each state in the MDP, assuming the optimal policy is followed. In this case, the optimal values are:

[345.50234612, 344.67341842, 343.6734185, 344.55333172, 345.35409114, 345.50234688, 344.67341916, 343.59661548, 342.21208001, 340.40291911, 338.18458793]

Each value corresponds to a specific state, which represents the current stock level of the product. The states range from 0 to the maximum capacity (10 in this example). The optimal values indicate the maximum expected rewards that can be obtained starting from each state and following the optimal policy.

For instance, if the current stock level is 0 (the first state), the expected long-term reward is 345.50234612. Similarly, if the stock level is 5 (the sixth state), the expected long-term reward is 345.50234688.

These optimal values serve as a guide for the inventory manager to assess the long-term profitability of different stock levels and make informed decisions accordingly.

Optimal Policy: The optimal policy determines the best action to take in each state to maximize the expected long-term rewards. In this case, the optimal policy is:

[0, 0, 0, 0, 0, 0]

Each value in the optimal policy corresponds to a state, and the value represents the optimal action to take in that state. The actions represent the number of units to order to replenish the inventory.

In this example, the optimal policy suggests ordering 0 units in all states. This means that regardless of the current stock level, the optimal decision is not to place any new orders.

The interpretation of this optimal policy depends on the specific parameters used in the MDP, such as the demand probabilities, profit, holding cost, and order cost. It suggests that given the current setup, it is more profitable to sell the existing inventory without replenishing it.

However, it’s important to note that the optimal policy may change if the parameters of the MDP are modified. For example, if the profit per unit sold increases or the holding cost decreases, the optimal policy might suggest ordering more units in certain states.

The optimal policy provides the inventory manager with a clear recommendation on how many units to order in each state to maximize long-term profitability. By following the optimal policy, the manager can make data-driven decisions that align with the company’s objectives.

In conclusion, the output of the MDP solution offers valuable insights into the optimal inventory management strategy. The optimal values indicate the expected long-term rewards for each state, while the optimal policy prescribes the best action to take in each state. By leveraging these results, inventory managers can optimize their decision-making process, minimize costs, and maximize profitability in the long run.

References

https://www.sciencedirect.com/topics/computer-science/markov-decision-process

An Introduction to Markov Decision Processes — Bob Givan & Ron Parr

--

--