Reinforcement Learning in DataRobot
In this notebook, we implement a very simple model based on the Q-learning algorithm. This notebook is intended to show a basic form of RL that doesn't require a deep understanding of neural networks or advanced mathematics and how one might deploy such a model in DataRobot.
Request a Demo
This example shows the Grid World problem, where an agent learns to navigate a grid to reach a goal.
The notebook will go through the following steps:
- Define State and Action Space
- Create a Q-table to store expected rewards for each state/action combination
- Implement learning algorithm and train model
- Evaluate model
- Deploy to a DataRobot Rest API end-point
1. Define State and Action Space
Let’s first install datarobotx for some convenient DataRobot deployment procedures.
In [ ]:
%%bash
pip install -U datarobotx
In [ ]:
import random
import numpy as np
In [ ]:
# Grid settings
grid_size = 4
# funtion to build list of all state tuples
def build_state_list(grid_size):
state_list = []
for i in range(grid_size):
for j in range(grid_size):
state_list.append((i, j))
return state_list
all_states = build_state_list(grid_size)
# Here we just try to reach the top right corner (could be center or any other state)
goal_state = (3, 3)
n_states = grid_size * grid_size
n_actions = 4 # Up, Down, Left, Right
2. Create a Q-table to store expected rewards for each state/action combination
In [ ]:
# Initialize Q-table
Q = np.zeros((n_states, n_actions))
# Helper functions
def state_to_index(state):
return state[0] * grid_size + state[1]
def index_to_state(index):
return (index // grid_size, index % grid_size)
def get_possible_actions(state):
actions = []
if state[0] > 0:
actions.append(0) # Up
if state[0] < grid_size - 1:
actions.append(1) # Down
if state[1] > 0:
actions.append(2) # Left
if state[1] < grid_size - 1:
actions.append(3) # Right
return actions
# Correct the state transition function to prevent invalid states
def take_action(state, action):
new_state = list(state)
if action == 0 and state[0] > 0:
new_state[0] -= 1 # Up
if action == 1 and state[0] < grid_size - 1:
new_state[0] += 1 # Down
if action == 2 and state[1] > 0:
new_state[1] -= 1 # Left
if action == 3 and state[1] < grid_size - 1:
new_state[1] += 1 # Right
return tuple(new_state)
3. Implement learning algorithm and train model
In [ ]:
# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1 # Exploration rate
n_episodes = 100000
# Training the model with corrected state transitions
for episode in range(n_episodes):
# start at a random state
state = random.choice(all_states)
done = state == goal_state
while not done:
state_index = state_to_index(state)
if random.uniform(0, 1) < epsilon:
# Explore: choose a random action
action = random.choice(get_possible_actions(state))
else:
# Exploit: choose the best action from Q-table
action = np.argmax(Q[state_index])
# Take action and observe reward
next_state = take_action(state, action)
reward = 1 if next_state == goal_state else 0
next_state_index = state_to_index(next_state)
# Q-learning update
Q[state_index, action] = Q[state_index, action] + learning_rate * (
reward
+ discount_factor * np.max(Q[next_state_index])
- Q[state_index, action]
)
# Transition to the next state
state = next_state
done = state == goal_state
4. Evaluate model
First, we will just show one path then see on average how many actions it takes to get to the goal state.
In [ ]:
# Evaluating the model
state = random.choice(all_states)
print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
state_index = state_to_index(state)
action = np.argmax(Q[state_index]) # Choose the best action
state = take_action(state, action)
trajectory.append(state)
done = state == goal_state
print(trajectory)
Out [ ]:
Initial state: (3, 3)
[(3, 3)]
In [ ]:
total_actions = 0 # Total number of actions taken to reach the goal
for state in all_states:
# Evaluating the model
trajectory = [state]
done = state == goal_state
while not done:
state_index = state_to_index(state)
action = np.argmax(Q[state_index]) # Choose the best action
state = take_action(state, action)
trajectory.append(state)
done = state == goal_state
total_actions += 1
print(
"Average number of actions taken to reach the goal:",
total_actions / len(all_states),
)
Out [ ]:
Average number of actions taken to reach the goal: 3.0
Is this optimal? We know the optimal policy is to move up or to the right until we reach the goal. From the bottom left, this is 6 actions, for the next 2 states it is 5 actions, for the next 3 it is 4 actions, then 4->3, 3->2, 2->1, 1->0 as we already start at the goal state. By simple arithmetic we have
6+2*5+3*4+4*3+3*2+2*1 = 48
Total state = 16
Therefore, the optimal is 48/16 = 3 which is exactly our average number of actions.
5. Deploy to DataRobot Rest API end-point
In [ ]:
import pickle
import datarobot as dr
import numpy as np
import pandas as pd
In [ ]:
import os
os.makedirs("./storage/deploy/", exist_ok=True)
# save the Q table to a pickle file
with open("./storage/deploy/q_table.pkl", "wb") as f:
pickle.dump(Q, f)
Connect to DataRobot
Read more about different options for connecting to DataRobot from the client.
In [ ]:
dr_client = dr.Client()
Define Hooks for Deploying an Unstructured Custom Model. One could use a standard custom deployment, but using this to illustrate flexibity for more complex RL problems.
In [ ]:
def load_model(input_dir):
"""Custom model hook for loading our Q-table
Make sure to execute the cell earlier in the notebook that create Q-table before deploying
"""
with open(input_dir + "/storage/deploy/" + "q_table.pkl", "rb") as f:
Q = pickle.load(f)
return Q
def score_unstructured(model, data, query, **kwargs) -> str:
"""Custom model hook for return action.
model: The output of load_model is passed to this object
data: str
Expects json string passed in request body.
Required keys:
state: tuple(int, int) .. Current state of the agent
query: None
Unused
**kwargs: dict
Unused
Returns:
JSON string with output action
"""
import json
import numpy as np
Q = model
grid_size = int(np.sqrt(len(Q))) # Grid size is inferred from the Q-table
# Helper functions
def state_to_index(state):
return state[0] * grid_size + state[1]
data_dict = json.loads(data)
state = data_dict["state"]
state_index = state_to_index(state)
action = np.argmax(Q[state_index])
return json.dumps({"action": action}, default=int)
Test out the prediction structure proior to deployment.
In [ ]:
import json
score_unstructured(
load_model("."),
json.dumps({"state": (0, 1)}),
None,
)
Out [ ]:
'{"action": 1}'
Deploy the RL policy model. We will use this convenience method in drx.
- Builds a new Custom Model Environment
- You can also use a DataRobot Python Drop-in Enviroment (e.g. “6386dc1159c606b0d8beddc7”)
- Assembles a new Custom Model with the provided hooks
- Deploys an Unstructured Custom Model to your Deployments
- Returns an object which can be used to make predictions
Use environment_id
to re-use an existing Custom Model Environment that you’re happy with for shorter iteration cycles on the custom model hooks.
Note: See https://app.datarobot.com/docs/api/api-quickstart/index.html for instructions to setup a drconfig.yaml or call drx.Context() to initialize your credentials.
In [ ]:
import datarobotx as drx
drx.Context().endpoint = dr_client.endpoint
drx.Context().token = dr_client.token
In [ ]:
deployment = drx.deploy(
"storage/deploy/",
hooks={"score_unstructured": score_unstructured, "load_model": load_model},
extra_requirements=[],
# environment_id="6386dc1159c606b0d8beddc7",
)
Out [ ]:
# Deploying custom model
- Unable to auto-detect model type; any provided paths and files will be
exported - dependencies should be explicitly specified using
`extra_requirements` or `environment_id`
- Preparing model and environment...
- Configured environment [[Custom]
priceless-ganguly](https://app.datarobot.com/model-registry/custom-environments/65ac4115be769b7f85d5aaf9)
with requirements:
python 3.9.16
datarobot-drum==1.10.14
datarobot-mlops==9.2.8
cloudpickle==2.2.1
- Awaiting custom environment build...
Out [ ]:
- Configuring and uploading custom model...
100%|███████████████████████████| 11.0k/11.0k [00:00<00:00, 5.14MB/s]
- Registered custom model
[priceless-ganguly](https://app.datarobot.com/model-registry/custom-models/65ac42ce046ed058aada50c7/info)
with target type: Unstructured
- Creating and deploying model package...
Out [ ]:
- Created deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Custom model deployment complete
Let’s try out our deployment and track the trajectory from the deployed policy (returns action)
In [ ]:
# If your deployment already occured or your notebook restarted due to inactivity, get ID from URL in the UI
# deployment = drx.Deployment("YOUR DEPLOYEMENT ID HERE")
deployment.predict_unstructured({"state": (0, 1)})
Out [ ]:
# Making predictions
- Making predictions with deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
{'action': 1}
Test and print trajectory.
In [ ]:
state = (0, 1)
goal_state = (3, 3)
print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
action = deployment.predict_unstructured({"state": state})["action"]
state = take_action(state, action)
trajectory.append(state)
done = state == goal_state
print(trajectory)
Out [ ]:
Initial state: (0, 1)
# Making predictions
- Making predictions with deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
- Making predictions with deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
- Making predictions with deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
- Making predictions with deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
- Making predictions with deployment
[priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
[(0, 1), (1, 1), (2, 1), (3, 1), (3, 2), (3, 3)]
Get Started with Free Trial
Experience new features and capabilities previously only available in our full AI Platform product.