Reinforcement Learning in DataRobot

In this notebook, we implement a very simple model based on the Q-learning algorithm. This notebook is intended to show a basic form of RL that doesn't require a deep understanding of neural networks or advanced mathematics and how one might deploy such a model in DataRobot.

Request a Demo

This example shows the Grid World problem, where an agent learns to navigate a grid to reach a goal.

The notebook will go through the following steps:

Define State and Action Space
Create a Q-table to store expected rewards for each state/action combination
Implement learning algorithm and train model
Evaluate model
Deploy to a DataRobot Rest API end-point

1. Define State and Action Space

Let’s first install datarobotx for some convenient DataRobot deployment procedures.

In [ ]:

%%bash
pip install -U datarobotx

In [ ]:

import random

import numpy as np

In [ ]:

# Grid settings
grid_size = 4

# funtion to build list of all state tuples


def build_state_list(grid_size):
    state_list = []
    for i in range(grid_size):
        for j in range(grid_size):
            state_list.append((i, j))
    return state_list


all_states = build_state_list(grid_size)

# Here we just try to reach the top right corner (could be center or any other state)
goal_state = (3, 3)
n_states = grid_size * grid_size
n_actions = 4  # Up, Down, Left, Right

2. Create a Q-table to store expected rewards for each state/action combination

In [ ]:

# Initialize Q-table
Q = np.zeros((n_states, n_actions))

# Helper functions


def state_to_index(state):
    return state[0] * grid_size + state[1]


def index_to_state(index):
    return (index // grid_size, index % grid_size)


def get_possible_actions(state):
    actions = []
    if state[0] > 0:
        actions.append(0)  # Up
    if state[0] < grid_size - 1:
        actions.append(1)  # Down
    if state[1] > 0:
        actions.append(2)  # Left
    if state[1] < grid_size - 1:
        actions.append(3)  # Right
    return actions


# Correct the state transition function to prevent invalid states


def take_action(state, action):
    new_state = list(state)
    if action == 0 and state[0] > 0:
        new_state[0] -= 1  # Up
    if action == 1 and state[0] < grid_size - 1:
        new_state[0] += 1  # Down
    if action == 2 and state[1] > 0:
        new_state[1] -= 1  # Left
    if action == 3 and state[1] < grid_size - 1:
        new_state[1] += 1  # Right
    return tuple(new_state)

3. Implement learning algorithm and train model

In [ ]:

# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1  # Exploration rate
n_episodes = 100000

# Training the model with corrected state transitions
for episode in range(n_episodes):
    # start at a random state
    state = random.choice(all_states)
    done = state == goal_state

    while not done:
        state_index = state_to_index(state)
        if random.uniform(0, 1) < epsilon:
            # Explore: choose a random action
            action = random.choice(get_possible_actions(state))
        else:
            # Exploit: choose the best action from Q-table
            action = np.argmax(Q[state_index])

        # Take action and observe reward
        next_state = take_action(state, action)
        reward = 1 if next_state == goal_state else 0
        next_state_index = state_to_index(next_state)

        # Q-learning update
        Q[state_index, action] = Q[state_index, action] + learning_rate * (
            reward
            + discount_factor * np.max(Q[next_state_index])
            - Q[state_index, action]
        )

        # Transition to the next state
        state = next_state
        done = state == goal_state

4. Evaluate model

First, we will just show one path then see on average how many actions it takes to get to the goal state.

In [ ]:

# Evaluating the model
state = random.choice(all_states)
print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
    state_index = state_to_index(state)
    action = np.argmax(Q[state_index])  # Choose the best action
    state = take_action(state, action)
    trajectory.append(state)
    done = state == goal_state

print(trajectory)

Out [ ]:

Initial state: (3, 3)
[(3, 3)]

In [ ]:

total_actions = 0  # Total number of actions taken to reach the goal
for state in all_states:
    # Evaluating the model
    trajectory = [state]
    done = state == goal_state
    while not done:
        state_index = state_to_index(state)
        action = np.argmax(Q[state_index])  # Choose the best action
        state = take_action(state, action)
        trajectory.append(state)
        done = state == goal_state
        total_actions += 1
print(
    "Average number of actions taken to reach the goal:",
    total_actions / len(all_states),
)

Out [ ]:

Average number of actions taken to reach the goal: 3.0
Is this optimal? We know the optimal policy is to move up or to the right until we reach the goal. From the bottom left, this is 6 actions, for the next 2 states it is 5 actions, for the next 3 it is 4 actions, then 4->3, 3->2, 2->1, 1->0 as we already start at the goal state. By simple arithmetic we have

6+2*5+3*4+4*3+3*2+2*1 = 48

Total state = 16

Therefore, the optimal is 48/16 = 3 which is exactly our average number of actions.

5. Deploy to DataRobot Rest API end-point

In [ ]:

import pickle

import datarobot as dr
import numpy as np
import pandas as pd

In [ ]:

import os

os.makedirs("./storage/deploy/", exist_ok=True)
# save the Q table to a pickle file
with open("./storage/deploy/q_table.pkl", "wb") as f:
    pickle.dump(Q, f)

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

In [ ]:

dr_client = dr.Client()

Define Hooks for Deploying an Unstructured Custom Model. One could use a standard custom deployment, but using this to illustrate flexibity for more complex RL problems.

In [ ]:

def load_model(input_dir):
    """Custom model hook for loading our Q-table

    Make sure to execute the cell earlier in the notebook that create Q-table before deploying
    """

    with open(input_dir + "/storage/deploy/" + "q_table.pkl", "rb") as f:
        Q = pickle.load(f)

    return Q


def score_unstructured(model, data, query, **kwargs) -> str:
    """Custom model hook for return action.

    model: The output of load_model is passed to this object
    data: str
        Expects json string passed in request body.
        Required keys:
                state: tuple(int, int) .. Current state of the agent
    query: None
        Unused
    **kwargs: dict
        Unused

    Returns:
        JSON string with output action

    """
    import json

    import numpy as np

    Q = model
    grid_size = int(np.sqrt(len(Q)))  # Grid size is inferred from the Q-table

    # Helper functions
    def state_to_index(state):
        return state[0] * grid_size + state[1]
    
    data_dict = json.loads(data)
    state = data_dict["state"]

    state_index = state_to_index(state)
    action = np.argmax(Q[state_index])

    return json.dumps({"action": action}, default=int)

Test out the prediction structure proior to deployment.

In [ ]:

import json

score_unstructured(
    load_model("."),
    json.dumps({"state": (0, 1)}),
    None,
)

Out [ ]:

'{"action": 1}'

Deploy the RL policy model. We will use this convenience method in drx.

Builds a new Custom Model Environment
You can also use a DataRobot Python Drop-in Enviroment (e.g. “6386dc1159c606b0d8beddc7”)
Assembles a new Custom Model with the provided hooks
Deploys an Unstructured Custom Model to your Deployments
Returns an object which can be used to make predictions

Use environment_id to re-use an existing Custom Model Environment that you’re happy with for shorter iteration cycles on the custom model hooks.

Note: See https://app.datarobot.com/docs/api/api-quickstart/index.html for instructions to setup a drconfig.yaml or call drx.Context() to initialize your credentials.

In [ ]:

import datarobotx as drx

drx.Context().endpoint = dr_client.endpoint
drx.Context().token = dr_client.token

In [ ]:

deployment = drx.deploy(
    "storage/deploy/",
    hooks={"score_unstructured": score_unstructured, "load_model": load_model},
    extra_requirements=[],
    # environment_id="6386dc1159c606b0d8beddc7",
)

Out [ ]:

# Deploying custom model
  - Unable to auto-detect model type; any provided paths and files will be
    exported - dependencies should be explicitly specified using
    `extra_requirements` or `environment_id`
  - Preparing model and environment...
  - Configured environment [[Custom]
    priceless-ganguly](https://app.datarobot.com/model-registry/custom-environments/65ac4115be769b7f85d5aaf9)
    with requirements:
      python 3.9.16
      datarobot-drum==1.10.14
      datarobot-mlops==9.2.8
      cloudpickle==2.2.1
  - Awaiting custom environment build...

Out [ ]:

  - Configuring and uploading custom model...

    100%|███████████████████████████| 11.0k/11.0k [00:00<00:00, 5.14MB/s]
  - Registered custom model
    [priceless-ganguly](https://app.datarobot.com/model-registry/custom-models/65ac42ce046ed058aada50c7/info)
    with target type: Unstructured
  - Creating and deploying model package...

Out [ ]:

  - Created deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Custom model deployment complete

Let’s try out our deployment and track the trajectory from the deployed policy (returns action)

In [ ]:

# If your deployment already occured or your notebook restarted due to inactivity, get ID from URL in the UI
# deployment = drx.Deployment("YOUR DEPLOYEMENT ID HERE")
deployment.predict_unstructured({"state": (0, 1)})

Out [ ]:

# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
{'action': 1}

Test and print trajectory.

In [ ]:

state = (0, 1)
goal_state = (3, 3)

print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
    action = deployment.predict_unstructured({"state": state})["action"]
    state = take_action(state, action)
    trajectory.append(state)
    done = state == goal_state

print(trajectory)

Out [ ]:

Initial state: (0, 1)
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
# Making predictions
  - Making predictions with deployment
    [priceless-ganguly](https://app.datarobot.com/deployments/65ac42d34958c314b9badcb9/overview)
# Predictions complete
[(0, 1), (1, 1), (2, 1), (3, 1), (3, 2), (3, 3)]

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

Get Started Now