Prediction Intervals via Conformal Inference

This AI Accelerator demonstrates various ways for generating prediction intervals for any DataRobot model. The methods presented here are rooted in the area of conformal inference (also known as conformal prediction).

Request a Demo

These types of approaches have become increasingly popular for uncertainty quantification because they do not require strict distributional assumptions to be met in order to achieve proper coverage (i.e., they only require that the testing data is exchangeable with the training data). While conformal inference can be applied across a wide array of prediction problems, the focus in this notebook will be prediction interval generation on regression targets. This notebook is formatted as follows:

  1. Importing libraries
  2. Notebook parameters and helper functions
  3. Loading the example dataset
  4. Modeling building and making predictions
  5. Method 1: Absolute conformal
  6. Method 2: Signed conformal
  7. Method 3: Locally-weighted conformal
  8. Method 4: Conformalized quantile regression
  9. Comparing methods
  10. Conclusion

Note: the particulars for each method have been simplified (e.g., authors use an “adjusted” quantile level rather than the traditional quantile calculation implemented here). For a full treatment of each approach and specific algorithm details, see the cited reference papers below.

1. Importing libraries

Read about different options for connecting to DataRobot from the client. Load the remaining libraries below in the usual way.

In [1]:

# Establish connection
import datarobot as dr

print(f"DataRobot version: {dr.__version__}")
dr.Client()

DataRobot version: 3.2.0

Out [1]:

<datarobot.rest.RESTClientObject at 0x10c74f970>
In [2]:

# Imports
import datarobot as dr
from datarobot.models.modeljob import wait_for_async_model_creation
import numpy as np
import pandas as pd

2. Notebook parameters and helper functions

Below, you’ll have two parameters for this notebook:

  1. COVERAGE_LEVEL: fraction of prediction intervals that should contain the target
  2. TEST_DATA_FRACTION: fraction of data to holdout out and use as a testing dataset to evaluate each method

In addition, a couple functions are provided to make the following analysis easier. One of which computes the two metrics that will be used for comparison:

  1. Coverage: fraction of computed prediction intervals that contain the target across all rows
  2. Average Width: the average width of the prediction intervals across all rows

A desirable method should achieve the proper coverage at the smallest width possible.

In [3]:

# Coverage level
COVERAGE_LEVEL = 0.9

# Fraction of data to use to evaluate each method
TEST_DATA_FRACTION = 0.2
In [4]:

# Courtesy of https://github.com/yromano/cqr/blob/master/cqr/helper.py


def compute_coverage(
    y_test: np.array,
    y_lower: np.array,
    y_upper: np.array,
    significance: float,
    name: str = "",
) -> (float, float):
    """
    Computes coverage and average width

    Parameters
    ----------
    y_test: true labels (n)
    y_lower: estimated lower bound for the labels (n)
    y_upper: estimated upper bound for the labels (n)
    significance: desired significance level
    name: optional output string (e.g. the method name)

    Returns
    -------
    coverage : average coverage
    avg_width : average width

    """
    # Compute coverage
    in_the_range = np.sum((y_test >= y_lower) & (y_test <= y_upper))
    coverage = in_the_range / len(y_test) * 100
    print(
        "%s: Percentage in the range (expecting %.2f): %f"
        % (name, 100 - significance * 100, coverage)
    )

    # Compute average width
    avg_width = np.mean(abs(y_upper - y_lower))
    print("%s: Average width: %f" % (name, avg_width))

    return coverage, avg_width


def compute_training_predictions(model: dr.Model) -> pd.DataFrame:
    """
    Computes (or gathers) the out-of-sample training predictions from a model

    Parameters
    ----------
    model: DataRobot model

    Returns
    -------
    DataFrame of training predictions

    """

    # Get project to unlock holdout
    project = dr.Project.get(model.project_id)
    project.unlock_holdout()

    # Request or gather predictions
    try:
        training_predict_job = model.request_training_predictions(
            dr.enums.DATA_SUBSET.ALL
        )
        training_predictions = training_predict_job.get_result_when_complete()

    except dr.errors.ClientError:
        training_predictions = [
            tp
            for tp in dr.TrainingPredictions.list(project.id)
            if tp.model_id == model.id and tp.data_subset == "all"
        ][0]

    return training_predictions.get_all_as_dataframe()


def quantile_rearrangement(
    test_preds: pd.DataFrame,
    quantile_low: float,
    quantile_high: float,
) -> pd.DataFrame:
    """
    Produces monotonic quantiles
    Based on: https://github.com/yromano/cqr/blob/master/cqr/torch_models.py#L66-#L94

    Parameters
    ----------
    test_preds: dataframe of quantile predictions to rearrange, sorted from lowest quantile to highest
    quantile_low: desired low quantile in the range (0,1)
    quantile_high: desired high quantile in the range (0,1)

    Returns
    -------
    Dataframe of rearranged quantile predictions

    References
    ----------
    .. [1]  Chernozhukov, Victor, Iván Fernández‐Val, and Alfred Galichon.
            "Quantile and probability curves without crossing."
            Econometrica 78.3 (2010): 1093-1125.
    """

    # Based on the code in the referenced function, "all_quantiles" is defined as the following:
    # See https://github.com/yromano/cqr/blob/master/cqr/helper.py#L423
    all_quantiles = np.linspace(0.01, 0.99, 99)

    # This part remains remains the same
    scaling = all_quantiles[-1] - all_quantiles[0]
    low_val = (quantile_low - all_quantiles[0]) / scaling
    high_val = (quantile_high - all_quantiles[0]) / scaling

    # Get new values
    q_fixed = np.quantile(
        test_preds.values, (low_val, high_val), method="linear", axis=1
    )

    return pd.DataFrame(q_fixed.T, columns=test_preds.columns)

3. Loading the example dataset

The dataset you’ll use comes from this DataRobot blog post. Each row represents a player in the National Basketball Association (NBA) and the columns signify different NBA statistics from various repositories, fantasy basketball news sources, and betting information. The target, game_score, is a single statistic that attempts to quantify player performance and productivity.

Additionally, you’ll partition the data into a training and testing sets. The training set will be used for modeling building / evaluation while the testing set will be used to compare each method.

In [5]:

# Load data
df = pd.read_csv(
    "https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_NBA_2017-2018.csv"
)
df.head()

Out [5]:

01234
roto_fpts_per_minNaNNaNNaNNaNNaN
roto_minutesNaNNaNNaNNaNNaN
roto_fptsNaNNaNNaNNaNNaN
roto_valueNaNNaNNaNNaNNaN
free_throws_lag30_meanNaN01.522.5
field_goals_decay1_meanNaN686.2857147.2
game_score_lag30_meanNaN8.113.3510.712.15
minutes_played_decay1_meanNaN27.06666735.51111133.38095230.955556
PF_lastseason3.13.13.13.13.1
free_throws_attempted_lag30_meanNaN02.533.25
teamPHOPHOPHOPHOPHO
opponentPORLALLACSACUTA
over_underNaNNaNNaNNaNNaN
eff_field_goal_percent_lastseason0.4750.4750.4750.4750.475
spread_decay1_meanNaN-48-17.333333-31.428571-13.6
positionSGSGSGSGSG
OWS_lastseason1.31.31.31.31.3
free_throws_percent_decay1NaNNaN0.60.6923080.862069
text_yesterday_and_todayNaNNaNNaNNaNNaN
game_score8.118.65.416.58.4
In [6]:

# Distribution of target
target_column = "game_score"
df[target_column].hist()
Out [6]:

<Axes: >
download 12
In [7]:
# Split data
df_train = df.sample(frac=1 - TEST_DATA_FRACTION, replace=False, random_state=10)
df_test = df.loc[~df.index.isin(df_train.index)]
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
print(df_train.shape)
print(df_test.shape)

(7999, 52)

(2000, 52)

4. Modeling building and making predictions

To create a DataRobot project and start building models, you can use the convenient Project.start function, which chains together project creation, file upload, and target selection. Once models are finished training, you’ll retrieve the one DataRobot recommends for deployment and request predictions for both the training and testing sets. Note that the predictions made on the training dataset are not in-sample, but rather out-of-sample (i.e., also referred to as stacked predictions). These out-of-sample training predictions are a key component to each prediction interval method discussed in this notebook.

In [8]:

# Starting main project
project = dr.Project.start(
    sourcedata=df_train,
    project_name="Conformal Inference AIA - NBA",
    target=target_column,
    worker_count=-1,
)

# Wait
project.wait_for_autopilot(check_interval=120)
In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 4s)
In progress: 8, queued: 0 (waited: 6s)
In progress: 8, queued: 0 (waited: 10s)
In progress: 8, queued: 0 (waited: 17s)
In progress: 8, queued: 0 (waited: 30s)
In progress: 7, queued: 0 (waited: 56s)
In progress: 1, queued: 0 (waited: 108s)
In progress: 4, queued: 0 (waited: 211s)
In progress: 1, queued: 0 (waited: 332s)
In progress: 0, queued: 0 (waited: 453s)
In progress: 0, queued: 0 (waited: 574s)
In [9]:

# Get recommended model
best_model = dr.ModelRecommendation.get(project.id).get_model()
best_model
Out [9]:

Model('RandomForest Regressor')
In [10]:

# Compute training predictions (necessary for each method)
training_preds = compute_training_predictions(model=best_model)
training_preds.head()

Out [10]:

row_idpartition_idprediction
000.00.899231
11Holdout9.240274
220.015.043815
334.08.626567
442.015.435130
In [11]:

# Request predictions on testing data
pred_dataset = project.upload_dataset(sourcedata=df_test, max_wait=60 * 60 * 24)
predict_job = best_model.request_predictions(dataset_id=pred_dataset.id)
testing_preds = predict_job.get_result_when_complete(max_wait=60 * 60 * 24)
testing_preds.head()

Out [11]:

row_idprediction
0014.275717
1113.238045
2212.827469
3314.141054
447.113611
In [12]:

# Join predictions training and testing datasets
df_train = df_train.join(training_preds.set_index("row_id"))
df_test = df_test.join(testing_preds.set_index("row_id"))
display(df_train[[target_column, "prediction"]])
display(df_test[[target_column, "prediction"]])

game_scoreprediction
00.00.899231
10.09.240274
221.615.043815
34.48.626567
426.715.435130
799418.419.801230
799512.910.349299
799619.314.453104
79979.823.360390
79988.19.220965
7999 rows × 2 columns

game_scoreprediction
05.414.275717
116.513.238045
27.212.827469
323.814.141054
40.07.113611
199517.010.358305
199625.016.455246
1997-1.24.356278
199815.714.503841
199914.310.568885
2000 rows × 2 columns
In [13]:

# Compute the residuals on the training data
df_train["residuals"] = df_train[target_column] - df_train["prediction"]
df_train["residuals"].hist()

Out[13]:

<Axes: >


download 13

5. Method: Absolute conformal

The first method you’ll implement, regarded here as “absolute conformal”, is as follows:

  1. Take the absolute value of the out-of-sample residuals (these will be the conformity scores)
  2. Compute the quantile associated with the specified COVERAGE_LEVEL on the conformity scores
  3. Add and subtract this quantile value to the prediction

The resulting prediction intervals are guaranteed to be symmetric and the same width (since you’re simply applying a scalar value across all rows). For more information regarding this approach, see Section 2.3.

In [14]:

# Compute the conformity 

df_train["abs_residuals"] = df_train["residuals"].abs()
abs_residuals_q = df_train["abs_residuals"].quantile(COVERAGE_LEVEL)
abs_residuals_q

Out [14]:

11.028431310477108

In [15]:

# Using the conformity score, create the prediction intervals
df_test["method_1_lower"] = df_test["prediction"] - abs_residuals_q
df_test["method_1_upper"] = df_test["prediction"] + abs_residuals_q
In [16]:

# Compute metrics
method_1_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_1_lower"].values,
    y_upper=df_test["method_1_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Absolute Conformal",
)
Absolute Conformal: Percentage in the range (expecting 90.00): 89.000000
Absolute Conformal: Average width: 22.056863

6. Method: Signed conformal

“Signed conformal” follows a very similar procedure as the previous one:

  1. Compute lower and upper quantile levels based on the specified COVERAGE_LEVEL
  2. Apply these quantile levels to the out-of-sample residuals (i.e., conformity scores)
  3. Add these quantile values to the prediction

The main advantage to this approach over the previous one is that the prediction intervals are not forced to be symmetric, which can lead to better coverage for skewed targets. For more information regarding this approach, see Section 3.2.

In [17]:

# Compute lower and upper quantile levels to use based on the coverage
lower_coverage_q = round((1 - COVERAGE_LEVEL) / 2, 2)
upper_coverage_q = COVERAGE_LEVEL + (1 - COVERAGE_LEVEL) / 2
lower_coverage_q, upper_coverage_q

Out [17]:

(0.05, 0.95)

In [18]:

# Compute quantiles on the conformity scores
residuals_q_low = df_train["residuals"].quantile(lower_coverage_q)
residuals_q_high = df_train["residuals"].quantile(upper_coverage_q)
residuals_q_low, residuals_q_high

Out [18]:

(-10.573999229291612, 11.617478915155703)

In [19]:

# Using the quantile levels, create the prediction intervals
df_test["method_2_lower"] = df_test["prediction"] + residuals_q_low
df_test["method_2_upper"] = df_test["prediction"] + residuals_q_high
In [20]:

# Compute coverage / width
method_2_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_2_lower"].values,
    y_upper=df_test["method_2_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Signed Conformal",
)

Signed Conformal: Percentage in the range (expecting 90.00): 88.900000

Signed Conformal: Average width: 22.191478

7. Method: Locally-weighted conformal

While the primary advantage of the previous two methods is their simplicity, the disadvantage is that each prediction interval ends up being the exact same width. In many cases, it’s desirable to have varying widths that reflect the degree of confidence (i.e., harder to predict rows get a larger prediction interval and vice versa). To this end, you can make them more adaptive by using an auxiliary model to help augment the width on a per-row basis, depending on how much error we’d expect to see in a particular row. The “locally-weighted conformal” method is as follows:

  1. Take the absolute value of the out-of-sample residuals
  2. Build a model that regresses against the absolute residuals using the same feature set
  3. Compute the out-of-sample predictions from the absolute residuals model
  4. Scale the out-of-sample residuals using the the auxillary model’s predictions to create the conformity scores
  5. Compute the quantile associated with the specified COVERAGE_LEVEL on the conformity scores
  6. Multiply this quantile value and the auxillary model’s predictions together (this will result in a locally-weighted offset to apply, specific to each row)
  7. Add and subtract this multiplied value to each prediction

Although this approach is more involved, it addresses the disadvantage above by making the prediction intervals adaptive to each row while still being symmetric. For more information regarding this approach, see Section 5.2.

In [21]:

# Starting project to predict absolute residuals
project_abs_residuals = dr.Project.start(
    sourcedata=df_train.drop(
        columns=[
            target_column,
            "partition_id",
            "prediction",
            "residuals",
        ],
        axis=1,
    ),
    project_name=f"Predicting absolute residuals from {project.project_name}",
    target="abs_residuals",
    worker_count=-1,
)
project_abs_residuals.wait_for_autopilot(check_interval=120)
In progress: 8, queued: 0 (waited: 0s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 1s)
In progress: 8, queued: 0 (waited: 2s)
In progress: 8, queued: 0 (waited: 3s)
In progress: 8, queued: 0 (waited: 5s)
In progress: 8, queued: 0 (waited: 9s)
In progress: 8, queued: 0 (waited: 16s)
In progress: 8, queued: 0 (waited: 29s)
In progress: 7, queued: 0 (waited: 55s)
In progress: 1, queued: 0 (waited: 108s)
In progress: 3, queued: 0 (waited: 211s)
In progress: 1, queued: 0 (waited: 331s)
In progress: 0, queued: 0 (waited: 452s)
In progress: 0, queued: 0 (waited: 572s)
In [22]:

# Get recommended model
best_model_abs_residuals = dr.ModelRecommendation.get(
    project_abs_residuals.id
).get_model()
best_model_abs_residuals
Out [22]:

Model('RandomForest Regressor')

In [23]:

# Compute training predictions and join
df_train = df_train.join(
    compute_training_predictions(model=best_model_abs_residuals)
    .rename(columns={"prediction": f"abs_residuals_prediction"})
    .set_index("row_id")
    .drop(columns=["partition_id"], axis=1)
)
In [24]:

# Now compute prediction on testing data and join
pred_dataset_abs_residuals = project_abs_residuals.upload_dataset(
    sourcedata=df_test, max_wait=60 * 60 * 24
)
df_test = df_test.join(
    best_model_abs_residuals.request_predictions(
        dataset_id=pred_dataset_abs_residuals.id
    )
    .get_result_when_complete(max_wait=60 * 60 * 24)
    .rename(columns={"prediction": f"abs_residuals_prediction"})
    .set_index("row_id")
)
In [25]:

# Now we need to compute our locally-weighted conformity score and take the quantile
scaled_abs_residuals = df_train["abs_residuals"] / df_train["abs_residuals_prediction"]
scaled_abs_residuals_q = scaled_abs_residuals.quantile(COVERAGE_LEVEL)
scaled_abs_residuals_q
Out [25]:

2.0517307447009157
In [26]:

# Using the conformity score and absolute residuals model, create the prediction intervals
df_test["method_3_lower"] = (
    df_test["prediction"] - df_test["abs_residuals_prediction"] * scaled_abs_residuals_q
)
df_test["method_3_upper"] = (
    df_test["prediction"] + df_test["abs_residuals_prediction"] * scaled_abs_residuals_q
)
In [27]:

# Compute coverage / width
method_3_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_3_lower"].values,
    y_upper=df_test["method_3_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Locally-Weighted Conformal",
)
Locally-Weighted Conformal: Percentage in the range (expecting 90.00): 89.800000
Locally-Weighted Conformal: Average width: 21.710734

8. Method: Conformalized Quantile Regression

If you consider “locally-weighted conformal” to be a model-based extension of “absolute conformal”, then you could consider “conformalized quantile regression” to be a model-based extension of “signed conformal.” The goal is similar – create more adaptive prediction intervals, but it inherits the quality that the prediction intervals are not forced to be symmetric. The reference paper offers a symmetric and asymmetric formulation for the conformity scores. The former (Theorem 1) “allows coverage errors to be spread arbitrarily over the left and right tails” while the latter (Theorem 2) controls “the left and right tails independently, resulting in a stronger coverage guarantee” at the cost of slightly wider prediction intervals. Here, you’ll use the symmetric version. The full method is as follows:

  1. Compute lower and upper quantile levels based on the specified COVERAGE_LEVEL
  2. Train two quantile regression models at the lower and upper quantile levels on the training set
  3. Compute the out-of-sample predictions for both quantile models
  4. Compute the conformity scores, E, such that E = max[^qlowery, y^qupper]  where у is the target, ^qlower is the lower quantile predictions, and ^qupper is the upper quantile predictions
  5. Compute the quantile associated with the specified COVERAGE_LEVEL on E
  6. Add and subtract this quantile value to the quantile predictions

Notably, this approach is completely independent of your main model. That is, it doesn’t use any information about the recommended model defined earlier. This may or may not be desired, depending on the user’s preference or use case requirements. To ensure the main model’s predictions fall within the prediction interval, you’ll simply extend the interval’s boundary to be equal to the prediction itself (if the prediction lies outside of the respective prediction interval). Additionally, when using non-linear quantile regression methods (e.g., tree-based approaches, neural networks), it’s possible to experience quantile crossing (i.e., non-monotonic quantile predictions). To combat this, the referenced paper offers a solution via rearrangement, which is implemented here.

There are two ways to run quantile regression models in DataRobot:

  1. Set the project metric to quantile loss (which is currently a public-preview feature)
  2. Use certain blueprints with algorithms that support quantile loss as a hyperparameter in your current project. These includes gradient boosted trees from scikit-learn and Keras neural networks.

In this notebook, you’ll use the second approach, since it’s generally available. This involves using DataRobot’s advanced tuning functionality to change the loss function to the desired quantile loss.

In [28]:

# Get a GBT from scikit-learn (using the first one)
models = project.get_models()
gbt_models = [x for x in models if x.model_type.startswith("Gradient Boosted")]

# Check for GBT model. If none, make one.
if gbt_models:
    # Get most accurate one on validation set
    gbt_model = gbt_models[0]

else:
    # Pull models (will usually be at least one blueprint with a scikit-learn GBT)
    gbt_bps = [
        x
        for x in project.get_blueprints()
        if x.model_type.startswith("Gradient Boosted")
    ]

    # Get first one
    gbt_bp = gbt_bps[0]

    # Train it
    gbt_model = wait_for_async_model_creation(
        project_id=project.id,
        model_job_id=project.train(gbt_bp),
        max_wait=60 * 60 * 24,
    )

gbt_model
Out [28]:

Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Least-Squares Loss)')
In [29]:

# Train it on all the data
model_job_id = gbt_model.train(
    sample_pct=100,
    featurelist_id=gbt_model.featurelist_id,
)
gbt_model_100 = wait_for_async_model_creation(
    project_id=project.id, model_job_id=model_job_id, max_wait=60 * 60 * 24
)
gbt_model_100
Out [29]:

Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Least-Squares Loss)')

In [30]:

# Train quantile models
quantile_models = {lower_coverage_q: None, upper_coverage_q: None}

# Tune main keras model
for q in quantile_models.keys():
    # Start
    tune = gbt_model_100.start_advanced_tuning_session()

    # Set loss and level
    tune.set_parameter(
        task_name=gbt_model_100.model_type, parameter_name="loss", value="quantile"
    )
    tune.set_parameter(
        task_name=gbt_model_100.model_type, parameter_name="alpha", value=q
    )

    # Save job
    quantile_models[q] = tune.run()

# Wait and get resulting models
for q in quantile_models.keys():
    quantile_models[q] = quantile_models[q].get_result_when_complete(
        max_wait=60 * 60 * 24
    )

quantile_models
Out [30]:

{0.05: Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Quantile Loss)'),
 0.95: Model('Gradient Boosted Greedy Trees Regressor with Early Stopping (Quantile Loss)')}
In [31]:

# Compute training predictions
for q in quantile_models.keys():
    df_train = df_train.join(
        compute_training_predictions(model=quantile_models[q])
        .rename(columns={"prediction": f"quantile_prediction_{q}"})
        .set_index("row_id")
        .drop(columns=["partition_id"], axis=1)
    )

# Check
df_train[
    [
        target_column,
        f"quantile_prediction_{lower_coverage_q}",
        f"quantile_prediction_{upper_coverage_q}",
    ]
]

Out[31]:

game_scorequantile_prediction_0.05quantile_prediction_0.95
00.0-0.6617017.367953
10.0-0.10951016.625219
221.63.75237324.534147
34.40.52144718.209039
426.71.36721326.972765
799418.44.88263235.834840
799512.90.87948818.311118
799619.31.23530023.512004
79979.85.62211432.164047
79988.1-0.04649319.430948
7999 rows × 3 columns
In [32]:

# Making prediction on test data
quantile_models_test_predict = quantile_models.copy()
for q in quantile_models.keys():
    quantile_models_test_predict[q] = quantile_models[q].request_predictions(
        dataset_id=pred_dataset.id
    )

# Joining the results
for q in quantile_models.keys():
    df_test = df_test.join(
        quantile_models_test_predict[q]
        .get_result_when_complete(max_wait=60 * 60 * 24)
        .rename(columns={"prediction": f"quantile_prediction_{q}"})
        .set_index("row_id")
    )

# Check
df_test[
    [
        target_column,
        f"quantile_prediction_{lower_coverage_q}",
        f"quantile_prediction_{upper_coverage_q}",
    ]
]

Out[32]:

game_scorequantile_prediction_0.05quantile_prediction_0.95
05.40.91827725.233671
116.51.48829125.160815
27.20.31511724.488930
323.80.86442723.131123
40.00.23983821.055298
199517.0-0.11318923.151948
199625.05.38651823.425884
1997-1.2-1.63187722.678644
199815.72.10761523.573162
199914.33.64408222.918123
2000 rows × 3 columns
In [33]:

# Implement quantile rearrangement
q_crossing_train = (
    df_train[f"quantile_prediction_{lower_coverage_q}"]
    > df_train[f"quantile_prediction_{upper_coverage_q}"]
).sum()
q_crossing_test = (
    df_test[f"quantile_prediction_{lower_coverage_q}"]
    > df_test[f"quantile_prediction_{upper_coverage_q}"]
)
print(
    f"Number of rows with quantile crossing in training set (before rearrangement): {q_crossing_train}"
)
print(
    f"Number of rows with quantile crossing in testing set (before rearrangement): {q_crossing_test}"
)

# Capture quantile columns
quantile_pred_cols = [x for x in df_train.columns if x.startswith("quantile")]

# On training set
df_train = df_train.drop(quantile_pred_cols, axis=1).join(
    quantile_rearrangement(
        test_preds=df_train[
            [
                f"quantile_prediction_{lower_coverage_q}",
                f"quantile_prediction_{upper_coverage_q}",
            ]
        ],
        quantile_low=lower_coverage_q,
        quantile_high=upper_coverage_q,
    )
)

# On testing set
df_test = df_test.drop(quantile_pred_cols, axis=1).join(
    quantile_rearrangement(
        test_preds=df_test[
            [
                f"quantile_prediction_{lower_coverage_q}",
                f"quantile_prediction_{upper_coverage_q}",
            ]
        ],
        quantile_low=lower_coverage_q,
        quantile_high=upper_coverage_q,
    )
)

# Check again
q_crossing_train = (
    df_train[f"quantile_prediction_{lower_coverage_q}"]
    > df_train[f"quantile_prediction_{upper_coverage_q}"]
).sum()
q_crossing_test = (
    df_test[f"quantile_prediction_{lower_coverage_q}"]
    > df_test[f"quantile_prediction_{upper_coverage_q}"]
)
print(
    f"Number of rows with quantile crossing in training set (after rearrangement): {q_crossing_train}"
)
print(
    f"Number of rows with quantile crossing in testing set (after rearrangement): {q_crossing_test}"
)
Number of rows with quantile crossing in training set (before rearrangement): 7
Number of rows with quantile crossing in testing set (before rearrangement): 0       False
1       False
2       False
3       False
4       False
        ...  
1995    False
1996    False
1997    False
1998    False
1999    False
Length: 2000, dtype: bool
Number of rows with quantile crossing in training set (after rearrangement): 0
Number of rows with quantile crossing in testing set (after rearrangement): 
0       False
1       False
2       False
3       False
4       False
        ...  
1995    False
1996    False
1997    False
1998    False
1999    False
Length: 2000, dtype: bool
In [34]:

# Now we compute our conformity score and take the quantile
E_cqr = np.maximum(
    df_train[f"quantile_prediction_{lower_coverage_q}"] - df_train[target_column],
    df_train[target_column] - df_train[f"quantile_prediction_{upper_coverage_q}"],
)
E_cqr_q = E_cqr.quantile(COVERAGE_LEVEL)
E_cqr_q
In [34]:

1.7189887578306196
In [35]:

# Create the prediction intervals
df_test["method_4_lower"] = df_test[f"quantile_prediction_{lower_coverage_q}"] - E_cqr_q
df_test["method_4_upper"] = df_test[f"quantile_prediction_{upper_coverage_q}"] + E_cqr_q
In [36]:

# Extend to make sure the prediction is inside the interval
df_test["method_4_lower"] = df_test[["method_4_lower", "prediction"]].min(axis=1)
df_test["method_4_upper"] = df_test[["method_4_upper", "prediction"]].max(axis=1)
In [37]:

# Compute coverage / width
method_4_coverage = compute_coverage(
    y_test=df_test[target_column].values,
    y_lower=df_test["method_4_lower"].values,
    y_upper=df_test["method_4_upper"].values,
    significance=1 - COVERAGE_LEVEL,
    name="Conformalized Quantile Regression",
)
Conformalized Quantile Regression: Percentage in the range (expecting 90.00): 89.700000
Conformalized Quantile Regression: Average width: 21.706370

9. Comparing methods

Below you can see that the more advanced methods (i.e., “locally-weighted conformal” and “conformalized quantile regression”) yield similar coverage rates while producing smaller prediction intervals on average. Notably, this is just one dataset, and it’s suggested to empirically experiment with your own data to find the best method for your use case.

In [38]:

# Organize
summary = pd.DataFrame(
    {
        "Coverage": [
            method_1_coverage[0],
            method_2_coverage[0],
            method_3_coverage[0],
            method_4_coverage[0],
        ],
        "Average Width": [
            method_1_coverage[1],
            method_2_coverage[1],
            method_3_coverage[1],
            method_4_coverage[1],
        ],
        "Method": [
            "Absolute Conformal",
            "Signed Conformal",
            "Locally-Weighted Conformal",
            "Conformalized Quantile Regression",
        ],
    }
)
summary

Out[38]:

CoverageAverage WidthMethod
089.022.056863Absolute Conformal
188.922.191478Signed Conformal
289.821.710734Locally-Weighted Conformal
389.721.706370Conformalized Quantile Regression

10. Conclusion

This notebook demonstrates how one could build prediction intervals for any DataRobot model using methods derived from the conformal inference space. Conformal inference is a popular framework to use for generating such prediction intervals because they don’t require strict distributional assumptions to achieve the desired coverage, so long as the testing data is exchangeable with the training data. This characteristic was confirmed in the analysis done here. Because each approach offers different pros and cons, it’s worthwhile to use this AI Accelerator as a starting point for your own experiments to decide which one to implement for your use case. DataRobot offers the ability to easily implement each of these methods, even for the more advanced techniques. For more information on the topic of conformal inference, see the following introductory paper.

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

Get Started with Prediction Intervals

Explore more AI Accelerators