Skip to content

Part 10: Producing banded factor tables

Part 10: Producing banded factor tables

Continuous feature curves are good for diagnostics, but factor tables require discrete bands. Actuarial convention and rating system constraints both demand breakpoints.

The key principle: band the SHAP values, do not band the feature before modelling. The model was trained on continuous driver_age. You cannot pass age_band to a model that was trained on driver_age. What you can do is extract the SHAP values for driver_age (which the model already computed for each observation) and then aggregate those SHAP values by your chosen age bands.

In a new cell, type this and run it (Shift+Enter):

# Define age bands - round numbers that are defensible to the committee
age_breaks = [17, 22, 25, 30, 40, 55, 70, 86]
age_labels  = ["17-21", "22-24", "25-29", "30-39", "40-54", "55-69", "70+"]

# Add age_band to the Polars DataFrame
df_banded = df.with_columns(
    pl.col("driver_age").cut(
        breaks=age_breaks[1:-1],
        labels=age_labels,
    ).alias("age_band")
)

print("Age band distribution:")
print(
    df_banded.group_by("age_band")
    .agg(
        pl.len().alias("n_obs"),
        pl.col("exposure").sum().alias("total_exposure"),
        pl.col("claim_count").sum().alias("claims"),
    )
    .with_columns((pl.col("claims") / pl.col("total_exposure")).alias("observed_freq"))
    .sort("age_band")
)

You will see a table showing the count of policies, exposure, and observed frequency for each age band. Verify that no band has fewer than 500 policies - very sparse bands will have unreliable relativities.

Now extract the per-observation SHAP values and aggregate by age band. In a new cell, type this and run it (Shift+Enter):

shap_vals    = sr.shap_values()            # numpy array, shape (100_000, n_features)
feature_names = sr.feature_names_          # list matching the SHAP columns

age_idx = feature_names.index("driver_age")
age_shap = shap_vals[:, age_idx]

# Build a Polars frame: age_band, age SHAP value, exposure
shap_frame = pl.DataFrame({
    "age_band": df_banded["age_band"].to_list(),
    "age_shap": age_shap.tolist(),
    "exposure":  df["exposure"].to_list(),
})

# Exposure-weighted mean SHAP per band
band_stats = shap_frame.group_by("age_band").agg([
    (pl.col("age_shap") * pl.col("exposure")).sum().alias("weighted_shap_sum"),
    pl.col("exposure").sum().alias("total_exposure"),
    pl.col("exposure").count().alias("n_obs"),
    pl.col("age_shap").std().alias("shap_std"),
]).with_columns(
    (pl.col("weighted_shap_sum") / pl.col("total_exposure")).alias("mean_shap")
)

# Base level: 30-39 (lowest risk mid-range band)
base_shap = band_stats.filter(pl.col("age_band") == "30-39")["mean_shap"][0]

band_rels = band_stats.with_columns(
    (pl.col("mean_shap") - base_shap).exp().alias("relativity")
).sort("age_band")

print("Age band relativities (base: 30-39):")
print(band_rels.select(["age_band", "relativity", "n_obs", "total_exposure"]).sort("age_band"))

You will see:

Age band relativities (base: 30-39):
shape: (7, 4)
┌─────────┬────────────┬───────┬────────────────┐
 age_band relativity  n_obs  total_exposure 
╞═════════╪════════════╪═══════╪════════════════╡
 17-21    1.823        4987        3905.1   
 22-24    1.421        3519        2794.3   
 25-29    1.178        7103        5661.4   
 30-39    1.000       18241       14538.2   
 40-54    0.988       24803       19758.9   
 55-69    1.042       21374       17023.1   
 70+      1.187       19973       15893.6   
└─────────┴────────────┴───────┴────────────────┘

The 17-21 band should show a relativity significantly above 1.0, and the 70+ band a milder uplift. The true DGP has +0.55 for under-25 and +0.20 for over-70, giving exp(0.55) ≈ 1.73 and exp(0.20) ≈ 1.22. Your extracted relativities should be in that neighbourhood.