Your Model Validation Is a Checklist, Not a Test

Most UK pricing teams treat model validation as a documentation exercise. Run the Gini. If it is above the threshold someone wrote in a PowerPoint three years ago, it passes. Write “model performs well on validation data” in the paper and move on. The PRA disagrees with this approach. SS1/23, the Supervisory Statement on model risk management in force since May 2023, requires that validation tests have defined quantitative thresholds with documented pass/fail outcomes, and that those outcomes are reproducible and auditable. A narrative saying “discrimination is adequate” does not satisfy Principle 3. A TestResult(passed=True, metric_value=0.412, details="Gini 0.412, bootstrap 95% CI [0.389, 0.435], threshold 0.30. PASS") does.

insurance-governance automates the full SS1/23 validation suite: Gini with bootstrap CI, Hosmer-Lemeshow goodness-of-fit, PSI on score distributions, actual-vs-expected by decile with Poisson CI, calibration slope, and a risk-tier scorecard. It outputs a self-contained HTML report plus a JSON sidecar. The entire suite runs in under five seconds on a 50,000-row validation set.

pip install insurance-governance

The problem with checklist validation

SS1/23 has nine validation principles. Principle 3 (performance measurement) and Principle 4 (outcome analysis) are where most teams are most exposed. A typical pricing team’s validation process looks like:

Gini: computed, compared against a threshold inherited from a previous model
Lift chart: produced, eyeballed, declared “satisfactory”
A/E ratio: computed at the aggregate level, possibly by peril
Hosmer-Lemeshow: rarely done, because nobody remembers how to set up the test correctly for a Poisson model
PSI: sometimes done, with no defined threshold for failure
Calibration slope: almost never done

None of these is wrong. The Gini is a real metric. But “Gini = 0.41, above our threshold of 0.30” is a checklist item, not a test. A test has a null hypothesis, a rejection criterion, and a confidence level. A test produces a number that either passes or fails, not a number that someone interprets. And critically, a test is reproducible: running the same code on the same data tomorrow gives the same result.

That reproducibility matters to the PRA. If a supervisory visit asks “show me your validation results for the current motor frequency model”, the answer should not be “let me find the notebook Brian ran in March”. It should be a committed HTML file with a run ID, a generation date, and JSON that can be ingested into your MRM system.

Step 1: fit a CatBoost frequency model on synthetic motor data

We use synthetic data throughout. The insurance-governance library does not ship a dataset loader, so we generate one with Polars and NumPy. The pattern matches a realistic UK private motor portfolio: 50,000 policies, Poisson frequency around 8%, with age, vehicle age, area, and NCD as predictors.

import numpy as np
import polars as pl
from catboost import CatBoostRegressor

rng = np.random.default_rng(42)
n = 50_000

# Synthetic UK motor portfolio
df = pl.DataFrame({
    "driver_age":    rng.integers(17, 80, n),
    "vehicle_age":   rng.integers(0, 15, n),
    "ncd_years":     rng.integers(0, 9, n),
    "area":          rng.choice(["urban", "suburban", "rural"], n),
    "exposure":      rng.uniform(0.5, 1.0, n),
})

# True frequency: younger drivers, urban area, high vehicle age -> higher risk
log_mu = (
    -2.5
    - 0.015 * df["driver_age"].to_numpy()
    + 0.04  * df["vehicle_age"].to_numpy()
    - 0.08  * df["ncd_years"].to_numpy()
    + np.where(df["area"].to_numpy() == "urban", 0.35, 0.0)
    + np.where(df["area"].to_numpy() == "suburban", 0.15, 0.0)
)
freq = np.exp(log_mu)
claims = rng.poisson(freq * df["exposure"].to_numpy())
df = df.with_columns(pl.Series("claims", claims))

# 60/20/20 temporal split (simulate calendar-year ordering)
idx = np.arange(n)
train_idx = idx[:30_000]
val_idx   = idx[30_000:40_000]
test_idx  = idx[40_000:]

feature_cols = ["driver_age", "vehicle_age", "ncd_years"]
cat_cols     = ["area"]

X_train = df[train_idx].select(feature_cols + cat_cols).to_pandas()
X_val   = df[val_idx].select(feature_cols + cat_cols).to_pandas()
y_train = df[train_idx]["claims"].to_numpy().astype(float)
y_val   = df[val_idx]["claims"].to_numpy().astype(float)
exp_train = df[train_idx]["exposure"].to_numpy()
exp_val   = df[val_idx]["exposure"].to_numpy()

model = CatBoostRegressor(
    iterations=300,
    learning_rate=0.05,
    depth=5,
    loss_function="Poisson",
    cat_features=cat_cols,
    verbose=0,
)
model.fit(X_train, y_train / exp_train, sample_weight=exp_train)

y_pred_train = model.predict(X_train) * exp_train
y_pred_val   = model.predict(X_val)   * exp_val

The model predicts expected claim counts. We pass both training and validation predictions into the validation suite so PSI can compare score distributions.

Step 2: run the full SS1/23 validation suite

The ModelValidationReport facade runs every test in one call. Pass training data to get PSI; pass feature DataFrames to get data quality and feature drift; pass a named monitoring owner to satisfy Principle 5.

from insurance_governance import ModelValidationReport, ValidationModelCard

card = ValidationModelCard(
    name="Motor Frequency v4.1",
    version="4.1.0",
    purpose="Predict claim frequency for UK private motor portfolio",
    methodology="CatBoost gradient boosting with Poisson objective, 300 trees",
    target="claim_count",
    features=feature_cols + cat_cols,
    limitations=["No telematics data", "Synthetic training data"],
    owner="Pricing Team",
)

X_train_pl = df[train_idx].select(feature_cols + cat_cols)
X_val_pl   = df[val_idx].select(feature_cols + cat_cols)

report = ModelValidationReport(
    model_card=card,
    y_val=y_val,
    y_pred_val=y_pred_val,
    exposure_val=exp_val,
    y_train=y_train,
    y_pred_train=y_pred_train,
    exposure_train=exp_train,
    X_train=X_train_pl,
    X_val=X_val_pl,
    monitoring_owner="Head of Pricing",
    monitoring_triggers={"psi": 0.25, "ae_ratio": 1.20},
    random_state=42,
)

# Write HTML report + JSON sidecar
report.generate("motor_freq_v41_validation.html")
report.to_json("motor_freq_v41_validation.json")

# Inspect results programmatically
results = report.get_results()
rag = report.get_rag_status()
print(f"Overall RAG: {rag.value}")

Overall RAG: Green

The rag_status field reflects the worst-severity failure across all tests. Green means no test failed at Warning or Critical severity. The HTML report is self-contained - no CDN, no external JS - and includes all test results with their thresholds, a Lorenz curve, a lift chart, and the monitoring plan.

Step 3: inspect the individual test results

The tests return TestResult objects with passed, metric_value, details, and severity. Interrogating them directly is useful for CI pipelines.

for r in results:
    status = "PASS" if r.passed else "FAIL"
    val    = f"{r.metric_value:.4f}" if r.metric_value is not None else "—"
    print(f"{status}  {r.test_name:<40} {val}  {r.details[:60]}")

PASS  gini_coefficient                         0.4124  Gini 0.412, threshold 0.30. PASS
PASS  gini_bootstrap_ci                        0.4124  Gini 0.412 [0.391, 0.432] bootstrap 95% CI. T
PASS  hosmer_lemeshow_test                     0.3218  HL chi2=11.42, df=8, p=0.322. Cannot reject ca
PASS  psi_score                                0.0412  PSI 0.041. Stable (< 0.10). Score distribution
PASS  actual_vs_expected                       0.9873  A/E 0.987 across 10 deciles. All within Poisso
PASS  ae_with_poisson_ci                       0.9873  A/E 0.987, exact 95% Poisson CI [0.961, 1.014]
PASS  lorenz_curve                             —       Lorenz curve computed.
PASS  calibration_plot_data                    —       Calibration data computed.
PASS  monitoring_plan                          —       Monitoring plan: Owner = Head of Pricing. Rev

Every test that can fail has a documented threshold. The Gini threshold is 0.30 - a model that cannot beat random on a motor portfolio needs rebuilding, not validating. The HL p-value threshold is 0.05: a p-value below 0.05 means the calibration by decile is statistically poor. PSI above 0.25 triggers a warning and flags that the score distribution has shifted materially since training. The A/E Poisson CI check flags whether aggregate actual-vs-expected is within the exact confidence interval for the observed claim count.

These thresholds are defaults. Override them by passing extra_results with custom TestResult objects, or by subclassing PerformanceReport directly.

Step 4: risk-tier scoring for the MRM pack

SS1/23 Principle 1 requires that models are classified by materiality and complexity, with review frequency and sign-off authority tied to the tier. RiskTierScorer makes this objective and auditable.

from insurance_governance import MRMModelCard, RiskTierScorer, GovernanceReport

mrm_card = MRMModelCard(
    model_id="motor-freq-v41",
    model_name="Motor TPPD Frequency",
    version="4.1.0",
    model_class="pricing",
    intended_use="Frequency pricing for UK private motor, all distribution channels.",
)

scorer = RiskTierScorer()
tier = scorer.score(
    gwp_impacted=180_000_000,   # £180m GWP
    model_complexity="high",
    deployment_status="champion",
    regulatory_use=False,
    external_data=False,
    customer_facing=True,
)

print(f"Tier: {tier.tier}  ({tier.label})")
print(f"Score: {tier.score}/100")
print(f"Review frequency: {tier.review_frequency}")
print(f"Sign-off: {tier.sign_off_authority}")

Tier: 1  (Critical)
Score: 73/100
Review frequency: Annual
Sign-off: Model Risk Committee

The scorer uses six dimensions: GWP materiality (25 pts), model complexity (20 pts), data quality and external data use (10 pts), validation coverage (10 pts), drift history (10 pts), and regulatory exposure including deployment status and customer-facing flag (25 pts). A score of 60+ is Tier 1. At £180m GWP, high complexity, champion deployment, and customer-facing: this model scores 73 and goes to the MRC annually. That is not a judgement call - it is a formula, and the formula is in the code, not in a governance policy document that three people interpret differently.

Generate the full governance pack in one call:

GovernanceReport(card=mrm_card, tier=tier).save_html("motor_freq_v41_mrm.html")

The HTML pack covers model purpose, risk tier rationale, last validation RAG (Green), assumptions register, outstanding issues, approval conditions, and next review date. It is print-to-PDF ready. Print it, sign it, file it.

What this does not do

insurance-governance automates the tests. It does not automate the judgement. A Gini of 0.31 passes the default threshold of 0.30, but it is a poor model for UK motor and a validator who knows the portfolio will say so. The HTML report gives them everything they need to make that call - all test results, CIs, charts - but the call is still theirs.

The library also does not cover reserving or capital models. The scope is explicitly pricing. For those, you need a different framework entirely.

The practical workflow: run ModelValidationReport as part of your model release process, commit the HTML and JSON to your model repository, register the run in ModelInventory, and update the MRMModelCard with the new validation date. When the PRA visits, you have an auditable trail of every validation run for every model in the portfolio, going back to whenever you started using the library. That is what SS1/23 actually requires.

insurance-governance is open source under MIT at github.com/burning-cost/insurance-governance. Requires Python 3.10+, NumPy, SciPy, and Polars.

Your Pricing Model Is Drifting - monitoring model performance and detecting when a refit is overdue
Tracking Trend Between Model Updates with GAS Filters - between-refit trend tracking with a principled statistical foundation
Per-Risk Volatility Scoring with Distributional GBMs - uncertainty quantification at the policy level, complementary to portfolio-level validation

validationpra-ss123governanceginihosmer-lemeshowpsicalibrationcatboostpolarsinsurance-governancemotoruk-pricingpythonmodel-risk

Back to all articles