Most pricing tutorials show you one thing: how to fit a GLM, or how to train a GBM, or how to calculate a Gini coefficient. This is not that. This is the full workflow — from raw data to governance sign-off — using seven libraries that were built to work together. If you are a pricing actuary who wants to see what a properly instrumented model pipeline looks like before you commit to building one yourself, this is the page.

The full runnable script is end-to-end-demo.py. It runs on Databricks Free Edition, single-node Python, no Spark. Runtime is roughly 3–5 minutes.

pip install insurance-datasets "shap-relativities[ml]" "insurance-distill[catboost]" \ "insurance-conformal[catboost]" insurance-fairness \ insurance-monitoring insurance-governance

Step 1

Data you can actually test against

The hardest problem in developing pricing models is that you never know if your implementation is right. Real policyholder data has unknown true coefficients. You can fit a GLM and get plausible-looking numbers, but you cannot tell whether those numbers are correct or whether they are absorbing some omitted variable.

insurance-datasets gives you synthetic UK motor data where the true parameters are published. The frequency model is Poisson with a log-linear predictor; the exact coefficients are in MOTOR_TRUE_FREQ_PARAMS. If your model recovers ncd_years = -0.12 and you fitted it on 50,000 policies and got -0.125, that is correct. If you got -0.18, something is wrong with your specification.

```python from insurance_datasets import load_motor, MOTOR_TRUE_FREQ_PARAMS df = load_motor(n_policies=50_000, seed=42) print(MOTOR_TRUE_FREQ_PARAMS) # {'intercept': -3.2, 'vehicle_group': 0.025, 'ncd_years': -0.12, ...} ```

The split is temporal, not random. Train on 2019–2022, validate on 2023. Insurance data is not exchangeable across time — validating on a random shuffle inflates your Gini and you will not find out until the model goes live.


Step 2

Rating factors from a GBM

The standard complaint about GBMs in pricing is that you cannot get the factor table out. The regulator wants exp(beta) — a number you can defend in a rate filing. The GBM gives you a black box.

shap-relativities closes that gap. For a Poisson model with log link, SHAP values are additive in log space. That means you can compute the exposure-weighted mean SHAP value for each level of each feature and exponentiate the difference — which is exactly the same operation as computing exp(beta_k - beta_ref) from a GLM coefficient.

```python from shap_relativities import SHAPRelativities sr = SHAPRelativities( model=gbm, X=X_train, exposure=train["exposure"], categorical_features=["area_code", "ncd_years", "has_convictions", "vehicle_group"], ) sr.fit() rels = sr.extract_relativities( normalise_to="base_level", base_levels={"area_code": 0, "ncd_years": 0, "has_convictions": 0, "vehicle_group": 1}, ) ```

The output is one row per feature level, with columns for relativity, confidence interval lower, and upper. The same format your rating engine expects for a factor table import.

Always call sr.validate() before trusting the output. The reconstruction check verifies that the SHAP decomposition is numerically exact. If it fails, you have a mismatch between the model objective and the SHAP output type.


Step 3

A GLM the rating engine can load

SHAP relativities are descriptive. You can inspect them and export them as a CSV, but you cannot load a GBM into Radar or Emblem. If your rating engine requires a multiplicative GLM, you need distillation.

insurance-distill fits a Poisson GLM using the GBM's predictions as the target rather than the raw claims. This matters: fitting on GBM pseudo-predictions eliminates the noise from individual claim events. The GBM has already smoothed that away. The result is a surrogate GLM that retains 90–97% of the GBM's Gini coefficient on typical UK motor books.

```python from insurance_distill import SurrogateGLM surrogate = SurrogateGLM( model=gbm, X_train=X_train, y_train=train["claim_count"].to_numpy(), exposure=train["exposure"].to_numpy(), family="poisson", ) surrogate.fit(max_bins=10, method_overrides={"ncd_years": "isotonic"}) report = surrogate.report() print(report.metrics.summary()) ```

The isotonic override for ncd_years is deliberate. NCD discount should be monotone — more years of no claims means lower frequency, without exception. Isotonic regression enforces that constraint directly. The tree-based default would find the statistically optimal bins but might produce a non-monotone step at NCD=4 if the data happens to support one. That is not a result you can defend to a pricing committee.


Step 4

Prediction intervals that actually hold

Point estimates are not enough for pricing decisions involving large limits, reinsurance attachment points, or capital allocation. You need uncertainty quantification. The question is whether your intervals actually achieve their stated coverage levels.

Parametric intervals for insurance data typically fail in the high-risk tail. The assumption is that all risks have the same coefficient of variation. High-risk policies — young drivers, high vehicle groups — have genuinely higher dispersion than the parametric model predicts. The result: the model overcaters low-risk policies (unnecessarily wide intervals) and barely meets the coverage target for top-decile risks.

Conformal prediction is distribution-free. The coverage guarantee holds regardless of the true data distribution and regardless of model misspecification. The pearson_weighted score accounts for the Var(Y) ~ mu relationship in Poisson data, which is why intervals are 13% narrower than the parametric alternative without sacrificing coverage.

```python from insurance_conformal import InsuranceConformalPredictor cp = InsuranceConformalPredictor( model=gbm, nonconformity="pearson_weighted", distribution="tweedie", tweedie_power=1.0, ) cp.calibrate(X_cal.to_pandas(), cal_mask["claim_count"].to_numpy()) intervals = cp.predict_interval(X_val.to_pandas(), alpha=0.10) ```

Two things matter for getting the coverage guarantee to hold. First, the calibration set must not overlap with training — we use the 2022 accident year, held out from the GBM training. Second, calibrate on data that is temporally close to the test period. Conformal prediction requires exchangeability, and a 2019 calibration set is not exchangeable with a 2023 test set after claims inflation.


Step 5

Fairness audit before anything goes to production

The FCA's Consumer Duty (PS22/9) requires firms to evidence fair value by customer group. The FCA's 2024 thematic review found most insurers' fair value assessments were, in their words, "high-level summaries with little substance." Six Consumer Duty investigations followed.

The mechanism creating fair value failures in motor is proxy discrimination. Area band — which you almost certainly have in your model — correlates with ethnicity because urban postcodes are more diverse. You are not modelling ethnicity. But if area band is systematically correlated with ethnicity, and area band drives price, the Equality Act Section 19 prohibition on indirect discrimination is a live concern regardless of your intent.

insurance-fairness checks whether this is happening in your specific book:

```python from insurance_fairness import FairnessAudit audit = FairnessAudit( model=gbm, data=val_with_preds, protected_cols=["area_band"], prediction_col="predicted_rate", outcome_col="incurred", exposure_col="exposure", factor_cols=["area_band", "ncd_years", "has_convictions", "vehicle_group"], model_name="Motor Frequency GBM v1.0", run_proxy_detection=True, ) report = audit.run() report.to_markdown("fairness_audit.md") ```

The primary test is calibration by group. If the model's A/E ratio is 1.0 for urban and rural areas at every predicted pricing level, any premium differences reflect genuine risk differences and are defensible under the proportionality test. The proxy detection module flags factors where a CatBoost model can predict the protected characteristic with R-squared above 0.05 — the threshold where it is worth investigating further. It also runs mutual information scores to catch non-linear relationships that R-squared misses.


Step 6

Monitoring that catches drift before the loss ratio does

Deployed models go stale. The standard approach is to track aggregate A/E — total actual claims divided by total expected — and investigate if it moves outside a band. The problem is that errors cancel. A model that is 15% cheap on under-25s and 15% expensive on over-65s reads 1.00 at portfolio level and nobody raises an alarm.

insurance-monitoring monitors the features, not just the headline number. PSI per rating factor detects distributional shift before it shows up in A/E. The Gini drift z-test — implemented from arXiv:2510.04556 — answers the question that A/E cannot: has the model's ranking degraded? This is the difference between a recalibration (hours of work) and a refit (weeks of work).

```python from insurance_monitoring import MonitoringReport monitoring = MonitoringReport( reference_actual=act_ref, reference_predicted=pred_ref, current_actual=act_cur, current_predicted=pred_cur, feature_df_reference=feat_ref, feature_df_current=feat_cur, features=["driver_age", "vehicle_group", "ncd_years", "area_code"], murphy_distribution="poisson", ) print(monitoring.recommendation) # 'NO_ACTION' | 'RECALIBRATE' | 'REFIT' | 'INVESTIGATE' ```

The Murphy decomposition sharpens the RECALIBRATE/REFIT decision. It decomposes miscalibration into a global component (fixed by multiplying all predictions by A/E — a recalibration) and a local component (requires rebuilding the model). If local MCB exceeds global MCB, the ranking is broken and you need a refit. A/E alone cannot tell you this.


Step 7

Governance sign-off

Before the model goes to production, you need documented validation evidence and a risk tier assessment. insurance-governance automates both.

ModelValidationReport runs the standard test suite — Gini with bootstrap confidence interval, lift chart, A/E by predicted decile with Poisson CI, Hosmer-Lemeshow goodness-of-fit, PSI — and produces a self-contained HTML report. Each test returns a TestResult with a pass/fail flag, severity level, and a human-readable detail string. The report is printable as a PDF and goes directly into the model validation pack.

RiskTierScorer assigns a risk tier from objective criteria: GWP impacted, model complexity, deployment status, regulatory use, external data, customer-facing flag. The 0–100 composite score has documented rules for every point. This removes the subjectivity from MRC presentations — you are not arguing about whether the model is "complex", you are presenting a score with documented methodology.

```python from insurance_governance import ( ModelValidationReport, ValidationModelCard, MRMModelCard, RiskTierScorer, GovernanceReport, ) val_report = ModelValidationReport( model_card=ValidationModelCard(name="Motor Frequency GBM v1.0", ...), y_val=act_cur, y_pred_val=pred_cur, exposure_val=val["exposure"].to_numpy(), ) val_report.generate("validation_report.html") tier = RiskTierScorer().score( gwp_impacted=85_000_000, model_complexity="high", deployment_status="champion", customer_facing=True, ) GovernanceReport(card=MRMModelCard(...), tier=tier).save_html("mrm_pack.html") ```

What this pipeline gives you

Seven libraries, one workflow. Each one solves a specific problem that the others do not:

  • insurance-datasets Known-DGP test environment so you can verify implementations rather than assume they work.
  • shap-relativities Makes GBM predictions interpretable in the format pricing teams and regulators expect.
  • insurance-distill Converts the GBM into something a rating engine can actually load.
  • insurance-conformal Quantifies uncertainty without parametric assumptions that fail on heterogeneous motor books.
  • insurance-fairness Produces the documented evidence trail that FCA Consumer Duty requires.
  • insurance-monitoring Catches the segment-level drift that aggregate A/E misses.
  • insurance-governance Automates the test suite and governance pack that sign-off requires.