Every library ships with a Databricks-runnable benchmark on a known data-generating process. All benchmarks are self-contained (synthetic data, no external files), run on Databricks Serverless unless noted, and produce the same output on every run. The results below are extracted from those runs.

We publish the failures alongside the successes. Where a method has conditions under which it does not win, we say so.


GBM Deployment

shap-relativities — SHAP rating relativities from GBMs

What is measured: GBM+SHAP relativity extraction vs direct Poisson GLM on a 20,000-policy synthetic motor DGP with known true coefficients (NCD=5 true relativity 0.549, conviction true 1.57×).

Method Gini NCD=5 relativity error Conviction relativity error
Poisson GLM (direct) baseline 9.44%
CatBoost + shap-relativities +2.85pp Gini 4.47% recovers 1.57× within CI

Adding an interaction DGP (vehicle_group × NCD): SHAP relativities give +3.4pp Gini over main-effects GLM. SHAP absorbs the interaction into the marginals, which is the correct TreeSHAP behaviour.

github.com/burning-cost/shap-relativities


insurance-distill — GBM-to-GLM distillation

What is measured: SurrogateGLM fidelity vs direct GLM on 30,000 synthetic motor policies with a CatBoost teacher model. 80/20 holdout.

Method Max segment deviation Holdout Gini Fidelity R²
Direct GLM 21.4% 97.6%* 0.51
SurrogateGLM 3.6% 86.5%* 0.54
LassoGuidedGLM (α=0.0005) 10.4%

*The Gini reversal is noise at this n; fidelity R² is the correct metric. SurrogateGLM is 6× more faithful at cell level — the segment a pricing actuary would present to a CRO.

github.com/burning-cost/insurance-distill


Severity Modelling

insurance-severity — EVT tail modelling

EVT benchmark (TruncatedGPD vs naive GPD, £100k policy limit):

Method Shape parameter (ξ) bias Q99 error
Naive GPD 0.035 10.3%
TruncatedGPD 0.006 1.2%

Heavy-tail benchmark (α=1.5 Pareto, infinite variance, n=20,000): Gamma GLM structurally fails at the tail. LognormalGPDComposite and GammaGPDComposite (both threshold_method=’profile_likelihood’) recover the tail shape. Q99 error reduction vs Gamma: 15–20+ percentage points. ILF error at £5m limit: 20+ ppts lower for composite models. Run: Databricks Serverless (43s, SUCCESS).

WeibullTemperedPareto vs standard Pareto: +31.3 log-likelihood; standard Pareto Q99.5 error 15.9%.

github.com/burning-cost/insurance-severity


Prediction Intervals

insurance-conformal — Conformal prediction intervals

What is measured: Locally-weighted (LW) conformal vs naive parametric intervals vs standard conformal on 50,000-policy Gamma DGP (heteroscedastic, shape decreasing with predicted mean). 60/20/20 temporal split.

Method Worst-decile coverage Interval width vs parametric
Naive parametric ~70–75% (misses 90% target) baseline
Standard conformal 87.9% (still misses worst decile) −13.4%
LW conformal 90%+ in every decile −11.7%

Standard conformal meets aggregate coverage but undercovers the highest-risk decile by ~10pp — precisely the segment that drives SCR and reinsurance cost. LW conformal meets the target by construction in every decile while also being narrower.

github.com/burning-cost/insurance-conformal


Causal Inference

insurance-causal-policy — Synthetic DiD / SDID for pricing interventions

What is measured: SDID vs naive before-after and plain DiD. 30-simulation Monte Carlo, 80 segments, 12 periods, true ATT = −0.08, market inflation 0.5pp per period.

Method Bias 95% CI coverage
Naive before-after ~+2pp upward (market inflation absorbed)
SDID near-zero ~93–95%

Naive before-after is biased upward by roughly 4 × 0.5pp inflation over the post-period window. SDID recovers the true ATT with valid confidence intervals.

github.com/burning-cost/insurance-causal-policy


insurance-causal — DML causal effect estimation

What is measured: DML (CausalPricingModel) vs naive Poisson GLM on confounded telematics DGP (5,000 policies, true effect −0.15, confounding via driver safety score).

Honest finding: on a single-run Databricks benchmark at n=5,000, the naive GLM achieved −0.2124 (bias 41.6%, CI covers true) and DML achieved −0.0202 (bias 86.5%, CI misses true). DML underperforms here due to over-partialling — when CatBoost nuisance models absorb most outcome variance, the residualised treatment has low variance and the final regression is imprecise. DML wins when n ≥ 50,000, treatment effects are large, and GLM misspecification compounds across many factors. See the README for the full conditions.

github.com/burning-cost/insurance-causal


Fairness & Compliance

insurance-fairness — Proxy discrimination detection

What is measured: CatBoost proxy R² vs manual Spearman correlation inspection. 20,000 synthetic UK motor policies, postcode area as proxy for ethnicity (non-linear categorical relationship).

Method Postcode proxy flagged? Time
Manual Spearman (threshold 0.25) No — Spearman r = 0.064
Library proxy R² YES — R² = 0.777 (RED) 0.5s
Library MI score YES — 0.817 nats included

Rank correlation cannot detect non-linear categorical proxy relationships. The library caught the postcode proxy in under a second. All other rating factors returned proxy R² = 0.000.

github.com/burning-cost/insurance-fairness


insurance-governance — PRA model validation

What is measured: Automated 5-test validation suite vs manual 4-check checklist. Three scenarios: well-specified (Model A), miscalibrated with age-band bias (Model B), drifted population (Model C). 20k training + 8k validation policies.

Scenario Manual checklist Automated suite
Model A (well-specified) PASS PASS
Model B (miscalibrated) Flags global A/E only Detects age-band bias via Hosmer-Lemeshow (p<0.0001)
Model C (drifted) Flags PSI Flags PSI + Poisson CI on A/E

The manual checklist cannot detect that miscalibration is concentrated in young drivers (age < 30). Overhead: automated suite is ~13× slower (1.2s vs 0.09s) due to 500-resample bootstrap for Gini CI — acceptable for a sign-off workflow.

github.com/burning-cost/insurance-governance


Model Monitoring

insurance-monitoring — In-production drift detection

What is measured: MonitoringReport vs manual aggregate A/E ratio check on 14,000 synthetic UK motor policies (10k reference, 4k monitoring) with three deliberately embedded failure modes: young driver covariate shift, vehicle calibration drift, discrimination decay.

Check Manual A/E MonitoringReport
Covariate shift (young drivers) Missed PSI = 0.211 [AMBER]
Calibration drift (vehicle_age < 3) Missed Murphy MCB local > global → REFIT
Discrimination decay (30% randomised) Missed Gini z-test (underpowered at n=4k — statistically correct)

Manual aggregate A/E: 0.962 reference, 0.942 monitoring → verdict INVESTIGATE (errors cancel at portfolio level). MonitoringReport verdict: REFIT.

github.com/burning-cost/insurance-monitoring


Covariate Shift Diagnostics

insurance-covariate-shift — Shift severity classification

What is measured: ESS/KL diagnostic accuracy on three shift scenarios, plus importance-weighted metric correction at n=5,000. Databricks Serverless, 2026-03-21.

Diagnostic accuracy (3 scenarios, all correctly classified):

Scenario ESS KL Verdict
NEGLIGIBLE (same book) 0.849 0.09 NEGLIGIBLE ✓
MODERATE (broker, age +6, urban −11pp) 0.532 0.34 MODERATE ✓
SEVERE (acquired MGA, age +27, NCD +4) 0.004 4.55 SEVERE ✓

Metric correction at n=5,000: IW-weighted MAE 0.0552 vs unweighted MAE 0.0636 vs oracle 0.0528. IW error 4.5× better than unweighted. Correction is secondary to diagnostics; requires n ≥ 2,000 and ESS ≥ 0.30.

github.com/burning-cost/insurance-covariate-shift


Credibility & Smoothing

insurance-credibility — Bühlmann-Straub credibility weighting

What is measured: Credibility weighting vs raw segment averages vs portfolio average. 30 schemes, 5 accident years, 64,302 policy-years, known DGP (K=4.0). Run: 4.4 seconds on ARM64 Pi.

Method Thin schemes MAE Medium MAE Thick MAE
Portfolio average 0.0596 0.0423 0.0337
Raw segment 0.0074 0.0030 0.0014
Bühlmann-Straub 0.0069 0.0029 0.0014 (tie)

Portfolio average is uniformly worst. Credibility beats raw on thin and medium schemes; ties on thick (correct — credibility weight Z approaches 1). K estimated at 8.36 (true 4.0); conservative K means extra shrinkage that is still better than raw on thin schemes.

github.com/burning-cost/insurance-credibility


insurance-whittaker — Whittaker-Henderson age curve smoothing

What is measured: Whittaker-Henderson (order=2, REML) vs raw rates vs weighted 5-point moving average. 63 age bands, U-shaped loss ratio DGP, Poisson noise.

Method MSE (overall)
Raw 0.000417
5-pt moving average 0.000184
Whittaker-Henderson (REML λ=55,539) 0.000179

WH improvement vs raw: +57.2%. Vs moving average: +2.8%. REML selected EDF=7.7. Honest caveat: at the young-driver peak WH max error is slightly worse than raw — the smoothness penalty trades off local precision for global regularity.

github.com/burning-cost/insurance-whittaker


Interaction Detection

insurance-interactions — Automated GLM interaction detection

What is measured: CANN+NID interaction detection at scale. 50 features, 3 planted interactions, 1,225 candidate pairs.

NID filters candidate pairs before statistical testing. Bonferroni threshold is 82× stricter for exhaustive pairwise testing (1,225 pairs) vs NID-pre-filtered testing (~15 pairs). Without NID, the multiple testing burden makes real interactions undetectable in moderate-sized portfolios.

The 10-feature exhaustive benchmark is included in the README as the honest “when exhaustive works” case.

github.com/burning-cost/insurance-interactions


Joint Frequency-Severity

insurance-frequency-severity — Sarmanov copula

What is measured: JointFreqSev IFM estimator on a pure Sarmanov DGP (ω=3.5 planted directly in the copula). Databricks Serverless (58s, SUCCESS).

The benchmark uses a pure Sarmanov DGP via SarmanovCopula.sample() — the same family as the model being fit. Earlier benchmarks using a latent-factor DGP were methodologically invalid (the planted parameter has no correspondence to the IFM estimate). The current benchmark validates parameter recovery directly: omega planted 3.5, IFM relative error expected <20%.

Independence assumption biases high-severity/high-frequency segments: the pure premium correction factor from the joint model is the differentiating metric.

github.com/burning-cost/insurance-frequency-severity


Rate Optimisation

insurance-optimise — Constrained portfolio optimisation

What is measured: PortfolioOptimiser vs uniform +7% rate change. 2,000 renewals, heterogeneous elasticities (PCW ~−2.0, direct ~−1.2), constraints: LR cap 68%, retention floor 78%, max rate change ±25%, ENBP compliance.

The optimiser achieves the same GWP target as the flat increase with higher profit and better retention by applying larger increases to inelastic customers and smaller increases to elastic ones. Typical profit uplift: 3–8% vs flat rate change.

ParetoFrontier benchmark: 3×3 epsilon-constraint grid (N=150 solutions). Single-objective EfficientFrontier achieves profit-max but produces disparity ratio 1.168 — a fairness cost that is invisible to the optimiser. The Pareto surface makes the profit–retention–fairness trade-off explicit: 4 non-dominated solutions, TOPSIS selection picks the balanced solution.

github.com/burning-cost/insurance-optimise


Synthetic Data

insurance-synthetic — Vine copula portfolio synthesis

What is measured: Vine copula synthesis vs naive independent sampling. 8,000-row UK motor DGP with known correlations (ρ(age,NCD)=+0.502, ρ(NCD,vehicle_group)=−0.338).

Metric Vine copula Naive independent
Frobenius norm (Spearman matrix) 0.315 0.880
Age/NCD correlation +0.400 (true +0.502) +0.001 (destroys correlation)
Impossible combinations 0.26% (real: 0.32%) 2.30%
TSTR Gini gap 0.0006 0.0016

Known issue: claim_amount KS=0.93 for vine (severity synthesis limitation, documented in README). Naive performs marginally better on discrete columns — expected with a continuous copula.

github.com/burning-cost/insurance-synthetic


Multilevel Modelling

insurance-multilevel — BLUP random effects for sparse segments

What is measured: MultilevelPricingModel (REML) vs one-hot encoding vs no group effect. 8,000 policies, 200 occupation codes, true ICC=0.36.

Method Deviance Thin-group MAPE
No group effect 0.338092
One-hot encoding 0.272967 66.09%
MultilevelPricingModel 0.272280 63.55%

BLUP recovery r=0.729 (target >0.6, passed). ICC estimated 0.332 (true 0.360). Stage 2 lift: +15.93% deviance reduction vs Stage 1 alone. REML under-estimates variance components when Stage 1 is strong — documented in README.

github.com/burning-cost/insurance-multilevel


Notes on methodology

All benchmarks follow the same design contract:

Where a library does not yet have a run-verified result in the KB, we have omitted it from this page rather than publish unverified numbers.