Benchmarks

Every library ships with a Databricks-runnable benchmark on a known data-generating process. All benchmarks are self-contained (synthetic data, no external files), run on Databricks Serverless unless noted, and produce the same output on every run. The results below are extracted from those runs.

We publish the failures alongside the successes. Where a method has conditions under which it does not win, we say so.

GBM Deployment

shap-relativities — SHAP rating relativities from GBMs

What is measured: GBM+SHAP relativity extraction vs direct Poisson GLM on a 20,000-policy synthetic motor DGP with known true coefficients (NCD=5 true relativity 0.549, conviction true 1.57×).

Method	Gini	NCD=5 relativity error	Conviction relativity error
Poisson GLM (direct)	baseline	9.44%	—
CatBoost + shap-relativities	+2.85pp Gini	4.47%	recovers 1.57× within CI

Adding an interaction DGP (vehicle_group × NCD): SHAP relativities give +3.4pp Gini over main-effects GLM. SHAP absorbs the interaction into the marginals, which is the correct TreeSHAP behaviour.

github.com/burning-cost/shap-relativities

insurance-distill — GBM-to-GLM distillation

What is measured: SurrogateGLM fidelity vs direct GLM on 30,000 synthetic motor policies with a CatBoost teacher model. 80/20 holdout.

Method	Max segment deviation	Holdout Gini	Fidelity R²
Direct GLM	21.4%	97.6%*	0.51
SurrogateGLM	3.6%	86.5%*	0.54
LassoGuidedGLM (α=0.0005)	10.4%	—	—

*The Gini reversal is noise at this n; fidelity R² is the correct metric. SurrogateGLM is 6× more faithful at cell level — the segment a pricing actuary would present to a CRO.

github.com/burning-cost/insurance-distill

Severity Modelling

insurance-severity — EVT tail modelling

EVT benchmark (TruncatedGPD vs naive GPD, £100k policy limit):

Method	Shape parameter (ξ) bias	Q99 error
Naive GPD	0.035	10.3%
TruncatedGPD	0.006	1.2%

Heavy-tail benchmark (α=1.5 Pareto, infinite variance, n=20,000): Gamma GLM structurally fails at the tail. LognormalGPDComposite and GammaGPDComposite (both threshold_method=’profile_likelihood’) recover the tail shape. Q99 error reduction vs Gamma: 15–20+ percentage points. ILF error at £5m limit: 20+ ppts lower for composite models. Run: Databricks Serverless (43s, SUCCESS).

WeibullTemperedPareto vs standard Pareto: +31.3 log-likelihood; standard Pareto Q99.5 error 15.9%.

github.com/burning-cost/insurance-severity

Prediction Intervals

insurance-conformal — Conformal prediction intervals

What is measured: Locally-weighted (LW) conformal vs naive parametric intervals vs standard conformal on 50,000-policy Gamma DGP (heteroscedastic, shape decreasing with predicted mean). 60/20/20 temporal split.

Method	Worst-decile coverage	Interval width vs parametric
Naive parametric	~70–75% (misses 90% target)	baseline
Standard conformal	87.9% (still misses worst decile)	−13.4%
LW conformal	90%+ in every decile	−11.7%

Standard conformal meets aggregate coverage but undercovers the highest-risk decile by ~10pp — precisely the segment that drives SCR and reinsurance cost. LW conformal meets the target by construction in every decile while also being narrower.

github.com/burning-cost/insurance-conformal

Causal Inference

insurance-causal-policy — Synthetic DiD / SDID for pricing interventions

What is measured: SDID vs naive before-after and plain DiD. 30-simulation Monte Carlo, 80 segments, 12 periods, true ATT = −0.08, market inflation 0.5pp per period.

Method	Bias	95% CI coverage
Naive before-after	~+2pp upward (market inflation absorbed)	—
SDID	near-zero	~93–95%

Naive before-after is biased upward by roughly 4 × 0.5pp inflation over the post-period window. SDID recovers the true ATT with valid confidence intervals.

github.com/burning-cost/insurance-causal-policy

insurance-causal — DML causal effect estimation

What is measured: DML (CausalPricingModel) vs naive Poisson GLM on confounded telematics DGP (5,000 policies, true effect −0.15, confounding via driver safety score).

Honest finding: on a single-run Databricks benchmark at n=5,000, the naive GLM achieved −0.2124 (bias 41.6%, CI covers true) and DML achieved −0.0202 (bias 86.5%, CI misses true). DML underperforms here due to over-partialling — when CatBoost nuisance models absorb most outcome variance, the residualised treatment has low variance and the final regression is imprecise. DML wins when n ≥ 50,000, treatment effects are large, and GLM misspecification compounds across many factors. See the README for the full conditions.

github.com/burning-cost/insurance-causal

Fairness & Compliance

insurance-fairness — Proxy discrimination detection

What is measured: CatBoost proxy R² vs manual Spearman correlation inspection. 20,000 synthetic UK motor policies, postcode area as proxy for ethnicity (non-linear categorical relationship).

Method	Postcode proxy flagged?	Time
Manual Spearman (threshold 0.25)	No — Spearman r = 0.064	—
Library proxy R²	YES — R² = 0.777 (RED)	0.5s
Library MI score	YES — 0.817 nats	included

Rank correlation cannot detect non-linear categorical proxy relationships. The library caught the postcode proxy in under a second. All other rating factors returned proxy R² = 0.000.

github.com/burning-cost/insurance-fairness

insurance-governance — PRA model validation

What is measured: Automated 5-test validation suite vs manual 4-check checklist. Three scenarios: well-specified (Model A), miscalibrated with age-band bias (Model B), drifted population (Model C). 20k training + 8k validation policies.

Scenario	Manual checklist	Automated suite
Model A (well-specified)	PASS	PASS
Model B (miscalibrated)	Flags global A/E only	Detects age-band bias via Hosmer-Lemeshow (p<0.0001)
Model C (drifted)	Flags PSI	Flags PSI + Poisson CI on A/E

The manual checklist cannot detect that miscalibration is concentrated in young drivers (age < 30). Overhead: automated suite is ~13× slower (1.2s vs 0.09s) due to 500-resample bootstrap for Gini CI — acceptable for a sign-off workflow.

github.com/burning-cost/insurance-governance

Model Monitoring

insurance-monitoring — In-production drift detection

What is measured: MonitoringReport vs manual aggregate A/E ratio check on 14,000 synthetic UK motor policies (10k reference, 4k monitoring) with three deliberately embedded failure modes: young driver covariate shift, vehicle calibration drift, discrimination decay.

Check	Manual A/E	MonitoringReport
Covariate shift (young drivers)	Missed	PSI = 0.211 [AMBER]
Calibration drift (vehicle_age < 3)	Missed	Murphy MCB local > global → REFIT
Discrimination decay (30% randomised)	Missed	Gini z-test (underpowered at n=4k — statistically correct)

Manual aggregate A/E: 0.962 reference, 0.942 monitoring → verdict INVESTIGATE (errors cancel at portfolio level). MonitoringReport verdict: REFIT.

github.com/burning-cost/insurance-monitoring

Covariate Shift Diagnostics

insurance-covariate-shift — Shift severity classification

What is measured: ESS/KL diagnostic accuracy on three shift scenarios, plus importance-weighted metric correction at n=5,000. Databricks Serverless, 2026-03-21.

Diagnostic accuracy (3 scenarios, all correctly classified):

Scenario	ESS	KL	Verdict
NEGLIGIBLE (same book)	0.849	0.09	NEGLIGIBLE ✓
MODERATE (broker, age +6, urban −11pp)	0.532	0.34	MODERATE ✓
SEVERE (acquired MGA, age +27, NCD +4)	0.004	4.55	SEVERE ✓

Metric correction at n=5,000: IW-weighted MAE 0.0552 vs unweighted MAE 0.0636 vs oracle 0.0528. IW error 4.5× better than unweighted. Correction is secondary to diagnostics; requires n ≥ 2,000 and ESS ≥ 0.30.

github.com/burning-cost/insurance-covariate-shift

Credibility & Smoothing

insurance-credibility — Bühlmann-Straub credibility weighting

What is measured: Credibility weighting vs raw segment averages vs portfolio average. 30 schemes, 5 accident years, 64,302 policy-years, known DGP (K=4.0). Run: 4.4 seconds on ARM64 Pi.

Method	Thin schemes MAE	Medium MAE	Thick MAE
Portfolio average	0.0596	0.0423	0.0337
Raw segment	0.0074	0.0030	0.0014
Bühlmann-Straub	0.0069	0.0029	0.0014 (tie)

Portfolio average is uniformly worst. Credibility beats raw on thin and medium schemes; ties on thick (correct — credibility weight Z approaches 1). K estimated at 8.36 (true 4.0); conservative K means extra shrinkage that is still better than raw on thin schemes.

github.com/burning-cost/insurance-credibility

insurance-whittaker — Whittaker-Henderson age curve smoothing

What is measured: Whittaker-Henderson (order=2, REML) vs raw rates vs weighted 5-point moving average. 63 age bands, U-shaped loss ratio DGP, Poisson noise.

Method	MSE (overall)
Raw	0.000417
5-pt moving average	0.000184
Whittaker-Henderson (REML λ=55,539)	0.000179

WH improvement vs raw: +57.2%. Vs moving average: +2.8%. REML selected EDF=7.7. Honest caveat: at the young-driver peak WH max error is slightly worse than raw — the smoothness penalty trades off local precision for global regularity.

github.com/burning-cost/insurance-whittaker

Interaction Detection

insurance-interactions — Automated GLM interaction detection

What is measured: CANN+NID interaction detection at scale. 50 features, 3 planted interactions, 1,225 candidate pairs.

NID filters candidate pairs before statistical testing. Bonferroni threshold is 82× stricter for exhaustive pairwise testing (1,225 pairs) vs NID-pre-filtered testing (~15 pairs). Without NID, the multiple testing burden makes real interactions undetectable in moderate-sized portfolios.

The 10-feature exhaustive benchmark is included in the README as the honest “when exhaustive works” case.

github.com/burning-cost/insurance-interactions

Joint Frequency-Severity

insurance-frequency-severity — Sarmanov copula

What is measured: JointFreqSev IFM estimator on a pure Sarmanov DGP (ω=3.5 planted directly in the copula). Databricks Serverless (58s, SUCCESS).

The benchmark uses a pure Sarmanov DGP via SarmanovCopula.sample() — the same family as the model being fit. Earlier benchmarks using a latent-factor DGP were methodologically invalid (the planted parameter has no correspondence to the IFM estimate). The current benchmark validates parameter recovery directly: omega planted 3.5, IFM relative error expected <20%.

Independence assumption biases high-severity/high-frequency segments: the pure premium correction factor from the joint model is the differentiating metric.

github.com/burning-cost/insurance-frequency-severity

Rate Optimisation

insurance-optimise — Constrained portfolio optimisation

What is measured: PortfolioOptimiser vs uniform +7% rate change. 2,000 renewals, heterogeneous elasticities (PCW ~−2.0, direct ~−1.2), constraints: LR cap 68%, retention floor 78%, max rate change ±25%, ENBP compliance.

The optimiser achieves the same GWP target as the flat increase with higher profit and better retention by applying larger increases to inelastic customers and smaller increases to elastic ones. Typical profit uplift: 3–8% vs flat rate change.

ParetoFrontier benchmark: 3×3 epsilon-constraint grid (N=150 solutions). Single-objective EfficientFrontier achieves profit-max but produces disparity ratio 1.168 — a fairness cost that is invisible to the optimiser. The Pareto surface makes the profit–retention–fairness trade-off explicit: 4 non-dominated solutions, TOPSIS selection picks the balanced solution.

github.com/burning-cost/insurance-optimise

Synthetic Data

insurance-synthetic — Vine copula portfolio synthesis

What is measured: Vine copula synthesis vs naive independent sampling. 8,000-row UK motor DGP with known correlations (ρ(age,NCD)=+0.502, ρ(NCD,vehicle_group)=−0.338).

Metric	Vine copula	Naive independent
Frobenius norm (Spearman matrix)	0.315	0.880
Age/NCD correlation	+0.400 (true +0.502)	+0.001 (destroys correlation)
Impossible combinations	0.26% (real: 0.32%)	2.30%
TSTR Gini gap	0.0006	0.0016

Known issue: claim_amount KS=0.93 for vine (severity synthesis limitation, documented in README). Naive performs marginally better on discrete columns — expected with a continuous copula.

github.com/burning-cost/insurance-synthetic

Multilevel Modelling

insurance-multilevel — BLUP random effects for sparse segments

What is measured: MultilevelPricingModel (REML) vs one-hot encoding vs no group effect. 8,000 policies, 200 occupation codes, true ICC=0.36.

Method	Deviance	Thin-group MAPE
No group effect	0.338092	—
One-hot encoding	0.272967	66.09%
MultilevelPricingModel	0.272280	63.55%

BLUP recovery r=0.729 (target >0.6, passed). ICC estimated 0.332 (true 0.360). Stage 2 lift: +15.93% deviance reduction vs Stage 1 alone. REML under-estimates variance components when Stage 1 is strong — documented in README.

github.com/burning-cost/insurance-multilevel

Notes on methodology

All benchmarks follow the same design contract:

Known DGP — synthetic data with planted parameters so bias is measurable, not just approximate.
Self-contained — no external data files; everything is generated in the script.
Honest failures — where a method has conditions under which it underperforms, these are documented. See insurance-causal (DML over-partialling), insurance-synthetic (severity KS), insurance-whittaker (young-driver peak).
Databricks Serverless — all scripts are in notebooks/benchmark.py in each repo, formatted for Databricks import. Run times are noted where material.
Parameter recovery — benchmarks that validate an estimator use DGPs from the same distributional family as the model (see insurance-frequency-severity notes above).

Where a library does not yet have a run-verified result in the KB, we have omitted it from this page rather than publish unverified numbers.