Benchmark Evidence

Every library in the Burning Cost portfolio has been benchmarked against a standard baseline on synthetic insurance data with a known data-generating process. This page aggregates those results. The pattern is consistent: specialised methods beat generic approaches on insurance problems, and where they don't, we say so.

34 Libraries benchmarked

35 Databricks notebooks

MIT All licences

Python 3.10+ Runtime requirement

Model Building

7 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
shap-relativities	Gini lift vs Poisson GLM; relativity accuracy vs true DGP	GLM: 4.5% mean relativity error; lower Gini baseline	+2.85pp Gini lift; 9.4% relativity error	GBM wins on discrimination; GLM wins on factor accuracy. An honest trade-off: use shap-relativities for ranking, not if you need the coefficients to be actuarially exact.
insurance-glm-tools	Nested GLM embeddings for 500 vehicle makes vs dummy-coded GLM; R2VF fused lasso clustering vs manual quintile banding	Dummy-coded GLM: overfits on high-cardinality vehicle makes	Nested embeddings reduce overfitting; R2VF clustering eliminates arbitrary band choices	Dummy-coded GLM becomes unreliable above ~50 factor levels. Nested embeddings give consistent coefficient estimates at 500+ levels.
insurance-gam Flagship	EBM and Neural Additive Model vs Poisson GLM on synthetic data with planted non-linear effects; exact Shapley values vs approximation	Poisson GLM misses planted non-linear age-mileage interaction	EBM/NAM recovers planted non-linear effects; shape functions directly interpretable as factor tables	Where GLM forces linearity, EBM fits the true shape. The transparency cost vs a GBM is zero - you get exact Shapley values and shape functions instead of approximate attribution.
insurance-interactions	CANN/NID interaction detection vs exhaustive pairwise GLM search on planted interactions	Exhaustive pairwise GLM: slow, misses non-linear interactions	Production defaults recover both planted interactions; compact config less reliable on weak signals	CANN/NID is reliable when planted effects are of meaningful size. On weak interactions (<5% deviance contribution), SHAP interaction values are more stable than NID scores.
insurance-frequency-severity Flagship	Sarmanov copula joint model vs independence assumption; premium error under dependence	Independence assumption: premium bias wherever freq and sev are correlated	Analytical premium correction removes dependence bias; IFM estimation with dependence tests	Most pricing models assume frequency and severity are independent. Where they aren't - typically in high-mileage or commercial segments - the independence assumption inflates expected loss cost. This library tests and corrects for it.
insurance-spatial	BYM2 territory factors vs raw postcode rates vs manual banding on synthetic spatial data	Manual banding: ignores spatial autocorrelation; raw rates: noisy on thin postcodes	BYM2 smooth factors preserve spatial structure; Moran's I confirms residual autocorrelation is removed	Raw postcode rates are unstable on thin cells; manual banding loses information at boundaries. BYM2 borrows strength from neighbours using a proper adjacency structure.
insurance-distill	R² match between CatBoost predictions and surrogate GLM factor tables	Direct GLM: lower predictive performance than GBM by construction	90-97% R² match between GBM predictions and distilled factor tables	A distilled GLM captures 90-97% of the GBM's variance in a Radar/Emblem-compatible multiplicative structure. The residual 3-10% is the honest cost of interpretability.

Distributional and Tail Risk

5 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-distributional-glm Flagship	Sigma (dispersion) correlation vs true DGP on heterogeneous-variance claims data	Constant-phi Gamma GLM: sigma correlation 0.000 (cannot model dispersion variation)	GAMLSS sigma correlation 0.998 vs true DGP	This is the starkest result in the portfolio. When dispersion varies by risk (as it does in most motor books), a standard GLM is structurally incapable of capturing it. GAMLSS models the whole distribution, not just the mean.
insurance-distributional	Log-likelihood and prediction interval calibration for distributional GBMs vs point-estimate GBM	Standard point-prediction GBM: no per-risk volatility estimate	GammaGBM +1.5% log-likelihood; prediction intervals calibrated vs uncalibrated bootstrap	Per-risk volatility scoring is the main gain. Beyond the 1.5% log-likelihood improvement, you get a distributional output usable for capital allocation and per-policy uncertainty scoring.
insurance-dispersion	Double GLM vs constant-phi Gamma GLM on heteroscedastic claims data; per-risk volatility scoring	Constant-phi GLM: assumes all risks have the same dispersion	Double GLM captures heteroscedasticity the constant-phi model cannot see	A simpler alternative to GAMLSS for teams that want dispersion modelling without the full distributional GLM complexity. The alternating IRLS approach fits quickly and the output maps naturally to factor tables.
insurance-quantile	TVaR bias on heavy-tailed DGP; pinball loss at small n	Lognormal: lower pinball loss at small n; higher TVaR bias on heavy tails	GBM: lower TVaR bias on heavy-tailed data; lognormal beats GBM on pinball at small n	GBM quantile regression is the right choice for ILF curves on large portfolios with heavy tails. At small n, a parametric lognormal or Pareto is more stable - this library benchmarks both honestly.
insurance-severity	Tail error reduction vs single lognormal on composite Lognormal-GPD DGP	Single lognormal: misspecified in the tail by construction	Composite Lognormal-GPD reduces tail error 5.6% vs single lognormal	5.6% sounds modest; on a large-loss book it compounds into material reserve error. The EQRN extreme quantile network is the appropriate choice when the GPD threshold is uncertain.

Credibility and Thin Data

5 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-whittaker Flagship	MSE on smoothed age relativities vs raw observed rates; REML lambda vs manual step smoothing	Raw observed rates: noisy, inconsistent lambda choices by hand	57.2% MSE reduction vs raw rates; REML lambda selected automatically	Manually chosen smoothing parameters are unreliable and non-reproducible. REML selects lambda by maximising marginal likelihood - the same principled approach used in mixed models. 57% MSE reduction is the payoff.
insurance-credibility Flagship	MAE on thin scheme segments; Bühlmann-Straub credibility vs raw experience rating	Raw experience: high variance on thin segments; overreacts to single bad years	6.8% MAE improvement on thin schemes vs raw experience	6.8% MAE on thin segments where raw experience routinely moves by 20-40% year-on-year is a meaningful stabilisation. The mixed-model equivalence check confirms the credibility weights are correctly derived.
insurance-multilevel	Gamma deviance and thin-group MAPE: two-stage CatBoost + REML vs one-hot encoded GLM	One-hot GLM: thin-group MAPE 66.1%; 15.9% worse deviance	15.9% gamma deviance reduction; thin-group MAPE 63.6% vs 66.1%	One-hot encoding treats each group as independent. The two-stage approach uses REML random effects to share information across groups - the same gain as Bühlmann-Straub but applicable inside a CatBoost pipeline.
insurance-thin-data	Bootstrap 90% CI width on thin segment GLMs: GLMTransfer vs standalone GLM	Standalone GLM: wide confidence intervals on <200 policy segments	30-60% CI width reduction via GLMTransfer prior transfer from related segments	Narrower CIs on thin segments mean pricing decisions based on those factors are less likely to reverse at next renewal review. The transfer approach works best when the source and target segments have similar underlying DGPs.
bayesian-pricing	Hierarchical Bayesian vs raw experience on thin segments (PyMC 5)	Raw experience: unreliable on <100 policies per segment	Posterior pooling stabilises thin-segment estimates via hierarchical priors	The Bayesian approach gives full posterior distributions, not point estimates - useful for risk committee presentations where uncertainty communication matters. Slower than Bühlmann-Straub; use it when you need full posteriors.

Causal Inference

2 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-causal Flagship	Confounding bias removal: DML vs naive Poisson GLM on confounded data (n=50k+)	Naive Poisson GLM: confounding bias persists; coefficient estimates biased wherever rating variables correlate with channel or selection	DML removes nonlinear confounding bias at scale (n≥50k); honest: over-partials at small n	Standard GLM coefficients are correlational, not causal. DML removes confounding without a structural model. At small n (<50k), DML over-partials and introduces its own bias - use it on the full portfolio, not on segment-level data.
insurance-causal-policy	CI coverage and bias on rate change evaluation: SDID vs naive before-after comparison	Naive before-after: biased +3.8pp by concurrent market inflation	SDID 98% CI coverage; isolates the rate change effect from market movement	Before-after comparisons of rate changes are almost always confounded by market trends. A 3.8pp inflation bias in the benchmark is typical of what teams are currently acting on. SDID with HonestDiD sensitivity bounds is the defensible alternative for FCA evidence packs.

Fairness and Regulation

2 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-fairness Flagship	Proxy discrimination detection: proxy R² vs Spearman correlation on a planted postcode-to-ethnicity proxy	Spearman correlation: r=0.06 - fails to detect the proxy entirely	Proxy R²=0.78 - catches the same proxy that Spearman misses	Spearman correlation is not a valid test for proxy discrimination. A rating variable can have near-zero rank correlation with a protected characteristic but still act as a near-perfect proxy via a non-linear relationship. This is the result that matters for FCA Consumer Duty compliance.
insurance-covariate-shift	Importance-weighted evaluation after distribution shift: density ratio correction vs unweighted evaluation	Unweighted evaluation: performance metrics biased after book composition change	CatBoost/RuLSIF/KLIEP density ratio correction removes evaluation bias after shift; LR-QR conformal bounds included	When a book's risk mix changes - via broker switches, scheme exits, or a marketing campaign - historic model performance statistics become misleading. Density ratio weighting corrects this before a model review misinterprets drift as deterioration.

Validation and Monitoring

6 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-cv	Optimism in Gini estimate: walk-forward temporal CV vs random k-fold on insurance data	Random k-fold: 10.5% optimism vs true OOT holdout	Walk-forward CV matches OOT holdout; eliminates future-data leakage	10.5% optimism means models look better in random k-fold than they perform on live business. Walk-forward CV, with proper IBNR buffers between folds, eliminates this. Use it as your default validation approach.
insurance-conformal Flagship	Marginal coverage on Tweedie frequency data vs nominal 90%	Bootstrap intervals: miscalibrated coverage; computationally expensive	90.1% marginal coverage on Tweedie frequency data at 90% nominal	Conformal intervals give finite-sample coverage guarantees without distributional assumptions. 90.1% coverage at 90% nominal is the finite-sample guarantee working as intended. Bootstrap intervals typically over- or under-cover depending on the tail behaviour assumed.
insurance-conformal-ts	Coverage on non-exchangeable claims time series: ACI/SPCI vs static split conformal	Static split conformal: coverage degrades on non-exchangeable time series	ACI/SPCI maintain coverage on non-exchangeable series where static methods fail	Standard conformal prediction assumes exchangeability - an assumption that fails on claims time series with trends, seasonality, or reporting delays. ACI and SPCI adapt the non-conformity threshold sequentially.
insurance-monitoring	False positive rate under repeated peeking: mSPRT vs standard t-test	Peeking t-test: 25% FPR (5x the nominal 5% level)	mSPRT holds FPR at 1% under repeated looks (Johari et al. 2022)	This is the A/B testing result. Teams that check their champion/challenger results daily using a t-test have a 25% chance of declaring a false winner. The mSPRT is anytime-valid: you can look at any point without inflating the false positive rate.
insurance-deploy	Champion/challenger routing, shadow mode quote logging, bootstrap LR test for winner declaration	Manual routing: no deterministic allocation, no audit trail	SHA-256 deterministic routing; bootstrap LR test for winner declaration; ICOBS 6B.2 audit trail	The benchmark here is operational correctness rather than a performance metric. SHA-256 routing ensures the same risk always sees the same model variant. The bootstrap LR test gives a principled stopping rule for the experiment.
insurance-governance Flagship	PRA SS1/23 validation: automated suite vs manual checklist on age-band miscalibrated model	Manual checklist: misses age-band miscalibration that only appears in double-lift	Automated suite catches miscalibration manual checklists miss; HTML/JSON output for PRA review	The value is in the completeness. Manual validation checklists are selective by nature; the automated suite runs every required test on every model. The age-band miscalibration case in the benchmark is the kind of finding that appears in real PRA model reviews.

Optimisation and Pricing Strategy

1 library

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-optimise	Profit lift vs flat loading on synthetic demand curve data	Flat technical loading: treats all risks as equally price-elastic	+143.8% profit lift over flat loading via demand-curve-aware pricing	143.8% profit lift is the upper bound when demand curves are perfectly estimated - real-world gains are lower and depend on elasticity estimation quality. The benchmark establishes the ceiling. The ParetoFrontier component trades profit against retention and fairness, which is the decision most pricing committees actually face.

Time Series and Trends

2 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-trend	MAPE on loss cost trend vs naive OLS; structural break detection	Naive OLS trend: 3.93pp higher MAPE; no structural break detection	3.93pp MAPE improvement over naive OLS; BOCPD/PELT detects structural breaks	OLS trend fitting on loss costs ignores the non-stationarity that characterises post-2020 UK motor data. The MAPE improvement is the result of proper frequency/severity decomposition and structural break testing - not just a more complex model.
insurance-dynamics	MAE on dynamic frequency: GAS Poisson filter vs static GLM trend	Static GLM trend: cannot track within-year frequency movements	GAS Poisson +13% MAE improvement over static GLM trend	A 13% MAE improvement on frequency tracking means earned premium projections based on GAS filters are materially more accurate than static trend assumptions. The Bayesian changepoint detection handles the regime shifts that GAS filters smooth through.

Other Libraries

9 libraries

Library	What was measured	Standard approach	Burning Cost	Key takeaway
insurance-telematics Flagship	Gini improvement: HMM latent-state features vs raw trip aggregates in Poisson GLM	Raw trip averages (mean speed, hard braking counts): lower Gini baseline	3-8pp Gini improvement from HMM state features over raw trip averages	Raw trip statistics conflate driving contexts - a hard brake at 30mph is not the same risk signal as one at 70mph. HMM latent states separate driving contexts before aggregation, which is why the Gini improvement is consistent across segments.
insurance-survival	Cure fraction recovery: cure model vs KM/Cox PH extrapolation on long-tailed lapse data	KM/Cox: extrapolates to zero - overestimates ultimate lapse for low-risk segments	Cure model recovers 34.1% cure fraction (true DGP: 35.0%); KM/Cox misses this entirely	For motor customers with 10+ years no-claims, lapse probability effectively reaches a floor - not zero. KM and Cox PH extrapolate past this floor and overstate CLV for the most loyal segment. The cure model is the correct specification.
insurance-synthetic	Correlation preservation: vine copula synthetic generation vs naive independent sampling	Naive sampling: ignores multivariate dependence structure	64% better correlation preservation vs naive independent generation	Synthetic portfolios that ignore dependence structure produce datasets where the risk segmentation looks right but the portfolio-level behaviour is wrong. The 64% improvement in correlation preservation means synthetic stress tests are materially more realistic.
insurance-datasets	Parameter recovery RMSE on synthetic UK motor DGP; omitted variable bias demo	No standard baseline - this is a data library, not a model	GLM parameter recovery RMSE 0.069; OVB demo shows 24% NCD inflation when age is omitted	The 24% NCD inflation figure from the OVB demo matches the kind of bias we see when actuaries fit NCD factors on books where driver age is poorly captured. Use this library to validate any new method before running it on real data.
insurance-multilevel	Already covered above under Credibility and Thin Data.	-	-	See Credibility and Thin Data section.

Note: insurance-multilevel appears in the tools page under Credibility; its benchmark results are in that section above. Four libraries - insurance-glm-tools, insurance-spatial, insurance-deploy, and bayesian-pricing - have qualitative benchmark results rather than single headline numbers because the relevant comparisons are structural (correctness, coverage, output format) rather than metric-based.

Methodology

All benchmarks were run on synthetic data with a known data-generating process (DGP). Using synthetic data means we can verify that a method recovers the true parameters - something impossible with real data where the ground truth is unknown. The DGPs are calibrated to resemble UK motor insurance portfolios: Poisson frequencies with realistic base rates (5-12%), Gamma severities, log-linear relativities, and correlation structures consistent with published industry data.

Benchmarks were run as Databricks notebooks in the burning-cost-examples repository. Each notebook installs its own dependencies, generates data inline, fits models, and computes comparison metrics. They run on Databricks serverless compute (Free Edition) - no cluster configuration required. To reproduce a result: import the relevant notebook, run all cells, and the comparison table is the final output.

Where a library honestly underperforms in some scenario - shap-relativities on relativity accuracy, insurance-quantile on pinball loss at small n - we report it. A benchmark that only shows wins is not a benchmark; it is marketing. The honest results are the ones worth reading.

Numbers reported are point estimates from a single benchmark run on a fixed random seed. We do not report confidence intervals around the benchmark comparisons themselves, though most notebooks include sensitivity checks across multiple seeds. If you need replication code or want to run a modified DGP, the notebooks are the starting point.