Every library in the Burning Cost portfolio has been benchmarked against a standard baseline on synthetic insurance data with a known data-generating process. This page aggregates those results. The pattern is consistent: specialised methods beat generic approaches on insurance problems, and where they don't, we say so.

34 Libraries benchmarked
35 Databricks notebooks
MIT All licences
Python 3.10+ Runtime requirement

Model Building

Model Building

7 libraries
Library What was measured Standard approach Burning Cost Key takeaway
shap-relativities Gini lift vs Poisson GLM; relativity accuracy vs true DGP GLM: 4.5% mean relativity error; lower Gini baseline +2.85pp Gini lift; 9.4% relativity error GBM wins on discrimination; GLM wins on factor accuracy. An honest trade-off: use shap-relativities for ranking, not if you need the coefficients to be actuarially exact.
insurance-glm-tools Nested GLM embeddings for 500 vehicle makes vs dummy-coded GLM; R2VF fused lasso clustering vs manual quintile banding Dummy-coded GLM: overfits on high-cardinality vehicle makes Nested embeddings reduce overfitting; R2VF clustering eliminates arbitrary band choices Dummy-coded GLM becomes unreliable above ~50 factor levels. Nested embeddings give consistent coefficient estimates at 500+ levels.
insurance-gam Flagship EBM and Neural Additive Model vs Poisson GLM on synthetic data with planted non-linear effects; exact Shapley values vs approximation Poisson GLM misses planted non-linear age-mileage interaction EBM/NAM recovers planted non-linear effects; shape functions directly interpretable as factor tables Where GLM forces linearity, EBM fits the true shape. The transparency cost vs a GBM is zero - you get exact Shapley values and shape functions instead of approximate attribution.
insurance-interactions CANN/NID interaction detection vs exhaustive pairwise GLM search on planted interactions Exhaustive pairwise GLM: slow, misses non-linear interactions Production defaults recover both planted interactions; compact config less reliable on weak signals CANN/NID is reliable when planted effects are of meaningful size. On weak interactions (<5% deviance contribution), SHAP interaction values are more stable than NID scores.
insurance-frequency-severity Flagship Sarmanov copula joint model vs independence assumption; premium error under dependence Independence assumption: premium bias wherever freq and sev are correlated Analytical premium correction removes dependence bias; IFM estimation with dependence tests Most pricing models assume frequency and severity are independent. Where they aren't - typically in high-mileage or commercial segments - the independence assumption inflates expected loss cost. This library tests and corrects for it.
insurance-spatial BYM2 territory factors vs raw postcode rates vs manual banding on synthetic spatial data Manual banding: ignores spatial autocorrelation; raw rates: noisy on thin postcodes BYM2 smooth factors preserve spatial structure; Moran's I confirms residual autocorrelation is removed Raw postcode rates are unstable on thin cells; manual banding loses information at boundaries. BYM2 borrows strength from neighbours using a proper adjacency structure.
insurance-distill R² match between CatBoost predictions and surrogate GLM factor tables Direct GLM: lower predictive performance than GBM by construction 90-97% R² match between GBM predictions and distilled factor tables A distilled GLM captures 90-97% of the GBM's variance in a Radar/Emblem-compatible multiplicative structure. The residual 3-10% is the honest cost of interpretability.

Distributional and Tail Risk

Distributional and Tail Risk

5 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-distributional-glm Flagship Sigma (dispersion) correlation vs true DGP on heterogeneous-variance claims data Constant-phi Gamma GLM: sigma correlation 0.000 (cannot model dispersion variation) GAMLSS sigma correlation 0.998 vs true DGP This is the starkest result in the portfolio. When dispersion varies by risk (as it does in most motor books), a standard GLM is structurally incapable of capturing it. GAMLSS models the whole distribution, not just the mean.
insurance-distributional Log-likelihood and prediction interval calibration for distributional GBMs vs point-estimate GBM Standard point-prediction GBM: no per-risk volatility estimate GammaGBM +1.5% log-likelihood; prediction intervals calibrated vs uncalibrated bootstrap Per-risk volatility scoring is the main gain. Beyond the 1.5% log-likelihood improvement, you get a distributional output usable for capital allocation and per-policy uncertainty scoring.
insurance-dispersion Double GLM vs constant-phi Gamma GLM on heteroscedastic claims data; per-risk volatility scoring Constant-phi GLM: assumes all risks have the same dispersion Double GLM captures heteroscedasticity the constant-phi model cannot see A simpler alternative to GAMLSS for teams that want dispersion modelling without the full distributional GLM complexity. The alternating IRLS approach fits quickly and the output maps naturally to factor tables.
insurance-quantile TVaR bias on heavy-tailed DGP; pinball loss at small n Lognormal: lower pinball loss at small n; higher TVaR bias on heavy tails GBM: lower TVaR bias on heavy-tailed data; lognormal beats GBM on pinball at small n GBM quantile regression is the right choice for ILF curves on large portfolios with heavy tails. At small n, a parametric lognormal or Pareto is more stable - this library benchmarks both honestly.
insurance-severity Tail error reduction vs single lognormal on composite Lognormal-GPD DGP Single lognormal: misspecified in the tail by construction Composite Lognormal-GPD reduces tail error 5.6% vs single lognormal 5.6% sounds modest; on a large-loss book it compounds into material reserve error. The EQRN extreme quantile network is the appropriate choice when the GPD threshold is uncertain.

Credibility and Thin Data

Credibility and Thin Data

5 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-whittaker Flagship MSE on smoothed age relativities vs raw observed rates; REML lambda vs manual step smoothing Raw observed rates: noisy, inconsistent lambda choices by hand 57.2% MSE reduction vs raw rates; REML lambda selected automatically Manually chosen smoothing parameters are unreliable and non-reproducible. REML selects lambda by maximising marginal likelihood - the same principled approach used in mixed models. 57% MSE reduction is the payoff.
insurance-credibility Flagship MAE on thin scheme segments; Bühlmann-Straub credibility vs raw experience rating Raw experience: high variance on thin segments; overreacts to single bad years 6.8% MAE improvement on thin schemes vs raw experience 6.8% MAE on thin segments where raw experience routinely moves by 20-40% year-on-year is a meaningful stabilisation. The mixed-model equivalence check confirms the credibility weights are correctly derived.
insurance-multilevel Gamma deviance and thin-group MAPE: two-stage CatBoost + REML vs one-hot encoded GLM One-hot GLM: thin-group MAPE 66.1%; 15.9% worse deviance 15.9% gamma deviance reduction; thin-group MAPE 63.6% vs 66.1% One-hot encoding treats each group as independent. The two-stage approach uses REML random effects to share information across groups - the same gain as Bühlmann-Straub but applicable inside a CatBoost pipeline.
insurance-thin-data Bootstrap 90% CI width on thin segment GLMs: GLMTransfer vs standalone GLM Standalone GLM: wide confidence intervals on <200 policy segments 30-60% CI width reduction via GLMTransfer prior transfer from related segments Narrower CIs on thin segments mean pricing decisions based on those factors are less likely to reverse at next renewal review. The transfer approach works best when the source and target segments have similar underlying DGPs.
bayesian-pricing Hierarchical Bayesian vs raw experience on thin segments (PyMC 5) Raw experience: unreliable on <100 policies per segment Posterior pooling stabilises thin-segment estimates via hierarchical priors The Bayesian approach gives full posterior distributions, not point estimates - useful for risk committee presentations where uncertainty communication matters. Slower than Bühlmann-Straub; use it when you need full posteriors.

Causal Inference

Causal Inference

2 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-causal Flagship Confounding bias removal: DML vs naive Poisson GLM on confounded data (n=50k+) Naive Poisson GLM: confounding bias persists; coefficient estimates biased wherever rating variables correlate with channel or selection DML removes nonlinear confounding bias at scale (n≥50k); honest: over-partials at small n Standard GLM coefficients are correlational, not causal. DML removes confounding without a structural model. At small n (<50k), DML over-partials and introduces its own bias - use it on the full portfolio, not on segment-level data.
insurance-causal-policy CI coverage and bias on rate change evaluation: SDID vs naive before-after comparison Naive before-after: biased +3.8pp by concurrent market inflation SDID 98% CI coverage; isolates the rate change effect from market movement Before-after comparisons of rate changes are almost always confounded by market trends. A 3.8pp inflation bias in the benchmark is typical of what teams are currently acting on. SDID with HonestDiD sensitivity bounds is the defensible alternative for FCA evidence packs.

Fairness and Regulation

Fairness and Regulation

2 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-fairness Flagship Proxy discrimination detection: proxy R² vs Spearman correlation on a planted postcode-to-ethnicity proxy Spearman correlation: r=0.06 - fails to detect the proxy entirely Proxy R²=0.78 - catches the same proxy that Spearman misses Spearman correlation is not a valid test for proxy discrimination. A rating variable can have near-zero rank correlation with a protected characteristic but still act as a near-perfect proxy via a non-linear relationship. This is the result that matters for FCA Consumer Duty compliance.
insurance-covariate-shift Importance-weighted evaluation after distribution shift: density ratio correction vs unweighted evaluation Unweighted evaluation: performance metrics biased after book composition change CatBoost/RuLSIF/KLIEP density ratio correction removes evaluation bias after shift; LR-QR conformal bounds included When a book's risk mix changes - via broker switches, scheme exits, or a marketing campaign - historic model performance statistics become misleading. Density ratio weighting corrects this before a model review misinterprets drift as deterioration.

Validation and Monitoring

Validation and Monitoring

6 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-cv Optimism in Gini estimate: walk-forward temporal CV vs random k-fold on insurance data Random k-fold: 10.5% optimism vs true OOT holdout Walk-forward CV matches OOT holdout; eliminates future-data leakage 10.5% optimism means models look better in random k-fold than they perform on live business. Walk-forward CV, with proper IBNR buffers between folds, eliminates this. Use it as your default validation approach.
insurance-conformal Flagship Marginal coverage on Tweedie frequency data vs nominal 90% Bootstrap intervals: miscalibrated coverage; computationally expensive 90.1% marginal coverage on Tweedie frequency data at 90% nominal Conformal intervals give finite-sample coverage guarantees without distributional assumptions. 90.1% coverage at 90% nominal is the finite-sample guarantee working as intended. Bootstrap intervals typically over- or under-cover depending on the tail behaviour assumed.
insurance-conformal-ts Coverage on non-exchangeable claims time series: ACI/SPCI vs static split conformal Static split conformal: coverage degrades on non-exchangeable time series ACI/SPCI maintain coverage on non-exchangeable series where static methods fail Standard conformal prediction assumes exchangeability - an assumption that fails on claims time series with trends, seasonality, or reporting delays. ACI and SPCI adapt the non-conformity threshold sequentially.
insurance-monitoring False positive rate under repeated peeking: mSPRT vs standard t-test Peeking t-test: 25% FPR (5x the nominal 5% level) mSPRT holds FPR at 1% under repeated looks (Johari et al. 2022) This is the A/B testing result. Teams that check their champion/challenger results daily using a t-test have a 25% chance of declaring a false winner. The mSPRT is anytime-valid: you can look at any point without inflating the false positive rate.
insurance-deploy Champion/challenger routing, shadow mode quote logging, bootstrap LR test for winner declaration Manual routing: no deterministic allocation, no audit trail SHA-256 deterministic routing; bootstrap LR test for winner declaration; ICOBS 6B.2 audit trail The benchmark here is operational correctness rather than a performance metric. SHA-256 routing ensures the same risk always sees the same model variant. The bootstrap LR test gives a principled stopping rule for the experiment.
insurance-governance Flagship PRA SS1/23 validation: automated suite vs manual checklist on age-band miscalibrated model Manual checklist: misses age-band miscalibration that only appears in double-lift Automated suite catches miscalibration manual checklists miss; HTML/JSON output for PRA review The value is in the completeness. Manual validation checklists are selective by nature; the automated suite runs every required test on every model. The age-band miscalibration case in the benchmark is the kind of finding that appears in real PRA model reviews.

Optimisation and Pricing Strategy

Optimisation and Pricing Strategy

1 library
Library What was measured Standard approach Burning Cost Key takeaway
insurance-optimise Profit lift vs flat loading on synthetic demand curve data Flat technical loading: treats all risks as equally price-elastic +143.8% profit lift over flat loading via demand-curve-aware pricing 143.8% profit lift is the upper bound when demand curves are perfectly estimated - real-world gains are lower and depend on elasticity estimation quality. The benchmark establishes the ceiling. The ParetoFrontier component trades profit against retention and fairness, which is the decision most pricing committees actually face.

Time Series and Trends

2 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-trend MAPE on loss cost trend vs naive OLS; structural break detection Naive OLS trend: 3.93pp higher MAPE; no structural break detection 3.93pp MAPE improvement over naive OLS; BOCPD/PELT detects structural breaks OLS trend fitting on loss costs ignores the non-stationarity that characterises post-2020 UK motor data. The MAPE improvement is the result of proper frequency/severity decomposition and structural break testing - not just a more complex model.
insurance-dynamics MAE on dynamic frequency: GAS Poisson filter vs static GLM trend Static GLM trend: cannot track within-year frequency movements GAS Poisson +13% MAE improvement over static GLM trend A 13% MAE improvement on frequency tracking means earned premium projections based on GAS filters are materially more accurate than static trend assumptions. The Bayesian changepoint detection handles the regime shifts that GAS filters smooth through.

Other Libraries

Other Libraries

9 libraries
Library What was measured Standard approach Burning Cost Key takeaway
insurance-telematics Flagship Gini improvement: HMM latent-state features vs raw trip aggregates in Poisson GLM Raw trip averages (mean speed, hard braking counts): lower Gini baseline 3-8pp Gini improvement from HMM state features over raw trip averages Raw trip statistics conflate driving contexts - a hard brake at 30mph is not the same risk signal as one at 70mph. HMM latent states separate driving contexts before aggregation, which is why the Gini improvement is consistent across segments.
insurance-survival Cure fraction recovery: cure model vs KM/Cox PH extrapolation on long-tailed lapse data KM/Cox: extrapolates to zero - overestimates ultimate lapse for low-risk segments Cure model recovers 34.1% cure fraction (true DGP: 35.0%); KM/Cox misses this entirely For motor customers with 10+ years no-claims, lapse probability effectively reaches a floor - not zero. KM and Cox PH extrapolate past this floor and overstate CLV for the most loyal segment. The cure model is the correct specification.
insurance-synthetic Correlation preservation: vine copula synthetic generation vs naive independent sampling Naive sampling: ignores multivariate dependence structure 64% better correlation preservation vs naive independent generation Synthetic portfolios that ignore dependence structure produce datasets where the risk segmentation looks right but the portfolio-level behaviour is wrong. The 64% improvement in correlation preservation means synthetic stress tests are materially more realistic.
insurance-datasets Parameter recovery RMSE on synthetic UK motor DGP; omitted variable bias demo No standard baseline - this is a data library, not a model GLM parameter recovery RMSE 0.069; OVB demo shows 24% NCD inflation when age is omitted The 24% NCD inflation figure from the OVB demo matches the kind of bias we see when actuaries fit NCD factors on books where driver age is poorly captured. Use this library to validate any new method before running it on real data.
insurance-multilevel Already covered above under Credibility and Thin Data. - - See Credibility and Thin Data section.

Note: insurance-multilevel appears in the tools page under Credibility; its benchmark results are in that section above. Four libraries - insurance-glm-tools, insurance-spatial, insurance-deploy, and bayesian-pricing - have qualitative benchmark results rather than single headline numbers because the relevant comparisons are structural (correctness, coverage, output format) rather than metric-based.


Methodology

All benchmarks were run on synthetic data with a known data-generating process (DGP). Using synthetic data means we can verify that a method recovers the true parameters - something impossible with real data where the ground truth is unknown. The DGPs are calibrated to resemble UK motor insurance portfolios: Poisson frequencies with realistic base rates (5-12%), Gamma severities, log-linear relativities, and correlation structures consistent with published industry data.

Benchmarks were run as Databricks notebooks in the burning-cost-examples repository. Each notebook installs its own dependencies, generates data inline, fits models, and computes comparison metrics. They run on Databricks serverless compute (Free Edition) - no cluster configuration required. To reproduce a result: import the relevant notebook, run all cells, and the comparison table is the final output.

Where a library honestly underperforms in some scenario - shap-relativities on relativity accuracy, insurance-quantile on pinball loss at small n - we report it. A benchmark that only shows wins is not a benchmark; it is marketing. The honest results are the ones worth reading.

Numbers reported are point estimates from a single benchmark run on a fixed random seed. We do not report confidence intervals around the benchmark comparisons themselves, though most notebooks include sensitivity checks across multiple seeds. If you need replication code or want to run a modified DGP, the notebooks are the starting point.