Testing models before they go to production. Gini, double-lift, calibration curves, out-of-time validation, and the specific failure modes that standard ML metrics miss on insurance data. 11 articles.
The Pool Adjacent Violators Algorithm solves an O(N) monotonicity problem with no parametric assumptions. It appears in three distinct insurance pricing contexts: as the link fu...
NeuralGaussianMixture is now in insurance-distributional v0.4.0. The question is not whether it can fit bimodal severity — it can. The question is whether your data actually nee...
Most governance tooling is tested on toy examples with clean DGPs and inflated Gini coefficients. We ran the full insurance-governance validation suite on 677K freMTPL2 policies...
We benchmarked Whittaker-Henderson against raw rates and a 5-point weighted moving average on a synthetic UK motor driver age curve with known truth. W-H reduces MSE by 57.2% vs...
The standard UK motor pricing formula multiplies E[N] by E[S] and assumes independence. On a 15,000-policy benchmark with planted omega=3.5, that assumption understates portfoli...
PSI detects covariate shift but not rank collapse. On a synthetic UK motor book where a new risk factor emerges post-deployment, PSI stays GREEN while Gini drops 8 points. The B...
Manual Spearman correlation missed postcode as an ethnicity proxy in 100% of 50 benchmark runs. CatBoost proxy R-squared caught it in 100% of runs. The difference is the non-lin...
On a UK motor DGP with a monotone young-driver requirement, unconstrained EBM violates monotonicity in 31% of runs. Constrained EBM matches GLM monotonicity compliance at 100% w...
HMM-derived driving state features improve Gini by 5–10 percentage points over raw trip averages on a state-structured DGP. The reason is temporal: the HMM knows that aggressive...
We benchmarked constrained portfolio optimisation against a uniform +7% rate change on a 2,000-policy UK motor book. The optimiser achieved the same GWP target with £4,000–8,000...
We benchmarked Bühlmann-Straub credibility against raw experience and manual Z-factors on a 30-segment synthetic UK motor fleet book with a known DGP. On thin schemes, it reduce...
REML-selected lambda beats manual tuning on a 63-band age curve benchmark: 22% lower MSE on thin tail bands, zero analyst discretion, and principled credible intervals. The hone...
We planted three simultaneous model failures in a 50,000-policy UK motor book. The aggregate A/E never triggered. The library detected the first problem after 1,500 policies. He...
Parametric Tweedie intervals undercover high-risk policies by 10–15 percentage points. We tested conformal prediction on 50,000 UK motor policies to find out whether the fix act...
We ran Double Machine Learning against a naive GLM on a 50,000-policy UK motor telematics book. The GLM overestimated the treatment effect by 50–90%. Here is what that means for...
Benchmark results on a known-DGP synthetic UK motor book. EBM beats the GLM by 12.6 Gini points (0.455 vs 0.329). But the deviance number is misleading. We explain why, and when...
Benchmark results on a known-DGP synthetic UK motor fleet. HMM state fractions deliver 5–10pp Gini lift over simple aggregates. State classification recovers >50% of true high-r...
Honest benchmark: does fitting a surrogate GLM on CatBoost pseudo-predictions recover more discriminatory power than a direct GLM? We test it on 30,000 synthetic UK motor policies.
Benchmark results on a known-DGP synthetic UK motor age curve. REML recovers the true frequency well in the data-rich middle. The tails are a different story. Numbers, not claims.
Aggregate A/E at 0.94 looks fine. The model has been mispricing under-25s for eight months. Benchmark results on a synthetic UK motor book with three planted failure modes.
We ran the benchmarks. On a synthetic UK motor book with nonlinear confounding, naive logistic GLM overestimates the telematics treatment effect by 50–90%. DML recovers the grou...
Benchmark results on a known-DGP synthetic motor book. Conformal hits 90% across all deciles. Parametric Tweedie under-covers the top decile by 10–15pp. Numbers, not theory.
Benchmark results on 100 synthetic schemes with known true loss rates. Credibility blending reduces MSE by 25–35% vs the best naive alternative. Numbers, not theory.