Ten Databricks notebooks from the Burning Cost library stack, covering the full pricing workflow from data generation to model deployment. Each notebook runs top-to-bottom on synthetic UK motor data with no external dependencies beyond what it installs with %pip.
Download all 10 notebooks (.zip, 53 KB)
How to use these
- Download a
.pyfile below (or the full.zip) - In Databricks: Workspace → Import → File — select the
.pyfile - Databricks will recognise the
# Databricks notebook sourceheader and render the markdown cells - Run
Run All— the first cell does%pip installso no cluster setup is needed
Notebooks are tested on Databricks Runtime 14.x+ with serverless SQL compute. They work on any standard cluster with Python 3.11+.
The notebooks
Foundation
01 — End-to-end motor pricing workflow
The full pipeline in one script: synthetic portfolio generation, CatBoost frequency model, SHAP relativities, PRA validation report, and champion/challenger deployment. Start here if you want to see how the libraries connect.
Uses: insurance-synthetic, catboost, shap-relativities, insurance-validation, insurance-deploy
02 — Synthetic portfolio generation
Build a realistic UK motor book using vine copulas that preserve multivariate dependence structure. Fit the synthesizer, generate 50k policies, and assess fidelity with SyntheticFidelityReport. Useful for sharing data with vendors without moving real policyholder records.
Uses: insurance-synthetic
Frequency and severity modelling
03 — Bayesian hierarchical frequency model
Hierarchical Bayesian model with partial pooling for sparse rating cells. Pathfinder for fast variational inference, NUTS for the posterior you actually want. Extracts multiplicative relativities in rate-table format and flags thin segments for manual review.
Uses: bayesian-pricing, pymc, arviz
04 — Bühlmann-Straub credibility
Polars-native Bühlmann-Straub credibility for account-level and segment-level experience rating. The actuarial workhorse for blending individual experience with the collective prior. Sklearn-compatible scorer included.
Uses: credibility
Model structure
05 — Automated GLM interaction detection
CANN + Neural Interaction Detection (NID) pipeline that finds the interactions your Poisson GLM missed. Ranks candidates by NID score, validates survivors with likelihood-ratio tests. Built into data: age_band × vehicle_group is a real interaction — watch the library find it.
Uses: insurance-interactions
Causal inference
06 — Causal deconfounding via double machine learning
DML for separating the causal effect of a rating factor from confounding driven by correlated features. Recovers the true ATE against a known data-generating process. The confounding bias report shows how much naive GLM estimates are distorted.
Uses: insurance-causal
Compliance and governance
07 — Fairness and proxy discrimination audit
FCA Consumer Duty framing: proxy detection, disparate impact metrics, and counterfactual fairness tests. Produces the structured audit report you need before a rate change sign-off.
Uses: insurance-fairness
08 — Model drift monitoring
CSI heatmap across rating factors, actual-vs-expected calibration by segment, and Gini drift z-test (arXiv 2510.04556). Scenario: frequency model fitted on 2022-2023 data, monitored on Q1 2025 data.
Uses: insurance-monitoring
09 — Champion/challenger deployment
Model registry, routing verification, shadow mode logging, KPI dashboard, bootstrap loss ratio comparison, and power analysis. ICOBS 6B.2.51R ENBP audit report included.
Uses: insurance-deploy
Spatial
10 — BYM2 spatial territory ratemaking
Besag-York-Mollié BYM2 model for postcode-level territory rates. Borrows strength from neighbouring areas to handle sparse geographic cells. Full INLA-style workflow in Python.
Uses: insurance-spatial
All notebooks are MIT-licensed. Source repos live under github.com/burning-cost. If a notebook is broken or out of date, raise an issue on the relevant repo.