Databricks Notebook Archive

Ten Databricks notebooks from the Burning Cost library stack, covering the full pricing workflow from data generation to model deployment. Each notebook runs top-to-bottom on synthetic UK motor data with no external dependencies beyond what it installs with %pip.

Download all 10 notebooks (.zip, 53 KB)

How to use these

Download a .py file below (or the full .zip)
In Databricks: Workspace → Import → File — select the .py file
Databricks will recognise the # Databricks notebook source header and render the markdown cells
Run Run All — the first cell does %pip install so no cluster setup is needed

Notebooks are tested on Databricks Runtime 14.x+ with serverless SQL compute. They work on any standard cluster with Python 3.11+.

The notebooks

Foundation

01 — End-to-end motor pricing workflow The full pipeline in one script: synthetic portfolio generation, CatBoost frequency model, SHAP relativities, PRA validation report, and champion/challenger deployment. Start here if you want to see how the libraries connect. Uses: insurance-synthetic, catboost, shap-relativities, insurance-validation, insurance-deploy

02 — Synthetic portfolio generation Build a realistic UK motor book using vine copulas that preserve multivariate dependence structure. Fit the synthesizer, generate 50k policies, and assess fidelity with SyntheticFidelityReport. Useful for sharing data with vendors without moving real policyholder records. Uses: insurance-synthetic

Frequency and severity modelling

03 — Bayesian hierarchical frequency model Hierarchical Bayesian model with partial pooling for sparse rating cells. Pathfinder for fast variational inference, NUTS for the posterior you actually want. Extracts multiplicative relativities in rate-table format and flags thin segments for manual review. Uses: bayesian-pricing, pymc, arviz

04 — Bühlmann-Straub credibility Polars-native Bühlmann-Straub credibility for account-level and segment-level experience rating. The actuarial workhorse for blending individual experience with the collective prior. Sklearn-compatible scorer included. Uses: credibility

Model structure

05 — Automated GLM interaction detection CANN + Neural Interaction Detection (NID) pipeline that finds the interactions your Poisson GLM missed. Ranks candidates by NID score, validates survivors with likelihood-ratio tests. Built into data: age_band × vehicle_group is a real interaction — watch the library find it. Uses: insurance-interactions

Causal inference

06 — Causal deconfounding via double machine learning DML for separating the causal effect of a rating factor from confounding driven by correlated features. Recovers the true ATE against a known data-generating process. The confounding bias report shows how much naive GLM estimates are distorted. Uses: insurance-causal

Compliance and governance

07 — Fairness and proxy discrimination audit FCA Consumer Duty framing: proxy detection, disparate impact metrics, and counterfactual fairness tests. Produces the structured audit report you need before a rate change sign-off. Uses: insurance-fairness

08 — Model drift monitoring CSI heatmap across rating factors, actual-vs-expected calibration by segment, and Gini drift z-test (arXiv 2510.04556). Scenario: frequency model fitted on 2022-2023 data, monitored on Q1 2025 data. Uses: insurance-monitoring

09 — Champion/challenger deployment Model registry, routing verification, shadow mode logging, KPI dashboard, bootstrap loss ratio comparison, and power analysis. ICOBS 6B.2.51R ENBP audit report included. Uses: insurance-deploy

Spatial

10 — BYM2 spatial territory ratemaking Besag-York-Mollié BYM2 model for postcode-level territory rates. Borrows strength from neighbouring areas to handle sparse geographic cells. Full INLA-style workflow in Python. Uses: insurance-spatial

All notebooks are MIT-licensed. Source repos live under github.com/burning-cost. If a notebook is broken or out of date, raise an issue on the relevant repo.