There is a version of this question that comes up at least once on every pricing project involving schemes, brokers, or MGA portfolios: do we apply credibility adjustments to the GBM output, or do we just feed the group identifiers into the model and let it figure it out?

The question sounds like it has a simple answer. It does not. The right approach depends on how many groups you have, how thin the data is, what your regulator expects, and whether you need to explain the group adjustment separately from the base rate. We have built separate libraries for the two ends of this spectrum and a third for the middle ground. This post explains when to use each.


Classical Bühlmann-Straub: what it does well

The Bühlmann-Straub model (1970) solves a specific problem. You have a set of groups – schemes, territories, NCD classes, fleet categories – each with its own observed loss rate across several periods, and each with a different volume of data. You want to blend each group’s observed experience with the portfolio mean, but the blend weight should reflect how credible that experience actually is.

The mathematics is clean. Three structural parameters govern everything:

From these, Bühlmann’s k = v/a gives you the noise-to-signal ratio. The credibility factor for group i is Z_i = w_i / (w_i + k), where w_i is total exposure. A large k means the portfolio is relatively homogeneous – trust the collective. A small k means groups genuinely differ – trust the individual experience.

The practical output is a credibility premium for each group: a weighted blend of that group’s own loss rate and the portfolio mean, where the blend weight is derived entirely from the data.

Our insurance-credibility library implements this as BuhlmannStraub, fitting structural parameters non-parametrically from a panel of (group, period) observations:

import polars as pl
from insurance_credibility import BuhlmannStraub

# One row per (scheme, accident year) -- loss rate and earned exposure
scheme_panel = pl.DataFrame({
    "scheme":     ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
    "year":       [2021, 2022, 2023] * 3,
    "loss_rate":  [0.55, 0.60, 0.58, 0.80, 0.75, 0.82, 0.40, 0.42, 0.38],
    "exposure":   [1000, 1200, 1100,  300,  350,  320, 5000, 4800, 5200],
})

bs = BuhlmannStraub()
bs.fit(scheme_panel, group_col="scheme", period_col="year",
       loss_col="loss_rate", weight_col="exposure")

summary = bs.summary()
# scheme  exposure  loss_rate_observed  credibility_weight  credibility_premium
# A       3300      0.578               0.76                0.561
# B       970       0.789               0.40                0.674
# C       15000     0.401               0.96                0.406

Scheme B has only 970 earned exposures across three years. Its observed loss rate of 78.9% is pulled toward the portfolio mean. Scheme C, with 15,000 exposures, earns 96% credibility and its premium barely moves. This is exactly the behaviour you want.

The limitations are equally clear. Bühlmann-Straub operates on a univariate loss rate. It cannot handle interactions between driver age and vehicle group. It has no view of the policy-level rating factors that drive the underlying risk – it only sees aggregate loss experience at the group level. If Scheme B’s bad experience is explained by a high proportion of young drivers, Bühlmann-Straub cannot see that. It will apply a partial credibility adjustment for what is actually a compositional difference.

This is the right tool for scheme-level adjustments when you have aggregate data and need an auditable, defensible blend with a clear mathematical interpretation. It is not the right tool for building the base rate structure.


CatBoost for base rate structure: what it does well

The GBM, fitted at policy level on all rating factors, solves a different problem. Given driver age, vehicle group, postcode sector, NCD, and fifty other factors, it learns the non-linear, interacting structure of claim frequency and severity across the portfolio. CatBoost handles high-cardinality categoricals natively, which matters a great deal when vehicle group alone has 400 values and postcode sector has 2,800.

On a UK personal lines motor portfolio of 500,000 policies, a well-tuned CatBoost model typically outperforms a GLM with manually engineered interactions by 4-7 Gini points on holdout data. That gap is not noise. It represents genuine rating structure that the GLM’s additive assumptions cannot capture.

The problem arises when you want to add group-level adjustments on top. Feeding broker_id or scheme_id directly into CatBoost only works when groups are large enough for the tree to learn stable effects. With 200 brokers and an average of 2,500 policies each, a high-cardinality categorical in CatBoost will produce unreliable leaves for the smaller brokers and will not naturally shrink toward the portfolio mean the way Bühlmann-Straub does. You can tune the regularisation parameters to get some shrinkage, but you cannot inspect the effective credibility weight given to each group, and you certainly cannot explain it to a regulator.

Where shap-relativities fits here: once you have a CatBoost base rate model, you still need to produce factor tables for your pricing committee and for Radar or Emblem. Our shap-relativities library extracts exposure-weighted relativities in multiplicative format from any CatBoost model:

from shap_relativities import ShapRelativities

sr = ShapRelativities(model=catboost_model, X=X_train, exposure=exposure)
tables = sr.extract(factors=["driver_age_band", "vehicle_group", "ncd_years"])
tables["driver_age_band"].write_csv("age_relativities.csv")

The resulting relativities are what go into Radar. The CatBoost model is the source of truth; the factor tables are the governance-compatible representation of it. This workflow – CatBoost in training, GLM-format tables in production – is how the Gini gap actually gets into production.


The hybrid: two-stage multilevel modelling

The right architecture for most UK motor scheme and MGA portfolios is neither pure credibility nor pure GBM. It is sequential: fit CatBoost on individual risk factors to learn the base rate structure, then apply REML-estimated random effects to capture group-level departures from that structure.

Our insurance-multilevel library implements this as MultilevelPricingModel:

Stage 1: CatBoost is fitted on all features excluding group columns. The group exclusion is deliberate and critical. If broker_id is included in Stage 1, the GBM partially absorbs the group signal. Stage 2 then sees only the residual, underestimates the between-group variance tau2, and applies insufficient shrinkage. Excluding group columns from Stage 1 preserves identifiability.

Stage 2: Log-ratio residuals r_i = log(y_i / f_hat_i) are fitted with REML random intercepts per group. The BLUP for each group – the Best Linear Unbiased Predictor – is exactly a credibility-weighted adjustment: Z_g * (r_bar_g - mu_hat), where Z_g = tau2 / (tau2 + sigma2/n_g) is the Bühlmann credibility weight.

Final premium: f_hat(x) * exp(b_hat_group). Multiplicative structure throughout, compatible with standard UK personal lines rating.

from insurance_multilevel import MultilevelPricingModel

model = MultilevelPricingModel(
    catboost_params={"iterations": 500, "loss_function": "Poisson", "depth": 6},
    random_effects=["broker_id", "scheme_id"],
    min_group_size=10,
)

model.fit(
    X_train,                          # all features including broker_id, scheme_id
    y_train["claim_count"],
    weights=y_train["exposure"],
    group_cols=["broker_id", "scheme_id"],
)

premiums = model.predict(X_test, group_cols=["broker_id", "scheme_id"])

# Inspect credibility weights per group
summary = model.credibility_summary()
# group_col   group_id   n_obs  tau2   sigma2   credibility_weight  blup_adjustment
# broker_id   BRK_007    4200   0.031  0.284    0.82                1.043
# broker_id   BRK_119    180    0.031  0.284    0.17                1.008
# scheme_id   SCH_003    8900   0.018  0.219    0.88               0.961

Broker 007 has 4,200 policies. Its 82% credibility weight means the BLUP adjustment is heavily anchored to its own experience – a 4.3% uplift, probably a worse-than-average book. Broker 119 has only 180 policies: 17% credibility weight, its adjustment barely departs from zero. The shrinkage is automatic, principled, and inspectable.

This is the notation that mirrors Bühlmann-Straub deliberately. The tau2/sigma2/k/credibility_weight output from credibility_summary() maps directly to v/a/k/Z in the classical model. A pricing actuary who knows credibility theory can read this output without any new conceptual framework.


When each approach wins

Use Bühlmann-Straub (insurance-credibility) when:

Use CatBoost base rate plus shap-relativities when:

Use insurance-multilevel when:

The diagnostic you want from insurance-multilevel is the ICC – the intraclass correlation coefficient – which tells you what fraction of total premium variance sits between groups rather than within them. If ICC is below 2%, the group structure is probably not worth the additional complexity. If it is above 10%, treating groups as random effects will materially improve both accuracy and stability.

from insurance_multilevel import diagnostics

vc = model._variance_components["broker_id"]
icc = diagnostics.icc(vc, group_col="broker_id")
# 0.087 -- 8.7% of variance between brokers, worth modelling separately

The actual workflow for UK motor

For a UK personal lines motor portfolio with direct and broker channels mixed together, the approach we use is:

  1. Fit CatBoost on individual risk factors (driver, vehicle, geography, NCD) excluding any channel or group identifiers. Extract factor tables via shap-relativities for the pricing committee. This is the base rate model.

  2. On the residuals from Step 1, fit Bühlmann-Straub at scheme level if you only have aggregate scheme data. Or fit insurance-multilevel at broker and scheme level if you have policy-level data with group identifiers. The BLUP outputs become the scheme and broker loading tables.

  3. The final premium is base rate (from the CatBoost factor tables) multiplied by the scheme loading multiplied by the broker loading. Each component is separately auditable and separately governable.

Step 3 is what makes this architecture work in practice. The base rate review and the scheme rate review happen on different cycles and involve different stakeholders. Separating them structurally – rather than baking everything into one model that nobody can decompose – is what allows the pricing function to actually manage the portfolio.

The credibility approach did not disappear when GBMs arrived. It got a better Stage 1.


insurance-credibility insurance-multilevel shap-relativities

Back to all articles