Skip to content

Part 6: Choosing the Tweedie power and fitting the base model

Part 6: Choosing the Tweedie power and fitting the base model

The Tweedie family

CatBoost's Tweedie loss function models the compound Poisson-Gamma distribution that characterises aggregate insurance losses. The variance_power parameter p controls the variance-to-mean relationship:

  • p = 1: Poisson (variance proportional to mean). Appropriate for claim counts only.
  • p = 2: Gamma (variance proportional to mean squared). Appropriate for claim severity only.
  • 1 < p < 2: Compound Poisson-Gamma. This is the distribution of aggregate losses - a point mass at zero (no claims) combined with a positive continuous distribution when claims occur.

For UK motor pure premiums, p = 1.5 is the standard choice. This sits in the middle of the compound Poisson-Gamma range and reflects both the frequency structure (lots of zeros from no-claim policies) and the severity structure (right-skewed losses when claims occur). Using p=1.3 or p=1.7 makes a small difference to the fit. Using p outside the range (1, 2) makes a large difference and is inappropriate for aggregate loss data.

Practical note: if your book has a very low claims rate (e.g. liability, where most policies never claim), you might choose p closer to 1.0. If severity is the dominant driver of variation (e.g. catastrophe-exposed property), p closer to 2.0 is more appropriate. For standard UK motor, 1.5 is correct.

Build the Pool objects and fit the model

In a new cell:

%md
## Part 6: Training the base Tweedie model
# CatBoost Pool objects package features, labels, and metadata together
train_pool = Pool(X_train, y_train, cat_features=CAT_FEATURES)
cal_pool   = Pool(X_cal,   y_cal,   cat_features=CAT_FEATURES)
test_pool  = Pool(X_test,  y_test,  cat_features=CAT_FEATURES)

tweedie_params = {
    "loss_function":    "Tweedie:variance_power=1.5",
    "eval_metric":      "Tweedie:variance_power=1.5",
    "learning_rate":    0.05,
    "depth":            5,
    "min_data_in_leaf": 50,    # prevents overfitting to small insurance cells
    "iterations":       500,
    "random_seed":      42,
    "verbose":          100,   # print progress every 100 trees
}

model = CatBoostRegressor(**tweedie_params)
model.fit(train_pool, eval_set=cal_pool, early_stopping_rounds=50)

# Sanity-check predictions on the test set
preds_test = model.predict(test_pool)
print(f"\nTest set predictions:")
print(f"  Min: {preds_test.min():.2f}")
print(f"  Median: {np.median(preds_test):.2f}")
print(f"  Mean: {preds_test.mean():.2f}")
print(f"  Max: {preds_test.max():.2f}")
print(f"  Actual mean pure premium: {y_test.mean():.2f}")

What this does: creates the three CatBoost Pool objects, sets the Tweedie hyperparameters, and trains the model using the calibration set as an early-stopping validation set. The verbose=100 setting prints the loss every 100 iterations.

Why min_data_in_leaf=50: without a minimum leaf size, deep trees can split on cells with only a handful of observations. These splits produce very precise but unreliable predictions for thin-cell risks. Thin-cell risks are exactly where conformal intervals will be widest - we need stable base model predictions in those regions, not wildly varying predictions from overfit splits.

A note on early stopping: using the calibration pool for early stopping means the model's iteration count has been influenced by the calibration data. This introduces a very minor dependency. In practice the effect on coverage is negligible. However, if you need strict separation for a regulatory audit, use a separate validation pool (drawn from the training set, not the calibration set) for early stopping and keep the calibration set entirely unseen during model fitting.

What you should see: training output like this, with the Tweedie loss printed every 100 iterations:

0:      learn: 2.05xxx  test: 2.07xxx   best: 2.07xxx (0)   total: ...
100:    learn: 1.82xxx  test: 1.85xxx   best: 1.84xxx (87)  total: ...
...
Stopped by early stopping after xxx iterations

The test set mean prediction should be close to (but not identical to) the actual mean pure premium. A large discrepancy here would suggest a misconfigured model.