Skip to content

Part 5: Feature definitions

Part 5: Feature definitions

Before fitting anything, we need to decide which features go into the model and which are categorical.

Create a new cell:

# Feature definitions for the motor GBM
CONTINUOUS_FEATURES = ["driver_age", "vehicle_group", "ncd_years"]
CAT_FEATURES        = ["area", "conviction_points"]
FEATURES            = CONTINUOUS_FEATURES + CAT_FEATURES

FREQ_TARGET  = "claim_count"
EXPOSURE_COL = "exposure"
SEV_TARGET   = "incurred"

Run it. There is no output - you are just setting up variable names that the rest of the notebook will use.

Why conviction_points is categorical: The values are 0, 3, 6, and 9. These are penalty point totals. They are not continuous quantities: the step from 0 to 3 points (one minor offence) is qualitatively different from 6 to 9 (approaching disqualification territory). Treating them as a continuous number would impose a linear assumption on the effect. As a categorical, CatBoost learns the effect of each penalty level independently using ordered target statistics.

Why vehicle_group is continuous: ABI groups 1-50 have a roughly monotone relationship with risk - higher groups are generally more expensive vehicles. The continuous treatment allows CatBoost to find non-linear effects within that trend, which tree splits handle naturally.

Why driver_age is continuous: Age has a well-known non-linear effect - a peak of risk in the under-25 band and a secondary peak in the over-70 band, with a flat middle. Tree splits on a continuous variable capture this shape without requiring manual bucketing.