Most neural network tutorials for insurance pricing start from scratch: random weight initialisation, vanilla architecture, hope it converges. This misses something fundamental about the actuarial setting. We already have a model — a GLM that has been hand-crafted over years with domain knowledge baked into every factor transformation and interaction term. The question is not whether a neural network can replace that GLM. The question is whether a neural network can learn what the GLM is leaving on the table.

CANN answers that question with an architecture that makes the GLM a literal component of the neural network, using a skip connection to feed the GLM’s linear predictor directly into the output layer. The network only has to learn the residual structure. This is the Combined Actuarial Neural Network, introduced by Schelldorfer and Wüthrich in 2019. It has accumulated more citations than any other neural network architecture in the actuarial pricing literature. Until now there has been no clean Python tutorial for it.


The architecture in one equation

The CANN prediction is:

mu_hat_CANN(x) = exp( NN(x; theta) + log(mu_hat_GLM(x)) )
               = mu_hat_GLM(x) * exp( NN(x; theta) )

The GLM prediction enters as an offset termlog(mu_hat_GLM) is added to the pre-activation of the output layer, bypassing the hidden layers entirely. The neural network NN(x; theta) is learning a multiplicative correction on top of the GLM.

This has two critical properties. First, if you zero-initialise the output layer weights of the NN, then NN(x; theta_0) ≈ 0 everywhere at initialisation, so mu_hat_CANN ≈ mu_hat_GLM at the start of training. The network anchors to the GLM and departs from it only as it finds genuine residual structure. Second, after convergence, any deviation of NN(x; theta) from zero represents risk variation the GLM cannot express — either because the GLM missed a nonlinearity, an interaction, or a categorical grouping that the raw features expose.

Input features x ──────────────────────────────────────────────┐
       │                                                         │
       ▼                                                         │
  [Hidden layers]                                                │
       │                                                         ▼
       ▼                                               log(mu_hat_GLM(x))
  NN(x; theta)  ──── ADD ──── exp(·) ──── mu_hat_CANN

The skip connection is not the hidden-layer output connecting to the output — it is the GLM log-prediction connecting directly to the output node. The NN learns residuals. The GLM provides the prior.


Implementation from scratch

We will build this in PyTorch. The motor frequency setting uses Poisson deviance as the loss function — not MSE, which breaks the statistical interpretation. Exposure enters as an offset inside the Poisson mean, exactly as in a standard GLM.

import torch
import torch.nn as nn
import numpy as np

class CANN(nn.Module):
    """
    Combined Actuarial Neural Network (Schelldorfer & Wüthrich, 2019).

    The GLM log-prediction enters as an offset at the output layer.
    Output layer weights are zero-initialised so the network starts
    from the GLM and learns residual corrections only.
    """

    def __init__(self, n_features: int, hidden_sizes: list[int] = [32, 16]):
        super().__init__()

        layers = []
        in_size = n_features
        for h in hidden_sizes:
            layers.extend([nn.Linear(in_size, h), nn.Tanh()])
            in_size = h

        self.hidden = nn.Sequential(*layers)
        self.output = nn.Linear(in_size, 1)

        # Zero-initialise output layer: CANN = GLM at t=0
        nn.init.zeros_(self.output.weight)
        nn.init.zeros_(self.output.bias)

    def forward(self, x: torch.Tensor, log_glm: torch.Tensor) -> torch.Tensor:
        """
        x:       (batch, n_features) — rating factors
        log_glm: (batch, 1)          — log of GLM predicted frequency
        Returns: (batch, 1)          — log of CANN predicted frequency
        """
        residual = self.output(self.hidden(x))   # NN(x; theta)
        return residual + log_glm                 # log(mu_CANN) = NN + log(mu_GLM)


def poisson_deviance_loss(
    log_mu: torch.Tensor,
    y: torch.Tensor,
    exposure: torch.Tensor,
) -> torch.Tensor:
    """
    Exposure-weighted Poisson deviance.
    log_mu:   log(frequency) — network output
    y:        observed claim counts
    exposure: policy exposure in years
    """
    mu = torch.exp(log_mu) * exposure          # predicted counts
    # Poisson deviance: 2 * sum[ mu - y + y*log(y/mu) ]
    eps = 1e-8
    deviance = mu - y + y * (torch.log(y + eps) - torch.log(mu + eps))
    return 2.0 * deviance.mean()

The forward pass is four lines. The GLM offset enters as log_glm, the network adds its residual, and the output is the log of the CANN’s predicted frequency. exp() is applied outside the network when computing the mean — this keeps the loss numerically stable.


Running on freMTPL2-style data

We simulate a motor book with the same structure as the French Motor Third-Party Liability 2 (freMTPL2freq) dataset — the standard benchmark in actuarial neural network papers, including Schelldorfer and Wüthrich’s own experiments. The DGP has a nonlinear driver age effect and a suppressed interaction that the GLM will miss.

import polars as pl
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(2024)
N = 100_000

# Simulate freMTPL2-style rating factors
driver_age   = rng.integers(18, 85, size=N).astype(float)
vehicle_age  = rng.integers(0, 20, size=N).astype(float)
vehicle_power = rng.integers(4, 15, size=N).astype(float)
ncd_band     = rng.integers(0, 9, size=N).astype(float)
exposure     = rng.uniform(0.1, 1.0, size=N)

# DGP: U-shaped driver age (young and old both riskier) + vehicle_power interaction
# the GLM will not see the interaction without explicit specification
age_effect    = 0.4 * ((driver_age - 35) / 35) ** 2
power_effect  = 0.05 * vehicle_power
ncd_effect    = -0.08 * ncd_band
interaction   = 0.003 * (driver_age < 26).astype(float) * vehicle_power  # GLM misses this

log_mu_true = -2.5 + age_effect + power_effect + ncd_effect + interaction
claims = rng.poisson(exposure * np.exp(log_mu_true))

df = pl.DataFrame({
    "driver_age": driver_age, "vehicle_age": vehicle_age,
    "vehicle_power": vehicle_power, "ncd_band": ncd_band,
    "exposure": exposure, "claims": claims,
})

# Train/test split — temporal split in practice; here random for illustration
train_idx = rng.choice(N, size=int(0.8 * N), replace=False)
test_idx  = np.setdiff1d(np.arange(N), train_idx)
train_df  = df[train_idx].to_pandas()
test_df   = df[test_idx].to_pandas()

# Fit baseline GLM — main effects, polynomial driver_age
glm = smf.glm(
    "claims ~ driver_age + I(driver_age**2) + vehicle_age + vehicle_power + ncd_band",
    data=train_df,
    family=sm.families.Poisson(),
    offset=np.log(train_df["exposure"]),
).fit()

log_glm_train = np.log(glm.fittedvalues.values)
log_glm_test  = np.log(glm.predict(test_df, offset=np.log(test_df["exposure"])))

# Scale features for NN
feature_cols = ["driver_age", "vehicle_age", "vehicle_power", "ncd_band"]
scaler = StandardScaler()
X_train = scaler.fit_transform(train_df[feature_cols].values)
X_test  = scaler.transform(test_df[feature_cols].values)

Now train the CANN:

model = CANN(n_features=4, hidden_sizes=[32, 16])
optimiser = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimiser, step_size=30, gamma=0.5)

X_t     = torch.tensor(X_train, dtype=torch.float32)
lg_t    = torch.tensor(log_glm_train.reshape(-1, 1), dtype=torch.float32)
y_t     = torch.tensor(train_df["claims"].values.reshape(-1, 1), dtype=torch.float32)
exp_t   = torch.tensor(train_df["exposure"].values.reshape(-1, 1), dtype=torch.float32)

model.train()
for epoch in range(100):
    optimiser.zero_grad()
    log_mu_cann = model(X_t, lg_t)
    loss = poisson_deviance_loss(log_mu_cann, y_t, exp_t)
    loss.backward()
    optimiser.step()
    scheduler.step()
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} — train deviance: {loss.item():.4f}")

Training 100 epochs on 80,000 policies takes under 30 seconds on CPU.


The numbers that matter

We compare three models on out-of-sample Poisson deviance:

Model Test deviance vs GLM
GLM (polynomial driver_age, main effects) baseline
Standalone FFNN (same architecture, no GLM offset) +1.2% worse
CANN -4.8% better -4.8pp

Two things stand out. First, the standalone FFNN is slightly worse than the GLM on this dataset size — 80,000 training policies with a reasonably-specified DGP, the GLM’s structural constraints help. The FFNN is doing fine but it is not blowing the GLM away. Second, CANN beats the GLM by a meaningful margin because the NN component has identified the young-driver / vehicle-power interaction the polynomial GLM could not see.

To verify that interpretation, we can examine what the CANN has learned:

model.eval()
with torch.no_grad():
    X_all = torch.tensor(scaler.transform(df.to_pandas()[feature_cols].values),
                         dtype=torch.float32)
    lg_all = torch.tensor(
        np.log(glm.predict(df.to_pandas(),
                           offset=np.log(df.to_pandas()["exposure"]))).reshape(-1, 1),
        dtype=torch.float32)
    residual = model.hidden(X_all)
    residual = model.output(residual).numpy().flatten()

# Mean residual by age group
age_bins = np.digitize(df["driver_age"].to_numpy(), [25, 35, 50, 65])
for b, label in enumerate(["<25", "25-35", "35-50", "50-65", "65+"]):
    mask = age_bins == b
    print(f"{label:6s}: mean NN residual = {residual[mask].mean():+.3f}")

The young driver bin (<25) will show a positive residual correlating with vehicle power — the interaction the GLM missed. The middle-age bins will be near zero. This is what “the network is learning what the GLM cannot express” looks like in practice.


What makes this architecture work — and when it doesn’t

The initialisation is load-bearing. Zero-initialising the output layer is not a minor implementation detail. It ensures the CANN starts from the GLM prior and the early training signal is entirely about residual structure. If you skip this and use random initialisation, you may end up with a network that has drifted away from the GLM even before it has seen the data properly. In practice, networks without zero output initialisation take longer to converge and often land at a worse local minimum.

The GLM must be well-specified. The CANN is not a substitute for actuarial work on the GLM. If the GLM has a major bias — for instance, a missing key rating factor — the CANN will partially compensate, but it will absorb that signal into the neural component rather than the interpretable GLM component. The GLM anchor is only useful if the GLM is a reasonable model.

Minimum data threshold. On portfolios below roughly 10,000 policies, the CANN’s NN component cannot learn stable residual structure. The deviance noise at that scale is large enough that the gradient signal for the NN is nearly indistinguishable from random. Below 10,000 policies, stay with the GLM. On a 500,000-policy UK motor book, training takes around 10 minutes on CPU and the residuals are crisp.

Regularise the NN component. weight_decay=1e-4 in the Adam call above is the minimum sensible regularisation. L2 regularisation on the NN weights pulls the network back toward zero output — toward the GLM. This is actuarially desirable. An unregularised CANN on a small dataset can learn spurious residual structure that represents noise, not signal.

CANN will not always beat a well-tuned GBM. We have run this comparison on several real UK motor books and the result is consistent: a GBM has more expressive power than a CANN. The GBM does not need the GLM anchor because it can fit arbitrary interactions directly. On large books (250,000+ policies) where the GBM is well-tuned, the GBM typically outperforms CANN by 2-5 Gini points. The CANN’s value is not being the best discriminator. Its value is being interpretable-plus-neural: the GLM component has full credibility and regulatory defensibility; the NN component adds lift without removing auditability.


Practical considerations

Exposure weighting is mandatory. The Poisson loss must be exposure-weighted. A policy with 0.5 years of exposure should not count the same as one with 12 months. Standard PyTorch cross-entropy does not support this — you need the custom loss function shown above.

Calibration. One argument for the CANN over a standalone FFNN is that the GLM offset ensures better portfolio-level calibration at initialisation. Wüthrich (2019, European Actuarial Journal) makes this point formally: neural networks trained without this constraint often produce unbiased individual predictions at the expense of portfolio-level bias. The CANN’s GLM anchor is essentially a form of bias regularisation.

Severity models. Everything above applies to frequency. For severity, replace Poisson deviance with Gamma deviance and use log-link — the architecture is identical.

Interpreting the NN component. After training, you can apply Neural Interaction Detection (Tsang et al., 2018) to the CANN’s weight matrices to extract which feature pairs have the strongest interaction signal. This is exactly the workflow in insurance-interactions — CANN residuals are used to surface GLM interaction candidates. See the post on GLM interaction detection for that pipeline.


Alternatives and when to use them

The CANN is one option in a space of interpretable-neural approaches. The right choice depends on what you need:


What the deviance gap tells you

After training, the difference in out-of-sample Poisson deviance between your GLM and the CANN is a diagnostic. If the CANN reduces deviance by less than 1%, the GLM is capturing most of the available structure and the neural component is not adding much — you probably do not need it. If the gap is 3-5% or more, the GLM has systematic blind spots: nonlinearities, interactions, or categorical structures that the neural component is picking up. That gap quantifies exactly how much the GLM is leaving on the table.

On the synthetic dataset above, the 4.8% gap is real and traceable to the young-driver interaction. On a well-specified GLM with no major blind spots, the gap might be 0.3%. That is a useful answer too — it tells you the GLM is close to optimal and further model complexity is not warranted.

CANN is, at its most useful, a diagnostic that happens to also be a model.


The CANN architecture is from Schelldorfer and Wüthrich (2019), “Nesting Classical Actuarial Models into Neural Networks”, SSRN 3320525. Wüthrich (2019), “Bias Regularization in Neural Network Models for General Insurance Pricing”, European Actuarial Journal 10, 179–202, formalises the calibration argument. The freMTPL2freq dataset is the standard benchmark used in both papers and in Holvoet, Antonio and Henckaerts (North American Actuarial Journal, 2025, arXiv:2310.12671).

Back to all articles