We published a post earlier today on LLM feature engineering for insurance pricing — what the frameworks do, where the evidence is thin, and which use cases are worth pursuing. This post is about the risk you take on when any of those features enters a pricing model.

The short version: LLMs encode societal stereotypes from training data. When you use an LLM to generate insurance rating features — vehicle model scores, occupation risk bands, interaction term suggestions — those stereotypes can become embedded in your model. The insurer is responsible for the outcome. The LLM vendor is not. And “a human reviewed it” is not a defence if the actuary did not understand what the LLM optimised for.


The three risk channels

There are three distinct pathways through which an LLM-generated feature can introduce proxy discrimination.

Channel 1: Subjective categorical scoring. You have 3,000 vehicle model codes and you ask GPT-4 to score each by “driver risk signal” or “sporty driving likelihood.” The LLM draws on its training data — automotive journalism, forum discussions, insurance pricing commentary. Its scores correlate with the demographic profile of typical owners. Age, gender, and in some cases ethnicity through cultural associations with vehicle brands are all represented in that training corpus. You now have a feature that is a proxy for multiple protected characteristics, generated by a third-party model you cannot inspect.

Channel 2: Occupation band generation. You ask the LLM to group 10,000 occupation codes into five risk tiers. Its groupings reflect training data bias: occupations correlated with lower income or with particular ethnic groups may be systematically rated into higher risk tiers independent of claims experience. You have introduced a discriminatory feature with actuarial-sounding names on the bands.

Channel 3: Interaction term suggestion. The LLM recommends testing postcode_district × vehicle_age as an interaction. Postcode is a confirmed ethnicity proxy — the FCA has said so explicitly. Adding an LLM-suggested interaction involving postcode amplifies the proxy discrimination pathway in your model, without the actuary having explicitly decided to add a protected-characteristic-correlated interaction. The LLM made that decision; the actuary approved the test without noticing what was being amplified.

Channel 3 is the one that catches people off-guard. The LLM’s operator bias toward simple terms (documented in arXiv:2410.17787: GPT-4o-mini and Gemini-1.5-flash have 90% of their suggestions concentrated in five basic operators) means it will readily suggest arithmetic combinations of postcode with other variables. These interactions can look entirely benign — postcode × vehicle_age looks like a geographic-vehicle interaction, not a demographics interaction — and still amplify proxy pathways.


Who is responsible

The insurer. This is settled, not debatable.

The FCA’s position on third-party data is that firms must assure themselves that data used in pricing does not discriminate against customers based on protected characteristics. The LLM is a third-party feature vendor. The fact that it is an AI model rather than a data supplier makes no difference to that obligation. The FCA December 2024 Research Note on ML bias confirmed that senior manager responsibility under SM&CR cannot be outsourced to a model or to a vendor. Consumer Duty applies: AI-driven decisions must be fair, explainable, and free of unjustified discrimination.

NY DFS Circular Letter 2024-7 establishes the US precedent: insurers must demonstrate that AI systems and external data do not proxy for protected classes, and vendors must be auditable. UK firms should expect the FCA to reach the same operational position as the 2026 AI review agenda matures.

The “human in the loop” defence deserves specific scrutiny. It is weak if the actuary reviewed a model fit improvement (deviance reduction, Gini) without running the discrimination checks described below. An actuary who approved a new feature because it reduced Poisson deviance by 0.3% — without checking whether it concentrates Shapley value on protected-characteristic proxies — has not discharged the firm’s fairness obligation. A governance sign-off on model performance is not a fairness sign-off.


The regulatory documentation requirement

EIOPA-BoS-25-360 (August 2025) is the most operationally direct document here. It requires that for built or engineered features: “records should exist on how the feature was built and the associated intention.”

This language captures LLM-generated features explicitly. “How the feature was built” means: which LLM was used, what the prompt was, what validation was applied to the LLM’s output, and what fairness checks were performed before the feature entered the model. “The associated intention” means: what was the feature designed to measure, and is that thing legitimately predictive of insurance risk or does it stand in as a proxy for something else?

EIOPA’s opinion is addressed to EEA supervisors and does not directly bind UK-only firms post-Brexit. But any firm with EEA activities is in scope, and the FCA has been moving in the same direction on feature governance. PRA SS1/23 (operative May 2024) adds a further requirement: LLMs used during model development must be in the model inventory with their own documentation and vendor dependency chain. If GPT-4 generates your interaction candidates, that usage must be recorded — even if the LLM never runs in the live scoring path.

The practical documentation requirement for a UK pricing team using LLMs for feature generation therefore includes:

  1. The LLM model name and version used (GPT-4o, Claude 3.5 Sonnet, etc.)
  2. The full prompt or prompt template applied
  3. The dataset or categories passed to the LLM (vehicle model names, occupation codes, etc.)
  4. The validation method applied to LLM outputs before model fitting
  5. The fairness checks applied and their results
  6. The SM&CR senior manager sign-off

If any of those six things are missing, the feature record does not satisfy EIOPA-BoS-25-360.


The testing protocol

Any LLM-generated feature should pass four checks before entering a pricing model.

Check 1: Marginal correlation. Compute Spearman rank correlation between the new feature and known protected characteristic proxies. The minimum proxy set for UK motor: postcode deprivation decile (Income Deprivation Domain from LSOA 2019), name-based ethnicity probability (ONOMAP or similar), and ABI occupation class band. A Spearman ρ > 0.15 against any of these is a flag.

Check 2: Proxy R-squared change. Fit a regression of each protected proxy on the full rating factor set, with and without the new LLM feature. If R-squared against the proxy improves by more than one percentage point when the new feature is added, the feature is increasing the model’s discriminatory potential. This is the proxy detection logic in proxy_detection.py in the insurance-fairness library.

Check 3: Shapley attribution. Compute SHAP values for the LLM feature across the portfolio. If the feature’s SHAP distribution concentrates in records that are high on a protected proxy (high deprivation decile, high name-ethnicity probability), the feature is delivering most of its predictive lift on proxied-demographic groups. This should be red-flagged.

Check 4: A/B premium comparison. Run the model with and without the new feature. Compare the premium distribution change by postcode deprivation decile quintile and by geographic ethnicity proxy quintile. A feature that improves accuracy for the overall portfolio but increases premiums for high-deprivation postcodes relative to low-deprivation postcodes is doing something you need to understand before it goes live.

Here is an implementation using the insurance-fairness library. The approach: run FairnessAudit on the model with and without the LLM-generated feature, compare proxy detection scores between the two, and flag any factor whose proxy R-squared against protected proxies increased:

import polars as pl
from insurance_fairness import FairnessAudit


def audit_feature_delta(
    model_without: object,
    model_with: object,
    df: pl.DataFrame,
    new_feature_col: str,
    protected_cols: list[str],
    prediction_col: str,
    factor_cols_base: list[str],
    exposure_col: str = "exposure",
    proxy_r2_threshold: float = 0.01,
) -> dict:
    """
    Compare FairnessAudit results before and after adding an LLM-generated feature.

    Returns a summary dict flagging:
      - whether any proxy R-squared increased by more than proxy_r2_threshold
      - the SHAP proxy score for the new feature
      - the premium disparity ratio change by protected proxy quintile

    Parameters
    ----------
    model_without : fitted model (CatBoost or sklearn API)
        Pricing model without the LLM feature.
    model_with : fitted model
        Pricing model including the LLM feature.
    df : pl.DataFrame
        Policy-level dataset. Must contain all factor_cols, protected_cols,
        the new_feature_col, prediction_col from model_with, and exposure_col.
    new_feature_col : str
        Name of the LLM-generated feature column.
    protected_cols : list[str]
        Protected characteristic proxy columns (e.g., postcode deprivation decile,
        name-ethnicity probability).
    prediction_col : str
        Column containing predicted premiums from model_with.
    factor_cols_base : list[str]
        Rating factor columns present in the base model (without LLM feature).
    exposure_col : str
        Exposure column name.
    proxy_r2_threshold : float
        Flag if proxy R-squared for any factor increases by more than this
        when the new feature is added.
    """
    factor_cols_with = factor_cols_base + [new_feature_col]

    # Audit without the new feature
    # (compute predictions separately for the base model)
    df_base = df.with_columns(
        pl.Series("_pred_base", model_without.predict(
            df.select(factor_cols_base).to_pandas()
        ))
    )
    audit_base = FairnessAudit(
        model=model_without,
        data=df_base,
        protected_cols=protected_cols,
        prediction_col="_pred_base",
        exposure_col=exposure_col,
        factor_cols=factor_cols_base,
    )
    report_base = audit_base.run()

    # Audit with the new feature
    audit_with = FairnessAudit(
        model=model_with,
        data=df,
        protected_cols=protected_cols,
        prediction_col=prediction_col,
        exposure_col=exposure_col,
        factor_cols=factor_cols_with,
    )
    report_with = audit_with.run()

    # Compare proxy R-squared for each protected column
    flags = []
    for prot in protected_cols:
        base_scores = {
            s.factor: s.proxy_r2
            for s in report_base.proxy_detection[prot].scores
            if s.proxy_r2 is not None
        }
        with_scores = {
            s.factor: s.proxy_r2
            for s in report_with.proxy_detection[prot].scores
            if s.proxy_r2 is not None
        }
        for factor, r2_with in with_scores.items():
            r2_base = base_scores.get(factor, 0.0)
            delta = r2_with - r2_base
            if delta > proxy_r2_threshold:
                flags.append({
                    "protected_proxy": prot,
                    "factor": factor,
                    "proxy_r2_base": round(r2_base, 4),
                    "proxy_r2_with_llm_feature": round(r2_with, 4),
                    "delta": round(delta, 4),
                    "flag": "PROXY_R2_INCREASED",
                })

    # Isolate the new feature's own proxy score
    new_feature_proxy = {}
    for prot in protected_cols:
        scores_dict = {
            s.factor: s
            for s in report_with.proxy_detection[prot].scores
        }
        if new_feature_col in scores_dict:
            s = scores_dict[new_feature_col]
            new_feature_proxy[prot] = {
                "proxy_r2": s.proxy_r2,
                "mutual_information": s.mutual_information,
                "rag": s.rag,
            }

    return {
        "new_feature": new_feature_col,
        "proxy_flags": flags,
        "new_feature_proxy_scores": new_feature_proxy,
        "clean": len(flags) == 0,
    }


# --- Example usage ---

result = audit_feature_delta(
    model_without=model_base,
    model_with=model_with_llm_feature,
    df=df_policies,
    new_feature_col="llm_vehicle_risk_score",
    protected_cols=["postcode_deprivation_decile", "name_ethnicity_prob"],
    prediction_col="predicted_premium",
    factor_cols_base=["driver_age", "vehicle_group", "ncd_years",
                      "annual_mileage_band", "region"],
    exposure_col="exposure",
)

if not result["clean"]:
    print("WARNING: LLM feature triggered proxy discrimination flags")
    for flag in result["proxy_flags"]:
        print(f"  {flag['factor']} → proxy R² for {flag['protected_proxy']} "
              f"increased by {flag['delta']:.4f} "
              f"({flag['proxy_r2_base']:.4f} → "
              f"{flag['proxy_r2_with_llm_feature']:.4f})")
else:
    print("No proxy R-squared flags triggered. Check new feature scores directly:")
    for prot, scores in result["new_feature_proxy_scores"].items():
        print(f"  {prot}: proxy_r2={scores['proxy_r2']:.4f}, "
              f"MI={scores['mutual_information']:.4f}, RAG={scores['rag']}")

Run this before and after every LLM-generated feature enters the model. It is not a one-time check — run it again after any model refitting on updated data.


What the EU AI Act does and does not say here

For completeness, since this is a question we get asked: the pricing model does not inherit the LLM’s classification under the EU AI Act.

GPT-4 and Claude are GPAI models under Chapter V of Regulation (EU) 2024/1689. The pricing model itself is classified by what it does, not by what built it. For UK motor and property pricing, the Annex III high-risk classification applies only to AI systems used for individual risk assessment in life and health insurance — motor and property are excluded, confirmed in EU Commission Guidelines C/2025/3554, paragraph 42.

There is an arguable case that a life/health insurer using an LLM for feature generation brings it within Article 6(3) as a “preparatory task” for an Annex III system. No Commission guidance has directly addressed this yet, and it is an open legal question. The practical response is to inventory the LLM as a model component and document the dependency chain, which you should be doing anyway under PRA SS1/23.

The explainability requirement in Article 13 does create friction even for non-high-risk systems: if a key pricing feature is an opaque LLM score, explaining that feature’s contribution is harder than explaining a conventional derived variable. This is not an insuperable problem — a numeric score with a documented rationale is explainable — but it requires the feature documentation to be completed before the model goes live, not retrospectively when the auditor asks.


Why the “we validated it against claims data” defence does not hold

Fitting a model and confirming that the LLM-generated feature has a statistically significant coefficient does not eliminate the discrimination risk. It confirms only that the feature is correlated with claims experience in your historical data. That historical data already reflects the societal conditions — which occupations and postcodes historically had higher claims — that the LLM was trained on. A feature that is statistically predictive in your training data can simultaneously be a proxy for a protected characteristic.

This is why the discrimination testing is separate from the actuarial validation. The actuarial question is: does this feature reduce Poisson deviance on held-out data? The fairness question is: is this feature encoding a protected characteristic pathway independent of its predictive power? Both must be answered. Answering one does not answer the other.

The Lindholm-Richman-Tsanakas-Wüthrich (2022) discrimination-free pricing framework — implemented in discrimination_insensitive.py in the insurance-fairness library — addresses exactly this by applying conditional marginalisation across the protected characteristic. If you have a feature you suspect may be correlated with a protected attribute, applying LRTW-style conditional marginalisation will tell you how much of that feature’s premium contribution would survive if the protected characteristic information were removed. That is the right test.


The governance record

Put concretely: if your team uses an LLM to generate features and you are asked in an FCA supervisory visit to produce the governance record for those features, you need to be able to show:

If you cannot produce those four things, you have a EIOPA-BoS-25-360 documentation gap and a Consumer Duty audit exposure. The documentation is not difficult — it is an afternoon of work per feature. What is difficult is building the habit of treating LLM-generated features as requiring the same governance process as any other new rating factor, which is what they are.

The payoff is that the workflow we described in the earlier post — LLM-assisted interaction search for GLM models — remains a sound technique. The operator bias toward simple terms (arXiv:2410.17787) means you are unlikely to get an LLM to suggest the most valuable complex aggregation features. But for semantic prioritisation of GLM interaction candidates, the mechanism is credible. The discrimination testing protocol above is what makes it deployable.


FCA Evaluation Paper EP25/2: Supervisory Approach to AI in Insurance. Financial Conduct Authority, February 2025.

EIOPA Supervisory Statement on Artificial Intelligence (2025). EIOPA-BoS-25-360. August 2025.

PRA Supervisory Statement SS1/23: Model Risk Management Principles for Banks. Bank of England, May 2023 (operative May 2024).

FCA Research Note on Fairness and Machine Learning in Insurance Pricing. Financial Conduct Authority, December 2024.

NY DFS Circular Letter 2024-7: Use of External Consumer Data and Information Sources in Underwriting for Life Insurance. New York Department of Financial Services, 2024.

Lindholm M, Richman R, Tsanakas A, Wüthrich M (2022). “Discrimination-Free Insurance Pricing.” ASTIN Bulletin 52(1): 55–89.

Operator bias paper: (2024). “Large Language Models Engineer Too Many Simple Features for Tabular Data.” arXiv:2410.17787

insurance-fairness library: github.com/burning-cost/insurance-fairness

Back to all articles