The FCA’s Consumer Duty (PS22/9, in force July 2023) and the General Insurance Pricing Practices rules (PS21/5) together create a clear obligation: firms must be able to demonstrate that their pricing models produce fair outcomes and do not use factors that act as conduits for protected characteristics the firm is not permitted to use directly. This is not a new idea, but the FCA’s 2024 multi-firm Consumer Duty review made clear that the standard of evidence most firms were producing was inadequate.
This post explains what that obligation actually requires technically, why the general-purpose fairness tooling you might reach for first does not answer the right question, and how to run a conformant proxy discrimination audit in Python using the insurance-fairness library.
What the regulatory obligation actually requires
The obligation comes from two directions simultaneously.
From the FCA side: Consumer Duty’s Price and Value Outcome (PRIN 2A.4) requires firms to demonstrate that their products and services represent fair value. The FCA’s multi-firm review of Consumer Duty implementation (2024) found that evidence quality was “too high level and lacking the granularity to adequately evidence good outcomes.” General Insurance Pricing Practices (PS21/5) adds the specific prohibition on price walking and the requirement that renewal premiums are fair. Both require documented, reproducible evidence — not a one-line assertion.
From the Equality Act side: Section 19 prohibits indirect discrimination — applying a provision, criterion, or practice that is not directly related to a protected characteristic but which puts persons sharing a protected characteristic at a particular disadvantage. A rating factor that is a strong statistical proxy for ethnicity or disability may constitute indirect discrimination unless the firm can demonstrate a legitimate aim proportionate to the discriminatory effect. This is a factor-level question, not an output-level question.
Three things follow from this that shape what a technical audit needs to produce.
First, the test is about factor-level correlation, not output-level disparity. A model that charges different average premiums to men and women is not necessarily discriminating - risk differs. A model that uses a factor which is a strong predictor of gender, without having considered that relationship, is a problem. The regulatory obligation is to examine the inputs, not just compare the group average outputs.
Second, conditional independence is the legal standard. The Lindholm, Richman, Tsanakas and Wüthrich (LRTW) framework, now the academic reference point for FCA enforcement discussions, defines discrimination as: a pricing model that is not conditionally independent of a protected attribute S given the observed rating factors X. In other words, if knowing a customer’s protected characteristic tells you something about what the model will charge them - even after you know all their risk factors - the model is discriminating. This is a stricter and more precise test than demographic parity.
Third, the audit trail is part of the obligation. The FCA’s multi-firm review of Consumer Duty implementation (2024) found that the quality of evidence submitted by firms was “too high level and lacking the granularity to adequately evidence good outcomes.” The audit needs to be reproducible, factor-level, and documented - not a one-line assertion that the model was reviewed.
Why Fairlearn and AIF360 do not fit
The two most widely used Python fairness libraries are Fairlearn (Microsoft) and AIF360 (IBM). Both are well-built. Neither was designed for this problem.
They implement demographic parity. The core metric in both libraries is a comparison of model outputs across protected groups - are the average predictions, positive prediction rates, or error rates similar across groups? This is the right question for fraud classification or credit approval. For insurance pricing, it is the wrong question. UK insurers are explicitly permitted to charge different average premiums to different groups when risk differs. A pricing model that achieves demographic parity is not compliant - it is probably mispricing some risks.
They do not test rating factor inputs. Fairlearn’s MetricFrame and AIF360’s ClassificationMetric both take model outputs and ask how those outputs vary by protected characteristic. They cannot tell you whether occupation carries ethnicity information, or whether postcode_district is a stronger predictor of the protected characteristic than it is of claim frequency. Factor-level proxy detection - the thing the Equality Act Section 19 analysis requires - is not in either library’s scope.
They do not handle exposure weights. Insurance portfolios are not rows. A single-month policy contributes a twelfth of the exposure of an annual policy on the same risk. Both libraries treat every row equally. For calibration-by-group or demographic parity computations in insurance, unweighted metrics are wrong.
They do not produce insurance-specific audit outputs. A Fairlearn MetricFrame is not an FCA evidence document. It does not contain a regulatory mapping, a RAG status, or the factor-level proxy scores that a pricing committee sign-off requires. The output format matters: if you want to put something in a model risk register or send it to a Consumer Duty owner, you need a structured document, not a Python object.
The right tool is one built for the specific regulatory question: do your non-protected rating factors proxy for protected characteristics? That is what insurance-fairness is for.
The three-step workflow
Install the library:
uv add insurance-fairness
The library requires a trained CatBoost model and a Polars DataFrame. It produces a structured FairnessReport with RAG statuses, factor-level proxy scores, and a Markdown output suitable for governance documentation.
Step 1: Fit the model and prepare the data
The audit takes a policy-level DataFrame with a prediction column already populated. The typical setup for a UK motor frequency model:
import polars as pl
from catboost import CatBoostRegressor
from insurance_fairness import FairnessAudit
# Load your policy data and trained model
df = pl.read_parquet("motor_policies_2025.parquet")
model = CatBoostRegressor()
model.load_model("frequency_model_v4.cbm")
# Add predictions to the DataFrame if not already present
X = df.select(["postcode_district", "vehicle_group", "occupation", "ncd_years", "age_band"])
df = df.with_columns(
pl.Series("predicted_freq", model.predict(X.to_pandas()))
)
The ethnicity_prop column here is a postcode-level ONS Census 2021 continuous proxy: the proportion of non-white-British residents at LSOA level, joined via postcode-to-LSOA lookup. It is a floating-point number between 0 and 1, not a binary flag. The library handles continuous protected characteristic proxies natively.
Step 2: Run the proxy audit
audit = FairnessAudit(
model=model,
data=df,
protected_cols=["ethnicity_prop"],
prediction_col="predicted_freq",
outcome_col="claim_count",
exposure_col="policy_years",
factor_cols=["postcode_district", "vehicle_group", "occupation", "ncd_years", "age_band"],
model_name="Motor Frequency v4 — Q1 2026",
run_proxy_detection=True,
)
report = audit.run()
report.summary()
The audit runs three complementary proxy detection methods for each (factor, protected characteristic) pair:
- CatBoost proxy R-squared: fits a gradient boosting model to predict
ethnicity_propfrom each rating factor alone, using exposure as sample weights. The held-out R-squared measures how much a single factor explains of the protected characteristic proxy. The amber threshold is 0.05; red is 0.10. - Mutual information: model-free, captures non-linear dependencies. A postcode band that is not monotonically ordered with ethnicity proportion - higher in inner London, lower in the Home Counties, higher again in Bradford - will show up here when a linear or rank correlation misses it. Measured in nats.
- Partial Pearson-residualised Spearman correlation: the association between a factor and the protected characteristic after controlling for the other rating factors. This answers: does postcode carry ethnicity information beyond what vehicle group, occupation, and NCD already explain?
All three are exposure-weighted throughout.
Step 3: Generate the audit report
# Structured Markdown for governance documentation
report.to_markdown("fairness_audit_motor_2026q1.md")
# Machine-readable JSON for audit trail and MI submissions
import json
with open("fairness_audit_motor_2026q1.json", "w") as f:
json.dump(report.to_dict(), f, indent=2)
The Markdown report maps each finding to the specific FCA regulatory requirement: PRIN 2A.4 (Price and Value Outcome), Consumer Duty (PS22/9), and Equality Act 2010 Section 19. It contains the factor-level proxy scores, RAG statuses, calibration-by-group results, and the overall audit status. It is structured to go directly into a pricing committee pack or model risk register.
Reading the output
The proxy detection scores for each rating factor:
proxy_result = report.results["ethnicity_prop"].proxy_detection
print(proxy_result.to_polars().select(
["factor", "proxy_r2", "mutual_information", "partial_correlation", "rag"]
))
shape: (5, 5)
┌────────────────────┬──────────┬───────────────────┬─────────────────────┬───────┐
│ factor ┆ proxy_r2 ┆ mutual_information ┆ partial_correlation ┆ rag │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞════════════════════╪══════════╪═══════════════════╪═════════════════════╪═══════╡
│ postcode_district ┆ 0.3847 ┆ 0.2913 ┆ 0.5821 ┆ red │
│ occupation ┆ 0.0832 ┆ 0.0721 ┆ 0.1934 ┆ amber │
│ vehicle_group ┆ 0.0621 ┆ 0.0514 ┆ 0.1183 ┆ amber │
│ age_band ┆ 0.0198 ┆ 0.0231 ┆ -0.0412 ┆ green │
│ ncd_years ┆ 0.0087 ┆ 0.0119 ┆ 0.0203 ┆ green │
└────────────────────┴──────────┴───────────────────┴─────────────────────┴───────┘
What each result means in practice:
postcode_district, proxy R-squared 0.38, red. A CatBoost model trained on postcode district alone explains 38% of the variance in the ethnicity proxy. This is not a borderline result. Postcode encodes substantial demographic information; the model’s postcode relativity is doing pricing work that partially reflects ethnicity. This does not automatically mean the model is unlawfully discriminating - postcode also reflects urban density, theft rates, road quality, and traffic patterns that genuinely predict claims. But it means the factor warrants full decomposition: how much of the postcode premium variation is genuine risk, and how much is demographic correlation?
occupation, proxy R-squared 0.08, amber. Sits between the amber threshold (0.05) and the red threshold (0.10). Manual occupations correlate with socioeconomic status, which correlates with multiple protected characteristics. An amber result means: document this, monitor it, understand the evidence. It does not mean remove occupation from the model.
age_band and ncd_years, green. These factors have low proxy R-squared for ethnicity. That is the expected result - NCD years and driver age are not strong proxies for ethnicity in UK motor data.
The partial correlation for occupation (0.19) is noticeably higher than its proxy R-squared (0.08). This indicates occupation carries ethnicity-correlated information beyond what the other factors explain - after controlling for postcode, vehicle group, and NCD, occupation still tells you something about ethnicity. Worth investigating whether your occupation banding creates avoidable concentration.
The overall report also contains the calibration-by-group check: actual-to-expected claim rates within each decile of the protected characteristic proxy. A well-calibrated model with a high proxy R-squared for postcode means the price variation is tracking genuine risk, not systematic mispricing. A miscalibrated model on top of a high proxy R-squared is the worst outcome.
What to do about amber and red results
The FCA does not require a zero-proxy model. It requires evidence of engagement.
For a red result (proxy R-squared above 0.10): the factor needs explicit investigation and documentation. The minimum requirement is: quantify how much of the factor’s premium variation is attributable to genuine risk differentiation versus demographic correlation. The insurance-fairness.optimal_transport subpackage handles this - it implements the LRTW discrimination-free pricing calculation, which marginalises the model’s output over the conditional distribution of the protected characteristic given the non-protected features. This produces a decomposition: here is the discrimination-free price, here is the adjustment relative to the uncorrected model, here is the magnitude of the proxy discrimination channel.
For an amber result: document the finding, the proxy R-squared, the mutual information, and the calibration result. Record the conclusion: either “the calibration is strong, the factor reflects genuine risk, and we are monitoring it at each model review” or “we are taking the following mitigation action.” Either conclusion is defensible if it is documented.
For green results: the audit trail records that the test was run and passed. File it.
The specific trigger for deeper investigation should be: amber or red proxy R-squared combined with a calibration disparity above the amber threshold (0.10). That combination - a factor that proxies for a protected characteristic and a model that is systematically miscalibrating one demographic group - is the pattern that the FCA’s enforcement concern is focused on.
If remediation is required, the options in order of preference are: (1) add features that capture the legitimate causal channel more directly, reducing the proxy correlation at source (urban density index, telematics features, road quality scores instead of raw postcode); (2) apply LRTW marginalisation to correct the prices directly; (3) document the justification that the remaining correlation reflects genuine risk variation under Equality Act 2010 Section 19 indirect discrimination’s proportionate means / legitimate aim defence.
Integration with model governance
The FairnessAudit is designed to run as part of the standard model review cycle, not as a one-off exercise. The right integration is:
# Run as part of annual model review
report = audit.run()
report.to_markdown(f"governance/fairness_audit_{review_date}.md")
# Check overall status and fail the review if red
if report.overall_rag == "red":
raise RuntimeError(
f"Fairness audit RED status. Flagged factors: {report.flagged_factors}. "
"Escalate to Chief Actuary before model deployment."
)
# Append to the model risk register
print(f"Audit date: {report.audit_date}")
print(f"Overall RAG: {report.overall_rag.upper()}")
print(f"Flagged proxy factors: {report.flagged_factors}")
The insurance-governance library’s model registry is designed to accept FairnessReport outputs as structured evidence items in the model risk register. Linking the two means that every model deployment is associated with a dated fairness audit, the audit output is retrievable on demand, and changes in RAG status between review cycles are logged automatically. See PRA SS1/23-Compliant Model Validation in Python for the governance integration.
The Consumer Duty obligation is ongoing. The FCA is explicit that Consumer Duty monitoring requires regular review, not a single assessment at model launch. Annual re-run on the current in-force book is the minimum; quarterly on books with significant mix shifts is more defensible.
The data engineering question
The biggest practical obstacle for most firms is not the statistical method - it is assembling the protected characteristic proxy column.
For ethnicity in UK motor and home insurance, the standard approach is the ONS Census 2021 ethnic group tables at Lower Layer Super Output Area (LSOA) level, joined to your policy data via postcode-to-LSOA lookup. The ONS publishes both: the TS021 ethnic group tables and the National Statistics Postcode Lookup (NSPL). The join is a standard left join on postcode sector. The result is a floating-point ethnicity proportion per policy, which you pass as ethnicity_prop in the protected characteristic column. The library handles continuous protected characteristic proxies natively; you do not need to binarise.
For gender, the situation has changed since the EU Test-Achats ruling (effective December 2012). Insurers no longer hold gender at quote, so building a gender proxy requires either a name-based estimation method or a proxy from fleet composition data. This is harder and the proxies are noisier.
For disability, the Equality Act 2010 definition covers a broad range of conditions that insurers do not ask about and would not be permitted to use. A disability proxy requires either self-reported data or area-level DWP statistics. The library accepts any numeric column as a protected characteristic - the statistical machinery is the same regardless of which protected characteristic the column represents.
The regulatory timetable here is not abstract. The FCA’s 2024 Consumer Duty multi-firm review told the market that the quality of evidence firms were producing was inadequate. Consumer Duty (PS22/9) set the analytical standard and the FCA has signalled that the next intervention will be enforcement, not another consultation paper.
A FairnessAudit on your current production model takes an afternoon of data engineering and a few minutes of compute. The output is a dated, factor-level, RAG-rated document with regulatory mapping. That is what the FCA is asking for.
uv add insurance-fairness
Source and issue tracker at github.com/burning-cost/insurance-fairness.
Related posts:
- Proxy Discrimination in UK Motor Pricing: Detection and Correction - the LRTW framework, the Citizens Advice data, and the full audit workflow
- Fairlearn vs insurance-fairness - direct comparison of why general-purpose ML fairness tools miss the FCA’s specific concern
- Discrimination-Free Pricing in Python - LRTW marginalisation and causal path decomposition for correcting proxy discrimination
- PRA SS1/23-Compliant Model Validation in Python - governance integration for fairness audit outputs