Edit on GitHub

shap_relativities.datasets

Synthetic datasets for testing and demonstration.

The motor dataset provides a synthetic UK personal lines motor portfolio with a known data generating process. Use it to validate relativity extraction against the true parameters.

 1"""
 2Synthetic datasets for testing and demonstration.
 3
 4The motor dataset provides a synthetic UK personal lines motor portfolio with
 5a known data generating process. Use it to validate relativity extraction
 6against the true parameters.
 7"""
 8
 9from .motor import TRUE_FREQ_PARAMS, TRUE_SEV_PARAMS, load_motor
10
11__all__ = ["load_motor", "TRUE_FREQ_PARAMS", "TRUE_SEV_PARAMS"]
def load_motor( n_policies: int = 50000, seed: int = 42) -> polars.dataframe.frame.DataFrame:
323def load_motor(
324    n_policies: int = 50_000,
325    seed: int = 42,
326) -> pl.DataFrame:
327    """
328    Load a synthetic UK motor insurance dataset.
329
330    Generates ``n_policies`` rows with realistic UK personal lines motor
331    characteristics and simulated claims from a known data generating process.
332    Because the true parameters are known (see ``TRUE_FREQ_PARAMS`` and
333    ``TRUE_SEV_PARAMS``), you can fit GLMs and validate coefficient recovery.
334
335    Args:
336        n_policies: Number of policies to generate. Default 50,000 gives
337            stable GLM estimates. Use 5,000-10,000 for quick tests.
338        seed: Random seed for reproducibility. Changing this seed gives a
339            different but equally valid synthetic portfolio.
340
341    Returns:
342        Polars DataFrame with one row per policy and columns:
343
344        - ``policy_id``: Int64, sequential identifier
345        - ``inception_date``: Date, policy start
346        - ``expiry_date``: Date, policy end (may be < 12 months for
347          cancellations)
348        - ``accident_year``: Int64, year of inception (used for cohort splits)
349        - ``vehicle_age``: Int64, 0-20 years
350        - ``vehicle_group``: Int64, ABI group 1-50
351        - ``driver_age``: Int64, 17-85
352        - ``driver_experience``: Int64, years licensed
353        - ``ncd_years``: Int64, 0-5 (UK NCD scale)
354        - ``ncd_protected``: Boolean
355        - ``conviction_points``: Int64, total endorsement points
356        - ``annual_mileage``: Int64, 2,000-30,000 miles
357        - ``area``: Utf8, ABI area band A-F
358        - ``occupation_class``: Int64, 1-5
359        - ``policy_type``: Utf8, 'Comp' or 'TPFT'
360        - ``claim_count``: Int64, number of claims in period
361        - ``incurred``: Float64, total incurred cost (0.0 if no claims)
362        - ``exposure``: Float64, earned years (< 1.0 for cancellations)
363
364    Examples:
365        >>> df = load_motor(n_policies=10_000, seed=0)
366        >>> df.shape[0]
367        10000
368        >>> df["claim_count"].mean()  # roughly 6-8% claim rate
369        # ~0.07
370    """
371    rng = np.random.default_rng(seed)
372
373    policy_data = _generate_policies(n_policies, rng)
374
375    inception_dates = policy_data["inception_date"]
376    expiry_dates = policy_data["expiry_date"]
377    exposure = _calculate_earned_exposure(inception_dates, expiry_dates)
378
379    accident_year = np.array([d.year for d in inception_dates], dtype=int)
380
381    claim_count, incurred = _generate_claims(policy_data, exposure, rng)
382
383    df = pl.DataFrame({
384        "policy_id": np.arange(1, n_policies + 1, dtype=int),
385        "inception_date": inception_dates,
386        "expiry_date": expiry_dates,
387        "accident_year": accident_year,
388        "vehicle_age": policy_data["vehicle_age"].astype(int),
389        "vehicle_group": policy_data["vehicle_group"].astype(int),
390        "driver_age": policy_data["driver_age"].astype(int),
391        "driver_experience": policy_data["driver_experience"].astype(int),
392        "ncd_years": policy_data["ncd_years"].astype(int),
393        "ncd_protected": policy_data["ncd_protected"],
394        "conviction_points": policy_data["conviction_points"].astype(int),
395        "annual_mileage": policy_data["annual_mileage"].astype(int),
396        "area": policy_data["area"],
397        "occupation_class": policy_data["occupation_class"].astype(int),
398        "policy_type": policy_data["policy_type"],
399        "claim_count": claim_count.astype(int),
400        "incurred": incurred.astype(float),
401        "exposure": exposure.astype(float),
402    })
403
404    # Polars infers date columns from Python date objects; cast to Date type
405    df = df.with_columns([
406        pl.col("inception_date").cast(pl.Date),
407        pl.col("expiry_date").cast(pl.Date),
408    ])
409
410    column_order = [
411        "policy_id", "inception_date", "expiry_date", "accident_year",
412        "vehicle_age", "vehicle_group", "driver_age", "driver_experience",
413        "ncd_years", "ncd_protected", "conviction_points", "annual_mileage",
414        "area", "occupation_class", "policy_type", "claim_count", "incurred",
415        "exposure",
416    ]
417    return df.select(column_order)

Load a synthetic UK motor insurance dataset.

Generates n_policies rows with realistic UK personal lines motor characteristics and simulated claims from a known data generating process. Because the true parameters are known (see TRUE_FREQ_PARAMS and TRUE_SEV_PARAMS), you can fit GLMs and validate coefficient recovery.

Arguments:
  • n_policies: Number of policies to generate. Default 50,000 gives stable GLM estimates. Use 5,000-10,000 for quick tests.
  • seed: Random seed for reproducibility. Changing this seed gives a different but equally valid synthetic portfolio.
Returns:

Polars DataFrame with one row per policy and columns:

  • policy_id: Int64, sequential identifier
  • inception_date: Date, policy start
  • expiry_date: Date, policy end (may be < 12 months for cancellations)
  • accident_year: Int64, year of inception (used for cohort splits)
  • vehicle_age: Int64, 0-20 years
  • vehicle_group: Int64, ABI group 1-50
  • driver_age: Int64, 17-85
  • driver_experience: Int64, years licensed
  • ncd_years: Int64, 0-5 (UK NCD scale)
  • ncd_protected: Boolean
  • conviction_points: Int64, total endorsement points
  • annual_mileage: Int64, 2,000-30,000 miles
  • area: Utf8, ABI area band A-F
  • occupation_class: Int64, 1-5
  • policy_type: Utf8, 'Comp' or 'TPFT'
  • claim_count: Int64, number of claims in period
  • incurred: Float64, total incurred cost (0.0 if no claims)
  • exposure: Float64, earned years (< 1.0 for cancellations)
Examples:
>>> df = load_motor(n_policies=10_000, seed=0)
>>> df.shape[0]
10000
>>> df["claim_count"].mean()  # roughly 6-8% claim rate
<h1 id="007">~0.07</h1>
TRUE_FREQ_PARAMS = {'intercept': -3.2, 'vehicle_group': 0.025, 'driver_age_young': 0.55, 'driver_age_old': 0.3, 'ncd_years': -0.12, 'area_B': 0.1, 'area_C': 0.2, 'area_D': 0.35, 'area_E': 0.5, 'area_F': 0.65, 'has_convictions': 0.45}
TRUE_SEV_PARAMS = {'intercept': 7.8, 'vehicle_group': 0.018, 'driver_age_young': 0.25}