shap_relativities.datasets
Synthetic datasets for testing and demonstration.
The motor dataset provides a synthetic UK personal lines motor portfolio with a known data generating process. Use it to validate relativity extraction against the true parameters.
1""" 2Synthetic datasets for testing and demonstration. 3 4The motor dataset provides a synthetic UK personal lines motor portfolio with 5a known data generating process. Use it to validate relativity extraction 6against the true parameters. 7""" 8 9from .motor import TRUE_FREQ_PARAMS, TRUE_SEV_PARAMS, load_motor 10 11__all__ = ["load_motor", "TRUE_FREQ_PARAMS", "TRUE_SEV_PARAMS"]
def
load_motor( n_policies: int = 50000, seed: int = 42) -> polars.dataframe.frame.DataFrame:
323def load_motor( 324 n_policies: int = 50_000, 325 seed: int = 42, 326) -> pl.DataFrame: 327 """ 328 Load a synthetic UK motor insurance dataset. 329 330 Generates ``n_policies`` rows with realistic UK personal lines motor 331 characteristics and simulated claims from a known data generating process. 332 Because the true parameters are known (see ``TRUE_FREQ_PARAMS`` and 333 ``TRUE_SEV_PARAMS``), you can fit GLMs and validate coefficient recovery. 334 335 Args: 336 n_policies: Number of policies to generate. Default 50,000 gives 337 stable GLM estimates. Use 5,000-10,000 for quick tests. 338 seed: Random seed for reproducibility. Changing this seed gives a 339 different but equally valid synthetic portfolio. 340 341 Returns: 342 Polars DataFrame with one row per policy and columns: 343 344 - ``policy_id``: Int64, sequential identifier 345 - ``inception_date``: Date, policy start 346 - ``expiry_date``: Date, policy end (may be < 12 months for 347 cancellations) 348 - ``accident_year``: Int64, year of inception (used for cohort splits) 349 - ``vehicle_age``: Int64, 0-20 years 350 - ``vehicle_group``: Int64, ABI group 1-50 351 - ``driver_age``: Int64, 17-85 352 - ``driver_experience``: Int64, years licensed 353 - ``ncd_years``: Int64, 0-5 (UK NCD scale) 354 - ``ncd_protected``: Boolean 355 - ``conviction_points``: Int64, total endorsement points 356 - ``annual_mileage``: Int64, 2,000-30,000 miles 357 - ``area``: Utf8, ABI area band A-F 358 - ``occupation_class``: Int64, 1-5 359 - ``policy_type``: Utf8, 'Comp' or 'TPFT' 360 - ``claim_count``: Int64, number of claims in period 361 - ``incurred``: Float64, total incurred cost (0.0 if no claims) 362 - ``exposure``: Float64, earned years (< 1.0 for cancellations) 363 364 Examples: 365 >>> df = load_motor(n_policies=10_000, seed=0) 366 >>> df.shape[0] 367 10000 368 >>> df["claim_count"].mean() # roughly 6-8% claim rate 369 # ~0.07 370 """ 371 rng = np.random.default_rng(seed) 372 373 policy_data = _generate_policies(n_policies, rng) 374 375 inception_dates = policy_data["inception_date"] 376 expiry_dates = policy_data["expiry_date"] 377 exposure = _calculate_earned_exposure(inception_dates, expiry_dates) 378 379 accident_year = np.array([d.year for d in inception_dates], dtype=int) 380 381 claim_count, incurred = _generate_claims(policy_data, exposure, rng) 382 383 df = pl.DataFrame({ 384 "policy_id": np.arange(1, n_policies + 1, dtype=int), 385 "inception_date": inception_dates, 386 "expiry_date": expiry_dates, 387 "accident_year": accident_year, 388 "vehicle_age": policy_data["vehicle_age"].astype(int), 389 "vehicle_group": policy_data["vehicle_group"].astype(int), 390 "driver_age": policy_data["driver_age"].astype(int), 391 "driver_experience": policy_data["driver_experience"].astype(int), 392 "ncd_years": policy_data["ncd_years"].astype(int), 393 "ncd_protected": policy_data["ncd_protected"], 394 "conviction_points": policy_data["conviction_points"].astype(int), 395 "annual_mileage": policy_data["annual_mileage"].astype(int), 396 "area": policy_data["area"], 397 "occupation_class": policy_data["occupation_class"].astype(int), 398 "policy_type": policy_data["policy_type"], 399 "claim_count": claim_count.astype(int), 400 "incurred": incurred.astype(float), 401 "exposure": exposure.astype(float), 402 }) 403 404 # Polars infers date columns from Python date objects; cast to Date type 405 df = df.with_columns([ 406 pl.col("inception_date").cast(pl.Date), 407 pl.col("expiry_date").cast(pl.Date), 408 ]) 409 410 column_order = [ 411 "policy_id", "inception_date", "expiry_date", "accident_year", 412 "vehicle_age", "vehicle_group", "driver_age", "driver_experience", 413 "ncd_years", "ncd_protected", "conviction_points", "annual_mileage", 414 "area", "occupation_class", "policy_type", "claim_count", "incurred", 415 "exposure", 416 ] 417 return df.select(column_order)
Load a synthetic UK motor insurance dataset.
Generates n_policies rows with realistic UK personal lines motor
characteristics and simulated claims from a known data generating process.
Because the true parameters are known (see TRUE_FREQ_PARAMS and
TRUE_SEV_PARAMS), you can fit GLMs and validate coefficient recovery.
Arguments:
- n_policies: Number of policies to generate. Default 50,000 gives stable GLM estimates. Use 5,000-10,000 for quick tests.
- seed: Random seed for reproducibility. Changing this seed gives a different but equally valid synthetic portfolio.
Returns:
Polars DataFrame with one row per policy and columns:
policy_id: Int64, sequential identifierinception_date: Date, policy startexpiry_date: Date, policy end (may be < 12 months for cancellations)accident_year: Int64, year of inception (used for cohort splits)vehicle_age: Int64, 0-20 yearsvehicle_group: Int64, ABI group 1-50driver_age: Int64, 17-85driver_experience: Int64, years licensedncd_years: Int64, 0-5 (UK NCD scale)ncd_protected: Booleanconviction_points: Int64, total endorsement pointsannual_mileage: Int64, 2,000-30,000 milesarea: Utf8, ABI area band A-Foccupation_class: Int64, 1-5policy_type: Utf8, 'Comp' or 'TPFT'claim_count: Int64, number of claims in periodincurred: Float64, total incurred cost (0.0 if no claims)exposure: Float64, earned years (< 1.0 for cancellations)
Examples:
>>> df = load_motor(n_policies=10_000, seed=0) >>> df.shape[0] 10000 >>> df["claim_count"].mean() # roughly 6-8% claim rate <h1 id="007">~0.07</h1>
TRUE_FREQ_PARAMS =
{'intercept': -3.2, 'vehicle_group': 0.025, 'driver_age_young': 0.55, 'driver_age_old': 0.3, 'ncd_years': -0.12, 'area_B': 0.1, 'area_C': 0.2, 'area_D': 0.35, 'area_E': 0.5, 'area_F': 0.65, 'has_convictions': 0.45}
TRUE_SEV_PARAMS =
{'intercept': 7.8, 'vehicle_group': 0.018, 'driver_age_young': 0.25}