French motor third-party liability claims prediction
The French motor third-party liability dataset contains ~680k policies with their claim count and amount over an exposure period. A standard way to model claim counts is Poisson regression where exposure enters as an offset:
\[\log(\mathbb{E}[\text{ClaimNb}]) = f(\text{features}) + \log(\text{Exposure})\]
This example trains rgbm with objective="poisson" and the offset, then compares to LightGBM with the same setup.
[1]:
import lightgbm as lgb
import numpy as np
import polars as pl
import rgbm
from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_poisson_deviance
from sklearn.model_selection import train_test_split
Load the data
Fetch data from OpenML. We split 90% / 10% train / test.
[2]:
raw = fetch_openml(data_id=41214, as_frame=True, parser="auto").frame
df = pl.from_pandas(raw)
categorical_columns = ["Area", "VehBrand", "VehGas", "Region"]
numerical_columns = ["VehPower", "VehAge", "DrivAge", "BonusMalus", "Density"]
df = df.with_columns(
pl.col("ClaimNb").cast(pl.Float64),
pl.col("Exposure").cast(pl.Float64),
*[pl.col(c).cast(pl.Categorical) for c in categorical_columns],
*[pl.col(c).cast(pl.Float64) for c in numerical_columns],
)
train, test = train_test_split(df, test_size=0.1, random_state=0)
y_train = train["ClaimNb"].to_numpy()
y_test = test["ClaimNb"].to_numpy()
exposure_train = train["Exposure"].to_numpy()
exposure_test = test["Exposure"].to_numpy()
feature_columns = categorical_columns + numerical_columns
X_train = train.select(feature_columns)
X_test = test.select(feature_columns)
Fit rgbm gradient boosting model
[3]:
PARAMS = dict(
objective="poisson",
num_iterations=1000,
learning_rate=0.1,
max_depth=6,
max_leaves=31,
lambda_l2=1.0,
min_sum_hessian_in_leaf=10,
)
dataset = rgbm.Dataset(X_train, y_train, offsets=np.log(exposure_train))
booster = rgbm.Booster(**PARAMS)
booster.fit(dataset)
rgbm_pred = booster.predict(X_test, offsets=np.log(exposure_test))
print(f"Deviance: {mean_poisson_deviance(y_test, rgbm_pred):.4f}")
Deviance: 0.3012
Compare to LightGBM
Same setup.
[4]:
X_train_lgbm, X_test_lgbm = (
df.with_columns(pl.col(c).to_physical() for c in categorical_columns).to_arrow()
for df in (X_train, X_test)
)
LGBM_PARAMS = {"verbose": -1, "min_data_in_leaf": 0, **PARAMS}
lgb_train = lgb.Dataset(
X_train_lgbm,
label=y_train,
init_score=np.log(exposure_train),
categorical_feature=categorical_columns,
free_raw_data=False,
)
lgbm_booster = lgb.train(
LGBM_PARAMS, lgb_train, num_boost_round=PARAMS["num_iterations"]
)
lgbm_scores = lgbm_booster.predict(X_test_lgbm, raw_score=True)
lgbm_pred = np.exp(lgbm_scores + np.log(exposure_test))
print(f"lgbm deviance: {mean_poisson_deviance(y_test, lgbm_pred):.4f}")
lgbm deviance: 0.2991
benchmark fitting time
[5]:
def fit_rgbm():
ds = rgbm.Dataset(X_train, y_train, offsets=np.log(exposure_train))
rgbm.Booster(**PARAMS).fit(ds)
def fit_lgbm():
ds = lgb.Dataset(
X_train_lgbm,
label=y_train,
init_score=np.log(exposure_train),
categorical_feature=categorical_columns,
free_raw_data=False,
)
lgb.train(
{"verbose": -1, "min_data_in_leaf": 0, **PARAMS},
ds,
num_boost_round=PARAMS["num_iterations"],
)
%timeit -r 3 -n 1 fit_rgbm()
%timeit -r 3 -n 1 fit_lgbm()
9.9 s ± 62.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
9.46 s ± 45 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Summary
rgbm matches LightGBM’s quality on this dataset (Poisson deviance within ~1%) at comparable fit-time.