{ "cells": [ { "cell_type": "markdown", "id": "f7f9de29", "metadata": {}, "source": [ "## French motor third-party liability claims prediction\n", "\n", "The French motor third-party liability dataset contains ~680k policies with\n", "their claim count and amount over an exposure period. A standard way to model\n", "claim counts is Poisson regression where exposure enters as an offset:\n", "\n", "$$\\log(\\mathbb{E}[\\text{ClaimNb}]) = f(\\text{features}) + \\log(\\text{Exposure})$$\n", "\n", "This example trains `rgbm` with `objective=\"poisson\"` and the offset, then\n", "compares to LightGBM with the same setup." ] }, { "cell_type": "code", "execution_count": 1, "id": "ddac112e", "metadata": { "execution": { "iopub.execute_input": "2026-05-10T04:23:05.132030Z", "iopub.status.busy": "2026-05-10T04:23:05.131944Z", "iopub.status.idle": "2026-05-10T04:23:06.566304Z", "shell.execute_reply": "2026-05-10T04:23:06.565819Z" } }, "outputs": [], "source": [ "import lightgbm as lgb\n", "import numpy as np\n", "import polars as pl\n", "import rgbm\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.metrics import mean_poisson_deviance\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "id": "e006d125", "metadata": {}, "source": [ "### Load the data\n", "\n", "Fetch data from OpenML. We split 90% / 10% train / test." ] }, { "cell_type": "code", "execution_count": 2, "id": "c36362a4", "metadata": { "execution": { "iopub.execute_input": "2026-05-10T04:23:06.568109Z", "iopub.status.busy": "2026-05-10T04:23:06.567944Z", "iopub.status.idle": "2026-05-10T04:23:07.114489Z", "shell.execute_reply": "2026-05-10T04:23:07.113998Z" } }, "outputs": [], "source": [ "raw = fetch_openml(data_id=41214, as_frame=True, parser=\"auto\").frame\n", "df = pl.from_pandas(raw)\n", "\n", "categorical_columns = [\"Area\", \"VehBrand\", \"VehGas\", \"Region\"]\n", "numerical_columns = [\"VehPower\", \"VehAge\", \"DrivAge\", \"BonusMalus\", \"Density\"]\n", "df = df.with_columns(\n", " pl.col(\"ClaimNb\").cast(pl.Float64),\n", " pl.col(\"Exposure\").cast(pl.Float64),\n", " *[pl.col(c).cast(pl.Categorical) for c in categorical_columns],\n", " *[pl.col(c).cast(pl.Float64) for c in numerical_columns],\n", ")\n", "\n", "train, test = train_test_split(df, test_size=0.1, random_state=0)\n", "y_train = train[\"ClaimNb\"].to_numpy()\n", "y_test = test[\"ClaimNb\"].to_numpy()\n", "exposure_train = train[\"Exposure\"].to_numpy()\n", "exposure_test = test[\"Exposure\"].to_numpy()\n", "\n", "feature_columns = categorical_columns + numerical_columns\n", "X_train = train.select(feature_columns)\n", "X_test = test.select(feature_columns)" ] }, { "cell_type": "markdown", "id": "1f36322e", "metadata": {}, "source": [ "### Fit `rgbm` gradient boosting model" ] }, { "cell_type": "code", "execution_count": 3, "id": "02dd4d9c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Deviance: 0.3012\n" ] } ], "source": [ "PARAMS = dict(\n", " objective=\"poisson\",\n", " num_iterations=1000,\n", " learning_rate=0.1,\n", " max_depth=6,\n", " max_leaves=31,\n", " lambda_l2=1.0,\n", " min_sum_hessian_in_leaf=10,\n", ")\n", "\n", "dataset = rgbm.Dataset(X_train, y_train, offsets=np.log(exposure_train))\n", "booster = rgbm.Booster(**PARAMS)\n", "booster.fit(dataset)\n", "\n", "rgbm_pred = booster.predict(X_test, offsets=np.log(exposure_test))\n", "print(f\"Deviance: {mean_poisson_deviance(y_test, rgbm_pred):.4f}\")" ] }, { "cell_type": "markdown", "id": "e342fb9b", "metadata": {}, "source": [ "### Compare to LightGBM\n", "\n", "Same setup." ] }, { "cell_type": "code", "execution_count": 4, "id": "b6da49cb", "metadata": { "execution": { "iopub.execute_input": "2026-05-10T04:23:09.209266Z", "iopub.status.busy": "2026-05-10T04:23:09.209150Z", "iopub.status.idle": "2026-05-10T04:23:11.323379Z", "shell.execute_reply": "2026-05-10T04:23:11.323008Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lgbm deviance: 0.2991\n" ] } ], "source": [ "X_train_lgbm, X_test_lgbm = (\n", " df.with_columns(pl.col(c).to_physical() for c in categorical_columns).to_arrow()\n", " for df in (X_train, X_test)\n", ")\n", "\n", "LGBM_PARAMS = {\"verbose\": -1, \"min_data_in_leaf\": 0, **PARAMS}\n", "lgb_train = lgb.Dataset(\n", " X_train_lgbm,\n", " label=y_train,\n", " init_score=np.log(exposure_train),\n", " categorical_feature=categorical_columns,\n", " free_raw_data=False,\n", ")\n", "lgbm_booster = lgb.train(\n", " LGBM_PARAMS, lgb_train, num_boost_round=PARAMS[\"num_iterations\"]\n", ")\n", "lgbm_scores = lgbm_booster.predict(X_test_lgbm, raw_score=True)\n", "lgbm_pred = np.exp(lgbm_scores + np.log(exposure_test))\n", "print(f\"lgbm deviance: {mean_poisson_deviance(y_test, lgbm_pred):.4f}\")" ] }, { "cell_type": "markdown", "id": "dcf3c40d", "metadata": {}, "source": [ "### benchmark fitting time" ] }, { "cell_type": "code", "execution_count": 5, "id": "a9a8b082", "metadata": { "execution": { "iopub.execute_input": "2026-05-10T04:23:11.325372Z", "iopub.status.busy": "2026-05-10T04:23:11.325253Z", "iopub.status.idle": "2026-05-10T04:23:20.548484Z", "shell.execute_reply": "2026-05-10T04:23:20.547843Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9.9 s ± 62.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)\n", "9.46 s ± 45 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)\n" ] } ], "source": [ "def fit_rgbm():\n", " ds = rgbm.Dataset(X_train, y_train, offsets=np.log(exposure_train))\n", " rgbm.Booster(**PARAMS).fit(ds)\n", "\n", "\n", "def fit_lgbm():\n", " ds = lgb.Dataset(\n", " X_train_lgbm,\n", " label=y_train,\n", " init_score=np.log(exposure_train),\n", " categorical_feature=categorical_columns,\n", " free_raw_data=False,\n", " )\n", " lgb.train(\n", " {\"verbose\": -1, \"min_data_in_leaf\": 0, **PARAMS},\n", " ds,\n", " num_boost_round=PARAMS[\"num_iterations\"],\n", " )\n", "\n", "\n", "%timeit -r 3 -n 1 fit_rgbm()\n", "%timeit -r 3 -n 1 fit_lgbm()" ] }, { "cell_type": "markdown", "id": "e6a7e045", "metadata": {}, "source": [ "### Summary\n", "\n", "`rgbm` matches LightGBM's quality on this dataset (Poisson deviance\n", "within ~1%) at comparable fit-time." ] }, { "cell_type": "markdown", "id": "2999ab70", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "rgbm", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.12" } }, "nbformat": 4, "nbformat_minor": 5 }