{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f7f9de29",
   "metadata": {},
   "source": [
    "## French motor third-party liability claims prediction\n",
    "\n",
    "The French motor third-party liability dataset contains ~680k policies with\n",
    "their claim count and amount over an exposure period. A standard way to model\n",
    "claim counts is Poisson regression where exposure enters as an offset:\n",
    "\n",
    "$$\\log(\\mathbb{E}[\\text{ClaimNb}]) = f(\\text{features}) + \\log(\\text{Exposure})$$\n",
    "\n",
    "This example trains `rgbm` with `objective=\"poisson\"` and the offset, then\n",
    "compares to LightGBM with the same setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ddac112e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-10T04:23:05.132030Z",
     "iopub.status.busy": "2026-05-10T04:23:05.131944Z",
     "iopub.status.idle": "2026-05-10T04:23:06.566304Z",
     "shell.execute_reply": "2026-05-10T04:23:06.565819Z"
    }
   },
   "outputs": [],
   "source": [
    "import lightgbm as lgb\n",
    "import numpy as np\n",
    "import polars as pl\n",
    "import rgbm\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.metrics import mean_poisson_deviance\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e006d125",
   "metadata": {},
   "source": [
    "### Load the data\n",
    "\n",
    "Fetch data from OpenML. We split 90% / 10% train / test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c36362a4",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-10T04:23:06.568109Z",
     "iopub.status.busy": "2026-05-10T04:23:06.567944Z",
     "iopub.status.idle": "2026-05-10T04:23:07.114489Z",
     "shell.execute_reply": "2026-05-10T04:23:07.113998Z"
    }
   },
   "outputs": [],
   "source": [
    "raw = fetch_openml(data_id=41214, as_frame=True, parser=\"auto\").frame\n",
    "df = pl.from_pandas(raw)\n",
    "\n",
    "categorical_columns = [\"Area\", \"VehBrand\", \"VehGas\", \"Region\"]\n",
    "numerical_columns = [\"VehPower\", \"VehAge\", \"DrivAge\", \"BonusMalus\", \"Density\"]\n",
    "df = df.with_columns(\n",
    "    pl.col(\"ClaimNb\").cast(pl.Float64),\n",
    "    pl.col(\"Exposure\").cast(pl.Float64),\n",
    "    *[pl.col(c).cast(pl.Categorical) for c in categorical_columns],\n",
    "    *[pl.col(c).cast(pl.Float64) for c in numerical_columns],\n",
    ")\n",
    "\n",
    "train, test = train_test_split(df, test_size=0.1, random_state=0)\n",
    "y_train = train[\"ClaimNb\"].to_numpy()\n",
    "y_test = test[\"ClaimNb\"].to_numpy()\n",
    "exposure_train = train[\"Exposure\"].to_numpy()\n",
    "exposure_test = test[\"Exposure\"].to_numpy()\n",
    "\n",
    "feature_columns = categorical_columns + numerical_columns\n",
    "X_train = train.select(feature_columns)\n",
    "X_test = test.select(feature_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f36322e",
   "metadata": {},
   "source": [
    "### Fit `rgbm` gradient boosting model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "02dd4d9c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Deviance: 0.3012\n"
     ]
    }
   ],
   "source": [
    "PARAMS = dict(\n",
    "    objective=\"poisson\",\n",
    "    num_iterations=1000,\n",
    "    learning_rate=0.1,\n",
    "    max_depth=6,\n",
    "    max_leaves=31,\n",
    "    lambda_l2=1.0,\n",
    "    min_sum_hessian_in_leaf=10,\n",
    ")\n",
    "\n",
    "dataset = rgbm.Dataset(X_train, y_train, offsets=np.log(exposure_train))\n",
    "booster = rgbm.Booster(**PARAMS)\n",
    "booster.fit(dataset)\n",
    "\n",
    "rgbm_pred = booster.predict(X_test, offsets=np.log(exposure_test))\n",
    "print(f\"Deviance: {mean_poisson_deviance(y_test, rgbm_pred):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e342fb9b",
   "metadata": {},
   "source": [
    "### Compare to LightGBM\n",
    "\n",
    "Same setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b6da49cb",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-10T04:23:09.209266Z",
     "iopub.status.busy": "2026-05-10T04:23:09.209150Z",
     "iopub.status.idle": "2026-05-10T04:23:11.323379Z",
     "shell.execute_reply": "2026-05-10T04:23:11.323008Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lgbm deviance:  0.2991\n"
     ]
    }
   ],
   "source": [
    "X_train_lgbm, X_test_lgbm = (\n",
    "    df.with_columns(pl.col(c).to_physical() for c in categorical_columns).to_arrow()\n",
    "    for df in (X_train, X_test)\n",
    ")\n",
    "\n",
    "LGBM_PARAMS = {\"verbose\": -1, \"min_data_in_leaf\": 0, **PARAMS}\n",
    "lgb_train = lgb.Dataset(\n",
    "    X_train_lgbm,\n",
    "    label=y_train,\n",
    "    init_score=np.log(exposure_train),\n",
    "    categorical_feature=categorical_columns,\n",
    "    free_raw_data=False,\n",
    ")\n",
    "lgbm_booster = lgb.train(\n",
    "    LGBM_PARAMS, lgb_train, num_boost_round=PARAMS[\"num_iterations\"]\n",
    ")\n",
    "lgbm_scores = lgbm_booster.predict(X_test_lgbm, raw_score=True)\n",
    "lgbm_pred = np.exp(lgbm_scores + np.log(exposure_test))\n",
    "print(f\"lgbm deviance:  {mean_poisson_deviance(y_test, lgbm_pred):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcf3c40d",
   "metadata": {},
   "source": [
    "### benchmark fitting time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a9a8b082",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-10T04:23:11.325372Z",
     "iopub.status.busy": "2026-05-10T04:23:11.325253Z",
     "iopub.status.idle": "2026-05-10T04:23:20.548484Z",
     "shell.execute_reply": "2026-05-10T04:23:20.547843Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.9 s ± 62.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)\n",
      "9.46 s ± 45 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "def fit_rgbm():\n",
    "    ds = rgbm.Dataset(X_train, y_train, offsets=np.log(exposure_train))\n",
    "    rgbm.Booster(**PARAMS).fit(ds)\n",
    "\n",
    "\n",
    "def fit_lgbm():\n",
    "    ds = lgb.Dataset(\n",
    "        X_train_lgbm,\n",
    "        label=y_train,\n",
    "        init_score=np.log(exposure_train),\n",
    "        categorical_feature=categorical_columns,\n",
    "        free_raw_data=False,\n",
    "    )\n",
    "    lgb.train(\n",
    "        {\"verbose\": -1, \"min_data_in_leaf\": 0, **PARAMS},\n",
    "        ds,\n",
    "        num_boost_round=PARAMS[\"num_iterations\"],\n",
    "    )\n",
    "\n",
    "\n",
    "%timeit -r 3 -n 1 fit_rgbm()\n",
    "%timeit -r 3 -n 1 fit_lgbm()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a7e045",
   "metadata": {},
   "source": [
    "### Summary\n",
    "\n",
    "`rgbm` matches LightGBM's quality on this dataset (Poisson deviance\n",
    "within ~1%) at comparable fit-time."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2999ab70",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "rgbm",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}