rgbm

A lightweight, Rust-native gradient boosting machine.

Installation

You can install rgbm with pip

pip install rgbm

Quick start

import polars as pl
import rgbm

df = pl.read_csv("train.csv")
X, y = df.drop("y"), df["y"]

dataset = rgbm.Dataset(X, y)
booster = rgbm.Booster(objective="gaussian", num_iterations=100)
booster.fit(dataset)
predictions = booster.predict(X)

API Reference

class rgbm.Dataset(x, y, weights=None, offsets=None, max_bin=255, min_data_in_bin=3, n_jobs=-1, seed=0)

A binned representation of a feature matrix and labels for training.

Numerical columns are bucketed into max_bin bins via greedy quantile binning. Categorical (Arrow Dictionary) columns are mapped to bin indices per category.

Parameters:

x (polars.DataFrame, pyarrow.RecordBatch, or pyarrow.Table) – Feature matrix. Each column must be numerical (Float64 / Float32) or Categorical (Arrow Dictionary with string values).
y (polars.Series, pyarrow.Array, or numpy.ndarray) – Float64 labels.
weights (polars.Series, pyarrow.Array, or numpy.ndarray, optional) – Per-row weights (Float64). Defaults to uniform weights.
offsets (polars.Series, pyarrow.Array, or numpy.ndarray, optional) – Per-row baseline (Float64) added to the raw score during fit and predict.
max_bin (int, default 255) – Maximum number of bins per feature, including the missing/sentinel bin. Must satisfy max_bin <= 255.
min_data_in_bin (int, default 3) – Minimum number of rows that must accumulate before opening a new bin during quantile binning of numerical features.
n_jobs (int, default -1) – Number of threads used for binning. -1 uses all logical cores.
seed (int, default 0) – Seed for the row subsample used to determine bin boundaries on large datasets (>200,000 rows).

class rgbm.Booster(objective='gaussian', num_iterations=100, learning_rate=0.1, max_depth=6, min_sum_hessian_in_leaf=0.001, min_gain_to_split=0.0, lambda_l1=0.0, lambda_l2=0.0, max_leaves=31, leaf_wise=True, n_jobs=-1)

Gradient-boosted decision tree model.

Parameters:

objective ({"gaussian", "logistic", "probit", "poisson"}, default "gaussian") – Loss function. gaussian for regression, logistic and probit for binary classification with labels in {0, 1}, poisson for non-negative count regression.
num_iterations (int, default 100) – Number of boosting rounds (trees).
learning_rate (float, default 0.1) – Multiplier applied to each tree’s leaf values before adding to the ensemble.
max_depth (int, default 6) – Maximum tree depth.
max_leaves (int, default 31) – Maximum number of leaves per tree.
min_sum_hessian_in_leaf (float, default 1e-3) – Minimum sum of hessians required for a leaf to be split.
min_gain_to_split (float, default 0.0) – Minimum split gain required for a leaf to be split.
lambda_l1 (float, default 0.0) – L1 regularization on leaf values.
lambda_l2 (float, default 0.0) – L2 regularization on leaf values.
leaf_wise (bool, default True) – If True, grow trees by splitting the highest-gain leaf first (LightGBM-style). If False, grow level-wise: split shallowest leaves first, ties broken by gain (xgboost grow_policy=depthwise).
n_jobs (int, default -1) – Number of threads used for fitting and prediction. -1 uses all logical cores.

fit(dataset)

Fit the booster on a Dataset.

Parameters:: dataset (Dataset) – Training dataset built via Dataset.

model_to_string()

Serialize the model to a LightGBM-compatible model.txt (v4) string.

The returned string can be loaded back via lightgbm.Booster(model_str=...) for prediction. Useful for interoperability with downstream tooling that expects lgbm models.

predict(x, offsets=None)

Predict on a feature matrix.

Parameters:

x (polars.DataFrame, pyarrow.RecordBatch, or pyarrow.Table) – Feature matrix with the same schema as the training data.
offsets (polars.Series, pyarrow.Array, or numpy.ndarray, optional) – Per-row baseline (Float64) added to the raw score before applying the objective’s link function.

Returns:

Per-row predictions. For gaussian, raw scores. For logistic and probit, probabilities in [0, 1]. For poisson, expected counts.

Return type:

numpy.ndarray of float64

Examples

Poisson regression with exposure

Other