rgbm
A lightweight, Rust-native gradient boosting machine.
Installation
You can install rgbm with pip
pip install rgbm
Quick start
import polars as pl
import rgbm
df = pl.read_csv("train.csv")
X, y = df.drop("y"), df["y"]
dataset = rgbm.Dataset(X, y)
booster = rgbm.Booster(objective="gaussian", num_iterations=100)
booster.fit(dataset)
predictions = booster.predict(X)
API Reference
- class rgbm.Dataset(x, y, weights=None, offsets=None, max_bin=255, min_data_in_bin=3, n_jobs=-1, seed=0)
A binned representation of a feature matrix and labels for training.
Numerical columns are bucketed into
max_binbins via greedy quantile binning. Categorical (ArrowDictionary) columns are mapped to bin indices per category.- Parameters:
x (polars.DataFrame, pyarrow.RecordBatch, or pyarrow.Table) – Feature matrix. Each column must be numerical (Float64 / Float32) or Categorical (Arrow
Dictionarywith string values).y (polars.Series, pyarrow.Array, or numpy.ndarray) – Float64 labels.
weights (polars.Series, pyarrow.Array, or numpy.ndarray, optional) – Per-row weights (Float64). Defaults to uniform weights.
offsets (polars.Series, pyarrow.Array, or numpy.ndarray, optional) – Per-row baseline (Float64) added to the raw score during fit and predict.
max_bin (int, default 255) – Maximum number of bins per feature, including the missing/sentinel bin. Must satisfy
max_bin <= 255.min_data_in_bin (int, default 3) – Minimum number of rows that must accumulate before opening a new bin during quantile binning of numerical features.
n_jobs (int, default -1) – Number of threads used for binning.
-1uses all logical cores.seed (int, default 0) – Seed for the row subsample used to determine bin boundaries on large datasets (>200,000 rows).
- class rgbm.Booster(objective='gaussian', num_iterations=100, learning_rate=0.1, max_depth=6, min_sum_hessian_in_leaf=0.001, min_gain_to_split=0.0, lambda_l1=0.0, lambda_l2=0.0, max_leaves=31, leaf_wise=True, n_jobs=-1)
Gradient-boosted decision tree model.
- Parameters:
objective ({"gaussian", "logistic", "probit", "poisson"}, default "gaussian") – Loss function.
gaussianfor regression,logisticandprobitfor binary classification with labels in{0, 1},poissonfor non-negative count regression.num_iterations (int, default 100) – Number of boosting rounds (trees).
learning_rate (float, default 0.1) – Multiplier applied to each tree’s leaf values before adding to the ensemble.
max_depth (int, default 6) – Maximum tree depth.
max_leaves (int, default 31) – Maximum number of leaves per tree.
min_sum_hessian_in_leaf (float, default 1e-3) – Minimum sum of hessians required for a leaf to be split.
min_gain_to_split (float, default 0.0) – Minimum split gain required for a leaf to be split.
lambda_l1 (float, default 0.0) – L1 regularization on leaf values.
lambda_l2 (float, default 0.0) – L2 regularization on leaf values.
leaf_wise (bool, default True) – If True, grow trees by splitting the highest-gain leaf first (LightGBM-style). If False, grow level-wise: split shallowest leaves first, ties broken by gain (xgboost
grow_policy=depthwise).n_jobs (int, default -1) – Number of threads used for fitting and prediction.
-1uses all logical cores.
- fit(dataset)
Fit the booster on a Dataset.
- model_to_string()
Serialize the model to a LightGBM-compatible
model.txt(v4) string.The returned string can be loaded back via
lightgbm.Booster(model_str=...)for prediction. Useful for interoperability with downstream tooling that expects lgbm models.
- predict(x, offsets=None)
Predict on a feature matrix.
- Parameters:
x (polars.DataFrame, pyarrow.RecordBatch, or pyarrow.Table) – Feature matrix with the same schema as the training data.
offsets (polars.Series, pyarrow.Array, or numpy.ndarray, optional) – Per-row baseline (Float64) added to the raw score before applying the objective’s link function.
- Returns:
Per-row predictions. For
gaussian, raw scores. Forlogisticandprobit, probabilities in[0, 1]. Forpoisson, expected counts.- Return type:
numpy.ndarray of float64
Examples