Skip to content

openboost

Core module - main entry point for OpenBoost.

Quick Reference

import openboost as ob

# Check version and backend
print(ob.__version__)
print(ob.get_backend())  # "cuda" or "cpu"

# Data binning
X_binned = ob.array(X, n_bins=256)

# Models
model = ob.GradientBoosting(n_trees=100)
model = ob.NaturalBoostNormal(n_trees=100)
model = ob.OpenBoostGAM(n_rounds=500)

Data Layer

array

array(
    X, n_bins=256, *, categorical_features=None, device=None
)

Convert input data to binned format for tree building.

This is the primary entry point for data. Binning is done once, then the binned data can be used for training many models.

Missing values (NaN) are automatically detected and encoded as bin 255. The model learns the optimal direction for missing values at each split.

Categorical features use native category encoding instead of quantile binning, enabling the model to learn optimal category groupings.

PARAMETER DESCRIPTION
X

Input features, shape (n_samples, n_features) Accepts numpy arrays, PyTorch tensors, JAX arrays, CuPy arrays. NaN values are handled automatically.

TYPE: ArrayLike

n_bins

Maximum number of bins for numeric features (max 255).

TYPE: int DEFAULT: 256

categorical_features

List of column indices that are categorical. These use category encoding instead of quantile binning. Max 254 unique categories per feature (255 reserved for NaN).

TYPE: Sequence[int] | None DEFAULT: None

device

Target device ("cuda" or "cpu"). Auto-detected if None.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
BinnedArray

BinnedArray with binned data in feature-major layout (n_features, n_samples).

BinnedArray

NaN values are encoded as MISSING_BIN (255).

Example

import openboost as ob import numpy as np

Numeric features with missing values

X = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]]) X_binned = ob.array(X) print(X_binned.has_missing) # [True, True]

Mixed numeric and categorical

X = np.array([[25, 0, 50000], [30, 1, 60000], [35, 2, 70000]]) X_binned = ob.array(X, categorical_features=[1]) # Feature 1 is categorical print(X_binned.is_categorical) # [False, True, False]

BinnedArray dataclass

BinnedArray(
    data,
    bin_edges,
    n_features,
    n_samples,
    device,
    has_missing=(lambda: array([], dtype=bool_))(),
    is_categorical=(lambda: array([], dtype=bool_))(),
    category_maps=list(),
    n_categories=(lambda: array([], dtype=int32))(),
)

Binned feature matrix ready for tree building.

ATTRIBUTE DESCRIPTION
data

Binned data, shape (n_features, n_samples), dtype uint8 NaN values are encoded as bin 255 (MISSING_BIN)

TYPE: NDArray[uint8]

bin_edges

List of bin edges per feature, for inverse transform

TYPE: list[NDArray[float64]]

n_features

Number of features

TYPE: int

n_samples

Number of samples

TYPE: int

device

"cuda" or "cpu"

TYPE: str

has_missing

Boolean array (n_features,) indicating which features have NaN

TYPE: NDArray[bool_]

is_categorical

Boolean array (n_features,) indicating categorical features

TYPE: NDArray[bool_]

category_maps

List of dicts mapping original values -> bin indices (None for numeric)

TYPE: list[dict | None]

n_categories

Number of categories per feature (0 for numeric)

TYPE: NDArray[int32]

any_missing property

any_missing

Check if any feature has missing values.

any_categorical property

any_categorical

Check if any feature is categorical.

transform

transform(X)

Transform new data using the bin edges from this BinnedArray.

Use this method to transform test/validation data using the same binning learned from training data. This ensures tree splits work correctly across train and test sets.

PARAMETER DESCRIPTION
X

New input features, shape (n_samples_new, n_features). Must have the same number of features as the training data.

TYPE: ArrayLike

RETURNS DESCRIPTION
BinnedArray

BinnedArray with new data binned using training bin edges.

Example

X_train_binned = ob.array(X_train) model.fit(X_train_binned, y_train) X_test_binned = X_train_binned.transform(X_test) predictions = model.predict(X_test_binned)

Backend Control

get_backend

get_backend()

Get the current compute backend.

RETURNS DESCRIPTION
Literal['cuda', 'cpu']

"cuda" if NVIDIA GPU is available, "cpu" otherwise.

set_backend

set_backend(backend)

Force a specific backend.

Thread-safe: uses a lock to prevent concurrent modification.

PARAMETER DESCRIPTION
backend

"cuda" or "cpu"

TYPE: Literal['cuda', 'cpu']

RAISES DESCRIPTION
ValueError

If backend is not "cuda" or "cpu"

RuntimeError

If CUDA is requested but not available

is_cuda

is_cuda()

Check if using CUDA backend.

Low-Level Tree Building

fit_tree

fit_tree(
    X,
    grad,
    hess,
    *,
    max_depth=6,
    min_child_weight=1.0,
    reg_lambda=1.0,
    reg_alpha=0.0,
    min_gain=0.0,
    gamma=None,
    growth="levelwise",
    max_leaves=None,
    subsample=1.0,
    colsample_bytree=1.0,
)

Fit a single gradient boosting tree.

This is the core function of OpenBoost. It builds a tree using the specified growth strategy and returns a TreeStructure that can be used for prediction.

Phase 8: Uses composable growth strategies from _growth.py. Phase 11: Added reg_alpha, subsample, colsample_bytree. Phase 14: Handles missing values automatically via BinnedArray.has_missing.

PARAMETER DESCRIPTION
X

Binned feature data (BinnedArray from ob.array(), or raw binned array) Missing values (NaN in original data) are encoded as bin 255.

TYPE: BinnedArray | NDArray

grad

Gradient vector, shape (n_samples,), float32

TYPE: NDArray

hess

Hessian vector, shape (n_samples,), float32

TYPE: NDArray

max_depth

Maximum tree depth

TYPE: int DEFAULT: 6

min_child_weight

Minimum sum of hessian in a leaf

TYPE: float DEFAULT: 1.0

reg_lambda

L2 regularization on leaf values

TYPE: float DEFAULT: 1.0

reg_alpha

L1 regularization on leaf values (Phase 11)

TYPE: float DEFAULT: 0.0

min_gain

Minimum gain to make a split

TYPE: float DEFAULT: 0.0

gamma

Alias for min_gain (XGBoost compatibility)

TYPE: float | None DEFAULT: None

growth

Growth strategy - "levelwise", "leafwise", "symmetric", or a GrowthStrategy instance

TYPE: str | GrowthStrategy DEFAULT: 'levelwise'

max_leaves

Maximum leaves (for leafwise growth)

TYPE: int | None DEFAULT: None

subsample

Row sampling ratio (0.0-1.0), 1.0 = no sampling (Phase 11)

TYPE: float DEFAULT: 1.0

colsample_bytree

Column sampling ratio (0.0-1.0), 1.0 = no sampling (Phase 11)

TYPE: float DEFAULT: 1.0

RETURNS DESCRIPTION
TreeStructure

TreeStructure that can predict via tree.predict(X) or tree(X)

Example

import openboost as ob import numpy as np

Missing values handled automatically

X_train = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]]) X_binned = ob.array(X_train) pred = np.zeros(3, dtype=np.float32)

for round in range(100): ... grad = 2 * (pred - y) # MSE gradient ... hess = np.ones_like(grad) * 2 ... tree = ob.fit_tree(X_binned, grad, hess) ... pred = pred + 0.1 * tree.predict(X_binned)

Use leaf-wise growth (LightGBM style)

tree = ob.fit_tree(X_binned, grad, hess, growth="leafwise", max_leaves=32)

Use symmetric growth (CatBoost style)

tree = ob.fit_tree(X_binned, grad, hess, growth="symmetric")

Stochastic gradient boosting (Phase 11)

tree = ob.fit_tree(X_binned, grad, hess, subsample=0.8, colsample_bytree=0.8)

predict_tree

predict_tree(tree, X)

Predict using a fitted tree.

PARAMETER DESCRIPTION
tree

Fitted Tree object

TYPE: Tree

X

BinnedArray or binned data (n_features, n_samples)

TYPE: BinnedArray | NDArray

RETURNS DESCRIPTION
predictions

Shape (n_samples,), float32

TYPE: NDArray

predict_ensemble

predict_ensemble(
    trees, X, learning_rate=1.0, init_score=0.0
)

Predict using an ensemble of trees.

PARAMETER DESCRIPTION
trees

List of fitted Tree objects

TYPE: list[Tree]

X

BinnedArray or binned data

TYPE: BinnedArray | NDArray

learning_rate

Learning rate to apply to each tree

TYPE: float DEFAULT: 1.0

init_score

Initial prediction value

TYPE: float DEFAULT: 0.0

RETURNS DESCRIPTION
predictions

Shape (n_samples,)

TYPE: NDArray