openboost¶

Core module - main entry point for OpenBoost.

Quick Reference¶

import openboost as ob

# Check version and backend
print(ob.__version__)
print(ob.get_backend())  # "cuda" or "cpu"

# Data binning
X_binned = ob.array(X, n_bins=256)

# Models
model = ob.GradientBoosting(n_trees=100)
model = ob.NaturalBoostNormal(n_trees=100)
model = ob.OpenBoostGAM(n_rounds=500)

Data Layer¶

array ¶

array(
    X, n_bins=256, *, categorical_features=None, device=None
)

Convert input data to binned format for tree building.

This is the primary entry point for data. Binning is done once, then the binned data can be used for training many models.

Missing values (NaN) are automatically detected and encoded as bin 255. The model learns the optimal direction for missing values at each split.

Categorical features use native category encoding instead of quantile binning, enabling the model to learn optimal category groupings.

PARAMETER	DESCRIPTION
`X`	Input features, shape (n_samples, n_features) Accepts numpy arrays, PyTorch tensors, JAX arrays, CuPy arrays. NaN values are handled automatically. TYPE: `ArrayLike`
`n_bins`	Maximum number of bins for numeric features (max 255). TYPE: `int` DEFAULT: `256`
`categorical_features`	List of column indices that are categorical. These use category encoding instead of quantile binning. Max 254 unique categories per feature (255 reserved for NaN). TYPE: `Sequence[int] \| None` DEFAULT: `None`
`device`	Target device ("cuda" or "cpu"). Auto-detected if None. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`BinnedArray`	BinnedArray with binned data in feature-major layout (n_features, n_samples).
`BinnedArray`	NaN values are encoded as MISSING_BIN (255).

Example

import openboost as ob import numpy as np

Numeric features with missing values¶

X = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]]) X_binned = ob.array(X) print(X_binned.has_missing) # [True, True]

Mixed numeric and categorical¶

X = np.array([[25, 0, 50000], [30, 1, 60000], [35, 2, 70000]]) X_binned = ob.array(X, categorical_features=[1]) # Feature 1 is categorical print(X_binned.is_categorical) # [False, True, False]

BinnedArray `dataclass` ¶

BinnedArray(
    data,
    bin_edges,
    n_features,
    n_samples,
    device,
    has_missing=(lambda: array([], dtype=bool_))(),
    is_categorical=(lambda: array([], dtype=bool_))(),
    category_maps=list(),
    n_categories=(lambda: array([], dtype=int32))(),
)

Binned feature matrix ready for tree building.

ATTRIBUTE	DESCRIPTION
`data`	Binned data, shape (n_features, n_samples), dtype uint8 NaN values are encoded as bin 255 (MISSING_BIN) TYPE: `NDArray[uint8]`
`bin_edges`	List of bin edges per feature, for inverse transform TYPE: `list[NDArray[float64]]`
`n_features`	Number of features TYPE: `int`
`n_samples`	Number of samples TYPE: `int`
`device`	"cuda" or "cpu" TYPE: `str`
`has_missing`	Boolean array (n_features,) indicating which features have NaN TYPE: `NDArray[bool_]`
`is_categorical`	Boolean array (n_features,) indicating categorical features TYPE: `NDArray[bool_]`
`category_maps`	List of dicts mapping original values -> bin indices (None for numeric) TYPE: `list[dict \| None]`
`n_categories`	Number of categories per feature (0 for numeric) TYPE: `NDArray[int32]`

any_missing `property` ¶

any_missing

Check if any feature has missing values.

any_categorical `property` ¶

any_categorical

Check if any feature is categorical.

transform ¶

transform(X)

Transform new data using the bin edges from this BinnedArray.

Use this method to transform test/validation data using the same binning learned from training data. This ensures tree splits work correctly across train and test sets.

PARAMETER	DESCRIPTION
`X`	New input features, shape (n_samples_new, n_features). Must have the same number of features as the training data. TYPE: `ArrayLike`

RETURNS	DESCRIPTION
`BinnedArray`	BinnedArray with new data binned using training bin edges.

Example

X_train_binned = ob.array(X_train) model.fit(X_train_binned, y_train) X_test_binned = X_train_binned.transform(X_test) predictions = model.predict(X_test_binned)

Backend Control¶

get_backend ¶

get_backend()

Get the current compute backend.

RETURNS	DESCRIPTION
`Literal['cuda', 'cpu']`	"cuda" if NVIDIA GPU is available, "cpu" otherwise.

set_backend ¶

set_backend(backend)

Force a specific backend.

Thread-safe: uses a lock to prevent concurrent modification.

PARAMETER	DESCRIPTION
`backend`	"cuda" or "cpu" TYPE: `Literal['cuda', 'cpu']`

RAISES	DESCRIPTION
`ValueError`	If backend is not "cuda" or "cpu"
`RuntimeError`	If CUDA is requested but not available

is_cuda ¶

is_cuda()

Check if using CUDA backend.

Low-Level Tree Building¶

fit_tree ¶

fit_tree(
    X,
    grad,
    hess,
    *,
    max_depth=6,
    min_child_weight=1.0,
    reg_lambda=1.0,
    reg_alpha=0.0,
    min_gain=0.0,
    gamma=None,
    growth="levelwise",
    max_leaves=None,
    subsample=1.0,
    colsample_bytree=1.0,
)

Fit a single gradient boosting tree.

This is the core function of OpenBoost. It builds a tree using the specified growth strategy and returns a TreeStructure that can be used for prediction.

Phase 8: Uses composable growth strategies from _growth.py. Phase 11: Added reg_alpha, subsample, colsample_bytree. Phase 14: Handles missing values automatically via BinnedArray.has_missing.

PARAMETER	DESCRIPTION
`X`	Binned feature data (BinnedArray from ob.array(), or raw binned array) Missing values (NaN in original data) are encoded as bin 255. TYPE: `BinnedArray \| NDArray`
`grad`	Gradient vector, shape (n_samples,), float32 TYPE: `NDArray`
`hess`	Hessian vector, shape (n_samples,), float32 TYPE: `NDArray`
`max_depth`	Maximum tree depth TYPE: `int` DEFAULT: `6`
`min_child_weight`	Minimum sum of hessian in a leaf TYPE: `float` DEFAULT: `1.0`
`reg_lambda`	L2 regularization on leaf values TYPE: `float` DEFAULT: `1.0`
`reg_alpha`	L1 regularization on leaf values (Phase 11) TYPE: `float` DEFAULT: `0.0`
`min_gain`	Minimum gain to make a split TYPE: `float` DEFAULT: `0.0`
`gamma`	Alias for min_gain (XGBoost compatibility) TYPE: `float \| None` DEFAULT: `None`
`growth`	Growth strategy - "levelwise", "leafwise", "symmetric", or a GrowthStrategy instance TYPE: `str \| GrowthStrategy` DEFAULT: `'levelwise'`
`max_leaves`	Maximum leaves (for leafwise growth) TYPE: `int \| None` DEFAULT: `None`
`subsample`	Row sampling ratio (0.0-1.0), 1.0 = no sampling (Phase 11) TYPE: `float` DEFAULT: `1.0`
`colsample_bytree`	Column sampling ratio (0.0-1.0), 1.0 = no sampling (Phase 11) TYPE: `float` DEFAULT: `1.0`

RETURNS	DESCRIPTION
`TreeStructure`	TreeStructure that can predict via tree.predict(X) or tree(X)

Example

import openboost as ob import numpy as np

Missing values handled automatically¶

X_train = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]]) X_binned = ob.array(X_train) pred = np.zeros(3, dtype=np.float32)

for round in range(100): ... grad = 2 * (pred - y) # MSE gradient ... hess = np.ones_like(grad) * 2 ... tree = ob.fit_tree(X_binned, grad, hess) ... pred = pred + 0.1 * tree.predict(X_binned)

Use leaf-wise growth (LightGBM style)¶

tree = ob.fit_tree(X_binned, grad, hess, growth="leafwise", max_leaves=32)

Use symmetric growth (CatBoost style)¶

tree = ob.fit_tree(X_binned, grad, hess, growth="symmetric")

Stochastic gradient boosting (Phase 11)¶

tree = ob.fit_tree(X_binned, grad, hess, subsample=0.8, colsample_bytree=0.8)

predict_tree ¶

predict_tree(tree, X)

Predict using a fitted tree.

PARAMETER	DESCRIPTION
`tree`	Fitted Tree object TYPE: `Tree`
`X`	BinnedArray or binned data (n_features, n_samples) TYPE: `BinnedArray \| NDArray`

RETURNS	DESCRIPTION
`predictions`	Shape (n_samples,), float32 TYPE: `NDArray`

predict_ensemble ¶

predict_ensemble(
    trees, X, learning_rate=1.0, init_score=0.0
)

Predict using an ensemble of trees.

PARAMETER	DESCRIPTION
`trees`	List of fitted Tree objects TYPE: `list[Tree]`
`X`	BinnedArray or binned data TYPE: `BinnedArray \| NDArray`
`learning_rate`	Learning rate to apply to each tree TYPE: `float` DEFAULT: `1.0`
`init_score`	Initial prediction value TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`predictions`	Shape (n_samples,) TYPE: `NDArray`

openboost¶

Quick Reference¶

Data Layer¶

array ¶

Numeric features with missing values¶

Mixed numeric and categorical¶

BinnedArray dataclass ¶

any_missing property ¶

any_categorical property ¶

transform ¶

Backend Control¶

get_backend ¶

set_backend ¶

is_cuda ¶

Low-Level Tree Building¶

fit_tree ¶

Missing values handled automatically¶

Use leaf-wise growth (LightGBM style)¶

Use symmetric growth (CatBoost style)¶

Stochastic gradient boosting (Phase 11)¶

predict_tree ¶

predict_ensemble ¶

BinnedArray `dataclass` ¶

any_missing `property` ¶

any_categorical `property` ¶