openboost¶
Core module - main entry point for OpenBoost.
Quick Reference¶
import openboost as ob
# Check version and backend
print(ob.__version__)
print(ob.get_backend()) # "cuda" or "cpu"
# Data binning
X_binned = ob.array(X, n_bins=256)
# Models
model = ob.GradientBoosting(n_trees=100)
model = ob.NaturalBoostNormal(n_trees=100)
model = ob.OpenBoostGAM(n_rounds=500)
Data Layer¶
array
¶
Convert input data to binned format for tree building.
This is the primary entry point for data. Binning is done once, then the binned data can be used for training many models.
Missing values (NaN) are automatically detected and encoded as bin 255. The model learns the optimal direction for missing values at each split.
Categorical features use native category encoding instead of quantile binning, enabling the model to learn optimal category groupings.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
Input features, shape (n_samples, n_features) Accepts numpy arrays, PyTorch tensors, JAX arrays, CuPy arrays. NaN values are handled automatically.
TYPE:
|
n_bins
|
Maximum number of bins for numeric features (max 255).
TYPE:
|
categorical_features
|
List of column indices that are categorical. These use category encoding instead of quantile binning. Max 254 unique categories per feature (255 reserved for NaN).
TYPE:
|
device
|
Target device ("cuda" or "cpu"). Auto-detected if None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BinnedArray
|
BinnedArray with binned data in feature-major layout (n_features, n_samples). |
BinnedArray
|
NaN values are encoded as MISSING_BIN (255). |
Example
import openboost as ob import numpy as np
Numeric features with missing values¶
X = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]]) X_binned = ob.array(X) print(X_binned.has_missing) # [True, True]
Mixed numeric and categorical¶
X = np.array([[25, 0, 50000], [30, 1, 60000], [35, 2, 70000]]) X_binned = ob.array(X, categorical_features=[1]) # Feature 1 is categorical print(X_binned.is_categorical) # [False, True, False]
BinnedArray
dataclass
¶
BinnedArray(
data,
bin_edges,
n_features,
n_samples,
device,
has_missing=(lambda: array([], dtype=bool_))(),
is_categorical=(lambda: array([], dtype=bool_))(),
category_maps=list(),
n_categories=(lambda: array([], dtype=int32))(),
)
Binned feature matrix ready for tree building.
| ATTRIBUTE | DESCRIPTION |
|---|---|
data |
Binned data, shape (n_features, n_samples), dtype uint8 NaN values are encoded as bin 255 (MISSING_BIN)
TYPE:
|
bin_edges |
List of bin edges per feature, for inverse transform
TYPE:
|
n_features |
Number of features
TYPE:
|
n_samples |
Number of samples
TYPE:
|
device |
"cuda" or "cpu"
TYPE:
|
has_missing |
Boolean array (n_features,) indicating which features have NaN
TYPE:
|
is_categorical |
Boolean array (n_features,) indicating categorical features
TYPE:
|
category_maps |
List of dicts mapping original values -> bin indices (None for numeric)
TYPE:
|
n_categories |
Number of categories per feature (0 for numeric)
TYPE:
|
transform
¶
Transform new data using the bin edges from this BinnedArray.
Use this method to transform test/validation data using the same binning learned from training data. This ensures tree splits work correctly across train and test sets.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
New input features, shape (n_samples_new, n_features). Must have the same number of features as the training data.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BinnedArray
|
BinnedArray with new data binned using training bin edges. |
Example
X_train_binned = ob.array(X_train) model.fit(X_train_binned, y_train) X_test_binned = X_train_binned.transform(X_test) predictions = model.predict(X_test_binned)
Backend Control¶
get_backend
¶
Get the current compute backend.
| RETURNS | DESCRIPTION |
|---|---|
Literal['cuda', 'cpu']
|
"cuda" if NVIDIA GPU is available, "cpu" otherwise. |
set_backend
¶
Force a specific backend.
Thread-safe: uses a lock to prevent concurrent modification.
| PARAMETER | DESCRIPTION |
|---|---|
backend
|
"cuda" or "cpu"
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If backend is not "cuda" or "cpu" |
RuntimeError
|
If CUDA is requested but not available |
Low-Level Tree Building¶
fit_tree
¶
fit_tree(
X,
grad,
hess,
*,
max_depth=6,
min_child_weight=1.0,
reg_lambda=1.0,
reg_alpha=0.0,
min_gain=0.0,
gamma=None,
growth="levelwise",
max_leaves=None,
subsample=1.0,
colsample_bytree=1.0,
)
Fit a single gradient boosting tree.
This is the core function of OpenBoost. It builds a tree using the specified growth strategy and returns a TreeStructure that can be used for prediction.
Phase 8: Uses composable growth strategies from _growth.py. Phase 11: Added reg_alpha, subsample, colsample_bytree. Phase 14: Handles missing values automatically via BinnedArray.has_missing.
| PARAMETER | DESCRIPTION |
|---|---|
X
|
Binned feature data (BinnedArray from ob.array(), or raw binned array) Missing values (NaN in original data) are encoded as bin 255.
TYPE:
|
grad
|
Gradient vector, shape (n_samples,), float32
TYPE:
|
hess
|
Hessian vector, shape (n_samples,), float32
TYPE:
|
max_depth
|
Maximum tree depth
TYPE:
|
min_child_weight
|
Minimum sum of hessian in a leaf
TYPE:
|
reg_lambda
|
L2 regularization on leaf values
TYPE:
|
reg_alpha
|
L1 regularization on leaf values (Phase 11)
TYPE:
|
min_gain
|
Minimum gain to make a split
TYPE:
|
gamma
|
Alias for min_gain (XGBoost compatibility)
TYPE:
|
growth
|
Growth strategy - "levelwise", "leafwise", "symmetric", or a GrowthStrategy instance
TYPE:
|
max_leaves
|
Maximum leaves (for leafwise growth)
TYPE:
|
subsample
|
Row sampling ratio (0.0-1.0), 1.0 = no sampling (Phase 11)
TYPE:
|
colsample_bytree
|
Column sampling ratio (0.0-1.0), 1.0 = no sampling (Phase 11)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TreeStructure
|
TreeStructure that can predict via tree.predict(X) or tree(X) |
Example
import openboost as ob import numpy as np
Missing values handled automatically¶
X_train = np.array([[1.0, np.nan], [2.0, 3.0], [np.nan, 4.0]]) X_binned = ob.array(X_train) pred = np.zeros(3, dtype=np.float32)
for round in range(100): ... grad = 2 * (pred - y) # MSE gradient ... hess = np.ones_like(grad) * 2 ... tree = ob.fit_tree(X_binned, grad, hess) ... pred = pred + 0.1 * tree.predict(X_binned)
Use leaf-wise growth (LightGBM style)¶
tree = ob.fit_tree(X_binned, grad, hess, growth="leafwise", max_leaves=32)
Use symmetric growth (CatBoost style)¶
tree = ob.fit_tree(X_binned, grad, hess, growth="symmetric")
Stochastic gradient boosting (Phase 11)¶
tree = ob.fit_tree(X_binned, grad, hess, subsample=0.8, colsample_bytree=0.8)
predict_tree
¶
Predict using a fitted tree.
| PARAMETER | DESCRIPTION |
|---|---|
tree
|
Fitted Tree object
TYPE:
|
X
|
BinnedArray or binned data (n_features, n_samples)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
predictions
|
Shape (n_samples,), float32
TYPE:
|
predict_ensemble
¶
Predict using an ensemble of trees.
| PARAMETER | DESCRIPTION |
|---|---|
trees
|
List of fitted Tree objects
TYPE:
|
X
|
BinnedArray or binned data
TYPE:
|
learning_rate
|
Learning rate to apply to each tree
TYPE:
|
init_score
|
Initial prediction value
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
predictions
|
Shape (n_samples,)
TYPE:
|