Uncertainty Quantification with NaturalBoost¶

NaturalBoost predicts full probability distributions instead of just point estimates, giving you uncertainty bounds on your predictions.

Why Uncertainty Matters¶

Traditional gradient boosting gives you a single number: "the price will be $100". But in reality, you might want to know:

How confident is the model?
What's the range of likely values?
What's the probability of exceeding a threshold?

NaturalBoost answers these questions by predicting distribution parameters (e.g., mean and variance for a Normal distribution).

Basic Usage¶

import numpy as np
import openboost as ob

# Generate heteroscedastic data (variance depends on X)
np.random.seed(42)
X = np.random.randn(1000, 10).astype(np.float32)
noise_std = 0.5 + np.abs(X[:, 0])  # Noise increases with |X[:, 0]|
y = (X[:, 0] * 2 + X[:, 1] + noise_std * np.random.randn(1000)).astype(np.float32)

X_train, X_test = X[:800], X[800:]
y_train, y_test = y[:800], y[800:]

# Train NaturalBoost with Normal distribution
model = ob.NaturalBoostNormal(
    n_trees=100,
    max_depth=4,
    learning_rate=0.1,
)
model.fit(X_train, y_train)

# Point prediction (mean)
mean = model.predict(X_test)

# Prediction interval (90% confidence)
lower, upper = model.predict_interval(X_test, alpha=0.1)

print(f"Coverage: {np.mean((y_test >= lower) & (y_test <= upper)):.1%}")
# Should be close to 90%

Distribution Output¶

For full control, use predict_distribution():

# Get full distribution output
output = model.predict_distribution(X_test)

# Access distribution parameters
print(f"Mean (loc):  {output.params['loc'][:5]}")
print(f"Std (scale): {output.params['scale'][:5]}")

# Convenience methods
mean = output.mean()
std = output.std()
variance = output.variance()

# Prediction intervals
lower, upper = output.interval(alpha=0.1)  # 90% interval
lower, upper = output.interval(alpha=0.05)  # 95% interval

# Negative log-likelihood (useful for evaluation)
nll = output.nll(y_test)
print(f"Mean NLL: {np.mean(nll):.4f}")

Monte Carlo Sampling¶

Sample from the predicted distribution for downstream analysis:

# Draw samples from predicted distributions
samples = model.sample(X_test, n_samples=1000)  # Shape: (1000, n_test)

# Use samples for risk analysis
threshold = 10.0
prob_exceed = np.mean(samples > threshold, axis=0)  # P(Y > 10) for each sample

# Quantile estimation
q90 = np.percentile(samples, 90, axis=0)  # 90th percentile

Available Distributions¶

Normal (Gaussian)¶

Best for: General regression with symmetric errors.

model = ob.NaturalBoostNormal(n_trees=100)
model.fit(X_train, y_train)

Log-Normal¶

Best for: Positive data with right skew (prices, incomes, durations).

model = ob.NaturalBoostLogNormal(n_trees=100)
model.fit(X_train, y_train)  # y must be positive

# Predictions are in original scale
mean = model.predict(X_test)  # E[Y], not E[log(Y)]

Gamma¶

Best for: Positive continuous data (insurance claims, rainfall).

model = ob.NaturalBoostGamma(n_trees=100)
model.fit(X_train, y_train)  # y must be positive

Poisson¶

Best for: Count data (number of events, visitors, transactions).

model = ob.NaturalBoostPoisson(n_trees=100)
model.fit(X_train, y_train)  # y: non-negative integers

Student-t¶

Best for: Data with heavy tails and outliers.

model = ob.NaturalBoostStudentT(n_trees=100)
model.fit(X_train, y_train)

Tweedie¶

Best for: Zero-inflated positive data (insurance claims, Kaggle competitions).

model = ob.NaturalBoostTweedie(n_trees=100)
model.fit(X_train, y_train)  # y: non-negative, many zeros allowed

# Default power=1.5 (compound Poisson-Gamma)
# Use power closer to 1 for more zeros
# Use power closer to 2 for Gamma-like behavior

Negative Binomial¶

Best for: Overdispersed count data (sales forecasting, Rossmann competition).

model = ob.NaturalBoostNegBin(n_trees=100)
model.fit(X_train, y_train)  # y: non-negative integers

# Better than Poisson when variance > mean

Comparing Distributions¶

To choose the right distribution, compare negative log-likelihood:

import openboost as ob

distributions = [
    ('Normal', ob.NaturalBoostNormal),
    ('LogNormal', ob.NaturalBoostLogNormal),
    ('Gamma', ob.NaturalBoostGamma),
]

results = []
for name, ModelClass in distributions:
    model = ModelClass(n_trees=100, max_depth=4)
    model.fit(X_train, y_train)
    nll = model.score(X_test, y_test)  # Mean NLL
    results.append((name, nll))

# Lower NLL is better
for name, nll in sorted(results, key=lambda x: x[1]):
    print(f"{name}: NLL = {nll:.4f}")

Custom Distributions¶

Define your own distribution:

import openboost as ob
import numpy as np

# Method 1: Use create_custom_distribution (with autodiff)
MyDist = ob.create_custom_distribution(
    name='MyDist',
    param_names=['loc', 'scale'],
    nll_fn=lambda params, y: (
        0.5 * np.log(2 * np.pi * params['scale']**2) +
        0.5 * ((y - params['loc']) / params['scale'])**2
    ),
    mean_fn=lambda params: params['loc'],
    param_transforms={'scale': 'softplus'},  # Ensure scale > 0
)

# Use with NaturalBoost
model = ob.NaturalBoost(distribution=MyDist(), n_trees=100)
model.fit(X_train, y_train)

Calibration¶

Check if your prediction intervals are well-calibrated:

def check_calibration(model, X_test, y_test, alphas=[0.1, 0.2, 0.3, 0.4, 0.5]):
    """Check if prediction intervals have correct coverage."""
    print("Alpha | Expected | Observed")
    print("-" * 30)

    for alpha in alphas:
        lower, upper = model.predict_interval(X_test, alpha=alpha)
        coverage = np.mean((y_test >= lower) & (y_test <= upper))
        expected = 1 - alpha
        print(f"{alpha:.2f}  | {expected:.0%}       | {coverage:.1%}")

check_calibration(model, X_test, y_test)

Best Practices¶

1. Choose the Right Distribution¶

Data Type	Recommended Distribution
General continuous	Normal
Positive values	LogNormal, Gamma
Count data	Poisson, Negative Binomial
Heavy tails	Student-t
Zero-inflated positive	Tweedie
Overdispersed counts	Negative Binomial

2. Use Appropriate Tree Depth¶

Shallower trees (max_depth=3-4) often work better for uncertainty estimation than deep trees.

3. Train Longer¶

NaturalBoost learns two (or more) parameters per sample, so it may need more trees than standard GBDT.

model = ob.NaturalBoostNormal(
    n_trees=500,      # More trees
    max_depth=4,      # Shallower trees
    learning_rate=0.05,  # Lower LR for stability
)

4. Evaluate with NLL¶

Use negative log-likelihood for model comparison, not just RMSE:

nll = model.score(X_test, y_test)  # Lower is better

Example: Insurance Claims Prediction¶

import openboost as ob

# Tweedie distribution for insurance claims (many zeros, heavy tail)
model = ob.NaturalBoostTweedie(
    n_trees=300,
    max_depth=4,
    learning_rate=0.05,
)
model.fit(X_train, claims)

# Predict expected claim amount
expected_claim = model.predict(X_new)

# Probability of large claim (risk assessment)
output = model.predict_distribution(X_new)
samples = model.sample(X_new, n_samples=10000)
prob_large_claim = np.mean(samples > 10000, axis=0)

print(f"Expected claim: ${expected_claim[0]:.2f}")
print(f"P(claim > $10k): {prob_large_claim[0]:.1%}")

Next Steps¶

Custom Loss Functions - Define your own objectives
Migration from XGBoost - Switching from XGBoost