Why to use Bayesian Optimization for Hyperparameter Tuning?

Mar 29, 2025

Estimated reading time - 6 minutes

Introduction

Hyperparameter tuning is one of the most important parts of the process of building Machine Learning models. The difference in the model accuracy between a baseline model and the one with optimized hyperparameters can reach 30-40% in relative terms, at least from my experience.

Why does it happen? The main reason is that for a relatively large search space of hyperparameters, e.g. 5-6 hyperparameters for Gradient Boosting, there might be a lot of local minima and the difference between them can be noticeable.

If you randomly choose a set of hyperparameters or even perform a Grid Search, there is a very high probability that you have not investigated other local minima. And those minima can easily have a smaller cross-validation metric score than the one you found with a fixed / Grid Search approach.

On the other hand, every model training run can be costly, so it can be difficult to run a fully brute-force approach and explore tens of thousands hyperparameter combinations. Here's exactly where Bayesian Optimization comes into play.

What is Bayesian Optimization?

Bayesian Optimization is an approach to finding the minimum of a black-box function using probabilistic models to select the next sampling point based on the results of the previous samples.

Here's the main steps Bayesian Optimization takes (the gif below shows this process dynamically):

Step 1: Set the Objective function.

Imagine we want to optimize the regularization coefficient, alpha, in a LASSO model. Then our objective function is the Mean Squared Error on a validation set or K-fold validation sets.

Step 2: Perform the initial sampling of several points

We begin by selecting a few alphas at random and calculating validation errors for each.

Step 3: Build a probabilistic model

Using these samples, we fit a probabilistic model, e.g. Gaussian Processes (see the gif ↑↑↑). The model estimates both the most probable error and its uncertainty for each potential alpha. This helps estimate errors for untested alphas with confidence intervals.

Step 4. Define an Acquisition Function.

The probabilistic predictions are then used in an acquisition function. In the example above, we use the Lower Confidence Bound (LCB).

LCB(x) = μ(x) − κ ⋅ σ(x)

↳ μ(x): Predicted mean (expected function value)

↳ σ(x): Standard deviation (model uncertainty)

↳ κ (kappa): Trade-off parameter controlling exploration vs. exploitation.

The next hyperparameter value chosen minimizes the LCB value (see the gif ↑↑↑). By choosing the point with the lowest LCB, we explore uncertain areas while focusing on promising regions, optimizing the search for the best solution. As we collect more samples, the acquisition function closely approximates the true error curve better, allowing Bayesian Optimization finding a better minimum.

Step 5. Specify the number of steps

Ideally, this procedure should be performed until the model is converged to the optimum. However, in reality, most often we specify the number of trials to be taken. In the example though, the model is able to find the minimum with 15 iterations.

Step 6. Select the lowest found error and associated hyperparameters

Either based on the number of trials or the convergence criterion, the iterations are stopped and then the best alpha is selected.

Bayesian Optimization vs Grid Search vs Random Search.

As we see, Bayesian Optimization aims to iteratively optimize the given objective function. In contrast, the two most frequent hyperparameter optimization approaches either follow the pre-specified values (Grid Search) or randomly sample the search space (Random Search).

The example below compares the three approaches. As we can see, the cumulative value of the objective function of the Bayesian Optimization is much lower and converges to some value over time. This means that the quality of samples that Bayesian Optimization performs is higher (can also be seen on the first subplot).

The example above is created for a very simple convex case. However, for a larger search space as we often have in practice, this difference will be even larger. Also, if every model training evaluation is costly, the quality of samples becomes even more crucial and this is where Bayesian Optimization comes to the rescue.

Conclusion

Bayesian Optimization provides a powerful and efficient approach to hyperparameter tuning and often outperforms traditional methods like Grid Search and Random Search. This is especially true when dealing with complex search spaces and expensive model evaluations.

In real-world applications, where training runs can be costly and time-consuming, Bayesian Optimization can significantly improve model performance while reducing computational overhead. If you’re working with models that have multiple hyperparameters, incorporating Bayesian Optimization into your workflow can be a strong step towards accurate and robust models.

To stay up to day with my articles and roadmaps both on the technical and career part of your ML journey, subscribe to my weekly newsletter below!

Join my newsletter for 1 weekly piece with 2 ML guides:

1. Technical ML tutorial or skill learning guide

2. Tips list to grow ML career, LinkedIn, income

Join now!