Abstract
Hyperparameter tuning in the area of machine learning is often achieved using naive techniques, such as random search and grid search. However, most of these methods seldom lead to an optimal set of hyperparameters and often get very expensive. The hyperparameter optimization problem is inherently a bilevel optimization task, and there exist studies that have attempted bilevel solution methodologies to solve this problem. These techniques often assume a unique set of weights that minimizes the loss on the training set. Such an assumption is violated by deep learning architectures. We propose a bilevel solution method for solving the hyperparameter optimization problem that does not suffer from the drawbacks of the earlier studies. The proposed method is general and can be easily applied to any class of machine learning algorithms that involve continuous hyperparameters. The idea is based on the approximation of the lower level optimal value function mapping that helps in reducing the bilevel problem to a single-level constrained optimization task. The single-level constrained optimization problem is then solved using the augmented Lagrangian method. We perform extensive computational study on three datasets that confirm the efficiency of the proposed method. A comparative study against grid search, random search, Tree-structured Parzen Estimator and Quasi Monte Carlo Sampler shows that the proposed algorithm is multiple times faster and leads to models that generalize better on the testing set.
Similar content being viewed by others
References
Bennett, K.P., Kunapuli, G., Hu, J., Pang, J.: Bilevel optimization and machine learning. In: Proceedings of the 2008 World Congress on Computational Intelligence, pp. 25–47 (2008)
Von Stackelberg, H.: The Theory of the Market Economy. Oxford University Press, Oxford (1952)
Bracken, J., McGill, J.T.: Mathematical programs with optimization problems in the constraints. Oper. Res. 21(1), 37–44 (1973)
Sinha, A., Malo, P., Deb, K.: A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Trans. Evol. Comput. 22(2), 276–295 (2017)
Dempe, S.: Foundations of Bilevel Programming. Springer, Heidelberg, Germany (2002)
Bard, J.F.: Practical Bilevel Optimization: Algorithms and Applications, vol. 30. Springer, Heidelberg, Germany (2013)
Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523 (2011). Springer
Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., Adams, R.: Scalable bayesian optimization using deep neural networks. In: International Conference on Machine Learning, pp. 2171–2180 (2015)
Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based hyperparameter optimization through reversible learning. In: International Conference on Machine Learning, pp. 2113–2122 (2015)
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M.: Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910 (2018)
Bengio, Y.: Gradient-based optimization of hyperparameters. Neural Comput. 12(8), 1889–1900 (2000)
Pedregosa, F.: Hyperparameter optimization with approximate gradient. arXiv preprint arXiv:1602.02355 (2016)
Lorraine, J., Duvenaud, D.: Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419 (2018)
MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., Grosse, R.: Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088 (2019)
Mehra, A., Hamm, J.: Penalty method for inversion-free deep bilevel optimization. arXiv preprint arXiv:1911.03432 (2019)
Vicente, L., Savard, G., Júdice, J.: Descent approaches for quadratic bilevel programming. J. Optim. Theory Appl. 81, 379–399 (1994)
Wen, U., Hsu, S.: Linear bi-level programming problems—a review. J. Oper. Res. Soc. 42, 125–133 (1991)
Ben-Ayed, O.: Bilevel linear programming. Comput. Oper. Res. 20, 485–501 (1993)
Bard, J., Falk, J.: An explicit solution to the multi-level programming problem. Comput. Oper. Res. 9, 77–100 (1982)
Fortuny-Amat, J., McCarl, B.: A representation and economic interpretation of a two-level programming problem. J. Oper. Res. Soc. 32, 783–792 (1981)
Tuy, H., Migdalas, A., Värbrand, P.: A global optimization approach for the linear two-level program. J. Global Optim. 3, 1–23 (1993)
Bard, J., Moore, J.: A branch and bound algorithm for the bilevel programming problem. SIAM J. Sci. Stat. Comput. 11, 281–292 (1990)
Edmunds, T., Bard, J.: Algorithms for nonlinear bilevel mathematical programming. IEEE Trans. Syst. Man Cybern. 21, 83–89 (1991)
Al-Khayyal, F., Horst, R., Pardalos, P.: Global optimization of concave functions subject to quadratic constraints: an application in nonlinear bilevel programming. Ann. Oper. Res. 34, 125–147 (1992)
Yanikoglu, I., Kuhn, D.: Decision rule bounds for two-stage stochastic bilevel programs. SIAM J. Optim. 28(1), 198–222 (2018)
Savard, G., Gauvin, J.: The steepest descent direction for the nonlinear bilevel programming problem. Oper. Res. Lett. 15, 275–282 (1994)
Liu, G., Han, J., Wang, S.: A trust region algorithm for bilevel programing problems. Chin. Sci. Bull. 43(10), 820–824 (1998)
Marcotte, P., Savard, G., Zhu, D.L.: A trust region algorithm for nonlinear bilevel programming. Oper. Res. Lett. 29(4), 171–179 (2001)
Colson, B., Marcotte, P., Savard, G.: A trust-region method for nonlinear bilevel programming: algorithm and computational experience. Comput. Optim. Appl. 30(3), 211–227 (2005)
Aiyoshi, E., Shimizu, K.: Hierarchical decentralized systems and its new solution by a barrier method. IEEE Trans. Syst. Man Cybern. 6, 444–449 (1981)
Aiyoshi, E., Shimizu, K.: A solution method for the static constrained Stackelberg problem via penalty method. IEEE Trans. Autom. Control 29, 1111–1114 (1984)
Ishizuka, Y., Aiyoshi, E.: Double penalty method for bilevel optimization problems. Ann. Oper. Res. 34, 73–88 (1992)
White, D., Anandalingam, G.: A penalty function approach for solving bi-level linear programs. J. Global Optim. 3, 397–419 (1993)
Mathieu, R., Pittard, L., Anandalingam, G.: Genetic algorithm based approach to bi-level linear programming. Oper. Res. 28(1), 1–21 (1994)
Yin, Y.: Genetic algorithm based approach for bilevel programming models. J. Transp. Eng. 126(2), 115–120 (2000)
Zhu, X., Yu, Q., Wang, X.: A hybrid differential evolution algorithm for solving nonlinear bilevel programming with linear constraints. In: Cognitive Informatics, 2006. ICCI 2006. 5th IEEE International Conference On, vol. 1, pp. 126–131 (2006)
Sinha, A., Malo, P., Frantsev, A., Deb, K.: Finding optimal strategies in a multi-period multi-leader-follower stackelberg game using an evolutionary algorithm. Comput. Oper. Res. 41, 374–385 (2014)
Angelo, J., Krempser, E., Barbosa, H.: Differential evolution for bilevel programming. In: 2013 Congress on Evolutionary Computation (CEC-2013) (2013)
Hejazi, S.R., Memariani, A., Jahanshahloo, G., Sepehri, M.M.: Linear bilevel programming solution by genetic algorithm. Comput. Oper. Res. 29(13), 1913–1925 (2002)
Wang, Y., Jiao, Y., Li, H.: An evolutionary algorithm for solving nonlinear bilevel programming based on a new constraint-handling scheme. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 35(2), 221–32 (2005)
Li, H.: A genetic algorithm using a finite search space for solving nonlinear/linear fractional bilevel programming problems. Annal. Oper. Res. 235(1), 543–58 (2015)
Sinha, A., Soun, T., Deb, K.: Evolutionary bilevel optimization using KKT proximity measure. In: 2017 IEEE Congress on Evolutionary Computation (CEC-2017), pp. 2412–2419 (2017)
Sinha, A., Soun, T., Deb, K.: Using karush-kuhn-tucker proximity measure for solving bilevel optimization problems. Swarm Evol. Comput. 44, 496–510 (2019)
Angelo, J.S., Krempser, E., Barbosa, H.J.C.: Differential evolution assisted by a surrogate model for bilevel programming problems. In: 2014 IEEE Congress on Evolutionary Computation (CEC 2014), pp. 1784–1791 (2014)
Sinha, A., Malo, P., Deb, K.: Evolutionary algorithm for bilevel optimization using approximations of the lower level optimal solution mapping. Eur. J. Oper. Res. 257(2), 395–411 (2017)
Islam, M.M., Singh, H.K., Ray, T.: A surrogate assisted approach for single-objective bilevel optimization. IEEE Trans. Evol. Comput. 21(5), 681–696 (2017)
Sinha, A., Lu, Z., Deb, K., Malo, P.: Bilevel optimization based on iterative approximation of multiple mappings. J. Heuristics 26(2), 151–185 (2020)
Sinha, A., Shaikh, V.: Solving bilevel optimization problems using kriging approximations. IEEE Trans. Cybern. 52(10), 10639–10654 (2021)
Dempe, S., Dutta, J., Mordukhovich, B.S.: New necessary optimality conditions in optimistic bilevel programming. Optimization 56(5–6), 577–604 (2007)
Wiesemann, W., Tsoukalas, A., Kleniati, P.-M., Rustem, B.: Pessimistic bilevel optimization. SIAM J. Optim. 23(1), 353–380 (2013)
Dempe, S., Mordukhovich, B.S., Zemkoho, A.B.: Necessary optimality conditions in pessimistic bilevel programming. Optimization 63(4), 505–533 (2014)
Sinha, A., Lu, Z., Deb, K., Malo, P.: Bilevel optimization based on iterative approximation of multiple mappings. J. Heuristics 26(2), 151–185 (2020)
Sinha, A., Bedi, S., Deb, K.: Bilevel optimization based on kriging approximations of lower level optimal value function. In: 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8 (2018)
Sacks, J., Welch, W.J., Mitchell, T.J., Wynn, H.P.: Design and analysis of computer experiments. Stat. Sci. 4, 409–23 (1989)
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998)
Nocedal, J., Wright, S.J.: Penalty and Augmented Lagrangian Methods, pp. 497–528. Springer, New York, NY (2006)
UCI Machine Learning Repository: Communities and Crime Data Set. https://archive.ics.uci.edu/ml/datasets/communities+and+crime. Accessed: 2020-05-1
LeCun, Y., Cortes, C., Burges, C.: MNIST Handwritten Digit Database (2010)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sinha, A., Khandait, T. & Mohanty, R. A gradient-based bilevel optimization approach for tuning regularization hyperparameters. Optim Lett (2023). https://doi.org/10.1007/s11590-023-02057-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11590-023-02057-x