Reinforcement Learning for Variable Selection in Branch and Bound
Useful Links
Project Overview
In our work, we propose an off-policy policy gradient (OFF-PG) approach to learn variable selection in Branch and Bound. Unlike on-policy RL methods, which suffer from sparse rewards and difficult exploration, OFF-PG leverages pre-collected trajectories generated by a behavioral policy to stabilize training and improve sample efficiency. By incorporating importance sampling, generalized advantage estimation (GAE), and actor-critic networks, our method enables effective learning of branching strategies without requiring expensive expert data or extensive trial-and-error. As a result, our approach achieves competitive or superior solution quality compared to imitation learning methods while also offering greater scalability and flexibility across diverse MILP instances.
Results
We train our model on the following set of combinatorial optimization (CO) problems:
Problem | Training: easy | Testing: easy | Testing: medium | Testing: hard |
---|---|---|---|---|
Set Covering | $n_{\text{cols}}$ = 1000, $n_{\text{rows}}$ = 500 | $n_{\text{cols}}$ = 1000, $n_{\text{rows}}$ = 500 | $n_{\text{cols}}$ = 1000, $n_{\text{rows}}$ = 600 | $n_{\text{cols}}$ = 1000, $n_{\text{rows}}$ = 700 |
Capacitated Facility Location | $n_{\text{facilities}}$ = 50, $n_{\text{customers}}$ = 100 | $n_{\text{facilities}}$ = 50, $n_{\text{customers}}$ = 100 | $n_{\text{facilities}}$ = 50, $n_{\text{customers}}$ = 150 | $n_{\text{facilities}}$ = 50, $n_{\text{customers}}$ = 200 |
We compare against version 8.1 of PySCIPOpt [1] , which uses a variant of hybrid pseudo branching (HPB), and an imitation learning method proposed in [2] denoted as (GCNN). We neglect to include a full strong brancher (FSB) as it is not competitive in terms of running time, despite producing tiny search trees.
Performance comparison on Set Covering
Model | Easy (Time ± %) | Easy (Wins) | Easy (Nodes ± %) | Medium (Time ± %) | Medium (Wins) | Medium (Nodes ± %) | Hard (Time ± %) | Hard (Wins) | Hard (Nodes ± %) |
---|---|---|---|---|---|---|---|---|---|
HPB | 3.78 ± 28.16% | 6/30 | 381.10 ± 41.05% | 4.32 ± 18.60% | 7/30 | 339.97 ± 23.14% | 6.43 ± 7.41% | 17/30 | 727.83 ± 20.43% |
GCNN | 3.61 ± 33.38% | 12/30 | 325.50 ± 33.86% | 4.22 ± 19.99% | 13/30 | 299.43 ± 20.02% | 8.09 ± 5.06% | 7/30 | 609.13 ± 18.28% |
OFF-PG | 3.71 ± 31.15% | 6/30 | 344.80 ± 35.18% | 4.39 ± 17.97% | 10/30 | 309.13 ± 19.33% | 7.78 ± 5.42% | 6/30 | 601.47 ± 18.92% |
Performance comparison on Capacitated Facility Location
Model | Easy (Time ± %) | Easy (Wins) | Easy (Nodes ± %) | Medium (Time ± %) | Medium (Wins) | Medium (Nodes ± %) | Hard (Time ± %) | Hard (Wins) | Hard (Nodes ± %) |
---|---|---|---|---|---|---|---|---|---|
HPB | 10.91 ± 3.06% | 7/30 | 162.87 ± 20.47% | 20.55 ± 1.19% | 3/30 | 151.33 ± 17.29% | 26.82 ± 0.82% | 1/30 | 86.07 ± 16.53% |
GCNN | 9.72 ± 4.23% | 8/30 | 234.27 ± 20.11% | 15.09 ± 2.02% | 14/30 | 248.33 ± 19.43% | 15.69 ± 1.75% | 17/30 | 135.70 ± 16.73% |
OFF-PG | 8.35 ± 4.99% | 15/30 | 210.53 ± 20.31% | 13.96 ± 2.11% | 13/30 | 218.73 ± 16.17% | 15.08 ± 1.80% | 12/30 | 127.10 ± 15.47% |
References
- MAHER, Stephen; MILTENBERGER, Matthias; PEDROSO, João P.; REHFELDT, Daniel;
SCHWARZ, Robert; SERRANO, Felipe: PySCIPOpt: Mathematical Programming in Python with the SCIP Optimization Suite.
In: Mathematical Software – ICMS 2016. Springer International Publishing, 2016, pp. 301–307. - GASSE, Maxime; CHETELAT, Didier; FERRONI, Nicola; CHARLIN, Laurent; LODI, Andrea:
Exact Combinatorial Optimization with Graph Convolutional Neural Networks.
In: WALLACH, H. (Ed.); LAROCHELLE, H. (Ed.); BEYGELZIMER, A. (Ed.); ALCHÉ-BUC, F. d’ (Ed.);
FOX, E. (Ed.); GARNETT, R. (Ed.): Advances in Neural Information Processing Systems, Vol. 32.
Curran Associates, Inc., 2019.
URL