1 Introduction
A definition proposed by Azzouz [1] for a dynamic problem is: “dynamic multi-objective optimization problem (DMOP) is the problem of finding a vector of decision variables which satisfies a set of constraints and optimizes a vector of functions whose scalar values They represent objectives that change over time.” A DMOP can be defined as follows in Equation 1:
where
Ke Li [2] mentions that the recurring problems within single-objective and multi-objective optimization are determining a specific configuration for the control parameters and the selection of the correct genetic operators for each type of problem; this leads us to have different scenarios for solving problems, this becomes more complicated when the person is not an EA expert.
Search operators are of great importance in metaheuristics. Some operators are better for solving a specific problem. On the other hand, the selection and order of use of these operators can affect the performance of an algorithm [3].
In this case, we focus on selecting the correct genetic operator; adaptive operator selection, also called AOS, is used. The AOS is responsible for automatically determining which variation operator to use at a given time within a MOEA process.
Within the state of the art for AOS, the most recent works found correspond to methods based on multi-arm bandits applied to static multi objective optimization algorithms based on decomposition.
In this work, a new adaptive operator selection mechanism is proposed, which is based on the SARSA (λ) reinforcement learning technique. This new AOS mechanism has been incorporated into the DMOEA/D algorithm and compared against the state-of-the-art algorithm, which was also applied to the DMOEA/D algorithm.
2 Background and Related Work
The AOS comprises two main tasks [4]: credit assignment and operator selection. In the first task, the reward or weight of each operator is determined by their performance. The second task selects the best available operator.
Credits can be assigned in different ways depending on our method or algorithm. However, in general, it can be done based on the improvement of the children's fitness concerning the parents, and it can also be done based on a ranking of the operators.
Depending on how the operator selection task works, the AOS is divided into two groups: one encompasses all methods based on a probabilistic approach, and the other encompasses a multi-arm bandit approach (MAB).
In the multi-arm bandit approach, AOS uses the “multi-arm bandit (MAB) problem paradigm” [2], which considers each operator as an arm of a slot machine, each with an unknown reward probability. These methods seek to maximize the reward accumulated during the process and model these rewards to select the best operator (arm) at each moment.
In our previous work [5], we have applied “the Fitness-Rate-Rank-based Multi-Armed Bandit (FRRMAB)” [2], “Adaptive Operator Selection Based on Dynamic Thompson Sampling (DYTS)” [6], and “Adaptive operator selection with test-and-apply structure for decomposition-based multi-objective optimization (TAOS)” [7] methods to a dynamic version of the MOEA/D algorithm to observe its behavior.
Another of the most recent works in the area is “A novel bicriteria assisted adaptive operator selection (B-AOS) strategy for decomposition-based multi-objective evolutionary algorithms (MOEA/Ds)” proposed in 2021 by Wu Lin [8]. This approach employs two groups of operators; each group includes two genetic operators with different search patterns.
In addition, it uses two criteria, which emphasize convergence and diversity, to help select the appropriate operator. In the probabilistic approach, a probability is attached to each operator, and its selection process is similar to that of a roulette wheel; the operators with the highest probabilities will be those who will have a larger area on the roulette wheel and will be the ones who will have a greater chance of being selected.
The most popular methods within this group are Probability Matching [9] and Adaptive Pursuit [10]. Another work is “Adaptive crossover operator based multi-objective binary genetic algorithm for feature selection in classification” which uses a probability-based AOS within a multi-objective genetic algorithm to solve feature selection problems [11].
The strategies mentioned above are applied to static multi-objective algorithms in the state-of-the-art. In this work, it has been decided to use the most recent strategy, “Adaptive operator selection with test-and-apply structure for decomposition-based multi-objective optimization (TAOS)” [7] and apply it to a dynamic multi-objective algorithm to compare the state-of-the-art strategy with the strategy proposed in this work.
2.1 Adaptive Operator Selection with Test-and-Apply Structure for Decomposition-based Multi-Objective Optimization (TAOS)
In this approach proposed by Lisha Dong in 2022 [6], the whole evolutive process is structured into several continuous sections, each designed to execute testing and application phases.
2.1.1 Test Phase
In the testing phase, each operator is tested in the same environment. The testing phase is divided into N parts, as shown in Figure 1: test1, …, and testN. All operators will be evaluated once in order. Therefore, the total number of function evaluations for the testing phase is defined as follows, as shown in Equation 2:
where
A child solution updates the first solution, and the
2.1.1 Apply Phase
When the testing phase has finished, the successful updates count (&W^) is obtained for each operator; this is shown in Equation 4:
where
The operator with the
where k handles the resources for the testing and application phases, a shorter value of
2.2 Reinforcement Learning
Reinforcement learning is “learning how to do and map situations to actions to maximize a numerical reward signal. An Agent needs to be told what actions to take; instead, she must discover which actions yield the most significant reward by at-tempting them” [12].
In Figure 2, it can be seen that two important components of reinforcement learning are the Agent and the Environment. The Agent is the model that needs to be trained to make decisions.
2.2.1 SARSA (λ)
SARSA
The SARSA
Next, algorithm 1 presents the general mechanism of SARSA
The calculation of the temporal error is defined in the Equation 7:
The temporal error is used to update the Q values and improve the estimation of the quality of the actions in the different states. The update of the Q values is defined by the Equation 8:
where
3 Proposed Algorithm DMOEA/D-SL
This work proposes a new AOS method using a reinforcement learning technique called SARSA
Two mechanisms proposed by Deb [13] have been added to the MOEA/D algorithm proposed by Zhang and Li in 2007 [14] to make an algorithm capable of working with dynamic problems, a change detection mechanism based on detectors, and the change response mechanism called “A”.
We have taken the multi-arm approach and used it as actions, in this case four variants of the differential evolution trader are being used as actions The SARSA Lambda mechanism has been incorporated into the main loop of DMOEA/D. The general structure of this integration is shown in algorithm 2.
SARSA Lambda is initialized with the values corresponding to each of its parameters and is assigned to an object called an Agent; in this step, QTable is also initialized (line 3). For the episode loop, the main DMOEA/D loop is used (line 14), and for the step loop, the loop that runs through the population list (line 18) has been used.
In each turn of the main loop,
We get all the values stored in the QTable for
Subsequently, the following action,
This process is repeated until the stopping criterion has been met. It should also be considered that in each problem change, the QTable must be reinitialized (line 67).
2.2 Actions Pool
Five genetic operators are evaluated. Four different versions of the differential evolution (DE) crossover operator [6] were tested as actions. In addition to each crossover operator, the polynomial mutation operator was also applied:
Action 1: apply DE/rand/
Action 2: apply DE/rand/
Action 3: apply DE/current-to-rand/
Action 4: apply DE/current-to-rand/
The four crossover operators use
4 Computational Experiments
Table 1 shows the eight dynamic multi-objective benchmark problems of 2 and 3 objectives used in this experiment. For each algorithm and front of the dynamic multi-objective problem, 30 independent runs were conducted.
The objective of the experimentation is to compare the proposed algorithm against the state-of-the-art algorithm called “Adaptive operator selection with test-and-apply structure for decomposition-based multi-objective optimization (TAOS)” [6]. Table 2 shows the parameters used for each algorithm used. The algorithms were implemented in the Java language.
Variables / Parameters | DMOEA/D-TAOS | DMOEA/D-SL |
maxIt | 100 | 100 |
nPop | 100 | 100 |
fileSize | 100 | 100 |
Zeta | 0.2 | 0.2 |
K | 1 | -- |
alpha | -- | 0.04 |
gamma | -- | 0.08 |
lambda | -- | 0.07 |
epsilon | -- | 0.1 |
The values of the parameters for MOEA/D have been taken from state-of-the-art, and the values of the parameters of the SARSA Lambda Agent have been obtained by assigning values arbitrarily by performing multiple experiments to determine the current values.
4.1 Results
The experimentation results are presented below in a table by metric (hypervolume, generalized spread, and inverted generational distance). Wilcoxon non-parametric test was applied with a significance level of 5%.
The first column in the table presents the problem. The second column presents the problem front. Columns three to four present the results of each algorithm. The algorithm in the third column is taken as a reference (MOEA/D-TAOS).
The following symbols are included in the results tables: the symbol ▲ means that there is statistical significance in favor of the reference algorithm, ▼ that there is statistical significance in favor of the algorithm that is compared with the reference algorithm (in favor of the current column), and == means there is no statistical significance.
The cells marked in dark gray represent the winning algorithm in a given problem and the front, and second places are marked in light gray.
4.1.1 Hypervolume
The hypervolume denotes the multidimensional volume of the objective space weakly dominated by an approximation set [5]. In Table 3, the third column is considered as the reference algorithm. As seen in the previous table, in the hypervolume metric, with the Wilcoxon test for the DMOEA/D-SL algorithm, 8 first places are obtained with statistical significance in favor. In comparison, for the DMOEA/D-TAOS algorithm, they obtained 18 first places with statistical significance in their favor.
Problem | Front | DMOEA/D-TAOS | DMOEA/D-SL |
FDA1 | 1 | 6.32e-01 6.24e-03 | 6.31e-01 8.39e-03 ▲ |
2 | 6.33e-01 5.76e-03 | 6.32e-01 5.35e-03 ▲ | |
3 | 6.29e-01 7.83e-03 | 6.30e-01 8.70e-03 == | |
4 | 6.33e-01 5.44e-03 | 6.33e-01 8.34e-03 ▼ | |
5 | 6.31e-01 7.67e-03 | 6.35e-01 5.83e-03 ▼ | |
FDA3 | 1 | 6.56e-01 5.97e-03 | 6.55e-01 4.65e-03 == |
2 | 6.86e-01 7.86e-03 | 6.84e-01 4.33e-03 == | |
3 | 6.63e-01 3.52e-03 | 6.61e-01 4.50e-03 ▲ | |
4 | 6.51e-01 3.86e-03 | 6.49e-01 4.44e-03 ▲ | |
5 | 6.50e-01 6.20e-03 | 6.47e-01 8.16e-03 ▲ | |
DMOP1 | 1 | 4.28e-01 3.71e-03 | 4.27e-01 4.36e-03 ▲ |
2 | 4.07e-01 3.13e-03 | 4.07e-01 2.48e-03 == | |
3 | 3.88e-01 1.98e-03 | 3.88e-01 2.34e-03 ▲ | |
4 | 3.72e-01 2.27e-03 | 3.71e-01 2.42e-03 == | |
5 | 3.58e-01 2.20e-03 | 3.58e-01 1.96e-03 == | |
DMOP2 | 1 | 4.28e-01 2.30e-03 | 4.27e-01 1.93e-03 ▲ |
2 | 4.05e-01 2.61e-03 | 4.05e-01 3.10e-03 ▲ | |
3 | 3.86e-01 2.55e-03 | 3.86e-01 3.10e-03 ▼ | |
4 | 3.69e-01 2.25e-03 | 3.68e-01 2.53e-03 ▲ | |
5 | 3.55e-01 1.49e-03 | 3.55e-01 3.76e-03 ▼ | |
DF4 | 1 | 6.83e-01 4.06e-03 | 6.80e-01 8.08e-03 ▲ |
2 | 7.34e-01 4.68e-03 | 7.32e-01 3.60e-03 ▲ | |
3 | 7.72e-01 3.46e-03 | 7.72e-01 3.36e-03 == | |
4 | 8.04e-01 2.04e-03 | 8.03e-01 2.91e-03 ▲ | |
5 | 8.26e-01 3.16e-03 | 8.26e-01 3.39e-03 ▲ | |
DF6 | 1 | 3.71e-02 6.53e-04 | 3.70e-02 9.11e-04 ▲ |
2 | 0.00e+00 0.00e+00 | 0.00e+00 0.00e+00 == | |
3 | 3.51e-02 1.06e-01 | 3.81e-02 2.02e-01 ▼ | |
4 | 1.06e-01 1.75e-01 | 1.08e-01 2.97e-01 ▼ | |
5 | 2.12e-01 2.13e-01 | 2.15e-01 3.40e-01 ▼ | |
DF10 | 1 | 8.36e-01 6.66e-03 | 8.36e-01 5.48e-03 ▼ |
2 | 8.30e-01 7.12e-03 | 8.31e-01 5.09e-03 ▼ | |
3 | 8.11e-01 6.06e-03 | 8.12e-01 5.20e-03 ▼ | |
4 | 7.82e-01 5.58e-03 | 7.82e-01 5.14e-03 ▲ | |
5 | 7.45e-01 6.38e-03 | 7.45e-01 4.35e-03 ▼ | |
DF12 | 1 | 3.04e-01 1.18e-02 | 3.06e-01 2.47e-02 == |
2 | 6.53e-01 1.39e-02 | 6.47e-01 1.71e-02 ▲ | |
3 | 6.46e-01 1.16e-02 | 6.51e-01 1.17e-02 == | |
4 | 6.53e-01 1.63e-02 | 6.50e-01 1.90e-02 ▲ | |
5 | 6.53e-01 1.57e-02 | 6.46e-01 2.71e-02 ▲ |
4.1.2 Generalized Spread
Generalized Spread measures the uniformity and their dispersion of the solutions found [5]. In Table 4, the third column is considered as the reference algorithm. As seen in the previous table, in the generalized spread metric, with the Wilcoxon test for the DMOEA/D-SL algorithm, 20 first places with statistical significance are obtained in favor. In contrast, for the DMOEA/D-TAOS algorithm, 8 first places are obtained with statistical significance in favor.
Problem | Front | DMOEA/D-TAOS | DMOEA/D-SL |
FDA1 | 1 | 5.16e-01 1.91e-01 | 4.63e-01 1.90e-01 ▼ |
2 | 5.58e-01 1.45e-01 | 4.38e-01 1.20e-01 ▼ | |
3 | 5.34e-01 1.67e-01 | 4.68e-01 1.49e-01 ▼ | |
4 | 5.27e-01 1.59e-01 | 4.82e-01 1.50e-01 ▼ | |
5 | 5.59e-01 1.40e-01 | 4.65e-01 1.29e-01 ▼ | |
FDA3 | 1 | 6.65e-01 2.03e-01 | 6.96e-01 1.57e-01 == |
2 | 6.22e-01 2.27e-01 | 5.86e-01 3.57e-01 ▼ | |
3 | 8.44e-01 1.47e-01 | 8.34e-01 1.41e-01 ▼ | |
4 | 6.48e-01 7.09e-02 | 6.37e-01 6.86e-02 ▼ | |
5 | 4.70e-01 4.49e-02 | 4.44e-01 4.73e-02 ▼ | |
DMOP1 | 1 | 7.46e-01 5.48e-02 | 7.33e-01 1.47e-01 == |
2 | 7.42e-01 7.13e-02 | 7.27e-01 6.02e-02 ▼ | |
3 | 7.62e-01 1.05e-01 | 7.45e-01 6.63e-02 ▼ | |
4 | 7.53e-01 7.05e-02 | 7.48e-01 7.32e-02 ▼ | |
5 | 7.40e-01 1.17e-01 | 7.37e-01 6.95e-02 == | |
DMOP2 | 1 | 7.27e-01 7.21e-02 | 7.50e-01 9.14e-02 ▲ |
2 | 6.86e-01 8.37e-02 | 6.73e-01 7.78e-02 ▼ | |
3 | 6.92e-01 9.39e-02 | 7.00e-01 7.33e-02 == | |
4 | 6.86e-01 5.62e-02 | 6.93e-01 1.12e-01 == | |
5 | 6.85e-01 6.99e-02 | 6.83e-01 6.99e-02 == | |
DF4 | 1 | 8.38e-01 1.47e-01 | 8.77e-01 1.78e-01 == |
2 | 8.39e-01 1.11e-01 | 8.56e-01 1.47e-01 == | |
3 | 1.00e+00 4.66e-02 | 9.83e-01 6.08e-02 ▼ | |
4 | 1.12e+00 3.52e-02 | 1.10e+00 5.37e-02 ▼ | |
5 | 1.16e+00 7.50e-02 | 1.15e+00 5.48e-02 ▼ | |
DF6 | 1 | 7.02e-01 8.48e-02 | 6.96e-01 8.28e-02 == |
2 | 9.41e-03 4.32e-03 | 9.20e-03 1.12e-02 == | |
3 | 1.05e-01 4.86e-02 | 1.10e-01 1.01e-01 ▲ | |
4 | 1.07e-01 3.28e-02 | 1.10e-01 7.07e-02 ▲ | |
5 | 1.29e-01 8.68e-03 | 1.33e-01 3.10e-02 ▲ | |
DF10 | 1 | 7.32e-01 1.49e-01 | 7.61e-01 1.50e-01 ▲ |
2 | 7.78e-01 1.78e-01 | 7.93e-01 1.11e-01 == | |
3 | 7.66e-01 1.69e-01 | 7.73e-01 1.15e-01 ▲ | |
4 | 7.54e-01 1.83e-01 | 7.50e-01 9.83e-02 ▼ | |
5 | 7.96e-01 2.02e-01 | 7.57e-01 1.38e-01 == | |
DF12 | 1 | 5.14e-01 3.90e-02 | 4.76e-01 5.83e-02 ▼ |
2 | 4.96e-01 4.41e-02 | 5.05e-01 3.89e-02 ▲ | |
3 | 5.14e-01 3.21e-02 | 5.07e-01 3.29e-02 ▼ | |
4 | 5.12e-01 5.60e-02 | 5.14e-01 3.97e-02 ▲ | |
5 | 5.05e-01 5.91e-02 | 4.91e-01 5.79e-02 ▼ |
4.1.3 Inverted Generational Distance
The inverted generation distance provides the average distance between any point in the reference set and its closest point in the approximation set [5]. In Table 5, the third column is considered as the reference algorithm. As seen in the previous table, in the inverted generational distance metric, with the Wilcoxon test for the DMOEA/D-SL algorithm, 21 first places are obtained with statistical significance in favor.
Problem | Front | DMOEA/D-TAOS | DMOEA/D-SL |
FDA1 | 1 | 7.49e-04 3.08e-04 | 8.06e-04 2.87e-04 == |
2 | 8.25e-04 2.43e-04 | 7.31e-04 3.24e-04 == | |
3 | 8.49e-04 2.67e-04 | 7.99e-04 2.93e-04 ▼ | |
4 | 7.71e-04 2.71e-04 | 7.14e-04 2.70e-04 ▼ | |
5 | 8.33e-04 2.82e-04 | 7.00e-04 2.38e-04 ▼ | |
FDA3 | 1 | 6.80e-04 4.16e-04 | 6.92e-04 3.22e-04 ▲ |
2 | 2.20e-03 1.25e-03 | 1.97e-03 1.46e-03 ▼ | |
3 | 2.93e-03 2.27e-03 | 2.90e-03 2.30e-03 ▼ | |
4 | 3.21e-03 3.31e-03 | 3.27e-03 2.57e-03 ▲ | |
5 | 3.78e-03 2.67e-03 | 4.57e-03 4.78e-03 == | |
DMOP1 | 1 | 5.43e-04 1.13e-04 | 5.19e-04 1.96e-04 == |
2 | 5.04e-04 1.43e-04 | 4.63e-04 9.23e-05 ▼ | |
3 | 5.29e-04 2.26e-04 | 4.98e-04 9.75e-05 ▼ | |
4 | 5.13e-04 1.59e-04 | 5.12e-04 1.40e-04 ▼ | |
5 | 5.08e-04 2.18e-04 | 4.92e-04 1.20e-04 ▼ | |
DMOP2 | 1 | 5.34e-04 1.34e-04 | 5.33e-04 1.64e-04 ▼ |
2 | 4.76e-04 1.54e-04 | 4.72e-04 1.12e-04 == | |
3 | 5.18e-04 1.37e-04 | 5.09e-04 1.26e-04 == | |
4 | 4.94e-04 1.13e-04 | 5.13e-04 1.83e-04 ▲ | |
5 | 4.79e-04 1.26e-04 | 4.76e-04 1.53e-04 ▼ | |
DF4 | 1 | 2.99e-03 1.43e-03 | 3.57e-03 1.57e-03 ▲ |
2 | 2.86e-03 1.01e-03 | 3.07e-03 1.71e-03 ▲ | |
3 | 5.49e-03 5.07e-04 | 5.35e-03 5.72e-04 ▼ | |
4 | 8.52e-03 3.20e-04 | 8.46e-03 3.85e-04 ▼ | |
5 | 1.12e-02 1.95e-04 | 1.12e-02 2.76e-04 ▲ | |
DF6 | 1 | 3.86e-04 1.61e-04 | 4.00e-04 1.73e-04 ▲ |
2 | 2.01e-01 1.92e-01 | 2.04e-01 2.74e-01 == | |
3 | 1.51e-02 7.50e-03 | 1.15e-02 1.07e-02 ▼ | |
4 | 1.43e-02 7.36e-03 | 1.07e-02 1.06e-02 ▼ | |
5 | 1.24e-02 6.55e-03 | 9.28e-03 9.45e-03 ▼ | |
DF10 | 1 | 6.62e-03 5.77e-04 | 6.50e-03 4.09e-04 ▼ |
2 | 6.48e-03 5.11e-04 | 6.43e-03 2.15e-04 ▼ | |
3 | 6.52e-03 5.74e-04 | 6.50e-03 3.29e-04 ▼ | |
4 | 6.62e-03 4.84e-04 | 6.66e-03 2.71e-04 == | |
5 | 6.76e-03 3.57e-04 | 6.81e-03 2.04e-04 == | |
DF12 | 1 | 2.21e-03 1.43e-04 | 2.10e-03 3.25e-04 ▼ |
2 | 2.17e-03 2.94e-04 | 2.24e-03 3.61e-04 ▲ | |
3 | 2.21e-03 2.13e-04 | 2.18e-03 2.05e-04 ▼ | |
4 | 1.97e-03 2.39e-04 | 1.98e-03 1.94e-04 ▲ | |
5 | 2.13e-03 2.49e-04 | 2.09e-03 1.89e-04 == |
In comparison, TAOS obtains 9 first places with statistical significance in favor. The results obtained for the group of problems presented in this experiment suggest that the proposed algorithm is better in generalized spread and inverted generational distance metrics.
These metrics indicate that the proposed algorithm produces quality solutions with good approximation to the Pareto front and good dispersion. Regarding the hypervolume metric, the clear winner is the state-of-the-art algorithm. The main limitation of using an Agent to make the automatic selection of genetic operators in the algorithm is the parameter configuration of the Agent; a good parameter configuration can give us quality solutions, but on the other hand, we do use a wrong parameter configuration for the Agent we will get low-quality solutions.
In this work, the parameter values of the SARSA Lambda Agent have been obtained by assigning values arbitrarily by performing multiple experiments to determine the current values. The source code can be downloaded fromfn.
5 Conclusions and Future Work
This work proposes a new adaptive operator selection strategy using a reinforcement learning agent. This SARSA Lambda reinforcement learning strategy has been integrated into a dynamic multi-objective decomposition-based algorithm called DMOEA/D-SL. Furthermore, a state-of-the-art adaptive operator selection strategy, in this case, “Adaptive operator selection with test-and-apply structure for decomposition-based multi-objective optimization (TAOS),” has also been integrated into a dynamic multi-objective decomposition-based algorithm.
In the case of the proposed algorithm, the Agent learns to select the best genetic operator at a given moment, even when the definition of the problem changes over time. Extensive experimentation has been performed, and the results have been evaluated with three metrics: hypervolume, generalized spread, and inverted generation distance.
The Wilcoxon test has been applied with a significance level of 5%. The Wilcoxon test suggests that experimentation results are favorable in two metrics for DMOEA/D-SL. These results suggest that the algorithm produces high-quality solutions with a good approximation to the Pareto front and good dispersion.
Future work proposes exploring the parameters of the SARSA Lambda Agent more broadly to achieve better results in the three metrics used. On the other hand, using more than four genetic operators would test the behavior of the strategies used in this work and their performance.
Finally, other reinforcement learning strategies can also be considered to improve the quality of the solutions generated by the algorithm further.