1 Introduction

The algorithms used for classification, regression, or clustering typically do not perform well on high-dimensional data [1]. Therefore, a traditional approach in machine learning is to implement procedures to reduce the number of features used by those algorithms. The goal of all feature selection procedures is to find a small subset of features that provide a high value of the implemented performance measure.

In many areas, the actual feature reduction rate is not a critical parameter. For example, when the algorithm’s task is to identify cancer areas in medical images in offline mode, additional features may be acceptable if they enhance recognition. However, there are areas where we cannot afford any additional features. The need for the smallest possible feature set is particularly crucial when the classifier is used for real-time decision-making, and the time needed to extract each feature from the current stream of data is considerably high. In such a situation, each additional feature that must be extracted from the data stream is an unnecessary burden that can disrupt the real-time classification or even make it impossible.

There are many traditional feature selection approaches, including filters such as ReliefF [2], Correlation-based Feature Selection [3], and Consistency-based Feature Selection, as well as wrappers like step-wise selection [4, 5], random selection [6], and Recursive Feature Elimination [7]. Additionally, embedded methods such as Lasso [8] can also be used. Besides classic approaches, heuristic approaches inspired by nature are becoming increasingly popular. A comprehensive review of these approaches is provided in [9], which reports dozens of algorithms inspired by the behaviour of insects, reptiles, birds, and animals.

Among all nature-inspired feature selection algorithms, the most popular and commonly used are those based on swarm intelligence (SI) and genetic algorithms (GA). Some of the well-known algorithms from the first category are particle swarm optimization (PSO) [10,11,12,13,14], ant colony optimization (ACO) [15,16,17], and artificial bee colony optimization (ABC) [18,19,20]. Many studies have shown that SI algorithms are efficient feature selection techniques [21,22,23]. However, according to [1], most SI-based feature selection algorithms suffer from the poor scalability of representation, which usually makes them unsuitable for applications where thousands or millions of features are possible. Furthermore, although SI-inspired feature selection algorithms use different types of representation (either binary or continuous), they do not directly support the search process among feature sets containing a fixed and very small portion of all the possible features. Hence, when the smallest possible feature subset is needed, GA-based approaches might be a better option [24,25,26,27,28,29,30,31].

One advantage of using a GA as a feature selector is that it evaluates the entire set of solutions simultaneously, instead of sequentially [24]. Therefore, it can explore different parts of the search space in the same generation, rather than focusing on one particular area and potentially getting stuck in a local minimum. Additionally, even if it falls into a local minimum, it can escape on its own. Another feature of GA that is critical for the feature selection process is that it does not assume any interactions among features existing in the feature set [32,33,34]. However, the main disadvantage of using GAs as feature selectors is their long processing time, which is a consequence of wrapping the feature selection process around a classification scheme [6, 35].

Several GA approaches have been proposed to solve the feature selection problem, with one of the most popular being based on the classic Holland GA [27, 31, 32, 34, 36,37,38]. In this approach, features are coded into genes using a binary scheme, wherein a gene possessing a value of 1 denotes that the feature is included into an individual, while a gene possessing a value of 0 indicates its absence. With this scheme, the number of genes in an individual is equal to the total number of features in the feature space. However, a high number of genes in an individual has two disadvantages. First, when the GA fitness function is based solely on classification accuracy, as in the classic Holland GA, the classifiers are trained with a massive number of features, which is a time-consuming process. Second, although the classifiers’ accuracy is often approximately 100% at the very first algorithm iteration, most have limited application due to their limited generalization capabilities. Both problems can be resolved by introducing a penalty term into the classic Holland GA’s fitness function, penalizing individuals with too many features [24, 39, 40]. This solution generally achieves this task, but the feature reduction process is slow since the individuals in the initial population start with approximately half of all available features.

Another well-known feature selection approach based on a genetic algorithm is the non-dominated sorting genetic algorithm (NSGA), and its modification, NSGA-II [24, 25, 40,41,42,43,44,45]. The NSGA algorithm encodes features into individuals using a binary scheme, as in the classic Holland GA. The main difference between the algorithms is the optimization criterion. While the Holland GA optimises only the classification accuracy (maximization), the NSGA algorithm optimizes two criteria simultaneously: classification accuracy (maximization) and the number of features encoded in an individual (minimization). To perform this simultaneous optimization, the NSGA fitness function uses the domination principle, where individuals are assigned different ranks based on their level of dominance. The problem with the NSGA algorithm is similar to that with the Holland GA. Both algorithms start with a random population of individuals containing approximately 50% of the possible features (a result of a binary coding scheme) and then try to find a balance between the individuals’ accuracy and their number of features, which is a time-consuming process.

In [40], we proposed another GA dedicated to feature selection, a genetic algorithm with aggressive mutation (GAAM). This algorithm was designed to address one of the problems of the Holland approach, which emerged in datasets composed of thousands of features—the huge size of individuals. To overcome this issue, we changed the coding scheme from binary to integer, with genes encoding the feature indexes. We also designed a new mutation operator and adjusted other GA steps. These changes allowed us to use the algorithm with individuals of arbitrary length, from individuals containing only one gene to individuals coding all the features from the feature space. However, we encountered a subsequent problem.

The classic version of GAAM works with individuals of a fixed number of genes (features) set as a parameter. This highly restricts the possibility of training over-sized classifiers, but also limits the algorithm’s ability to search for subsets of features smaller than the initially chosen size. The only chance for decreasing the number of features is when an individual with repeated features is born in the reproduction process. When such a situation occurs, one of the redundant features is discarded, and the individual shrinks. Unfortunately, this is a sporadic event. Hence, although GAAM works well in terms of accuracy, classifier generalisation capabilities, and computation time, as compared to other approaches [40, 46,47,48,49], there is still room for improvement in changing individuals’ length. Although we can set the number of individuals’ genes to a reasonable value for the given task, we cannot predict if this value is optimal. Therefore, from the beginning of our work with GAAM, we have tried to adjust the algorithm so it could reduce the initial number of genes. To address this problem, we proposed a second version of the algorithm: Melting GAAM [24].

Melting GAAM uses an iterative approach to reduce the number of genes in individuals. It begins with the number of genes set by the user and attempts to improve classification accuracy to the specified level (also set by the user) during a given number of iterations. If the desired level is achieved, one random gene is removed from all individuals in the current population, and the optimization process restarts. The algorithm stops when it cannot surpass the accuracy threshold within the given number of iterations. While this approach works well, it is not flawless. Firstly, the process of optimising the classification accuracy is repeated numerous times for each number of genes. Although the process restarts from individuals highly similar to those from the previous iteration (with only one gene discarded), it requires some time for the algorithm to stabilize. Secondly, setting the accuracy threshold correctly is challenging: too high threshold quickly halts the feature reduction process, while too low threshold removes many features but yields individuals with low accuracy.

In this paper, we propose an alternative solution to decrease the number of individuals’ genes. The proposed solution utilizes both, the penalty term employed in Holland GA and the ranks assigned to individuals in the NSGA algorithm. The paper presents a detailed algorithm of our approach, called Genetic Algorithm with Aggressive Mutation and Minimum Features (GAAMmf), and the results of experiments comparing its performance to four other GAs (GAAM, Melting GAAM, Holland with a penalty term, and NSGA-II) and four non-genetic feature selection methods (Correlation-based Feature Selection (CFS), Lasso, Sequential Forward Selection (SFS) [4] and IniPG [11]—one of PSO algorithms) conducted on eleven datasets with different characteristics. The experiments demonstrate that GAAMmf produces individuals encoding feature sets of comparable accuracy but containing a significantly smaller number of features. Furthermore, in addition to returning the final solution, the algorithm also generates individuals with the highest accuracy for each analyzed number of features. Due to this feature, the algorithm’s user can decide by himself which feature subset is preferable, larger but more precise or less precise but containing a smaller number of features.

The paper’s structure is as follows. The subsequent section describes the main concepts of GAAM and provides a detailed explanation of GAAMmf. The following two sections discuss the experimental setup used to compare the algorithm’s performance and report the results. Finally, the last section concludes the paper.

2 Methods

The algorithm described in this manuscript is a modification of the GAAM algorithm introduced in [25]. The proposed modification involves altering the fitness function used to evaluate individuals in subsequent generations. While in the original GAAM, the individuals are ordered exclusively according to their classification accuracy, GAAMmf combines two criteria in the fitness function, classification accuracy and the number of features encoded in an individual. This change allows for a gradual decrease in the number of features in successive generations, which is unattainable with the original GAAM algorithm.

2.1 The original GAAM

The pseudocode of the original GAAM is presented in Algorithm 1. The algorithm begins by setting four parameters: N, M, T, and tourN. Parameters N and M represent the initial population size, where N denotes the number of genes (features) encoded in each individual, and M denotes the number of individuals in the population. Parameter T determines the stopping condition for the algorithm, i.e., the number of generations to perform. Lastly, parameter tourN indicates the number of individuals used in the tournament selection procedure.

After setting the algorithm’s parameters, the initial mother population motherP) is drawn (function: DrawInitialPopulation). The population is composed of M randomly selected individuals, where each individual contains N genes. The genes are integer values corresponding to the indexes of features in the set of possible features 1,..., P}.

figure a

The main algorithm loop commences with two reproduction operations executed on the individuals from the initial population (motherP): a traditional one-point crossover (function: OnePointCrossover) and the aggressive mutation (function: AggressiveMutation). The latter is a GAAM-specific concept, performed individually on each gene of each individual, in accordance with the pseudocode presented in Algorithm 2. The two populations created during the reproduction operations are concatenated with the mother population, forming the final population (finalP).

Subsequently, each individual is assessed based on the classification accuracy attained by the classifier utilizing the features encoded in the individual’s genes. Following evaluation, the M best individuals are selected from the current population using the tournament selection procedure. The algorithm terminates after achieving the predefined number of iterations.

figure b

2.2 The GAAMmf

As mentioned previously, the primary motivation for developing a new version of GAAM was to further reduce the number of features returned by the algorithm. To achieve this goal, we designed the GAAMmf fitness function to incorporate two criteria: classification accuracy, as in the original GAAM, and the number of features encoded in an individual. This new fitness function required several modifications in the GAAMmf algorithm, which are described below and presented in pseudocode in Algorithm 3. The three functions—DrawInitialPopulation, OnePointCrossover, and AggressiveMutation—remain unchanged and perform the same operations as described in Sect. 2.1.

figure c

First, we had to ensure that the algorithm’s individuals would be composed of a variable number of genes. We achieved this goal by using a redundant number of genes at the beginning of the algorithm. Second, we had to ensure the comparability of both criteria used in the evaluation function. To deal with this task, we employed the concept of ranks from the NSGA-II algorithm and assigned a set of ranks to different levels of accuracy and another set of ranks to different numbers of features. Third, we had to ensure that the original (unranked) values of both criteria would be passed between successive populations. To this end, we divided the algorithm evaluation function into two parts. The first part (function: accFitness) evaluated each individual’s accuracy. The second part counted the individuals’ features, ranked them individually according to both criteria, and calculated the final fitness of each individual (function: GAAMmfFitness). While we used the accFitness function twice in the algorithm body, first to evaluate the initial population and then in the main algorithm loop to evaluate each new population, the GAAMmfFitness function was used only once: in the main loop, after concatenating mother and child population.

Finally, to ensure an equal contribution of both criteria in the total fitness value, we added two algorithm constants, accFactor and fsFactor, which were calculated based on the number of possible accuracy levels and the number of possible features in an individual. We assumed that the accuracy levels range from 0 to 100% (100 integer levels) and the number of features, from 1 to N (N−1 levels). Under these assumptions, the accFactor and fsFactor constants were calculated as shown in Formulas (1) and (2), respectively.

$$\begin{aligned} accFactor={100}\frac{100}{N-1} \end{aligned}$$
(1)
$$\begin{aligned} fsFactor={(N-1)}\frac{100}{N-1}=100 \end{aligned}$$
(2)

In GAAMmf, the fitness of individuals is calculated within the GAAMmfFitness function (Algorithm 4). The function takes two input parameters: the final population of individuals, which includes individuals from the mother population and all off-springs born during the crossover and mutation operations, and the accuracy vector containing the classification accuracy of all individuals from the final population. Inside the function, a sequence of seven operations is performed.

First, the individuals from the final population are sorted according to their increasing accuracy. Then, they are ranked based on the rule that individuals with the same accuracy (rounded to integer values) are assigned the same rank. Rank 1 is assigned to individuals with the worst accuracy. Since the range of ranks can differ for both criteria, the ranks are normalized using a pseudo min-max normalization (3) after assigning accuracy ranks to all individuals. We refer to this as "pseudo min–max normalization" to indicate that accuracy values are normalized based on fixed boundaries set to 0 and 100 for the accuracy criterion, rather than the minimum and maximum accuracy obtained in the current population.

$$\begin{aligned} accNorm(i)=\frac{acc(i)}{100} \end{aligned}$$
(3)

where accNorm(i)—accuracy rank of individual i after normalisation, and acc(i)—accuracy rank of individual i.

The three steps described for the accuracy criterion are repeated for the number of features criterion, with two subtle changes. First, since the worst rank (Rank 1) for this criterion should be assigned to individuals with the largest number of features, individuals are sorted in descending order. Second, the pseudo min–max normalization boundaries are set to 1 and N, indicating that the normalized ranks are calculated according to Formula (4).

$$\begin{aligned} fsNorm(i)=\frac{fs(i)-1}{N-1} \end{aligned}$$
(4)

where fsNorm(i)—number-of-feature rank of individual i after normalisation, fs(i)—number-of-feature rank of individual i, and N—number of genes in an individual.

Finally, the normalized ranks are multiplied by the corresponding weights and added together, as shown in (5):

$$\begin{aligned} {\begin{matrix} fitness(i)=accWeight*accFactor*accNorm(i) \\ \qquad\qquad\quad\;\; +\;fsWeight*fsFactor*fsNorm(i) \end{matrix}} \end{aligned}$$
(5)

where fitness(i)—the final fitness of individual i, accNorm(i) and fsNorm(i)—the normalized ranks of individual i with respect to the accuracy (accNorm) and the number of features (fsNorm) criteria, accFactor and fsFactor—factors that ensure an equal contribution of both criteria to the total fitness, accWeight and fsWeight—weights that allow the importance of both criteria to be regulated.

figure d

Since the algorithm simultaneously explores subsets of features of different sizes, its output is not a single individual with the best characteristics, but a set of individuals. Each individual in the final set of bestIndividuals represents the solution of the highest accuracy obtained for the feature set of the given number of features. As a result, the algorithm user might decide which solution better suits their needs: that with slightly lower accuracy but a smaller number of features or that with slightly higher accuracy but a higher number of features.

Apart from the change in the fitness function, we introduced the two other subtle modifications that we had tested previously in other papers on GAAM [25, 50]. Firstly, we introduced a new algorithm parameter probM (probability of mutation) to control the intensity of the mutation process. This parameter enhanced the algorithm’s scalability and enabled its application for problems described in a high-dimensional feature space.

Secondly, we changed the selection procedure from tournament selection to rank selection. This alteration resulted in a significant reduction of the computational burden imposed by the algorithm. By employing aggressive mutation, the GAAMmf explores various regions within the problem space during each iteration. Consequently, it requires only a limited number of iterations to attain the final results, although each of these iterations is computationally intensive. The transition from tournament selection to rank selection facilitated a decrease in the number of iterations needed (as the best individuals consistently prevail in rank selection), thereby leading to a significant reduction in the overall processing time of the algorithm.

3 Experiment setup

To evaluate the effectiveness of GAAMmf, we conducted a study using eleven datasets, namely Pima-Indians-diabetes, Orlraws10P, Dermatology, Adult, Gisette, Humanactivity, Coil100, Gli_85, Orl_32×32, WarpAR10P, and Yale_32×32, downloaded from sources cited in Table 1. The datasets differed in terms of the number of features, classes, and examples. The aim of our study was to demonstrate that regardless of the dataset characteristics, GAAMmf produces individuals with classification accuracy comparable to reference methods but containing significantly fewer genes. The results of GAAMmf were compared with those of the original GAAM and three other genetic approaches that allow for changes in the number of individuals’ genes, namely Melting GAAM, Holland GA with a penalty term, and NSGA-II. In addition to genetic algorithms, we compared the GAAMmf results with results returned by four non-genetic feature selection methods (CFS, Lasso, SFS, and IniPG).

Before using the datasets in the study, we applied the following preprocessing procedures: (i) removal of all records containing NaN values, (ii) removal of redundant features (features that had the same value for each record), and (iii) identification of pairs of features whose linear correlation exceeded 99%, and discarding one feature from each pair. The detailed demography of the datasets before and after applying the preprocessing procedures is presented in Table 1.

Table 1 The characteristics of the datasets used in the survey

For all datasets, the main GAAMmf parameters were set at the same levels: M (number of individuals in the mother population) was set to 10, probM (mutation probability) to 1, and T (number of algorithm iterations) to 100 (for the first two experiments) or 1000 (for the last experiment). The value of the N parameter, denoting the initial number of individual’s genes, was also standardized for most datasets and was set to 20. Only for two datasets, namely Adult and Prima-Indians-diabetes, which contained 14 and 8 potential features, respectively, the N parameter was set to the total number of features.

The accuracy of individuals was evaluated using a linear discriminant analysis (LDA) classifier. Our decision to employ the LDA classifier was motivated by two key factors. Firstly, the adoption of a linear classification procedure allows for the generation of a classification model with a minimal number of parameters. Consequently, this choice mitigates the potential impact of variations that may occur in each training instance on the outcomes produced by the feature selection procedures. Another advantage of the LDA classifier is that it does not require a numerical procedure to estimate the model parameters. As a result, the estimation process of the LDA model is significantly faster compared to classifiers, whose parameters are estimated under the training process. The parameters of each LDA classifier were estimated according to the 10-fold cross-validation procedure on 80% of data chosen randomly from the dataset. The remaining 20% of data was used to test the generalisation capabilities of the final classification model returned by the algorithm. The LDA classifier was employed in all algorithms tested in the paper.

In the case of GAAM and Melting GAAM, the three parameters shared by both algorithms (M, N, and T) were set at the same levels as in GAAMmf. In addition, for Melting GAAM, an extra parameter needed to be set: the accuracy threshold. This parameter informs the algorithm that the current number of genes has achieved satisfactory accuracy, and the algorithm should proceed with N = N−1 genes. We assumed that we would be satisfied with the classifier of 90% accuracy, and hence we set the accuracy threshold at 90%. Unfortunately, setting the accuracy threshold beforehand can be challenging as it depends on the characteristics of the dataset. As discussed in the Sect. 4, our threshold was too high for some datasets, resulting in no feature reduction, and too low for others, leading to the convergence of the algorithm to individuals with low classification accuracy.

For Holland GA, the classic scheme was employed, utilizing the two most popular genetic operations, flip mutation (with a probability of 0.1) and one-point crossover (with a probability of 1). The selection process was performed with the tournament method (the tourN parameter was set to 2). The two primary algorithm parameters, the number of individuals in the mother population and the number of iterations, were set to the same levels as in GAAMmf. The fitness function was composed of accuracy and penalty terms, where the penalty term was introduced to penalise individuals for having too many genes. Both terms were assigned equal importance (6).

$$\begin{aligned} fitness(i)=0.5*acc(i)+{0.5*}\frac{P-features(i)}{P}, \end{aligned}$$
(6)

where fitness(i)—the fitness of individual i, acc(i)—accuracy of the classifier equipped with features encoded in individual i, P—number of all features in the feature set, and features(i)—number of features encoded in individual i.

The primary parameters of NSGA-II, the last GA employed in the experiments (number of individuals in the population, mutation and crossover probability, and number of iterations), were set to the same levels as those for the Holland algorithm. Binary coding of features was applied, and the algorithm scheme proposed in [43] was implemented.

In addition to the four GAs, the set of reference methods employed in the experiments also included four non-genetic feature selection techniques: one filter (CFS), one embedded method (Lasso), and two wrappers (SFS and IniPG). To ensure comparability with the genetic algorithms, the upper boundary of the feature set size was set for all four methods at the same level as that for all GAs (8 for Pima-Indians-diabetes, 14 for Adult, and 20 for the remaining datasets). All parameters required to run IniPG algorithm were set at the levels reported in [11].

The experiments were conducted on a machine with the following specification: Processor AMD Ryzen 5 1400 Quad-Core Processor CPU @ 3.20 GHz, 16GB RAM, Windows 10 Pro x64.

4 Results and discussion

In order to showcase the features of the proposed algorithm, we conducted a series of experiments. The first experiment aimed to demonstrate the impact of the two parameters incorporated in the algorithm for controlling the significance of the two competing criteria, namely classification accuracy (accWeight) and the number of features (fsWeight). Subsequently, we compared GAAMmf with four other genetic algorithms (GAs) capable of performing feature selection, namely GAAM, Melting GAAM, Holland with a penalty term, and NSGA-II. To facilitate the visual presentation of the results obtained from the first two experiments, both were executed on a single dataset only (Gisette in the first experiment; Humanactivity in the second experiment). Finally, we compared the performance of GAAMmf with all eight reference algorithms across the eleven datasets described in Sect. 3.

4.1 GAAMmf parameters (Gisette dataset)

This subsection provides an overview of GAAMmf’s performance on the Gisette dataset, described in Sect. 3. To showcase the impact of the accWeight and fsWeight parameters on the algorithm’s results, we executed the algorithm three times, each time with different values of both parameters. For the initial run, we set both parameters to 1 (accWeight = 1, fsWeight = 1). For the second and third runs, we doubled the significance of one of the fitness criteria, the number of features criterion (accWeight = 1, fsWeight = 2) in run 2 and the accuracy criterion (accWeight = 2, fsWeight = 1) in run 3. Each run was conducted over a period of 100 iterations. Figure 1 illustrates the average validation accuracy and the number of features encoded in the best individual returned by the algorithm for each iteration of each run. Additionally, Table 2 shows the average accuracy of the best individual found for different numbers of features.

Fig. 1
figure 1

The GAAMmf performance for the Gisette dataset; the plots on the left present the average validation accuracy of the best individual returned in each iteration, the plots on the right present the number of features encoded in that individual; the rows of plots present results for different levels of accWeight and fsWeight parameters: a accWeight = 1, fsWeight = 1; b accWeight = 1, fsWeight = 2; c accWeight = 2, fsWeight = 1

Table 2 The highest accuracy obtained for different numbers of features for the Gisette dataset

As presented in Table 2, the algorithm produced comparable results across all three levels of accWeight and fsWeight. The highest classification accuracy was equal to 93.15% for an equal value of both parameters, 92.04% for the doubled significance of the number-of-features criterion (fsWeight = 2), and 94.03% for the doubled importance of the accuracy criterion (accWeight = 2). Regarding the second criterion, the algorithm attained the smallest number of features (i.e., 10) with a doubled fsWeight parameter. The other algorithm runs returned feature sets composed of 15 features. Upon comparing the three sets of results, the algorithm obtained the most promising outcomes with a doubled accuracy weight. As demonstrated in the last two columns of Table 2, by applying greater pressure on the accuracy criterion, we forced the algorithm to conduct a more thorough search amongst the individuals with the same number of features. Consequently, the algorithm returned individuals of greater accuracy for each number of features in comparison to the other two cases. Since we were interested in high accuracy in the two following experiments, we utilized the variant with accWeight = 2 and fsWeight = 1 in both.

As illustrated in Fig. 1, the performance of the algorithm was consistent across all levels of the parameters accWeight and fsWeight, leading to individuals with high accuracy and a small number of features. The convergence rate varied across runs and was dependent on the parameter levels, with higher values of fsWeight resulting in a more rapid reduction of features but with some fluctuations in accuracy (Fig. 1b). Conversely, for higher values of accWeight (Fig. 1c), the algorithm demonstrated a highly stable accuracy performance, albeit with a slower rate of feature reduction.

4.2 Comparison of GAAMmf with genetic reference methods (Humanactivity dataset)

The second experiment aimed to compare the performance of GAAMmf with four other GAs (GAAM, Melting GAAM, Holland with a penalty term, and NSGA-II) on the Humanactivity dataset. Results are presented in Tables 3, 4 and Fig. 2. Table 3 shows the average validation accuracy of the best individuals identified for various numbers of features, Table 4 presents the processing time required to complete 100 iterations, and Fig. 2 compares the performance of all five algorithms across 100 iterations.

Table 3 demonstrates that the highest accuracy obtained by all four reference algorithms was similar, approximately 96–97%. The best accuracy of 97.80% was obtained with GAAM, followed by Melting GAAM with 97.30%, NSGA-II with 97.22%, and finally Holland with 96.98%. The accuracy of the best individual provided by the GAAMmf algorithm was slightly lower (96.99%) than the accuracy of the reference algorithms, but the difference was tiny, ranging from 0.23 to 0.81%.

Although the accuracy obtained with all five algorithms was similar, the number of features in individuals with the highest accuracy varied significantly. For the algorithms using a binary coding scheme (NSGA-II and Holland), the feature reduction was relatively small, with the best individuals containing 22 features (NSGA-II). In the case of GAAM, there was no feature reduction at all; the change from an initial 20 to 19 genes was caused only by a duplicated feature in the individual. The highest reduction was achieved for GAAMmf and Melting GAAM. In the case of GAAMmf, an individual with only 7 features had an accuracy (96.54%) of only 1.26% lower than the best individual in the table (20 features, 97.80%).

Table 3 The highest accuracy returned by five GAs for different numbers of features (Humanactivity dataset)

When comparing the feature reduction plots (the plots on the right-hand side of Fig. 2), it can be observed that the reduction process for GAAMmf and Melting GAAM remained relatively stable over time, gradually decreasing until the final number of features was reached. In contrast, the number of features in individuals produced by the two other algorithms designed for feature reduction (Holland and NSGA-II) slightly fluctuated during the initial period. Since the goal of the last algorithm (GAAM) was not feature reduction, the number of features in individuals produced by this algorithm remained largely consistent across all 100 iterations.

Fig. 2
figure 2

The algorithms’ performance for the Humanactivity dataset; the plots on the left present the average validation accuracy of the best individual returned in each iteration, the plots on the right present the number of features encoded in that individual; the rows of plots present results for different algorithms: a GAAMmf; b GAAM; c Melting GAAM; d NSGA-II; e Holland

Table 4 The processing time needed to complete 100 iterations by each GA (results for Humanactivity dataset)

Finally, regarding the processing time, it can be observed (Table 4) that GAAMmf required a significant amount of time to complete the required number of iterations (1 h 57 min), especially when compared to Holland (18 min), Melting GAAM (24 min) and NSGA-II (1 h 02 min). One of the reasons for such a long processing time was a significantly higher number of individuals evaluated by GAAMmf in each iteration. While Holland and NSGA-II evaluated only 10 individuals per iteration, i.e., 1000 individuals in 100 iterations, GAAMmf evaluated between 94 and 190 individuals in each iteration (10 mother individuals, 10 crossed-over, and from 70 to 170 mutated, depending on the average number of features in individuals in the current population). Although Melting GAAM started with the same number of individuals as GAAMmf, it quickly reduced the number of individuals needing evaluation to only one per iteration. Nevertheless, regardless of the reason, the long processing time should be considered a limitation of the proposed algorithm.

4.3 Comparison of GAAMmf with reference methods (all datasets)

In the previous subsection, we demonstrated that the individuals returned by GAAMmf for the Humanactivity dataset had slightly lower accuracy but were composed of a significantly smaller number of features compared to those produced by most other GAs (with the exception of Melting GAAM). The objective of the experiment reported in this section was to determine whether this observation is consistent across datasets of different numbers of features, classes, and examples. Unlike in the two previous sections, we do not present here the detailed results obtained in individual iterations. Instead, for each method and dataset, we report the characteristics of the feature set with the best classification accuracy. The comparison of results achieved by GAAMmf and eight reference methods (GAAM, Melting GAAM, Holland, NSGA-II, CFS, Lasso, SFS, IniPG) for eleven datasets described in Sect. 3 (Table 1) is presented in Table 5.

To test the statistical significance of GAAMmf’s results against those obtained with the reference methods, we conducted a series of one-sample tests with a significance level (\(\alpha\)) set to 0.05. Each test tested the hypothesis H0: the difference between the accuracy (or number of features) of individuals returned by GAAMmf and one of the reference methods is equal to zero against the alternative hypothesis H1: the difference between the accuracy (or number of features) of individuals returned by GAAMmf and one of the reference methods is not equal to zero.

To verify this set of hypotheses, we first calculated the differences between the results obtained by GAAMmf and each of the reference methods. This process yielded a set of 12 samples, consisting of 6 samples for differences in accuracy and 6 samples for differences in the number of features. Subsequently, we assessed the normality condition for each sample using the one-sample Lilliefors test with \(\alpha\) set to 0.05. Finally, since not all samples met the normality condition, we applied a one-sample Wilcoxon signed-rank test to test the differences significance. The results of the Wilcoxon test are presented in Fig. 3, which shows the mean differences in accuracy (Fig. 3a) and the number of features (Fig. 3b) calculated between GAAMmf and each of the reference methods (the actual p-value is provided for all statistically significant results).

As shown in Table 5, not all methods produced results for all datasets. The Holland and NSGA-II algorithms encountered problems in the classifiers’ training process for the individuals from seven datasets (Orlraws10P, Gisette, Coil100, Gli_85, Orl_32×32, WarpAR10P, and Yale_32×32). For all of those datasets, due to the unfavourable ratio of the number of features to the number of examples, the classification process could not be completed because the covariance matrices did not meet the positive definiteness condition. Since only four valid results were possible to obtain for the Holland and NSGA-II algorithms, they were excluded from the statistical tests. At first, a similar problem was encountered with IniPG. However, we managed to overcome it by slightly changing the parameters proposed in [11]. Our modification was to initialize all the particles with a small number of features (the same as was used for other algorithms) instead of using particles with sparse and dense initialization.

Table 5 The algorithms’ results across all datasets

When analysing the classification accuracy of the algorithms presented in Table 5, GAAM outperformed the other methods for almost all datasets. Only for three datasets, Adult, Gisette, and Orl_32×32, other algorithms returned classifiers with marginally higher accuracy. The second place was shared between GAAMmf (eight datasets), Holland (two datasets), and CFS (one dataset). Comparing the accuracy differences averaged over eleven datasets (Fig. 3a), it can be noticed that GAAMmf exhibited superior performance compared to the five reference methods: Melting GAAM, Lasso, CFS, IniPG, and SFS. In the case of all of those methods, the difference in accuracy was statistically significant. In fact, the only algorithm that performed better than GAAMmf in terms of accuracy was the original GAAM. However, the difference in accuracy between those two algorithms was insignificant.

Fig. 3
figure 3

The statistical significance of differences calculated between GAAMmf and other methods in terms of a classification accuracy, b number of features; p value for each significant difference is presented over the corresponding bar

As shown in Table 5, although Holland and NSGA-II returned individuals with accuracies comparable to those generated by GAAM-based algorithms, their application can sometimes be challenging. This is due to the fact that both algorithms employ a binary coding scheme, which begins the search process with roughly half of the total number of features. As a result, their individuals can be difficult to evaluate in terms of classification accuracy when applied to datasets with an unfavorable ratio between the number of features and the number of examples (e.g., Gisette and Orlaws10P). In contrast, all three GAAM-based algorithms permit the initial selection of the number of genes (i.e., features) in individuals. Consequently, they are not affected by the dimensionality ’curse’ and can be applied to datasets with arbitrary characteristics.

Moreover, it is worth noting that some non-genetic feature selection methods exhibited significantly poorer performance when confronted with multiclass problems. For example, as can be noticed in Table 5, Lasso and CFS performed much worse in the case of most multiclass datasets (apart from Dermatology and Gisette (analysed by Lasso)). The most extreme drop in accuracy measured between GAAMmf and the aforementioned methods could be observed for Orlraws10P, Coil100, Orl_32×32, WarpAR10P, and Yale_32×32.

The second section of Table 5 displays the accuracy of the final classification model estimated for each algorithm using the 20% of data that was not utilized in the parameters estimation process. We marked with bold font all the cases where the test accuracy was 10% lower than the corresponding validation accuracy reported in the first part of the table. Analysis of the table reveals that the parameters of most classifiers were correctly estimated, with the test accuracy being only slightly lower or, in some cases, slightly higher than the validation accuracy. Additionally, it is evident that the occurrence of overfitting behaviour primarily depended on the characteristics of the datasets rather than the algorithms themselves. Notably, for four datasets (Orlaws10P, Gli_85, WarpAR10P, and Yale_32×32), almost all classifiers exhibited overfitting behaviour, while for the remaining datasets, overfitting was not observed.

Regarding the average accuracy calculated across all eleven datasets, the classic GAAM outperformed other algorithms in terms of average validation accuracy, achieving a score of 91.73%. However, it displayed the poorest generalization capabilities among the estimated classifiers. On the other hand, GAAMmf demonstrated slightly lower average validation accuracy (90.48%) compared to GAAM, but it emerged as the winner in terms of test accuracy and generalization capabilities.

Concerning the third section of Table 5, which presents the number of features encoded in individuals with the highest accuracy, the outcomes significantly differ from those presented in the first section of the table. Here, Melting GAAM returned individuals with the fewest features, averaging at 8 features. GAAMmf followed closely with 10 features, while SFS obtained the third-best performance with 11 features. Other algorithms yielded significantly larger feature sets (Fig. 3b).

The comparison across different datasets presents one undesirable feature of Melting GAAM that motivated us to seek an approach to balance the accuracy and number of feature criteria in the feature selection process. By fixing the accuracy, the algorithm might halt the search process with individuals that are far from optimal, either in terms of accuracy or the number of features. When the accuracy threshold is underestimated, the algorithm terminates with individuals of a much worse accuracy than optimal (for Orlraws10P, the accuracy of Melting GAAM was over 10% worse than that of GAAM). Conversely, when the accuracy threshold exceeds the accuracy that can be achieved in the given dataset, the algorithm focuses entirely on the accuracy criterion, and no feature reduction is achieved. This situation was observed for Adult dataset, where Melting GAAM stopped with 14 features, although the individual composed of only three features returned by GAAMmf provided even better accuracy.

In the final part of Table 5, the total processing time required by all algorithms to complete the task was compared. The ranking of the methods in this section of the table was consistent with expectations. The quickest method was CFS, followed by Lasso. All wrappers required significantly longer time to fulfil the task. Among the wrappers, the SFS method was at the forefront, followed by IniPG, GAAM, Melting GAAM, and GAAMmf. The two least time-efficient methods were Holland and NSGA-II, respectively.

5 Conclusions

This study developed and evaluated a genetic algorithm for feature selection (GAAMmf), which is a modified version of a genetic algorithm with aggressive mutation. The new version of the algorithm was developed to overcome the limitations of both preceding GAAM-based algorithms, the original GAAM and Melting GAAM. The original GAAM is focused on feature subsets of a fixed size. Hence, it optimises the feature set in terms of classification accuracy but not the number of features. Conversely, Melting GAAM fixes the classification accuracy and optimises only the size of the feature space. GAAMmf combines both criteria and, at the same time, optimises the classification accuracy and the size of the feature set, similar to Holland with a penalty term and NSGA-II. One distinguishing feature of GAAMmf is its ability to start the feature reduction process from either the entire feature set or an arbitrarily chosen number of features, which is impossible with the direct binary coding scheme. This feature makes GAAMmf applicable to datasets of any characteristic.

In summary, the main benefit of GAAMmf is its ability to be run without tedious tuning of parameters. All parameters can be set to the levels used in the experiments described in the paper, which may not be optimal in terms of processing time but will produce a sufficiently small feature set of satisfactory classification accuracy for most datasets. On the other hand, the main limitation of GAAMmf is its processing time. The aggressive mutation used in the evolution process allows the algorithm to explore different subspaces of the search space, but also produces a large number of new individuals that must be evaluated. Attempts were made to overcome this problem by introducing the concept of mutation probability, but the processing time is still unsatisfactory. Hence, in future work, we plan to apply the concept of keeping track of previously evaluated individuals to avoid the cost of their reevaluation.